info-vax@ucbvax.ARPA (12/18/84)
From: speck@cit-vax (Don Speck) Our brand-new (Rev. 7) 750's came with RL02's instead of tape drives (bad idea!) so to bring up 4.2bsd I've put enough on the RL02 to be able to copy the rest via Ethernet. Partway through the copy I get hard RL02 errors (data late, operation incomplete); when I power-cycle and reboot, fsck finds the same thing, same sector. The DEC FE speculates that that sector's header gets messed up. VMS BAD and EVRAA give the pack a clean bill of health; if I rewrite it, it's fine again until I try copying via Ethernet onto the RA81. Same on both vaxes, I tried pack swaps, etc. The RL211 controller is at the head of the Unibus, followed by the DZ-11 (unused so far), an Interlan 10 Mbit Ether card, and the UDA-50. During the copy the latter two boards must be really hogging the bus - is that what the `data late' error-bit is about? DEC blames it on the software 'cause it does the same thing on both machines. Anybody have any clues/fixes/suggestions? Don Speck speck@cit-vax.arpa I get unix-wizards but not info-vax.
info-vax@ucbvax.ARPA (12/19/84)
From: "Richard Kenner" <KENNER@NYU-CMCL1.ARPA> Data-late errors are fairly common on RL01/2's. But they should be retried by the driver so they are not a real problem (there was a time early in the history of the RL01's when a data late in the wrong part of a write would clobber the next sector but we found an ECO for that and DEC has had in the controllers for at least 4 years by now). I do not know if the UNIX driver will retry them but the RSX and VMS drivers do. However, you do NOT seem to be getting data late errors! Look at the descriptions of those bits more carefully! Here's the description from the RL01 User's Manual (it is the same for the RL02 but I don't have that manual handy): bit 10 Operation Incomplete (OPI) When set, this bit indicates that the current command was not completed within 200ms. bit 12 Data Late (DLT) or Header Not Found (HNF error) When OPI (bit 10) is cleared and bit 12 is set, it indicates, that, on a write operation, the silo was empty and, therefore, a word was not available for writing; or, on a read operation, that the silo was full and unable to store another word from the drive. When OPI (bit 10) is set and bit 12 is also set, it indicates that a 200ms timeout occurred while the controller was searching for the correct sector to read or write (no header compare). So what you are really getting is a Header Not Found. This is most likely a bad sector on the device. As to why VMS BAD and EVRAA don't find it, the only possibility that I can think of is that it is in a reserved area of the pack (such as the last track) which UNIX, for some reason, is trying to access. Your FE is right -- there does seem to be a problem with a bad header. As to solutions, here's what I can think of: (1) Try another pack. (2) It is conceivable that the system is trying to access an invalid sector. Look at the sector number to make sure that it is valid. Another possible problem area is that seeks to the RL01/2 are given as deltas from the current position and the system might be confused as to the current position. The driver, when it gets a HNF error should attempt recovery by forcing the drive to cylinder zero and retrying. If it doesn't to this, it should. (3) Alternatively, try forcing EVRAA to just run on a small range centered around the problem area. I THINK (but am not sure) that this can be done with EVRAA. -------