[fa.info-vax] RL02 headache

info-vax@ucbvax.ARPA (12/18/84)

From: speck@cit-vax (Don Speck)

    Our brand-new (Rev. 7) 750's came with RL02's instead of tape drives
(bad idea!) so to bring up 4.2bsd I've put enough on the RL02 to be able
to copy the rest via Ethernet.	Partway through the copy I get hard RL02
errors (data late, operation incomplete); when I power-cycle and reboot,
fsck finds the same thing, same sector.  The DEC FE speculates that that
sector's header gets messed up.  VMS BAD and EVRAA give the pack a clean
bill of health; if I rewrite it, it's fine again until I try copying via
Ethernet onto the RA81.  Same on both vaxes, I tried pack swaps, etc.

    The RL211 controller is at the head of the Unibus, followed by the
DZ-11 (unused so far), an Interlan 10 Mbit Ether card, and the UDA-50.
During the copy the latter two boards must be really hogging the bus -
is that what the `data late' error-bit is about?  DEC blames it on the
software 'cause it does the same thing on both machines.  Anybody have
any clues/fixes/suggestions?
				Don Speck	speck@cit-vax.arpa
				I get unix-wizards but not info-vax.

info-vax@ucbvax.ARPA (12/19/84)

From: "Richard Kenner" <KENNER@NYU-CMCL1.ARPA>

Data-late errors are fairly common on RL01/2's.  But they should be retried
by the driver so they are not a real problem (there was a time early in
the history of the RL01's when a data late in the wrong part of a write
would clobber the next sector but we found an ECO for that and DEC has had
in the controllers for at least 4 years by now).  I do not know if the
UNIX driver will retry them but the RSX and VMS drivers do.

However, you do NOT seem to be getting data late errors!  Look at the
descriptions of those bits more carefully!

Here's the description from the RL01 User's Manual (it is the same for
the RL02 but I don't have that manual handy):

bit 10	Operation Incomplete (OPI)	When set, this bit indicates that
					the current command was not
					completed within 200ms.

bit 12	Data Late (DLT) or Header
	Not Found (HNF error)		When OPI (bit 10) is cleared and
					bit 12 is set, it indicates, that,
					on a write operation, the silo was
					empty and, therefore, a word was
					not available for writing; or,
					on a read operation, that the 
					silo was full and unable to store
					another word from the drive.

					When OPI (bit 10) is set and bit
					12 is also set, it indicates that
					a 200ms timeout occurred while the
					controller was searching for the
					correct sector to read or write
					(no header compare).

So what you are really getting is a Header Not Found.  This is most likely
a bad sector on the device.  As to why VMS BAD and EVRAA don't find it,
the only possibility that I can think of is that it is in a reserved
area of the pack (such as the last track) which UNIX, for some reason,
is trying to access.

Your FE is right -- there does seem to be a problem with a bad header.

As to solutions, here's what I can think of:

(1) Try another pack.
(2) It is conceivable that the system is trying to access an invalid sector.
    Look at the sector number to make sure that it is valid.  Another
    possible problem area is that seeks to the RL01/2 are given as deltas
    from the current position and the system might be confused as to the
    current position.  The driver, when it gets a HNF error should attempt
    recovery by forcing the drive to cylinder zero and retrying.  If it
    doesn't to this, it should.  
(3) Alternatively, try forcing EVRAA to just run on a small range centered
    around the problem area.  I THINK (but am not sure) that this can be
    done with EVRAA.
 
-------