[comp.os.vms] 4.6 lat and decnet problems?

CAMPBELL@UTOROCI.BITNET (01/19/88)

        A week ago I posted our problems with the DHU/V 4.6 hanging
problem, and was reassured by several reponses that my troubles were
not unique and that my fix was the right one. Many thanks to the half-
dozen who wrote.

        Now how about the LATSYM bugs that I also saw mentioned? We
have seen the following problems on the laser printer that's on our
decserver200:

        (1) It occasionally prints one or more pages twice.
        (2) Print jobs occasionally abort on error and are retained
with errors "SYSTEM-W-DEVALLOC, device already allocated to another
user" or "SYSTEM-F-ABORT".
        (3) The above sometimes begins to occur to every print job;
deleting the queue and restarting it fixes it.
        (4) Once (most recently), attempting to restart the queue
didn't work; the command START/QUEUE hung, and when I ^Y'd out, the
queue showed up as "starting" - and stayed that way. Same for
INIT/QUEUE/START. Furthermore, SET DEV/NOSPOOL reported "device has
channels allocated" - after the queue was deleted. I entered LCP and
told it to delete the port; it accepted the command but did nothing.
(The port was showing up as "interactive", not "application".)
Creating a new LTA name for the port and restarting the queue did not
help; INIT/START hung with both port names. I rebooted the vax and the
queue came up normally.

        Our setup: VAX 11/780 with delua. The ethernet is a thinwire
segment that goes about 20m (when I say LAN, I mean LAN). Of two
decserver200s, one has no ports connected, the other has 3 users and
the printer. The users' terminals seem ok. A microvax, with no users
at the time, also has a queue using the printer. During (4), it printed
normally to the printer. The uVax is also at 4.6. We had the printer
working in this setup under vms 4.4, and saw nothing like this. There
is nothing else on the ethernet, and we have nothing else that should
be using the delua.

        While I'm on the subject:

        (5) Transfers of large files (eg: 4600-block save-set) over
our async decnet lines fail with the messages: "RMS-F-BUG_DAP, Data
Access Protocol error detected; DAP code = 00019008" and "RMS-E-CRC,
network DAP level CRC check failed". The messages and recovery book
says to SPR those errors. The transfer works fine when I force routing
through a common node connected to us via a dmr32 synch line. The
other systems involved were uVMS 4.3 and 4.5. Possible hint: the line
is connected through a port on a DHU emulator, and thus goes through
the resurrected 4.4 YFdriver.

        Well, has anyone else seen stuff like this? Is it associated
with 4.6 - and does 4.7 fix it? Any suggestions as to what pieces of
4.4 to dust off?

Thanks,
Chip Campbell, Ontario Cancer Institute, Toronto
bitnet/NetNorth: campbell@utoroci

LEICHTER@VENUS.YCC.YALE.EDU ("Jerry Leichter ", LEICHTER-JERRY@CS.YALE.EDU) (01/19/88)

I can't comment on the other problems you are seeing, but...

	Transfers of large files (eg: 4600-block save-set) over our async
	decnet lines fail with the messages: "RMS-F-BUG_DAP, Data Access
	Protocol error detected; DAP code = 00019008" and "RMS-E-CRC, network
	DAP level CRC check failed". The messages and recovery book says to
	SPR those errors. The transfer works fine when I force routing through
	a common node connected to us via a dmr32 synch line. The other
	systems involved were uVMS 4.3 and 4.5. Possible hint: the line is
	connected through a port on a DHU emulator, and thus goes through the
	resurrected 4.4 YFdriver.

BUG_DAP could be a genuine bug, but the occurence of DAP-level CRC errors
opens up the possibility of hardware problems.

A little background:  There are checksums computed on the data as it enters
each DDCMP link, and as it exits the link at the other end.  There are similar
checksums on stuff as it is placed on and removed from the Ethernet.  A fail-
ure of a checksum at this level is invisible to higher levels - the failed
packet is simply sent again later.  The result is that a DDCMP or Ethernet
link will appear, to higher levels of the protocols, as an error-free path.
In theory, that's all there is to it - but DAP is extra careful:  The sender
computes a checksum of all the data it sends (over a theoretically error-free
channel) and the receiver checks it.  This "end-to-end" checksum covers the
ENTIRE DAP trans- action, so a DAP level CRC check failure can only occur as
the connection is closing down.

So, why would a DAP connection of "error-free" channels sometimes find errors?
You have to examine very closely what the checksums are really covering.  Con-
sider the case of the Ethernet checksum:  Data is pulled from memory and hand-
ed to the Ethernet controller.  It computes a checksum, and sends the packet.
The receiving controller pulls the data off of Ether, checks the checksum, and
writes the results to memory.  The checksum covers the transfer of the data on
the Ethernet - it CANNOT say anything about the transfers between memory and
the Ethernet controllers.

In practice, the most common bad link not covered by a checksum is the memory-
to-Unibus-to-DMC or -DMR path.  Usually the cause is power supply problems,
especially just exceeding the rating of the power supply for the Unibus.
Machines that do a lot of routing are often configured with a Unibus contain-
ing little besides a couple of DMR's, so the problem may never show up
in any other way.
							-- Jerry