CAMPBELL@UTOROCI.BITNET (01/19/88)
A week ago I posted our problems with the DHU/V 4.6 hanging problem, and was reassured by several reponses that my troubles were not unique and that my fix was the right one. Many thanks to the half- dozen who wrote. Now how about the LATSYM bugs that I also saw mentioned? We have seen the following problems on the laser printer that's on our decserver200: (1) It occasionally prints one or more pages twice. (2) Print jobs occasionally abort on error and are retained with errors "SYSTEM-W-DEVALLOC, device already allocated to another user" or "SYSTEM-F-ABORT". (3) The above sometimes begins to occur to every print job; deleting the queue and restarting it fixes it. (4) Once (most recently), attempting to restart the queue didn't work; the command START/QUEUE hung, and when I ^Y'd out, the queue showed up as "starting" - and stayed that way. Same for INIT/QUEUE/START. Furthermore, SET DEV/NOSPOOL reported "device has channels allocated" - after the queue was deleted. I entered LCP and told it to delete the port; it accepted the command but did nothing. (The port was showing up as "interactive", not "application".) Creating a new LTA name for the port and restarting the queue did not help; INIT/START hung with both port names. I rebooted the vax and the queue came up normally. Our setup: VAX 11/780 with delua. The ethernet is a thinwire segment that goes about 20m (when I say LAN, I mean LAN). Of two decserver200s, one has no ports connected, the other has 3 users and the printer. The users' terminals seem ok. A microvax, with no users at the time, also has a queue using the printer. During (4), it printed normally to the printer. The uVax is also at 4.6. We had the printer working in this setup under vms 4.4, and saw nothing like this. There is nothing else on the ethernet, and we have nothing else that should be using the delua. While I'm on the subject: (5) Transfers of large files (eg: 4600-block save-set) over our async decnet lines fail with the messages: "RMS-F-BUG_DAP, Data Access Protocol error detected; DAP code = 00019008" and "RMS-E-CRC, network DAP level CRC check failed". The messages and recovery book says to SPR those errors. The transfer works fine when I force routing through a common node connected to us via a dmr32 synch line. The other systems involved were uVMS 4.3 and 4.5. Possible hint: the line is connected through a port on a DHU emulator, and thus goes through the resurrected 4.4 YFdriver. Well, has anyone else seen stuff like this? Is it associated with 4.6 - and does 4.7 fix it? Any suggestions as to what pieces of 4.4 to dust off? Thanks, Chip Campbell, Ontario Cancer Institute, Toronto bitnet/NetNorth: campbell@utoroci
LEICHTER@VENUS.YCC.YALE.EDU ("Jerry Leichter ", LEICHTER-JERRY@CS.YALE.EDU) (01/19/88)
I can't comment on the other problems you are seeing, but... Transfers of large files (eg: 4600-block save-set) over our async decnet lines fail with the messages: "RMS-F-BUG_DAP, Data Access Protocol error detected; DAP code = 00019008" and "RMS-E-CRC, network DAP level CRC check failed". The messages and recovery book says to SPR those errors. The transfer works fine when I force routing through a common node connected to us via a dmr32 synch line. The other systems involved were uVMS 4.3 and 4.5. Possible hint: the line is connected through a port on a DHU emulator, and thus goes through the resurrected 4.4 YFdriver. BUG_DAP could be a genuine bug, but the occurence of DAP-level CRC errors opens up the possibility of hardware problems. A little background: There are checksums computed on the data as it enters each DDCMP link, and as it exits the link at the other end. There are similar checksums on stuff as it is placed on and removed from the Ethernet. A fail- ure of a checksum at this level is invisible to higher levels - the failed packet is simply sent again later. The result is that a DDCMP or Ethernet link will appear, to higher levels of the protocols, as an error-free path. In theory, that's all there is to it - but DAP is extra careful: The sender computes a checksum of all the data it sends (over a theoretically error-free channel) and the receiver checks it. This "end-to-end" checksum covers the ENTIRE DAP trans- action, so a DAP level CRC check failure can only occur as the connection is closing down. So, why would a DAP connection of "error-free" channels sometimes find errors? You have to examine very closely what the checksums are really covering. Con- sider the case of the Ethernet checksum: Data is pulled from memory and hand- ed to the Ethernet controller. It computes a checksum, and sends the packet. The receiving controller pulls the data off of Ether, checks the checksum, and writes the results to memory. The checksum covers the transfer of the data on the Ethernet - it CANNOT say anything about the transfers between memory and the Ethernet controllers. In practice, the most common bad link not covered by a checksum is the memory- to-Unibus-to-DMC or -DMR path. Usually the cause is power supply problems, especially just exceeding the rating of the power supply for the Unibus. Machines that do a lot of routing are often configured with a Unibus contain- ing little besides a couple of DMR's, so the problem may never show up in any other way. -- Jerry