reinier@.parcom.nl (Reinier van den Born) (11/15/90)
I have a uucp vs. trailblazer problem that I would like some help with. I have posted this to both comp.mail.uucp and comp.dcom.modems, but unfortunately we don't receive the latter here so please reply by mail. If anything interesting may turn up I will post a summary. I am not a regular reader of any of these groups so I apologize if this problem has been discussed here before. I am afraid things get a bit technical below, so I suppose it will only be interesting to uucp-gurus, trailblazer-lovers, and PK-addicts (:-). However I would like to hear from sites that have/had similar problems. Therefore I will start with a superficial problem description before getting into detail. The configuration: - Sony NWS-1750 workstation running NEWS 3.2 (BSD4.3 variant) - TELEBIT TrailBlazer Plus running UUCP support - a dialin connection to hp4nl which is the uucp backbone for the Netherlands (connects to TrailBlazers) What happens: Almost any file transfer (> some kbytes) failes due to a checksum error happening at an apparently random place/time. Conditions: This only happens when using the UUCP support of the TrailBlazer. Otherwise everything works just fine, except that the connection appears to be even slower than a 2400 baud modem. (As a result we currently use such a modem and consider to get rid of the TrailBlazer. ;-( No difference in behaviour is noticeable when talking to the trailblazer at 9600 or 19200 baud Note: During the short period we had a Sony NWS-1850 workstation (which has a separate I/O processor and is therefore much faster) the problems appeared considerably less. ------ stop reading here if you are not a guru, lover, or addict --------- The problem lies in the packet (PK) communication protocol employed by uucico and supported (whatever that means) by the trailblazer. I thankfully used a protocol description of G.L. Chesson (Bell Labs) to get as far as I have now. Terminology is taken from there. I got my hands on some uucico sources and after some hacking I discovered that the checksum error was caused by control packets appearing at a random position in a data packet. After the data packet uucico finds exactly n*6 bytes "noise" (where n is the number of control packets present and 6 the size of such a packet) before recognising the next data packet. Of course, sometimes the "noise" contains a DLE (^P), which causes even more errors. Uucico has some means to recover from transmission errors. A checksum error causes uucico to request to resend the last packets. However the result is exactly the same, i.e., the control packet(s) appear again at exactly the same location(s). (However when uucico is started again to transmit the same file the error will be found at another place or may not occur at all) The misaligned control packet found is always an RR packet. The sequence number of the last correctly received packet found in the RR packet is consistent with that found in the data packets received. (So it isn't a packet originating from our machine that is inadvertently echoed) So we get, for instance, something like this (C in octal, others hex) : receive C = 227 send C = 042 receive C = 237 send C = 043 receive C = 247 - checksum error control packet found: k = 0x9, c0 = 0x83, c1 = 0xaa, C = 047 send C = 023 6 bytes noise receive C = 257 receive C = 267 receive C = 247 - checksum error control packet found: k = 0x9, c0 = 0x83, c1 = 0xaa, C = 047 send C = 023 6 bytes noise : etc. After some further hacking I had uucico remove the control packet from the data packet (sometimes: it should not appear in the header or end up partly in the noise at the end of the data part). To my surprise communication continues normally after a recovery. Considering these facts I get to the following conclusions: - something is dead wrong here as those control packets should never end up in data packets. - the fact that hp4nl and its trailblazers maintain a large number of fast uucp connections and never heard of such problems suggests that the problem doesn't come from there. - the fact that if UUCP support is switched off, or if the error is corrected, normal communication is possible, suggests that the TrailBlazer is the bad guy. - the fact that Sony workstations are uncommon and ours is probably the only one in the Netherlands connected to hp4nl suggests it causes the problem itself. Problems with the serial device drivers have been reported. However this is mere circumstantial evidence. The fact that the faster Sony had less problems, however, might be a stronger indication. The only explanation I can think of at the moment is the following: - hp4nl sends the RR packets synchronously with the data packets - somehow one of the trailblazers doesn't recognize it and passes it straight through (as it would with any data not fitting the protocol??) Thus it gets out of sync. - it then ends up at a random place in the protocol buffers of our TrailBlazer. - when our TrailBlazer receives the resend packet from our uucico it just sends its buffers again, repeating exactly the same error. - a faster Sony will service the trailblazer and therefore hp4nl faster and therefore cause less RR packets to be sent I should note that I know little of TrailBlazers, especially the extend of the protocol support, so this explanation may be way off. For the time being I will try to make uucico recover from these errors as good as possible. It seems difficult to repair them all, but say 98% seems feasible. This will still leave a reduced transmission speed (uucico sleeps 1 second before continuing after an error), an occasional disruption of the communication, and the bad taste of an improper solution. So any suggestions to the cause or to a proper solution of the problem are welcomed. Reinier van den Born Parallel Computing E-mail: reinier@parcom.nl Phone: +31-20-233274 S-nail: Postbus 16775, 1001 RG Amsterdam, The Netherlands -- Reinier van den Born Parallel Computing, Amsterdam. +31-20-233274 E-mail: reinier@parcom.nl