hedrick@topaz.RUTGERS.EDU (Charles Hedrick) (03/03/86)
First, I should warn you that the problem I am about to describe was observed on a Pyramid 90X. However a quick perusal of other source suggests that the problem is probably present in our Sun 2.0 source and in 4.3. So I conclude that this problem is generic to 4bsd implementations. However symptoms may or may not be present on other systems, depending upon the details of how they use the variable rcv_adv. The symptom is that connections attempting to send data from a DEC-20 or Symbolics 3600 to Unix hang. Or connections from any kind of system may become super-slow (like about 1000bit/sec on an Ethernet). I now believe that the problem is due to incorrect initialization of rcv_adv. This variable indicates the receive window advertised to the other end. However it is not a window size. It is a sequence number, namely the largest sequence number that the other end has ever been authorized to send. This is sort of a "high water mark", since silly-window prevention can cause the window to shrink. In such cases rcv_adv does not become less. Except when this window shrinking has happened, the actual advertised window size is rcv_adv - rcv_nxt. Now for the bug. rcv_adv is set in only one place, in tcp_output: if (SEQ_GT(tp->rcv_nxt+win, tp->rcv_adv)) tp->rcv_adv = tp->rcv_nxt + win; This works fine, except for the first time. rcv_adv is initialized to zero. Unfortunately, sequence numbers are compared using a modulo arithmetic, such that some sequence numbers are actually less than zero. If a connection has such "negative" sequence numbers, then this test always fails, and rcv_adv is never updated. rcv_adv is used only one place, in tcp_output to calculate when to issue window updates. For connections that have bad values of rcv_adv, the effect can be missing window updates. If the TCP implementation on the other end is correct, it will eventually issue a probe, and the connection will be restarted. However such connections may be mysteriously slow. If the TCP implementation at the other end does not issue zero-window probes (TOPS-20), or issues them incorrectly (Symbolics, apparently -- there is some evidence that their probe has a data length of zero), then the connection will simply hang. Different Unix versions may use slightly different tests for when to do window updates. So the probability of hanging will depend upon the implementation. The fix that I recommend is to change the definition of tcp_rcvseqinit so that it initializes rcv_adv as well as rcv_nxt. #define tcp_rcvseqinit(tp) \ (tp)->rcv_nxt = (tp)->irs + 1; (tp)->rcv_adv += (tp)->rcv_nxt The obvious code would be (tp)->rcv_adv = (tp)->rcv_nxt. However sometimes rcv_adv is given a non-zero value before the sequence numbers are initialized. So it seems safer to use the code above.
jbn@wdl1.UUCP (03/15/86)
There are actually two bugs here. 4.3BSD(beta) has the sequence number bug as described; I posted a fix for this a few weeks ago and it follows below. The very slow connection bug is in a formal sense a bug in the TCP specification, in that it is possible to implement window management policies that result in near-stalled transfers. In 4.3BSD, once the zero window state has been reached, notification of more window will not start transmission unless enough window is available to send one maximum-sized segment. If the implementation at the other end sends a window notification only when the available window goes from zero to nonzero, and then holds off on further window notification until some data is recieved, the TCP connection stalls. Eventually the zero window probe mechanism gets things going again, but you typically get only one window's worth of data per zero window timeout interval, which slows things down to a crawl. This is strictly speaking a bug in the receiver end; 4.3BSD is not wrong here. This one you have to fix at the other end. We have observed this bug when talking to our UNET implementation (which we fixed) and Imagen laser printers (about which we informed Geof Cooper at Imagen.) A work-around for Imagen users is to reconfigure the software with a maximum IP datagram size of 576 instead of the default size based on Ethernet packet size. This works because it negotiates down the maximum TCP segment size on the connection, making 4.3BSD act on the Imagen's window updates of 1K or so. Incidentally, when studying problems like this, the "trpt(8C)" program, which edits the headers saved by socket-level debugging, is immensely useful. The program can also be easily modified to print out any other fields of interest in the TCP control tables. John Nagle ===REPOSTING OF BUG FIXES=== In response to popular demand, I am sending out two fixes to 4.3BSD (beta release). Fix #1 affects interoperability with non-4.xBSD systems, apparently including TOPS-20 machines. Fix #2 reduces network congestion on long-haul nets. (Yes, yet another of Nagle's continuing attempts to get network congestion under control.) The effect of #2 is substantial; in some situations, an order of magnitude improvement in file transfer speeds will be observed. With these in, 4.3BSD TCP behaves quite well. In 4.3, all the right machinery is there, but there are a few easily-fixed bugs. These fixes are going out via several routes (net.bugs.4bsd, the Berkeley buglist, and to some key individuals) because they have a marked effect on interoperability and Internet performance. John Nagle =============================================================================== Index: sys/netinet/tcp_input.c 4.3BSD-beta Fix Description: TCP connections to some non-BSD systems open, but will not accept data from the remote system. Known problem when trying to open connections to TOPS-20 systems. The "advertised window", tcp_adv, was not initialized during connection synchronization. Also, one comparison on sequence numbers was made incorrectly, using a difference of unsigned values, which in C is always positive(!). John Nagle Repeat-By: Try to establish a TCP connection with a system which sets the high bit in the TCP sequence number. (A 4.3BSD system which has been up for more than 195 days will do this, or you can change the initial value of tcp_iss to some value with the high bit set.) Fix: tcp_input.c 327a328,329 > * Be careful with arithmetic here; differences of sequence > * numbers compare in unexpected ways. Hence the (int) cast. 329c331 < tp->rcv_wnd = MAX(sbspace(&so->so_rcv), tp->rcv_adv - tp->rcv_nxt); --- > tp->rcv_wnd = MAX(sbspace(&so->so_rcv),(int)(tp->rcv_adv-tp->rcv_nxt)); tcp_seq.h: 22a23 > * Note that our rcv_adv variable needs to be initialized too. 25c26 < (tp)->rcv_nxt = (tp)->irs + 1 --- > (tp)->rcv_adv = (tp)->rcv_nxt = (tp)->irs + 1 =============================================================================== Index: ucb/netinet/tcp_timer.c 4.3BSD-beta Fix Description: Excessive retransmissions on long-haul nets. Serious congestion in Internet gateways. File transfer speeds under 10% of expected values over 9600 baud point-to-point links. Angry network managers. The basic machinery is right but some of the special cases are wrong, resulting in bad host behavior on slow links. Several problems combine to result in very short retransmit intervals: 1) The smoothed round-trip time is zero until the first successful round-trip without retransmission. If there is a retransmission of the first packet, the zero value is actually used to compute the round-trip time, resulting in a minumum retransmission time. 2) The standard backoff algorithm not only backs off rather slowly, but due to an incorrect calculation, the first retransmit interval is 2.0*t_srtt, but the second is only 1.0*t_srtt, and not until retransmit #4 or so does the retransmit time get back up to 2*t_srtt. The supplied "experimental" backoff algorithm backs off at rate 2**n, which reduces retransmits under overload conditions. John Nagle Repeat-By: Connect two 4.3BSD systems via a 9600 baud DMR link. Try a big file transfer with ftp(I). Be prepared for a long wait. Fix: tcp_timer.c 112c112 < int tcpexprexmtbackoff = 0; --- > int tcpexprexmtbackoff = 1; /* use exponential backoff if 1 */ 154a155,169 > /* > * Calculate retransmit timer for non-first try. > * Start with the same value used for the first retransmit. > * Then use either the table tcp_backoff to scale this up > * based on the number of retransmits, or if the patchable > * flag tcpexprexmtbackoff is set, just multiply it by > * 2**number of retransmits. > * If t_srtt is zero when we get here, we have never > * had a successful round-trip and are already retransmitting, > * which indicates trouble, so we apply a larger initial guess > * for the round-trip time. This prevents serious network > * overload when talking to faraway hosts, especially when > * they aren't answering. > */ > if (tp->t_srtt == 0) tp->t_srtt = TCPTV_SRTTRTRAN; 156c171 < (int)tp->t_srtt, TCPTV_MIN, TCPTV_MAX); --- > (int)(tcp_beta * tp->t_srtt), TCPTV_MIN, TCPTV_MAX); tcp_timer.h: 60a61,62 > #define TCPTV_SRTTRTRAN ( 10*PR_SLOWHZ) /* base roundtrip time if retran > before 1st good roundtrip */ ===============================================================================