craig@LOKI.BBN.COM.UUCP (12/07/86)
I'm in the midst of doing comparisons between an RDP implementation and the 4.2/4.3 TCP implementations and have run into a problem which I'm hoping someone else can shed light on. I'm running tests on two machines, a VAX 750 running 4.3 and a SUN workstation running 4.2. The two machines are on the same Ethernet and use the same gateway. If I set up an experiment to test behaviour over paths with long network delays (for example, bouncing packets off Goonhilly), the TCP connections are established and then typically fail part way through the transfer. I don't understand this because the RDP connections work just fine, and typically complete in 1/4 the time it takes for a TCP connection to send about 20% of the data and faint. The experiment generally involves passing 50-100 segments of anywhere from 64 to 1024 bytes to the protocols to send. This is on weekends so the delays aren't that long. The question I'm trying to answer is whether the problem is in the RDP implementation (what anti-social things could it be doing to maintain that connection?), or the TCP implementation (what might it be doing wrong to die where another implementation succeeds?). If I can, I'd like to discourage invective. I'm simply trying to figure out why this is happening so I can identify and fix the problem and do a comparison between the two implementations/protocols. (And soon -- hair pulling over this problem is beginning to threaten the health of my scalp and beard). General information on the RDP implementation: it will retransmit up to 10 times and calculates the round-trip time based on the first packet sent with the caveat that it ignores round-trip times of segments with sequence numbers lower than those of segments whose round-trip time has already been computed (this feature is an experiment which may not stay). The maximum RTT is 2 minutes, the minimum is 2 seconds. Craig
hedrick@TOPAZ.RUTGERS.EDU.UUCP (12/07/86)
I suspect that what you are seeing is that fact that Sun TCP (to be fair, most 4.2 based TCP's) doesn't perform well with bad connections. A number of sites have replaced Sun's TCP modules with their 4.3 equivalents. This makes a dramatic difference in dealing with difficult cases.
van@LBL-CSAM.ARPA.UUCP (12/07/86)
What you observe is probably poor tcp behavior, not antisocial rdp behavior. If the link is lossy or the mean round trip time is greater than 15 seconds, the 4.3bsd tcp throughput degrades rapidly. For long transfers, a link that gives 2.7KB/s throughput with a 1% loss rate, gives 0.07KB/s throughput with a 10% loss rate. (As appalling as this looks, 4.2bsd, TOPS-20 tcp, and some other major implementations that I've measured, get worse faster. The 4.3 behavior was the best of everything I looked at.) I know some of the reasons for the degradation. As one might expect, the failure seems to be due to the cumulative effect of a number of small things. Here's a list, in roughly the order that they might bear on your experiment. 1. There is a kernel bug that causes IP fragments to be generated and ip fragments have only a 7.5s TTL. In the distribution 4.3bsd, there is a bug in the routine in_localaddr that makes it say all addresses are "local". In most cases, this makes tcp use a 1k mss which results in a lot of ip fragmentation. On high loss or long delay circuits, a lot of the tcp traffic gets timed out and discarded at the destination's ip level. The bug fix is to change the line: if (net == subnetsarelocal ? ia->ia_net : ia->ia_subnet) in netinet/in.c to if (net == (subnetsarelocal ? ia->ia_net : ia->ia_subnet)) I also changed IPFRAGTTL in ip.h to two minutes (from 15 to 240) because we have more memory than net bandwidth. 2. The retransmit timer is clamped at 30s. The 4.3 tcp was put together before the arpanet went to hell and has some optimistic assumptions about time. Since the retransmit timer is set to 2 * RTT, an RTT > 15s is treated as 15s. (Last week, the mean daytime rtt from LBL to UCB was 17s.) On a circuit with 2min rtt, most packets would be transmitted four times and the protocol pipelining would be effectively turned off (if 4.3 is retransmitting, it only sends one segment rather than filling the window). When running in this mode, you're very sensitive to loss since each dropped packet or ack effectively uses up 4 of your 12 retries. I would at least change TCPTV_MAX in netinet/tcp_timer.h to a more realistic value, say 5 minutes (remembering to adjust related timers like MSL proportionally). I changed the TCPT_RANGESET macro to ignore the maximum value because I couldn't see any justification for a clamp. 3. It takes a long time for tcp to learn the rtt. I've harped on this before. With the default 4k socket buffers and a 512 byte mss, 4.3 tcp will only try to measure the rtt of every 8th packet. It will get a measurement only if that packet and its 7 predecessors are transmitted and acked without error. Based on trpt trace data, tcp gets the rtt of only one in every 80 packets on a link with a 5% drop rate. Then, because of the gross filtering suggested in rfc793, only 10% of the new measurement is used. For a 15s rtt, this means it takes at least 400 packets to get the estimate from the default 3s to 7.5s (where you stop doing unnecessary retransmits for segments with average delay) and 1700 packets to get the estimate to 14s (where you stop unnecessary retransmits because of variance in the delay). Also, if the minimum delay is greater than 6s (2*TCPTV_SRTTDFLT), tcp can never learn the rtt because there will always be a retransmit canceling with the measurement. There are several things we want to try to improve this situation. I won't suggest anything until we've done some experiments. But, the problem becomes easier to live with if you pick a larger value for TCPTV_SRTTDFLT, say 6s, and improve the transient response in the srtt filter (lower TCP_ALPHA to, say, .7). 4. The retransmit backoff is wimpy. Given that most of the links are congested and exhibit a lot of variance in delay, you would like the retransmit timer to back off pretty aggressively, particularly given the lousy rtt estimates. 4.3 backs of linearly most of the time. The actual intervals, in units of 2*rtt, are: 1 1 2 4 6 8 10 15 30 30 30 ... While this is only linear up to 10, the 30s clamp on timers means you never back off as far as 10 if the mean rtt is >1.5s. The effect of this slow backoff is to use up a lot of your potential retries early in a service interruption. E.g., a 2 minute outage when you think the rtt is 3s will cost you 9 of your 12 retries. If the outage happens while you were trying to retransmit, you probably won't survive it. This is another area where we want to do some experiments. It seems to me that you want to back off aggressively early on, say 1 4 8 16 ... for the first part of the table. It also seems like you want to go linear or constant at some point, waiting 8192*rtt for the 12th retry has to be pointless. The dynamic range depends to some extent on how good your rtt estimator is and on how robust the retransmit part of your tcp code is. Also, based on some modelling of gateway congestion that I did recently, you don't want the retransmit time to be deterministic. Our first cut here will probably look a lot like the backoff on an ethernet. 5. "keepalive" ignores rtt. If you are setting SO_KEEPALIVE on any of your sockets, the connection will be aborted if there's no inbound packet for 6 minutes (TCPTV_MAXIDLE). With a 2m rtt, that could happen in the worst case with one dropped packet followed by one dropped ack. ("Sendmail" sets keepalive and we were having a lot of problems with this when we first brought up 4.3.) A fix is to multiply by t_srtt when setting the keepalive timer and divide t_idle by t_srtt when comparing against MAXIDLE. 6. The initial retransmit of a dropped segment happens, at best, after 3*rtt rather than 2*rtt. If the delay is large compared to the window, the steady state traffic looks like a burst acks interleaved with data, an ~rtt delay, a burst of acks interleaved with data and repeat. 4.3 doesn't time individual segments. It starts a 2*rtt timer for the first segment, then, when the first segment is acked, restarts the timer at 2*rtt to time the next segment. Since the 2nd segment went out at approximately the same time as the first and since the ack for the first segment took rtt to come back, the retransmit time for the 2nd segment is 3*rtt. In the usual internet case of 4k windows and an mss of 512, the probability of a loss taking 3*rtt to detect is 7/8. The situation is actually worse than this on lossy circuits. Because segments are not individually timed, all retransmits will be timed 2*rtt from the last successful transfer (i.e., the last ack that moved snd_una). This tends add the time taken by previous retransmissions into the retransmission time of the the current segment, increasing the mean rexmit time and, thus, lowering the average throughput. On a link with a 5% loss rate, for long transfers, I've measured the mean time to retransmit a segment as ~10*rtt. The preceeding may not be clear without a picture (it sure took me a long time to figure out what was going on) but I'll try to give an example. Say that the window is 4 segments, the rtt is R, you want to ship segments A-G and segments B and D are going to get dropped. At time zero you spit out A B C D. At time R you get back the ack for A, set the retransmit timer to go off at 3R ("now" + 2*rtt), and spit out E. At 3R the timer goes off and you retransmit B. At 4R you get back an ack for C, set the retransmit timer to go off at 6R and transmit F G. At 6R the timer goes off, you retransmit D. [D should have been retransmitted at 2R.] Even if we count the retransmit of B delaying everything by 2R (in what is essentially a congestion control measure), there is an extra 2R added to D's retransmit because its retransmit time is slaved to B's ack. Also note that the average throughput has gone from 8 packets in 2R (if no loss) to 8 packets in 7R, a factor of four degradation. The obvious fix here is to time each segment. Unfortunately, this would add 14 bytes to a tcpcb which would then no longer fit in an mbuf. So, we're still trying to decide what to do. It's (barely) possible to live within the space limitations by, say, timing the first and last segments and assuming the segments were generated at a uniform rate. 7. the retransmit policy could be better. In the preceeding example, you might have wondered why F G were shipped after the ack for C rather than D. If I'd changed the example so that C was dropped rather than D, C D E F would have been shipped when the ack for B came in (unnecessarily resending D and E). In either case the behavior is "wrong". The reason it happens is because an ack after a retransmit is treated the same way as normal ack. I.e., because of data that might be in transit you ignore what the ack tells you to send next and just use it to open the window. But, because the ack after a retransmit comes 3*rtt after the last new data was injected, the two sides are essentially in sync and the ack usually does tell you what to send next. It's pretty clear what the retransmit policy should be. We haven't even started looking into the details of implementing that policy in tcp_input.c & tcp_output.c. If a grad student would like a real interesting project ... ------------ There's more but you're probably as tired of reading as I am of writing. If none of this helps and if you have any Sun-3s handy, I can probably send you a copy of my tcp monitor (as long as our lawyers don't find out). This is something like "etherfind" except it prints out timestamps and all the tcp protocol info. You'll have to agree to post anything interesting you find out though... Good luck. - Van
walsh@HARVARD.HARVARD.EDU.UUCP (12/08/86)
By having a maximum RTT of 2+ minutes, your RDP connection will stay open at times when the Berkeley UNIX system will arbitrarily close the connection after 30 seconds (rather than just informing the application about the problem). You are also seeing the benefits of EACKs. bob
braden@ISI.EDU.UUCP (12/08/86)
Craig: The amount of your hair pulling must be small compared to the time integral of hair pulled by our UCL friends over the years. Quite simply, their conclusion was that most TCP implementations have design problems that make them behave poorly over paths which have very long delay and moderate to high loss. SATNET sometimes (often?) exhibits that behaviour. Recently, the ARPANET+core_gateway system has also exhibited that behaviour, and many TCP's have not been up to it... lots of broken connections, etc. I suggest that the cause of this situation is a performance/ robustness tradeoff inherent to TCP implementations. Most of the currently-available TCPs have been implemented and tested in an LAN environment, to provide optimal performance in a low-delay, low-error situation. On the other hand, when we wrote the original experimental implementations of TCP, we found the little beasties to be amazingly robust; they would tenaciously hold on for minutes (or hours!) retransmitting until a path came back, and would get the data through in spite of terrible bugs. But we were writing and testing them for equally-experimental gateway implementations and frequently testing to UCL, and did not demand high throughput or low delay. It would certainly be interesting to understand exactly how these TCP;s have failed. I suspect it is a combination of a Zhang-catastrophe (RTT measurement diverging towards infinity due to high loss rate) with an implementation-imposed upper bound on retransmission time before the connection breaks. On the other hand, the answer may be that selective retransmission is really absolutely essential to deal with the long-delay, lossy situation. I would like to get someone interesting in running some experiments on this (maybe you just did??) Would it be possible for you to disable just the selective retransmisssion feature of RDP and try again? Bob Braden
karels%okeeffe@UCBVAX.BERKELEY.EDU (Mike Karels) (12/08/86)
Bob, The timeout for TCP on 4.2/3 is rather longer than 30 sec. The actual time depends on the round-trip time, as the limit is on the number of retransmissions. On 4.3, the timeout is at least 108 sec. with short RTT's. The limit on 4.2 was nearer 45 sec. As Van Jacobson says in his message, the keepalive time will have to be adjusted for long RTT's as well, but I doubt that Craig is using the keepalive timer. How can the TCP "just" inform the application about the problem? Unless there's a control channel to the application that allows the passage of status data, a send call must return an error. After that error, the application can't tell how much of the data, if any, was transmitted. "Reliable byte stream with possible gaps in case of error" isn't very satisfying. Mike
walsh@HARVARD.HARVARD.EDU (Bob Walsh) (12/08/86)
Mike, How can the TCP "just" inform the application about the problem? Unless there's a control channel to the application that allows the passage of status data, a send call must return an error. After that error, the application can't tell how much of the data, if any, was transmitted. "Reliable byte stream with possible gaps in case of error" isn't very satisfying. Informing the application that the networking system is having trouble getting acknowledgements does not mean that the networking system has given up on sending that data. My intent was to point out that such a decision should be left up to the application, which may in turn defer the decision to the user. This was one of the qualities of the BBN TCP/IP software, as you know. It is one of the reasons Bob Gilligan used the BBN software for his demos. Whether it is 30, 45 or 108 seconds doesn't matter. What does matter is that Craig is preserving the RDP connection for a longer time under such circumstances and therefore is less likely to see the connection fail. There is also the point of extended acknowledgements. Bob Walsh
van@LBL-CSAM.ARPA.UUCP (12/08/86)
I've been told that my 6th & 7th points (4.3bsd retransmit timers need some work) were incomprehensible. That's what you get when you reply to messages at 3am Sunday morning. Since the retransmit timer behavior results in the biggest performance loss (the other problems affect congestion & stability more than performance), I'll take a crack at explaining it better. Attached is a picture of the problem. It is taken directly from a trace but the window size has been reduced from 8*MSS to 4*MSS to simplify the drawing. Time runs down the page. The time axes has tick marks at multiples of the round trip time, R. The sender is on the left, the receiver on the right. Seven segment are sent, labeled A through G. Two segments, B and D, get lost or damaged in transit. A lower case letter is used for a receiver's ack (e.g., "g" is the ack for all bytes up to and including the last byte of segment "G"). A list of all the segments successfully received so far is in square brackets at the point where each ack is generated. Holes in the sequence space are indicated by "-". All the traffic goes one direction (this was an ftp). 4.3 almost always sends MSS byte segments and all these were of size MSS (512B). Because of the 4.3 delayed ack code, the receiver almost always reports a full size window (4KB in 4.3, 2KB in this example) in an ack. All these acks report a 4 MSS (2KB) window. All sends are timed. The retransmit timer is set to 2 times the smoothed round trip time (TCP_BETA * t_srtt). The timer is set on each ack that's not a duplicate of a previous ack (i.e., that changes the "sent but unacknowleged" pointer, snd_una). If the timer times out, the segment starting at snd_una is retransmitted and the timer is restarted at 2*srtt. Exactly one segment is retransmitted. Periodic retransmissions of that segment continue until it is acked. When the segment is acked, the retransmit timer is set to 2*srtt and "normal" behavior resumes (see rfc793 if you're not sure what normal behavior is). 0-| A (set timer to 2R, send enough | B\ packets to fill window (4)) | C\\ | D\\\ | \\*\ | *\ a [A] (ack A) | X | / a [A - C] (save C but can only ack through A) | / / | / / 1R-| E / (A ack received, set timer to 3R, | \ ack opens window by 1 so send E) | - \ (duplicate A ack discarded) | \ | \ | a [A - C - E] (save E but can only ack through A) | / | / | / | / 2R-| - (duplicate A ack discarded) | | | | | | | | | 3R-| B (timer goes off, rexmit first | \ unacked segment (B), timer set to 5R) | \ | \ | \ | c [A B C - E] ("B" fills in sequence space up | / through "C", ack C) | / | / | / 4R-| F (ack of C opens window for 2 more | G\ segments, timer set to 6R) | \\ | \\ | \\ | \c [A B C - E F] (missing D, can only ack through C) | /c [A B C - E F G] | // | // | // 5R-| -/ (duplicate acks for C discarded) | - | | | | | | | | 6R-| D (timer goes off, rexmit first | \ unacked segment (D), timer set 8R) | \ | \ | \ | g [A B C D E F G] (sequence space complete, ack G) | / | / | / | / 7R-| There are two problems here: the gap between 2R & 3R and the fact that we don't send D at 4R. The idle time from 2-3 (and from 5-6) happens because our timer is always 2*R from the last useful ack and is essentially unrelated to when a segment is originally sent (The code wasn't intended to work this way and on low delay circuits it works correctly.) We should really be retransmitting B 2*R from its first transmission (i.e., 1 line after the 2R tick mark). It's not too hard to show analytically that this (the current 4.3 algorithm) "feeds forward" (e.g., the recovery for D is moved later in time and is more likely to conflict with F,G recovery) which is why throughput degrades much faster than linearly with increasing loss rate. You can view the late transmission of D two ways. It could be another example of the timer problem. I.e., we should have retransmitted D 3 lines after the 2R tick. We held off sending it then because we thought the network might be congested and we wanted to send a minimum amount of data until we got back an indication (the ack) that the congestion had cleared up. But we certainly should have sent D at 4R when we got the "c" ack. Or you can say that when we get the "c" ack after the retransmission of B, no packets have been injected into the network for 2*R. The ack tells you pretty clearly that the receiver is missing D. (Either point of view will do the "right" thing in this case but treating a retransmit ack specially buys you a bit in one other case). If the two problems are corrected, the total time drops from 7R to 4R (2R is the total time if no packets are lost). If we don't do the send-1-packet-on-rexmit congestion control, the total time drops to 3R, the mimimum possible if one or more packets is dropped. Also, this partly illustrates why I thought Craig's measurements demonstrated a problem in TCP rather than the superiority of RDP. Even with EACKs, it takes RDP 3R to send the data if the same two packets are lost, exactly the same time it takes TCP. I think I can show that EACKs aren't a big win until the drop rate is >50%, if TCP is working as well as it can (that's not to say RDP isn't a win for other reasons). - Van
gds@EDDIE.MIT.EDU.UUCP (12/12/86)
> Mike, > How can the TCP "just" inform the application about the problem? > Unless there's a control channel to the application that allows > the passage of status data, a send call must return an error. > After that error, the application can't tell how much of the data, > if any, was transmitted. "Reliable byte stream with possible gaps > in case of error" isn't very satisfying. > > Informing the application that the networking system is having trouble > getting acknowledgements does not mean that the networking system has > given up on sending that data. My intent was to point out that such a > decision should be left up to the application, which may in turn defer > the decision to the user. This was one of the qualities of the BBN > TCP/IP software, as you know. It is one of the reasons Bob Gilligan used > the BBN software for his demos. > There are some Unix applications, like the 4.2 version of telnet, which do a close() if they get something like ETIMEDOUT, which can occur if a TCP timer has gone off. In the 4.2 BBN TCP/IP, the timer going off does not cause the connection to be closed. Instead, a routine called advise_user lets the higher layers know about the problem and lets them deal with it. I was rather surprised to see that 4.2 telnet was giving up when the actual connection was not remotely closed. With a quick patch to telnet to prevent it from closing for errors when a TCP timer has gone off, you can maintain telnet connections for hours. The reconstitution protocols required that connections remain open for long periods of time during network dynamics. I haven't looked at 4.3 telnet or TCP to see if this is fixed or handled differently. --gregbo
rhc@hplb.CSNET.UUCP (12/12/86)
I would like to enforce Bobs concern at the Internet layer. As an ex-UCL person I am also concerned at the current tradeoff in TCP implementations towards LAN-specific high performance and away from Internet-general robustness. The maximum packet size on SATNET is 256 bytes. If you are sending large (>576) TCP packets then the gateway to the ARPANET will fragment, then the ARPANET/SATNET gateway will fragment each of these again. The result is a large number of fragments bursting onto SATNET, pinging off Goonhilly (Why do you have to use the busiest European earth station anyway) then each fragment tries to get back across the ARPANET. At the SATNET/ARPANET gateway we have the ARPANET flow control problem. It is quite likely that every packet sent from the host has turned into at least 4 and possibly 6 packets for the return! Have you considered your reassembly timeout! I have seen packets hang around in these gateways for 16 minutes (yes thats right MINUTES). After a little while the gateway queues will fill up and the TCP will timeout because the IP reassembly is not able to get enough fragments to build enough packets to keep the TCP happy. Now if the TCP was able to use the same IP packet number for all retransmissions then the situation will improve - and I suspect that this is where RDP is winning! Just because the TCP connection is failing does not mean that the TCP layer is the only one that is broken. Happy satellites, Robert.