karn@JUPITER.BELLCORE.COM.UUCP (04/08/87)
I'd like to describe a TCP retransmission algorithm that I've been using with excellent results for the last several months. Background TCP has the following well known problem. When an ACK finally returns for a segment that was retransmitted, you don't know which transmission caused it. Therefore you can't be certain what the round trip time (RTT) for this packet was. Different implementations react differently to this problem: 1. Some assume that the first transmission(s) was/were lost and restart the round trip timer on each retransmission. 2. Others ignore the measurement (since it isn't reliable), keeping their previous estimates of the round trip time. Both approaches share a serious problem. Consider the case where the retransmission occurred because the smoothed round trip time (SRTT) was too small, not because packets are being lost. If approach #1 is used, the returning ACK is in fact responding to the FIRST transmission, so timing from a later transmission gives a measurement that is too small. Because the measurements are erroneously small, SRTT never adapts to the true value; because the SRTT never adapts to its true value, unnecessary retransmissions keep happening, resulting in erroneously small RTT measurements. This vicious circle can go on essentially forever. If the measurement is instead rejected as unreliable, SRTT is never updated, and again unnecessary retransmissions go on forever. Believe me, I've seen more than my share of TCPs behave this way. The current accepted way out of this problem is to assume the worst whenever a timeout occurs by measuring the RTT from the FIRST transmission of a given segment. If network loading suddenly increases, or if the initial estimate of the round trip time was too low, there will be a few retransmissions but the correct value *will* be found; the TCP eventually learns and unnecessary retransmissions stop. As long as the network packet loss rate is low, this approach works quite well. Retransmissions caused by the occasional dropped packet result in erroneously large round trip time measurements being fed into the smoother, but this is better than erring on the low side. Since few packets are being dropped, subsequent packets will most likely bring the SRTT value back down to its "real" value long before the next packet is dropped, so the timeout won't waste too much time. A serious problem develops, however, when the network is very lossy. A "raw" amateur packet radio channel (no link level acknowledgments) often loses 25% or more of the packets sent due to collisions and poor signal levels. Such rates drive this algorithm crazy, especially when consecutive packets are lost. In this situation, retransmission intervals are increasing rapidly due to back-off, and these intervals are added to the total round trip measurement for the segment. Enough such measurements can can drive the smoothed round trip time estimate right through the roof. This is especially true when a nonlinear smoothing function is used that reacts more quickly to increases in round trip delay than to decreases (e.g., Mills, RFC-889). Some people might say that the answer is simple: just constrain the round trip time to some arbitrary limit, say, 30 or 60 seconds, but there are real problems with this. Just how do you pick this "arbitrary" value? Through measurement? Just like the Dow Jones average, any number you see today is almost guaranteed to be exceeded tomorrow. If the delay through the Internet exceeds your arbitrary timeout limit, you start retransmitting unnecessarily, further adding to whatever it is that is causing the large delay in the first place. Perhaps we need robots to stand in the office of every protocol hacker: WARNING WARNING DANGER WILL ROBINSON! ARBITRARY PROTOCOL TIMEOUT DETECTED! EVENTUAL CONGESTION COLLAPSE IS NOW GUARANTEED! (SOONER OR LATER) A Better Way Thinking there *had* to be a better way, I devised and implemented the following rules: 1. If a segment being ACKed was sent only once, update the estimated round trip time and recompute the timeout interval. (This is standard behavior). 2. If a timeout occurs, back off the timeout timer interval (e.g., double it), retransmit the oldest unacknowledged segment, and restart the timeout timer. (Again, nothing unusual). 3. When an ACK comes in for a segment that was sent more than once (i.e., retransmitted at least once) do NOT attempt to measure its round trip time; leave the smoothed estimate alone. FURTHERMORE, KEEP THE BACKED-OFF TIMER SETTING FOR THE *NEXT* SEGMENT, AND DO NOT RECOMPUTE IT FROM (BETA*SRTT) UNTIL IT OR A LATER SEGMENT HAS BEEN ACKED WITHOUT A RETRANSMISSION. The purpose of this last rule is as follows. If the retransmission was caused by a too-small SRTT, keeping the backed-off timeout for the following segment gives an ACK a chance to come back without the RTT measurement being spoiled by an unnecessary retransmission. This gives a valid data point that can be fed into the smoothed estimate, driving it toward its true value. On the other hand, if the timeout was caused by the packet being lost altogether, the estimated round trip time will not (and should not) change. Only valid RTT measurements are fed into the smoother, and the timeout is always backed off whenever retransmissions occur until valid RTT measurements are again obtained. This algorithm seems to work extremely well in practice. Even when the packet loss rate is so high as to cause many packets to be lost in a row, SRTT always reflects a "sane" value. If the network delay suddenly increases, there may be a short period where the retransmission timeout "oscillates" between a value based on the SRTT and the backed-off interval necessary to allow an ACK to come back without a retransmission, but this stops when the computed SRTT catches up with the current network delay. Dave Mills' nonlinear algorithm shortens this period by allowing the smoothed estimate to react more quickly to sudden increases, but even without it the algorithm always converges. Comments? Phil Karn
craig@LOKI.BBN.COM.UUCP (04/08/87)
Phil, I've been looking at the same problem from the perspective of RDP for some months. Some of my conclusions are in a paper I'm presenting at the June USENIX. Most of them are the same as yours, except I came up with a different selection algorithm based on the assumption that the network *usually* maintains relative order (if packet A is sent before packet B, then is is likely that A will reach the destination before B). By the way, in support of the view that we need to look at retransmission timeouts, I've seen the SRTT explode at loss rates below 10% on some systems. One other point -- your algorithm is engaged in throwing away a certain number of good RTT values because the filtering isn't perfect. If the RTT suddenly increases sharply, you have to throw out the first packet that has the new RTT, and wait for the second packet to be accepted. Out of curiousity, what is beta in your implementation? I.e., how much does the RTT have to increase to cause this problem? Craig
geof@apolling.UUCP.UUCP (04/08/87)
> 1. If a segment being ACKed was sent only once, update the estimated round > trip time and recompute the timeout interval. (This is standard behavior). > > 2. If a timeout occurs, back off the timeout timer interval (e.g., double > it), retransmit the oldest unacknowledged segment, and restart the timeout > timer. (Again, nothing unusual). > > 3. When an ACK comes in for a segment that was sent more than once (i.e., > retransmitted at least once) do NOT attempt to measure its round trip time; > leave the smoothed estimate alone. FURTHERMORE, KEEP THE BACKED-OFF TIMER > SETTING FOR THE *NEXT* SEGMENT, AND DO NOT RECOMPUTE IT FROM (BETA*SRTT) UNTIL > IT OR A LATER SEGMENT HAS BEEN ACKED WITHOUT A RETRANSMISSION. Looks good. If it works, all the better! I like the separation of the RTT estimate from the timeout computation. I think it would help the description of the algorithm to make the separation clearer. The RTT estimate is a particular metered value, so it makes a lot of sense to not try to update it based on data that is questionable. The timeout value is a function of the RTT estimate, but need not be a simple function -- sometimes it might not depend on the RTT estimate at all, as you cleverly suggest. The possible problem is that the RTT might not be estimated for a long time (if packets are lost frequently). It looks like the algorithm will correctly play with the timeout in the absence of measured RTT values. I'm not sure what you do in [1], though. If you compute something like RTTnew = (7*RTTold + RTTmeasured)/8 then your behavior may be bad when the RTT had become seriously out of date and was then measured. You'll get a damped oscilatory response, I think. An example: Say the RTT estimate is 1 second and the retransmission timer is 2 s. Now imagine that the the network conditions change abruptly so that the real network RTT is 10 seconds. For several segments, you get spurious retransmissions and don't compute the RTT. Meanwhile the retransmission timer is increased using normal backoff. Eventually, it hits ~10 seconds or so, and you get a valid computation of RTT. This gives you: RTTnew = (7*1 + 10)/8 = 2.125 Then you might compute timer = RTT * 2 = 4.25 s This means that the next segment will also be retransmitted spuriously, and the process repeats. Maybe my example is too severe when compared with the real world (I don't think so, if you have both satnets, X.25 links and direct lines as alternate paths to each other). I guess the moral of the story is to use a method of smoothing the RTT estimate that converges very quickly, like RTTnew = (RTTold + RTTmeasured)/2 or even RTTnew = RTTold One element of incompleteness in the algorithm is the turn-on transition. You don't specify how to make the initial RTT and Timer guesses. In view of the above, that sounds especially important in this algorithm. - Geof Cooper
JSLove@MIT-MULTICS.ARPA.UUCP (04/10/87)
Counterproposal: the following rules are rather complex, and require more memory than some implementations may use (e.g., a timestamp per packet), but they may prevent the sort of explosion we see in SRTT somewhat better, and they use information which is already available. This combines several schemes, used without credit, so if you think you recognize the serial numbers, you probably do. 1. Keep a timestamp associated with each packet in the retransmission queue which tells when it was first transmitted. This timestamp is provided by IP, and indicates when the packet was given to the device driver. Since retransmissions should use the same IP identifier for reassembly, it is reasonable for TCP and IP to share the same copy of the packet for retransmission purposes. 2. Never retransmit a packet which hasn't been transmitted yet. There are a number of implementations which seem to have this problem. In fact, never retransmit a packet which hasn't had a certain minimum delay since its last transmission: SRTT * VR * RB ** RC Where SRTT is smoothed round trip time, VR is variance ratio, RB is retransmission base, and RC is retransmission count. 3. Let's set VR and RB to 2. This is arbitrary; substitute values to taste. RC is initially zero, and is reset for every packet. It doesn't have to be stored in the packet, since only one packet is retransmitted at a time. The SRTT for the first packet is unknown and doesn't matter. The RTTs used to calculate the SRTT use a variable weight, SF (smoothing factor) which is initially zero. The MSF (maximum SF) is also arbitrary, but values from 4 to 8 seem reasonable. 4. When we transmit the first packet, we set the timer to one second. After the timer has elapsed, we retransmit, and increase the interval geometrically (by RB), so we retransmit after 2 more seconds, then 4, and so on. 5. When the acknowledgement comes back after 10 seconds from the first transmission, we have possible RTTs of 10, 9, 7, and 3. We take the WCRTT (worst case RTT) of 10. 6. For subsequent datagrams, we distinguish between three kinds of RTT: a) the datagram was retransmitted (so we have a WCRTT). b) the datagram wasn't retransmitted, but it was acknowledged at the same time as a subsequent datagram, or a retransmission of a preceding datagram was sent subsequent to the transmission of this datagram. We can't be certain how much of the RTT was due to the influence of other datagrams, so we have an MRTT (merged RTT). c) the datagram wasn't retransmitted, and no retransmissions of preceding datagrams were transmitted after this one. We have an ARTT (actual RTT). 7. Each time an acknowledgement packet is processed, we may produce RTTs, including at most one WCRTT or ARTT, and possibly several MRTTs. All contribute to the SRTT, but at different weights. For the first packet (SYN) or if they would tend to decrease the SRTT, all types count at full weight, so SF = minimum (SF + 1, MSF) SRTT = ((SF - 1) * SRTT + RTT) / SF If they would tend to increase the SRTT, then the weights differ, so that an ARTT counts at full weight, a WCRTT at a lesser weight (perhaps half), and an MRTT somewhere in between. This means that an estimate of the RTT is more strongly influenced by optimistic or reliable data. This should help the algorithm converge on a small number. 8. Depending on the nature of the RTT from the preceding packet, we choose one of the following functions to generate the minimum retransmission interval for the current packet: Delay = maximum (SRTT, ARTT) * VR Delay = maximum (SRTT * VR, MRTT) Delay = maximum (SRTT * VR, WCRTT * PLR) PLR is the packet loss ratio. Lacking a filter to determine PLR as well as VR, it should be set to some arbitrary value sufficient to keep the delay from blowing up in the face of high packet loss rates. Let us assume that packet loss rates (in both directions) never average more than 50%, so a PLR of .5 should be sufficient. 9. Don't ever use a simple timer to determine if a connection has died. Instead, kill the connection when RC gets to a large value. Assuming that SRTT is one, a large value might be 8 (nearly 5 minutes), but the general idea is that the geometric backoff means that a long delay will be needed to determine that the connection is broken, so rather than pick a fixed time interval you can base this on the fixed time interval of consecutive lost packets. Note that failing to acknowledge a packet isn't necessarily enough information since if the remote site closes its window it can still indicate that it is alive by acknowledging old data. However, the geometric backoff means that it might take a very long time to detect the reopened window. A window opening ACK wouldn't be acceptable unless it opened the window farther than it had been before being closed. Does anyone admit to actually closing windows? This might be an argument to limit the maximum retransmission interval to say, two minutes. Even then, you might want a connection to keep trying forever. Comments?
sgroff@CCJ.BBN.COM.UUCP (04/10/87)
OK. I'll get back to you after I consult the Barclays sales folks. One other question. Is the inclusion of a Customer Support section in this document a one-off thing ?? Or will it appear in other docs ?? If it's the latter, how do I ensure that the Regional Support Centers get included in all future docs ?? Thanks, Steve