[mod.protocols.tcp-ip] A New TCP Retransmission Algorithm

karn@JUPITER.BELLCORE.COM.UUCP (04/08/87)

I'd like to describe a TCP retransmission algorithm that I've been using
with excellent results for the last several months.

Background

TCP has the following well known problem.  When an ACK finally returns for a
segment that was retransmitted, you don't know which transmission caused it.
Therefore you can't be certain what the round trip time (RTT) for this
packet was.  Different implementations react differently to this problem:

1.  Some assume that the first transmission(s) was/were lost and restart
the round trip timer on each retransmission.

2.  Others ignore the measurement (since it isn't reliable), keeping their
previous estimates of the round trip time.

Both approaches share a serious problem.  Consider the case where the
retransmission occurred because the smoothed round trip time (SRTT) was too
small, not because packets are being lost.  If approach #1 is used, the
returning ACK is in fact responding to the FIRST transmission, so timing
from a later transmission gives a measurement that is too small. Because the
measurements are erroneously small, SRTT never adapts to the true value;
because the SRTT never adapts to its true value, unnecessary retransmissions
keep happening, resulting in erroneously small RTT measurements.  This
vicious circle can go on essentially forever.  If the measurement is instead
rejected as unreliable, SRTT is never updated, and again unnecessary
retransmissions go on forever.  Believe me, I've seen more than my share of
TCPs behave this way.

The current accepted way out of this problem is to assume the worst whenever
a timeout occurs by measuring the RTT from the FIRST transmission of a given
segment.  If network loading suddenly increases, or if the initial estimate
of the round trip time was too low, there will be a few retransmissions but
the correct value *will* be found; the TCP eventually learns and unnecessary
retransmissions stop.  As long as the network packet loss rate is low, this
approach works quite well. Retransmissions caused by the occasional dropped
packet result in erroneously large round trip time measurements being fed
into the smoother, but this is better than erring on the low side.  Since
few packets are being dropped, subsequent packets will most likely bring the
SRTT value back down to its "real" value long before the next packet is
dropped, so the timeout won't waste too much time.

A serious problem develops, however, when the network is very lossy.  A
"raw" amateur packet radio channel (no link level acknowledgments) often
loses 25% or more of the packets sent due to collisions and poor signal
levels.  Such rates drive this algorithm crazy, especially when consecutive
packets are lost.  In this situation, retransmission intervals are
increasing rapidly due to back-off, and these intervals are added to the
total round trip measurement for the segment. Enough such measurements can
can drive the smoothed round trip time estimate right through the roof.
This is especially true when a nonlinear smoothing function is used that
reacts more quickly to increases in round trip delay than to decreases
(e.g., Mills, RFC-889).

Some people might say that the answer is simple: just constrain the round
trip time to some arbitrary limit, say, 30 or 60 seconds, but there are real
problems with this.  Just how do you pick this "arbitrary" value? Through
measurement? Just like the Dow Jones average, any number you see today is
almost guaranteed to be exceeded tomorrow.  If the delay through the
Internet exceeds your arbitrary timeout limit, you start retransmitting
unnecessarily, further adding to whatever it is that is causing the large
delay in the first place.  Perhaps we need robots to stand in the office of
every protocol hacker:

	WARNING WARNING DANGER WILL ROBINSON!
	ARBITRARY PROTOCOL TIMEOUT DETECTED!
	EVENTUAL CONGESTION COLLAPSE IS NOW GUARANTEED!
	(SOONER OR LATER)

A Better Way

Thinking there *had* to be a better way, I devised and implemented the
following rules:

1. If a segment being ACKed was sent only once, update the estimated round
trip time and recompute the timeout interval. (This is standard behavior).

2. If a timeout occurs, back off the timeout timer interval (e.g., double
it), retransmit the oldest unacknowledged segment, and restart the timeout
timer. (Again, nothing unusual).

3. When an ACK comes in for a segment that was sent more than once (i.e.,
retransmitted at least once) do NOT attempt to measure its round trip time;
leave the smoothed estimate alone.  FURTHERMORE, KEEP THE BACKED-OFF TIMER
SETTING FOR THE *NEXT* SEGMENT, AND DO NOT RECOMPUTE IT FROM (BETA*SRTT) UNTIL
IT OR A LATER SEGMENT HAS BEEN ACKED WITHOUT A RETRANSMISSION.

The purpose of this last rule is as follows.  If the retransmission was
caused by a too-small SRTT, keeping the backed-off timeout for the following
segment gives an ACK a chance to come back without the RTT measurement being
spoiled by an unnecessary retransmission.  This gives a valid data point
that can be fed into the smoothed estimate, driving it toward its true
value.  On the other hand, if the timeout was caused by the packet being
lost altogether, the estimated round trip time will not (and should not)
change.  Only valid RTT measurements are fed into the smoother, and the
timeout is always backed off whenever retransmissions occur until valid RTT
measurements are again obtained. 

This algorithm seems to work extremely well in practice. Even when the
packet loss rate is so high as to cause many packets to be lost in a row,
SRTT always reflects a "sane" value.  If the network delay suddenly
increases, there may be a short period where the retransmission timeout
"oscillates" between a value based on the SRTT and the backed-off interval
necessary to allow an ACK to come back without a retransmission, but this
stops when the computed SRTT catches up with the current network delay.
Dave Mills' nonlinear algorithm shortens this period by allowing the
smoothed estimate to react more quickly to sudden increases, but even
without it the algorithm always converges.

Comments?

Phil Karn

craig@LOKI.BBN.COM.UUCP (04/08/87)

Phil,

    I've been looking at the same problem from the perspective of
RDP for some months.  Some of my conclusions are in a paper I'm
presenting at the June USENIX.  Most of them are the same as yours,
except I came up with a different selection algorithm based on the
assumption that the network *usually* maintains relative order
(if packet A is sent before packet B, then is is likely that A
will reach the destination before B).

    By the way, in support of the view that we need to look at
retransmission timeouts, I've seen the SRTT explode at loss rates
below 10% on some systems.

    One other point -- your algorithm is engaged in throwing away a
certain number of good RTT values because the filtering isn't perfect.
If the RTT suddenly increases sharply, you have to throw out the first
packet that has the new RTT, and wait for the second packet to be
accepted.  Out of curiousity, what is beta in your implementation?
I.e., how much does the RTT have to increase to cause this problem?

Craig

geof@apolling.UUCP.UUCP (04/08/87)

 >      1. If a segment being ACKed was sent only once, update the estimated round
 >      trip time and recompute the timeout interval. (This is standard behavior).
 >      
 >      2. If a timeout occurs, back off the timeout timer interval (e.g., double
 >      it), retransmit the oldest unacknowledged segment, and restart the timeout
 >      timer. (Again, nothing unusual).
 >      
 >      3. When an ACK comes in for a segment that was sent more than once (i.e.,
 >      retransmitted at least once) do NOT attempt to measure its round trip time;
 >      leave the smoothed estimate alone.  FURTHERMORE, KEEP THE BACKED-OFF TIMER
 >      SETTING FOR THE *NEXT* SEGMENT, AND DO NOT RECOMPUTE IT FROM (BETA*SRTT) UNTIL
 >      IT OR A LATER SEGMENT HAS BEEN ACKED WITHOUT A RETRANSMISSION.

Looks good.  If it works, all the better!

I like the separation of the RTT estimate from the timeout computation.
I think it would help the description of the algorithm to make the
separation clearer.  The RTT estimate is a particular metered value, so
it makes a lot of sense to not try to update it based on data that is
questionable.  The timeout value is a function of the RTT estimate, but
need not be a simple function -- sometimes it might not depend on the RTT
estimate at all, as you cleverly suggest.

The possible problem is that the RTT might not be estimated for a long
time (if packets are lost frequently).  It looks like the algorithm will
correctly play with the timeout in the absence of measured RTT values.
I'm not sure what you do in [1], though.  If you compute something like
    RTTnew = (7*RTTold + RTTmeasured)/8
then your behavior may be bad when the RTT had become seriously out of
date and was then measured.  You'll get a damped oscilatory response,
I think.

An example:

Say the RTT estimate is 1 second and the retransmission timer is 2 s.
Now imagine that the the network conditions change abruptly
so that the real network RTT is 10 seconds.  For several segments,
you get spurious retransmissions and don't compute the RTT.  Meanwhile
the retransmission timer is increased using normal backoff.  Eventually,
it hits ~10 seconds or so, and you get a valid computation of RTT.
This gives you:
    RTTnew = (7*1 + 10)/8 = 2.125
Then you might compute
    timer = RTT * 2 = 4.25 s

This means that the next segment will also be retransmitted spuriously,
and the process repeats.  Maybe my example is too severe when compared
with the real world (I don't think so, if you have both satnets, X.25
links and direct lines as alternate paths to each other).

I guess the moral of the story is to use a method of smoothing the RTT
estimate that converges very quickly, like
    RTTnew = (RTTold + RTTmeasured)/2
or even
    RTTnew = RTTold

One element of incompleteness in the algorithm is the turn-on transition.
You don't specify how to make the initial RTT and Timer guesses.  In view
of the above, that sounds especially important in this algorithm.

- Geof Cooper

JSLove@MIT-MULTICS.ARPA.UUCP (04/10/87)

Counterproposal:  the following rules are rather complex, and require
more memory than some implementations may use (e.g., a timestamp per
packet), but they may prevent the sort of explosion we see in SRTT
somewhat better, and they use information which is already available.
This combines several schemes, used without credit, so if you think you
recognize the serial numbers, you probably do.

1.  Keep a timestamp associated with each packet in the retransmission
queue which tells when it was first transmitted.  This timestamp is
provided by IP, and indicates when the packet was given to the device
driver.  Since retransmissions should use the same IP identifier for
reassembly, it is reasonable for TCP and IP to share the same copy of
the packet for retransmission purposes.

2.  Never retransmit a packet which hasn't been transmitted yet.  There
are a number of implementations which seem to have this problem.  In
fact, never retransmit a packet which hasn't had a certain minimum delay
since its last transmission:

   SRTT * VR * RB ** RC

Where SRTT is smoothed round trip time, VR is variance ratio, RB is
retransmission base, and RC is retransmission count.

3.  Let's set VR and RB to 2.  This is arbitrary; substitute values to
taste.  RC is initially zero, and is reset for every packet.  It doesn't
have to be stored in the packet, since only one packet is retransmitted
at a time.  The SRTT for the first packet is unknown and doesn't matter.
The RTTs used to calculate the SRTT use a variable weight, SF (smoothing
factor) which is initially zero.  The MSF (maximum SF) is also
arbitrary, but values from 4 to 8 seem reasonable.

4.  When we transmit the first packet, we set the timer to one second.
After the timer has elapsed, we retransmit, and increase the interval
geometrically (by RB), so we retransmit after 2 more seconds, then 4,
and so on.

5.  When the acknowledgement comes back after 10 seconds from the first
transmission, we have possible RTTs of 10, 9, 7, and 3.  We take the
WCRTT (worst case RTT) of 10.

6.  For subsequent datagrams, we distinguish between three kinds of RTT:

a) the datagram was retransmitted (so we have a WCRTT).

b) the datagram wasn't retransmitted, but it was acknowledged at the
same time as a subsequent datagram, or a retransmission of a preceding
datagram was sent subsequent to the transmission of this datagram.  We
can't be certain how much of the RTT was due to the influence of other
datagrams, so we have an MRTT (merged RTT).

c) the datagram wasn't retransmitted, and no retransmissions of
preceding datagrams were transmitted after this one.  We have an ARTT
(actual RTT).

7.  Each time an acknowledgement packet is processed, we may produce
RTTs, including at most one WCRTT or ARTT, and possibly several MRTTs.
All contribute to the SRTT, but at different weights.  For the first
packet (SYN) or if they would tend to decrease the SRTT, all types count
at full weight, so

   SF = minimum (SF + 1, MSF)
   SRTT = ((SF - 1) * SRTT + RTT) / SF

If they would tend to increase the SRTT, then the weights differ, so
that an ARTT counts at full weight, a WCRTT at a lesser weight (perhaps
half), and an MRTT somewhere in between.  This means that an estimate of
the RTT is more strongly influenced by optimistic or reliable data.
This should help the algorithm converge on a small number.

8.  Depending on the nature of the RTT from the preceding packet, we
choose one of the following functions to generate the minimum
retransmission interval for the current packet:

   Delay = maximum (SRTT, ARTT) * VR
   Delay = maximum (SRTT * VR, MRTT)
   Delay = maximum (SRTT * VR, WCRTT * PLR)

PLR is the packet loss ratio.  Lacking a filter to determine PLR as well
as VR, it should be set to some arbitrary value sufficient to keep the
delay from blowing up in the face of high packet loss rates.  Let us
assume that packet loss rates (in both directions) never average more
than 50%, so a PLR of .5 should be sufficient.

9.  Don't ever use a simple timer to determine if a connection has died.
Instead, kill the connection when RC gets to a large value.  Assuming
that SRTT is one, a large value might be 8 (nearly 5 minutes), but the
general idea is that the geometric backoff means that a long delay will
be needed to determine that the connection is broken, so rather than
pick a fixed time interval you can base this on the fixed time interval
of consecutive lost packets.

Note that failing to acknowledge a packet isn't necessarily enough
information since if the remote site closes its window it can still
indicate that it is alive by acknowledging old data.  However, the
geometric backoff means that it might take a very long time to detect
the reopened window.  A window opening ACK wouldn't be acceptable unless
it opened the window farther than it had been before being closed.  Does
anyone admit to actually closing windows?  This might be an argument to
limit the maximum retransmission interval to say, two minutes.  Even
then, you might want a connection to keep trying forever.

Comments?

sgroff@CCJ.BBN.COM.UUCP (04/10/87)

OK.  I'll get back to you after I consult the Barclays sales folks.

One other question.  Is the inclusion of a Customer Support section
in this document a one-off thing ??  Or will it appear in other docs ??
If it's the latter, how do I ensure that the Regional Support Centers
get included in all future docs ??

Thanks,
Steve