[comp.protocols.tcp-ip] more on tcp congestion control

van@LBL-CSAM.ARPA (Van Jacobson) (02/23/88)
Phil -

Thanks for correcting the CUTE reference (I've got to start
putting JSAC and COM in separate piles) and thanks for your
interesting message on changes to the ka9q package.  You
bring up a lot of things that I should have covered so I'm
cc'ing this reply to the tcp/ip list (in the unlikely event
that anyone else is interested).

> I suggest rounding your computed RTO values up to the next
> highest number of ticks when setting the timer. This is
> especially important when you calculate retransmission timeouts
> based on mean deviations, since they can be very small.
> Otherwise it's possible for truncation effects to set the timer
> LESS than the current SRTT.

Good point.  I should have edited that timer msg I inserted
because it contained several rounding errors.  The line

	rto = (Avg >> 3) + (Mdev >> 1);

should have read

	rto = ((Avg >> 2) + Mdev + 3) >> 1;

which is how things read in the timer rfc & the bsd code -- Avg
could also be rounded but it doesn't make much practical
difference.  The mysterious "+ 3" is half a tick for rounding
plus one tick because the send is phased randomly with respect
to the clock (i.e., there's a +-1/2 tick uncertainty in when
the timer goes off).  Also, because of the timer uncertainty, the
minimum value for rto should be two ticks.  You can save a
"if (rto < 2) rto = 2" step, at the cost of a half-tick bias
in rto, if you make the above

	rto = ((Avg >> 2) + Mdev + 4) >> 1;

> 3. When you refer to "packets" in the context of setting the
> congestion window in TCP, what do you mean? One MSS?

Yes, by "packet" without a qualifier (such as "arbitrary size
packet") I always mean a one MSS sized packet.

> I tried that first, but I ran into problems. When I have an
> entire SLIP link to myself, the mean deviation sits at a very
> low value.  Thus the RTO sits just above the SRTT.  But when
> slowstart opens up the window to send more, the round trip time
> immediately increases beyond the retransmission timeout and an
> unnecessary retransmission occurs.  The congestion window
> oscillates rapidly, with an unnecessary retransmission in each
> cycle.

I think I understand the problem (although I think you're
confusing the slowstart & congestion control algorithms -- more
on that in the next section) but I'm not sure I agree with the
analysis.  I run tcp stuff, bind & NFS over a 2400 baud slip
link from home to work and I've spent a bit of time
investigating the link's behavior (waiting for the damn slow
link gave me a lot of time when I couldn't do anything but
investigate the link behavior :-).

Having the rtt variance in addition to the rtt is a big help
when you're trying to start a bandwidth-limited link.  Since
errors in the variance decay quickly, you can afford to pump it
up whenever you're uncertain about how long a send is going to
take.  In particular, when starting we set the variance to a
`big number' (default is 3 sec but this can be overridden on a
per-route basis) and the rtt to either our best estimate (also
kept in the route) or zero.  After the syn packet, we'll know
the rtt for a zero length packet and rttvar will get decayed by
at most 25%.  As long as the the rtt for the first data packet
changes by < 2*.75*bigNumber, we won't get a spurious timeout.
We're free to overestimate bigNumber initially because even a
factor of 10 error will be `forgotten' in 8 packet exchanges.

The same scheme applies for the 2nd and future packet exchanges.
If the link is delay limited, increasing the window from one to
two packets will have no effect.  If the link is bandwidth
limited, doubling the window size will double the rtt.  In
general, given that we know the rtt, R(W), for a window size of
W, the worst case rtt change for a window size of W+dW is
R(W)*dW/W.  We don't know if we'll see the worse case but the
above reflects our "uncertainty" in the rtt of the next packet
if we increase the window size.  Thus it should get lumped into rttvar:

	rttvar += rtt * dW/W;

During slowstart, dW is the max segment size, mss, and W is
snd.cwnd so the above becomes

	rttvar += rtt * mss/snd.cwnd

before each snd.cwnd change.  (The above needs appropriate
scaling and rounding but they're implemenatation dependent.
Also, it overestimates the possible rtt change when snd.cwnd is
small but that turns out to be desirable.)

> So I'm trying something else. I initialize the congestion window
> at one MSS, but then I increase the congestion window by the
> actual amount of data acked each time an ACK comes in.  This
> effectively slows the slow-start algorithm so the timer can have
> more time to adapt.  Or at least the congestion window
> oscillation occurs at a much lower rate. What do you think?

If you mean open cwnd by MIN(amount.acked, mss), that's a real
interesting idea and I'm going to think about it.  The above
timer hacking takes care of spurious retransmits but there's
another problem:  Slowstart is designed to prevent hitting a
gateway with bursts of packets.  If you send a few small packets
when the connection starts, their acks will open the window so
that if you suddenly decide to send a bunch of data, you'll hit
the gateway with a burst.  This small/large switch is exactly
what happens in an SMTP conversation & it looks like your mod
will cure the problem.

But, if you mean what you said, I don't think it's a good idea.
Consider what happens when you send 1,2,3,4 and 1 is dropped.
2-4 are cached on the other side so when you retransmit 1 and
fill the hole in the sequence space, all four packets will be
acked.  If you increment by the amount acked, cwnd will get
fully opened and you'll blast out a window's worth of
back-to-back packets, almost guaranteeing another drop.
(I've shown the IETF lots of pictures of this kind of stable
failure mode.)

> 4. I'm confused by the modified slow-start algorithm you
> described, the one that uses a threshold to change the amount
> the congestion window is increased. Again, what does it mean to
> increase the congestion window by 1/cwind? Is cwind in packets
> (what size?) or bytes (increase the window by fractions of a
> byte?)

We found it was important to distinguish startup behavior from
steady-state behavior.  (It sounds as if you're lumping the two
together.)  The thing we call "slow start" just has to do with
starting data flowing.  The algorithm has a very short lifetime
-- it typically runs for the first 3-5 rtt after you first start
sending on a connection or restart after a drop.  When slow
start shuts down, the link is in equilibrium (the pipe is full
and you've exchanged a full window of data so you know the ack
"clock" is working).  At this point another algorithm,
congestion avoidance, takes over.  This algorithm is responsible
for estimating how much data is needed to fill the pipe under
the current (time-varying) load.  The congestion avoidance &
slowstart are separate algorithms with completely different
objectives.  They just happen to run consecutively and just
happen to accomplish their individual objectives by twiddling
the same state variable (cwnd).

The congestion avoidance algorithm is described well in DEC
technical report DEC-TR-506, "Congestion Avoidance in Computer
Networks with a Connectionless Network Layer" by Raj Jain, K. K.
Ramakrishnan, Dah-Ming Chiu, August, 1987.  The idea is that when
you try to use a window larger than the bandwidth*delay of the
network path, the excess window won't fit in the pipe so it
must show up in a queue somewhere.  In the usual case, "somewhere"
will be the outbound queue of the slowest link in the path.
That link will probably be a bottleneck for a lot of other
conversations and it's likely its queue will fill up and overflow.
We treat this as a signal that our window is too large and
reduce it (Jain doesn't wait for an overflow -- his signal is
a bit in packet that the gateway can set to say its congested.)

But the `bandwidth' in the delay-bandwidth product is *your
share* of the fixed link bandwidth and your share can change as
conversations using the same path come and go.  The gateway
tells you when you're using too much but says nothing when
you're using too little.  To find out if someone has shut down
and your share is now larger, you continuously test by slowly
increasing your window size until the gateway says you're using
too much.  (Thus the window will oscillate around some mean
value.  The algorithm is designed to do this.  But we've tried
to set the adjustable parameters that determine the oscillation
so that the bandwidth lost to testing the path is at most 1%.)

From a linear system argument, you can show that best probe
policy is to make your bandwidth vs. time obey dB/dt = C for
some constant C.  Since B = W / R, where W is the window size
and R is the round trip time, and R is constant, this says you
want dW/dt = c for some other c.  In other words, in one rtt we
want the window to go smoothly from size W to size W+c.  I want
the algorithm to be "self clocked" so we go looking for how to
do the above change using acks rather than time to drive the dW.
If the max packet size is mss and we have decent silly window
code, a window of W will generate at most W/mss acks and it will
generate them spread out over one rtt.  Thus if we increment by
c/(W/mss) = c*mss/W per ack, we will have opened the window by c
after one rtt.  An appropriate c for current arpanet conditions
turns out to be one mss so the increment part of the congestion
avoidance turns into

	snd.cwnd += mss*mss / snd.cwnd;

(you might have to worry about the the fixed-point in this if you
were trying to send small packets over a net with a large bandwidth
delay product but it's not a problem in any net I know of.)

> 5. You didn't specifically mention ICMP source quench, but I
> assume you mean that the congestion window should be cut in half
> each time you get one.  To make this meaningful I also limit the
> congestion window to be no more than the offered window. One
> question: should the congestion window be allowed to decrease
> below one MSS? I would say yes, but I'd like to hear your
> opinion.

I didn't mention source quench because it's a disaster and I keep
hoping it will go away.  There must be at least one rtt of hysteresis
or "dead time" for any control algorithm to be stable.  By using
packet drops and timeouts to drive things, we get this automatically.
But it's very, very hard to get the right kind of hysteresis with
source quench.  It's also possible to show that source quench leads
to a very unfair bandwidth allocation -- SQ is preferentially 
delivered to the hosts using the smallest windows (I have lots of
trace data showing this and can explain why it happens but the
explanation is mathematical and quite involved).

So, since SQ currently happens only when one or more packets has
been dropped, all we do is close snd.cwnd down to one packet (so
no more will get dropped) but we don't touch snd.ssthresh
(because the slowstart time is short, ssthresh is the crucial
window control, not cwnd).  Ssthresh and the rest of the
congestion control stuff will get handled correctly, and with
the right "time constants", when we eventually take the timeout
for the lost packet.

We limit cwnd to be at most the offered window (you violate
rfc793 if you ever send more than the offered window).  I don't
think it's a good idea to let cwnd get below one mss.  Dropping
cwnd below mss forces you to send small packets, increasing the
overhead per data byte xferred.  Thus you are using the network
less efficiently at a time it's heavily loaded and you really
want to make the most efficient possible use of it.  This is an
unstable positive feedback that can lead to collapse.  If mss is
set correctly, it never makes sense to send a smaller packet.

> 6. It's interesting to note that the "set the congestion window
> to one packet after a timeout" rule automatically implies the
> "first-only retransmission" rule, thus simplifying my code
> somewhat.  In effect, a timeout is just a very drastic source
> quench.

Yes, there were several sort-of ad-hoc pieces of code, like the
first-only rexmit and all the source quench handling, that were
taken out because their job was handled in a more unified way by
the slowstart/cong. avoidance framework -- it was very
satisfying.  We removed something like 15 lines and replaced
them with 6.  Of course, we then stuck in your clamped
retransmit backoff algorithm, a fast retransmit algorithm (that
got described at one of the IETF meetings) and a few more
goodies, and ended up +3 lines from the original 4.3 code
instead of the -9 we expected (not including, of course, dozens
of lines of bug fixes).

I prefer to think of source quench as a poorly considered, ill
behaved timeout :-).

 - Van