[mod.protocols.tcp-ip] 4.2/4.3 TCP and long RTTs

craig@LOKI.BBN.COM.UUCP (12/07/86)

    I'm in the midst of doing comparisons between an RDP implementation and
the 4.2/4.3 TCP implementations and have run into a problem which I'm hoping
someone else can shed light on.

    I'm running tests on two machines, a VAX 750 running 4.3 and a SUN
workstation running 4.2.  The two machines are on the same Ethernet
and use the same gateway.  If I set up an experiment to test behaviour over
paths with long network delays (for example, bouncing packets off
Goonhilly), the TCP connections are established and then typically
fail part way through the transfer.  I don't understand this because
the RDP connections work just fine, and typically complete in 1/4 the
time it takes for a TCP connection to send about 20% of the data and
faint.  The experiment generally involves passing 50-100 segments of anywhere
from 64 to 1024 bytes to the protocols to send.  This is on weekends so
the delays aren't that long.

    The question I'm trying to answer is whether the problem is in the
RDP implementation (what anti-social things could it be doing to maintain
that connection?), or the TCP implementation (what might it be doing wrong
to die where another implementation succeeds?).  If I can, I'd like to
discourage invective.  I'm simply trying to figure out why this is happening
so I can identify and fix the problem and do a comparison between the two
implementations/protocols.  (And soon -- hair pulling over this problem
is beginning to threaten the health of my scalp and beard).

    General information on the RDP implementation:  it will retransmit
up to 10 times and calculates the round-trip time based on the first
packet sent with the caveat that it ignores round-trip times of segments
with sequence numbers lower than those of segments whose round-trip time
has already been computed (this feature is an experiment which may not
stay).  The maximum RTT is 2 minutes, the minimum is 2 seconds.

Craig

hedrick@TOPAZ.RUTGERS.EDU.UUCP (12/07/86)

I suspect that what you are seeing is that fact that Sun TCP (to be
fair, most 4.2 based TCP's) doesn't perform well with bad connections.
A number of sites have replaced Sun's TCP modules with their 4.3
equivalents.  This makes a dramatic difference in dealing with
difficult cases.

van@LBL-CSAM.ARPA.UUCP (12/07/86)

What you observe is probably poor tcp behavior, not antisocial rdp
behavior.  If the link is lossy or the mean round trip time is 
greater than 15 seconds, the 4.3bsd tcp throughput degrades rapidly.
For long transfers, a link that gives 2.7KB/s throughput with a
1% loss rate, gives 0.07KB/s throughput with a 10% loss rate.  (As
appalling as this looks, 4.2bsd, TOPS-20 tcp, and some other major
implementations that I've measured, get worse faster.  The 4.3
behavior was the best of everything I looked at.)

I know some of the reasons for the degradation.  As one might
expect, the failure seems to be due to the cumulative effect of
a number of small things.  Here's a list, in roughly the order
that they might bear on your experiment.

1. There is a kernel bug that causes IP fragments to be generated
   and ip fragments have only a 7.5s TTL.

In the distribution 4.3bsd, there is a bug in the routine
in_localaddr that makes it say all addresses are "local".  In
most cases, this makes tcp use a 1k mss which results in a lot of
ip fragmentation.  On high loss or long delay circuits, a lot of
the tcp traffic gets timed out and discarded at the destination's
ip level. 

The bug fix is to change the line:
	if (net == subnetsarelocal ? ia->ia_net : ia->ia_subnet)
in netinet/in.c to
	if (net == (subnetsarelocal ? ia->ia_net : ia->ia_subnet))

I also changed IPFRAGTTL in ip.h to two minutes (from 15 to 240)
because we have more memory than net bandwidth.


2. The retransmit timer is clamped at 30s.

The 4.3 tcp was put together before the arpanet went to hell and
has some optimistic assumptions about time.  Since the retransmit
timer is set to 2 * RTT, an RTT > 15s is treated as 15s.  (Last
week, the mean daytime rtt from LBL to UCB was 17s.) On a circuit
with 2min rtt, most packets would be transmitted four times and
the protocol pipelining would be effectively turned off (if 4.3
is retransmitting, it only sends one segment rather than filling
the window).  When running in this mode, you're very sensitive to
loss since each dropped packet or ack effectively uses up 4 of
your 12 retries. 

I would at least change TCPTV_MAX in netinet/tcp_timer.h to a
more realistic value, say 5 minutes (remembering to adjust
related timers like MSL proportionally).  I changed the
TCPT_RANGESET macro to ignore the maximum value because I
couldn't see any justification for a clamp.


3. It takes a long time for tcp to learn the rtt.  

I've harped on this before.  With the default 4k socket buffers
and a 512 byte mss, 4.3 tcp will only try to measure the rtt of
every 8th packet.  It will get a measurement only if that packet
and its 7 predecessors are transmitted and acked without error. 
Based on trpt trace data, tcp gets the rtt of only one in every
80 packets on a link with a 5% drop rate.  Then, because of the
gross filtering suggested in rfc793, only 10% of the new
measurement is used.  For a 15s rtt, this means it takes at least
400 packets to get the estimate from the default 3s to 7.5s
(where you stop doing unnecessary retransmits for segments with
average delay) and 1700 packets to get the estimate to 14s (where
you stop unnecessary retransmits because of variance in the
delay).  Also, if the minimum delay is greater than 6s
(2*TCPTV_SRTTDFLT), tcp can never learn the rtt because there
will always be a retransmit canceling with the measurement. 

There are several things we want to try to improve this
situation.  I won't suggest anything until we've done some
experiments.  But, the problem becomes easier to live with if
you pick a larger value for TCPTV_SRTTDFLT, say 6s, and improve
the transient response in the srtt filter (lower TCP_ALPHA to,
say, .7).


4. The retransmit backoff is wimpy.

Given that most of the links are congested and exhibit a lot of
variance in delay, you would like the retransmit timer to back
off pretty aggressively, particularly given the lousy rtt
estimates.  4.3 backs of linearly most of the time.  The actual
intervals, in units of 2*rtt, are:
  1  1  2  4  6  8  10  15  30  30  30 ...
While this is only linear up to 10, the 30s clamp on timers
means you never back off as far as 10 if the mean rtt is >1.5s.
The effect of this slow backoff is to use up a lot of your 
potential retries early in a service interruption.  E.g., a
2 minute outage when you think the rtt is 3s will cost you 9
of your 12 retries.  If the outage happens while you were
trying to retransmit, you probably won't survive it.

This is another area where we want to do some experiments.  It
seems to me that you want to back off aggressively early on, say
 1 4 8 16 ...
for the first part of the table.  It also seems like you want
to go linear or constant at some point, waiting 8192*rtt for the
12th retry has to be pointless.  The dynamic range depends to
some extent on how good your rtt estimator is and on how robust
the retransmit part of your tcp code is.  Also, based on some
modelling of gateway congestion that I did recently, you don't 
want the retransmit time to be deterministic.  Our first cut
here will probably look a lot like the backoff on an ethernet.


5. "keepalive" ignores rtt.

If you are setting SO_KEEPALIVE on any of your sockets, the 
connection will be aborted if there's no inbound packet for
6 minutes (TCPTV_MAXIDLE).  With a 2m rtt, that could happen
in the worst case with one dropped packet followed by one
dropped ack.  ("Sendmail" sets keepalive and we were having
a lot of problems with this when we first brought up 4.3.)

A fix is to multiply by t_srtt when setting the keepalive
timer and divide t_idle by t_srtt when comparing against
MAXIDLE.


6. The initial retransmit of a dropped segment happens, at
   best, after 3*rtt rather than 2*rtt.

If the delay is large compared to the window, the steady state
traffic looks like a burst acks interleaved with data, an ~rtt
delay, a burst of acks interleaved with data and repeat.  4.3
doesn't time individual segments.  It starts a 2*rtt timer for
the first segment, then, when the first segment is acked,
restarts the timer at 2*rtt to time the next segment.  Since the
2nd segment went out at approximately the same time as the first
and since the ack for the first segment took rtt to come back,
the retransmit time for the 2nd segment is 3*rtt.  In the usual
internet case of 4k windows and an mss of 512, the probability of
a loss taking 3*rtt to detect is 7/8. 

The situation is actually worse than this on lossy circuits.
Because segments are not individually timed, all retransmits
will be timed 2*rtt from the last successful transfer (i.e.,
the last ack that moved snd_una).  This tends add the time
taken by previous retransmissions into the retransmission time
of the the current segment, increasing the mean rexmit time
and, thus, lowering the average throughput.  On a link with
a 5% loss rate, for long transfers, I've measured the mean time
to retransmit a segment as ~10*rtt.

The preceeding may not be clear without a picture (it sure took
me a long time to figure out what was going on) but I'll try to
give an example.  Say that the window is 4 segments, the rtt is
R, you want to ship segments A-G and segments B and D are going
to get dropped.  At time zero you spit out A B C D.  At time R you
get back the ack for A, set the retransmit timer to go off at 3R
("now" + 2*rtt), and spit out E.  At 3R the timer goes off and you
retransmit B.  At 4R you get back an ack for C, set the retransmit
timer to go off at 6R and transmit F G. At 6R the timer goes off,
you retransmit D.  [D should have been retransmitted at 2R.]  Even
if we count the retransmit of B delaying everything by 2R (in
what is essentially a congestion control measure), there is an
extra 2R added to D's retransmit because its retransmit time is
slaved to B's ack.  Also note that the average throughput has
gone from 8 packets in 2R (if no loss) to 8 packets in 7R, a
factor of four degradation.

The obvious fix here is to time each segment.  Unfortunately,
this would add 14 bytes to a tcpcb which would then no longer fit
in an mbuf.  So, we're still trying to decide what to do.  It's
(barely) possible to live within the space limitations by, say,
timing the first and last segments and assuming the segments were
generated at a uniform rate. 


7. the retransmit policy could be better.

In the preceeding example, you might have wondered why F G were
shipped after the ack for C rather than D.  If I'd changed the
example so that C was dropped rather than D, C D E F would have
been shipped when the ack for B came in (unnecessarily resending
D and E).  In either case the behavior is "wrong".  The reason it
happens is because an ack after a retransmit is treated the same
way as normal ack.  I.e., because of data that might be in
transit you ignore what the ack tells you to send next and just
use it to open the window.  But, because the ack after a
retransmit comes 3*rtt after the last new data was injected, the
two sides are essentially in sync and the ack usually does tell
you what to send next. 

It's pretty clear what the retransmit policy should be.  We
haven't even started looking into the details of implementing
that policy in tcp_input.c & tcp_output.c.  If a grad student
would like a real interesting project ...

------------
There's more but you're probably as tired of reading as I am of
writing.  If none of this helps and if you have any Sun-3s handy,
I can probably send you a copy of my tcp monitor (as long as our
lawyers don't find out).  This is something like "etherfind"
except it prints out timestamps and all the tcp protocol info.
You'll have to agree to post anything interesting you find out
though...

Good luck.

  - Van

walsh@HARVARD.HARVARD.EDU.UUCP (12/08/86)

By having a maximum RTT of 2+ minutes, your RDP connection will stay open
at times when the Berkeley UNIX system will arbitrarily close the connection
after 30 seconds (rather than just informing the application about the
problem).  You are also seeing the benefits of EACKs.

bob

braden@ISI.EDU.UUCP (12/08/86)

Craig:

The amount of your hair pulling must be small compared to the time integral
of hair pulled by our UCL friends over the years.  Quite simply, their
conclusion was that most TCP implementations have design problems
that make them behave poorly over paths which have very long delay and
moderate to high loss.  SATNET sometimes (often?) exhibits that behaviour.
Recently, the ARPANET+core_gateway system has also exhibited that 
behaviour, and many TCP's have not been up to it... lots of broken
connections, etc.

I suggest that the cause of this situation is a performance/ robustness
tradeoff inherent to TCP implementations.  Most of the
currently-available TCPs have been implemented and tested in an LAN
environment, to provide optimal performance in a low-delay, low-error
situation.  On the other hand, when we wrote the original experimental
implementations of TCP, we found the little beasties to be amazingly
robust; they would tenaciously hold on for minutes (or hours!)
retransmitting until a path came back, and would get the data through in
spite of terrible bugs.  But we were writing and testing them for
equally-experimental gateway implementations and frequently testing to
UCL, and did not demand high throughput or low delay.

It would certainly be interesting to understand exactly how these TCP;s
have failed. I suspect it is a combination of a Zhang-catastrophe (RTT
measurement diverging towards infinity due to high loss rate) with an
implementation-imposed upper bound on retransmission time before the
connection breaks.  On the other hand, the answer may be that selective
retransmission is really absolutely essential to deal with the
long-delay, lossy situation.  I would like to get someone interesting in
running some experiments on this (maybe you just did??) Would it be
possible for you to disable just the selective retransmisssion feature of
RDP and try again?

Bob Braden

karels%okeeffe@UCBVAX.BERKELEY.EDU (Mike Karels) (12/08/86)

Bob,
The timeout for TCP on 4.2/3 is rather longer than 30 sec.
The actual time depends on the round-trip time, as the limit
is on the number of retransmissions.  On 4.3, the timeout
is at least 108 sec. with short RTT's.  The limit on 4.2
was nearer 45 sec.  As Van Jacobson says in his message,
the keepalive time will have to be adjusted for long RTT's
as well, but I doubt that Craig is using the keepalive timer.

How can the TCP "just" inform the application about the problem?
Unless there's a control channel to the application that allows
the passage of status data, a send call must return an error.
After that error, the application can't tell how much of the data,
if any, was transmitted.  "Reliable byte stream with possible gaps
in case of error" isn't very satisfying.

		Mike

walsh@HARVARD.HARVARD.EDU (Bob Walsh) (12/08/86)

Mike,
	How can the TCP "just" inform the application about the problem?
	Unless there's a control channel to the application that allows
	the passage of status data, a send call must return an error.
	After that error, the application can't tell how much of the data,
	if any, was transmitted.  "Reliable byte stream with possible gaps
	in case of error" isn't very satisfying.

Informing the application that the networking system is having trouble
getting acknowledgements does not mean that the networking system has
given up on sending that data.  My intent was to point out that such a
decision should be left up to the application, which may in turn defer
the decision to the user.  This was one of the qualities of the BBN
TCP/IP software, as you know.  It is one of the reasons Bob Gilligan used
the BBN software for his demos.

Whether it is 30, 45 or 108 seconds doesn't matter.  What does matter is
that Craig is preserving the RDP connection for a longer time under such
circumstances and therefore is less likely to see the connection fail.

There is also the point of extended acknowledgements.

Bob Walsh

van@LBL-CSAM.ARPA.UUCP (12/08/86)

I've been told that my 6th & 7th points (4.3bsd retransmit timers
need some work) were incomprehensible.  That's what you get
when you reply to messages at 3am Sunday morning.  Since the
retransmit timer behavior results in the biggest performance
loss (the other problems affect congestion & stability more than
performance), I'll take a crack at explaining it better.

Attached is a picture of the problem.  It is taken directly from
a trace but the window size has been reduced from 8*MSS to 4*MSS
to simplify the drawing.  Time runs down the page.  The time axes
has tick marks at multiples of the round trip time, R.  The sender
is on the left, the receiver on the right.  Seven segment are
sent, labeled A through G.  Two segments, B and D, get lost or 
damaged in transit.  A lower case letter is used for a receiver's
ack (e.g., "g" is the ack for all bytes up to and including the
last byte of segment "G").  A list of all the segments successfully
received so far is in square brackets at the point where each
ack is generated.  Holes in the sequence space are indicated by "-".

All the traffic goes one direction (this was an ftp).  4.3 almost
always sends MSS byte segments and all these were of size MSS (512B).
Because of the 4.3 delayed ack code, the receiver almost always
reports a full size window (4KB in 4.3, 2KB in this example) in an
ack.  All these acks report a 4 MSS (2KB) window.

All sends are timed.  The retransmit timer is set to 2 times the
smoothed round trip time (TCP_BETA * t_srtt).  The timer is set
on each ack that's not a duplicate of a previous ack (i.e., that
changes the "sent but unacknowleged" pointer, snd_una).  If the
timer times out, the segment starting at snd_una is retransmitted
and the timer is restarted at 2*srtt.  Exactly one segment is
retransmitted.  Periodic retransmissions of that segment continue
until it is acked.  When the segment is acked, the retransmit
timer is set to 2*srtt and "normal" behavior resumes (see rfc793
if you're not sure what normal behavior is). 

	  0-| A				(set timer to 2R, send enough
	    | B\			 packets to fill window (4))
	    | C\\
	    | D\\\
	    |  \\*\
	    |   *\ a [A]		(ack A)
	    |     X
	    |    / a [A - C]		(save C but can only ack through A)
	    |   / /
	    |  / /
	 1R-| E /			(A ack received, set timer to 3R,
	    |  \			 ack opens window by 1 so send E)
	    | - \			(duplicate A ack discarded)
	    |    \
	    |     \
	    |      a [A - C - E]	(save E but can only ack through A)
	    |     /
	    |    /
	    |   /
	    |  /
	 2R-| -				(duplicate A ack discarded)
	    | 
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	 3R-| B				(timer goes off, rexmit first
	    |  \			 unacked segment (B), timer set to 5R)
	    |   \
	    |    \
	    |     \
	    |      c [A B C - E]	("B" fills in sequence space up
	    |     /			 through "C", ack C)
	    |    /
	    |   /
	    |  /
	 4R-| F				(ack of C opens window for 2 more
	    | G\			 segments, timer set to 6R)
	    |  \\
	    |   \\
	    |    \\
	    |     \c [A B C - E F]	(missing D, can only ack through C)
	    |     /c [A B C - E F G]
	    |    //
	    |   //
	    |  //
	 5R-| -/			(duplicate acks for C discarded)
	    | -
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	    |
	 6R-| D				(timer goes off, rexmit first
	    |  \			 unacked segment (D), timer set 8R)
	    |   \
	    |    \
	    |     \
	    |      g [A B C D E F G]	(sequence space complete, ack G)
	    |     /
	    |    /
	    |   /
	    |  /
	 7R-| 


There are two problems here: the gap between 2R & 3R and the fact
that we don't send D at 4R.  The idle time from 2-3 (and from
5-6) happens because our timer is always 2*R from the last useful
ack and is essentially unrelated to when a segment is originally
sent (The code wasn't intended to work this way and on low delay
circuits it works correctly.) We should really be retransmitting
B 2*R from its first transmission (i.e., 1 line after the 2R tick
mark).  It's not too hard to show analytically that this (the
current 4.3 algorithm) "feeds forward" (e.g., the recovery for D
is moved later in time and is more likely to conflict with F,G
recovery) which is why throughput degrades much faster than
linearly with increasing loss rate. 

You can view the late transmission of D two ways.  It could be
another example of the timer problem.  I.e., we should have
retransmitted D 3 lines after the 2R tick.  We held off sending
it then because we thought the network might be congested and we
wanted to send a minimum amount of data until we got back an
indication (the ack) that the congestion had cleared up.  But we
certainly should have sent D at 4R when we got the "c" ack. 

Or you can say that when we get the "c" ack after the
retransmission of B, no packets have been injected into the
network for 2*R.  The ack tells you pretty clearly that the
receiver is missing D. (Either point of view will do the "right"
thing in this case but treating a retransmit ack specially buys
you a bit in one other case). 

If the two problems are corrected, the total time drops from 7R
to 4R (2R is the total time if no packets are lost).  If we don't
do the send-1-packet-on-rexmit congestion control, the total time
drops to 3R, the mimimum possible if one or more packets is
dropped. 

Also, this partly illustrates why I thought Craig's measurements
demonstrated a problem in TCP rather than the superiority of RDP.
Even with EACKs, it takes RDP 3R to send the data if the same two
packets are lost, exactly the same time it takes TCP.  I think I
can show that EACKs aren't a big win until the drop rate is >50%,
if TCP is working as well as it can (that's not to say RDP isn't
a win for other reasons). 

  - Van

gds@EDDIE.MIT.EDU.UUCP (12/12/86)

> Mike,
> 	How can the TCP "just" inform the application about the problem?
> 	Unless there's a control channel to the application that allows
> 	the passage of status data, a send call must return an error.
> 	After that error, the application can't tell how much of the data,
> 	if any, was transmitted.  "Reliable byte stream with possible gaps
> 	in case of error" isn't very satisfying.
> 
> Informing the application that the networking system is having trouble
> getting acknowledgements does not mean that the networking system has
> given up on sending that data.  My intent was to point out that such a
> decision should be left up to the application, which may in turn defer
> the decision to the user.  This was one of the qualities of the BBN
> TCP/IP software, as you know.  It is one of the reasons Bob Gilligan used
> the BBN software for his demos.
> 

There are some Unix applications, like the 4.2 version of telnet,
which do a close() if they get something like ETIMEDOUT, which can
occur if a TCP timer has gone off.  In the 4.2 BBN TCP/IP, the timer
going off does not cause the connection to be closed.  Instead, a
routine called advise_user lets the higher layers know about the
problem and lets them deal with it.  I was rather surprised to see
that 4.2 telnet was giving up when the actual connection was not
remotely closed.  With a quick patch to telnet to prevent it from
closing for errors when a TCP timer has gone off, you can maintain
telnet connections for hours.  The reconstitution protocols required
that connections remain open for long periods of time during network
dynamics.

I haven't looked at 4.3 telnet or TCP to see if this is fixed or
handled differently.

--gregbo

rhc@hplb.CSNET.UUCP (12/12/86)

I would like to enforce Bobs concern at the Internet layer. As an
ex-UCL person I am also concerned at the current tradeoff in TCP
implementations towards LAN-specific high performance and away from
Internet-general robustness.

The maximum packet size on SATNET is 256 bytes. If you are sending
large (>576) TCP packets then the gateway to the ARPANET will
fragment, then the ARPANET/SATNET gateway will fragment each of these
again. The result is a large number of fragments bursting onto SATNET,
pinging off Goonhilly (Why do you have to use the busiest European
earth station anyway) then each fragment tries to get back across the
ARPANET. At the SATNET/ARPANET gateway we have the ARPANET flow
control problem. It is quite likely that every packet sent from the
host has turned into at least 4 and possibly 6 packets for the return!
Have you considered your reassembly timeout! 
I have seen packets hang around in these gateways for 16 minutes (yes
thats right MINUTES).
After a little while the gateway queues will fill up and the TCP will
timeout because the IP reassembly is not able to get enough fragments
to build enough packets to keep the TCP happy.
Now if the TCP was able to use the same IP packet number for all
retransmissions then the situation will improve - and I suspect that
this is where RDP is winning!

Just because the TCP connection is failing does not mean that the TCP
layer is the only one that is broken.

Happy satellites,
Robert.