[mod.protocols.tcp-ip] Yet more on RTTs

van@lbl-csam.arpa (Van Jacobson) (11/14/86)

A few weeks ago there was a query about estimating packet round trip 
time for an RDP implementation.  I replied with some local measurements
that suggested that TCP's problems might be similar to RDP's:  most
TCP conversations were so short that they behaved as datagrams.  Based
on this, I suggested that a part of RTT maintenance be moved from
the TCP layer to the IP layer.  If RTT really is common to RDP and TCP,
the IP layer is to logical place to put it.  Based on measurement and
simulation, I have reason to believe that this move would improve
the Internet's present, abysmal performance.

I've been out of touch for two weeks (a problem of inter-personnal
congestion control) and read the past two weeks of TCP-IP messages
last night.  The RTT messages were disappointing:  they addressed
problems whose solution is known and which are being solved (albeit
slowly).  I think we're facing a whole new set of problems.  In an
effort to promote some light (or heat), what follows is my simple
minded explanation of what's going on and what we might start to
do about it.  (In what follows, "connection" means a conversation
between two processes over a network, not a TCP connection.)

RTT is measured to help deal with unreliable packet delivery.  When
packets are delivered reliably, TCP and RDP are self-clocking.  Since
we know that delivery is unreliable, we design our protocols to make
an educated guess about whether a particular packet has been lost:
If no "clock" has been received for a "long time" (relative to the
round trip time), the packet probably needs to be retransmitted.

There are two reasons for losing a packet:
  1) It was damaged or misplaced in transit.
  2) It was discarded due to congestion.
The appropriate recovery strategy depends on the reason:  For (1),
the packet should be retransmitted as soon as possible.  For (2),
the retransmission should happen after a "Long" time (many times
the round trip time) so the congestion has a chance to clear (I'm
making the assumption that there's substantial buffering in the
subnet so the time constants for congestion are long -- this is
true of the nets I deal with and, given current memory prices,
likely to remain true).  

Given that the sender doesn't know whether (1) or (2) has occurred,
what stategy should be used?  If the stategy for (2) is chosen when (1)
is the cause, the throughput on this connection will go down a bit.  If
the strategy for (1) is chosen when (2) is the cause, the problem will
get much worse, both for this host and others on the net.  In the
absence of other information, the principle of Least Damage tells us to
use the strategy for (2).  [Experience also suggests that damaged
packets are unlikely -- the error rate on our worst net is <0.1%.  But,
if you really have to get maximum throughput on a connection, as
opposed to maximum agregate throughput on all your connections, the
Pollaczek-Khinchine equation says that the variance of the RTT
estimates can be used to distinguish (1) & (2).  I design networks for
real-time control in hostile environments and occasionally make use of
this.  It's not generally useful.]

The strategy for (1) is very "local".  It can be detected by the
process running the connection and that process can take corrective
action that should both cure the problem and have negligible effect
on other connections.  The strategy for (2) is global.  The congestion
detected on a connection is probably not caused by that connection.
In fact, it is likely that no single connection is the cause.  Thus no
single action is going to cure the problem, only the combined effect of
several connections reducing their traffic rate.  For this to happen,
each of those connections has to discover the problem (which means
sending packets, which agravate the problem) including newly opened
connections.  The recovery time is clearly going to be an exponential
with a long time constant.

A way to reduce the recovery time is to introduce more coupling between
the connections.  Congestion is a propery of network path(s), not of
connections.  When one connection discovers congestion on a path,
that information should be made available to all connections in the
same machine using that path.  This isn't hard to implement:  A lot
of the congestion happens over paths that look like:


 Host A-|                                     |
	|                                     |
	|-GwyB---------------------------GwyC-|
	|                                     |
	|                                     |-Host D

Where A is talking to D, the vertical lines are relatively high-speed,
local nets and the horizontal line is a low-speed, long-haul net(s).  The
difference in net speeds means that any congestion will almost certainly
occur somewhere on the path from B to C.  This means that, from A's
point of view, the round trip time to D is characteristic of the RTT to
any host served by C (I have data which says that the gateway accounts
for 90% of the variation in RTT).  If A contains a routing entry for C
(IP requires a routing entry in A for B or C or both), a slot could be
left in that entry for RTT.  If TCP, RDP, etc., used that slot for
the value in all their RTT calculations, information about the path
would automatically be shared (and also wouldn't be lost when a TCP
connection closed).  Just this much change would eliminate the "turn-
on transient" of retransmissions that occur while a tcp connection
is learning the RTT.

Once one stops regarding RTT as a property of connections and starts to
regard it as a measured, dynamic property of the topology, some related
ideas start to look interesting.  Like B telling A topology and A
telling B transit times (the stability problems of the old Arpanet
routing protocol shouldn't show up if this is only done locally and
RTT(local) << RTT(long haul)).  Or treating "source quench" as if it
meant "I'm congested" rather than "You should shut up" (it obviously
means both).  Under the first interpretation, it is information about
the state of part of the path.  If gateways along the return path
wiretap, the information does service to several hosts rather than a
single TCP conversation (and we start to get distributed congestion
control via "choke packets" which have some nice properties if there's
enough buffering in the subnet to handle the diffusion time.)

[I'm sure I'll be toasted for something in the preceding paragraph,
if not for the rest of this opus.  We learn by making mistakes.]

I'll close with a brief reiteration of my context.  As the round trip
time of the Internet has gotten worse, the nature of our (locally
generated) traffic has changed.  Only someone desperate or mad would
try to telnet.  Our ftp lacks an automatic retry and it took only a
few "connection timed out"s to make our users abandon file transfer.
The result is a high proportion of mail traffic.  Our usual congestion
is not caused by a few hosts flooding the net with packets (perhaps
because the few hosts we've found doing this were quickly and forcibly
disconnected).  Our usual congestion is the result of a large fraction
of the 200 hosts on an ethernet trying to ship mail through a gateway
with a 9.6Kbit output line.  Each host sends 3 small packets and one big
one, "HELO", "MAIL FROM", "RCPT TO" and "DATA..." (the small packets
are SMTP's fault -- thanks to John Nagle's accumulate-until-the-ack, all
the packets are as big as they can be).  The destinations are usually
different.  I don't know of a congestion algorithm that deals with this
situation but I feel we need one.

  - Van Jacobson

karn@FLASH.BELLCORE.COM (Phil R. Karn) (11/15/86)

I feel that the "round trip timing problem" by itself is a red herring.
It's really a symptom of a larger problem.  A TCP with a very simple RTT
algorithm that always errs on the high side would still perform quite well
if the network didn't drop so many packets. The network wouldn't drop so
many packets if it wasn't being swamped by so many badly designed and
mistuned TCPs.

Last summer in Monterey there was a lot of discussion about vendor
certification and how much hard work it takes to test a protocol
implementation. The thing is, we already HAVE a "validation suite"; it's
called "operational use in the ARPA Internet"!  Furthermore, it has already
revealed some serious problems in some very popular implementations, but the
vendors have yet to fix the damn things (including the maker of the
workstation I'm typing this on). I've seen several (object only) releases of
software for this system come and go since RFC-896 came out and they STILL
don't have the Nagle algorithm yet.  Given how popular this system is, it's
no surprise that the Internet is in such trouble.  I think we should
concentrate on fixing known problems before we invent new ones to solve.

Phil

van@LBL-CSAM.ARPA (Van Jacobson) (11/15/86)

RTTs are not a red herring.  As the specs now stand, RTT and the
associated TCP retransmit algorithm are the *only* congestion
control for 99%+ of the Internet traffic.  The Nagle algorithm is
not for congestion control, it increases the line efficiency so
you are less likely to need congestion control.  This only
postpones the day of reckoning.  We have on the order of 30,000
networked computers sitting behind the Internet backbone and the
number is increasing exponentially.  Long-haul services like
NSFNet supercomputer access are going to increase the traffic
those computers send across the backbone.  There is a factor of
100 difference between number of customers and number of backbone
circuits.  There is a factor of 200 impedance mismatch between
the local nets and the backbone.  With these numbers, congestion
is guaranteed, even with everyone running every algorithm that
John devises. 

The problem could be avoided if there was some way to solve it in
the gateways.  RFC970 proposes one such algorithm.  I started to
implement it but took some data that convinced me it wouldn't
help.  In fact, I couldn't see anything that would help short of
improving the congestion algorithms in the endnodes.  I saw a
useful endnode change, implemented part of it and it worked.  But
to work well it requires more topology information going between
the gateways and the endnodes.  This is undesirable, as is the
thought of changing the tcp in all 1000 of our local computers. 
Thus it seemed worthwhile to continue the discussion in this
forum. 

If congestion problems can be solved in the gateways, we have
quite a bit of time and only need to do trial implementations and
measurements to verify that the proposed algorithms work in real
world traffic.  If problems have to be solved in the endnodes, we
have to implement and verify solutions now, then start leaning
hard on vendors to adopt those solutions.  If we went to the
vendors today, it would be at least a year until we could buy the
fruits of our labor. 

With luck, John Nagle's algorithms, and DCA/NSF infusions of
money into the backbone, we have a year to solve the next set of
problems.  But we still have to solve them.  Telling vendors to
hurry up and market things they should have had yesterday is very
important.  So is figuring out what to tell them tomorrow. 

  - Van

karn@FLASH.BELLCORE.COM (Phil R. Karn) (11/16/86)

Okay, you're right. RTTs are not a red herring, because yet ANOTHER
unfixed bug in my Brand X workstation (and widespread on the net)
is the incorrect computation of round trip time from the last transmission
to the first ACK of a sequence number.  Fix this one, throw in the Nagle
algorithm, and set the initial RTT to a reasonable value like 5-10 sec,
and I think we'll be in good shape for quite some time.

Phil