van@lbl-csam.arpa (Van Jacobson) (11/14/86)
A few weeks ago there was a query about estimating packet round trip time for an RDP implementation. I replied with some local measurements that suggested that TCP's problems might be similar to RDP's: most TCP conversations were so short that they behaved as datagrams. Based on this, I suggested that a part of RTT maintenance be moved from the TCP layer to the IP layer. If RTT really is common to RDP and TCP, the IP layer is to logical place to put it. Based on measurement and simulation, I have reason to believe that this move would improve the Internet's present, abysmal performance. I've been out of touch for two weeks (a problem of inter-personnal congestion control) and read the past two weeks of TCP-IP messages last night. The RTT messages were disappointing: they addressed problems whose solution is known and which are being solved (albeit slowly). I think we're facing a whole new set of problems. In an effort to promote some light (or heat), what follows is my simple minded explanation of what's going on and what we might start to do about it. (In what follows, "connection" means a conversation between two processes over a network, not a TCP connection.) RTT is measured to help deal with unreliable packet delivery. When packets are delivered reliably, TCP and RDP are self-clocking. Since we know that delivery is unreliable, we design our protocols to make an educated guess about whether a particular packet has been lost: If no "clock" has been received for a "long time" (relative to the round trip time), the packet probably needs to be retransmitted. There are two reasons for losing a packet: 1) It was damaged or misplaced in transit. 2) It was discarded due to congestion. The appropriate recovery strategy depends on the reason: For (1), the packet should be retransmitted as soon as possible. For (2), the retransmission should happen after a "Long" time (many times the round trip time) so the congestion has a chance to clear (I'm making the assumption that there's substantial buffering in the subnet so the time constants for congestion are long -- this is true of the nets I deal with and, given current memory prices, likely to remain true). Given that the sender doesn't know whether (1) or (2) has occurred, what stategy should be used? If the stategy for (2) is chosen when (1) is the cause, the throughput on this connection will go down a bit. If the strategy for (1) is chosen when (2) is the cause, the problem will get much worse, both for this host and others on the net. In the absence of other information, the principle of Least Damage tells us to use the strategy for (2). [Experience also suggests that damaged packets are unlikely -- the error rate on our worst net is <0.1%. But, if you really have to get maximum throughput on a connection, as opposed to maximum agregate throughput on all your connections, the Pollaczek-Khinchine equation says that the variance of the RTT estimates can be used to distinguish (1) & (2). I design networks for real-time control in hostile environments and occasionally make use of this. It's not generally useful.] The strategy for (1) is very "local". It can be detected by the process running the connection and that process can take corrective action that should both cure the problem and have negligible effect on other connections. The strategy for (2) is global. The congestion detected on a connection is probably not caused by that connection. In fact, it is likely that no single connection is the cause. Thus no single action is going to cure the problem, only the combined effect of several connections reducing their traffic rate. For this to happen, each of those connections has to discover the problem (which means sending packets, which agravate the problem) including newly opened connections. The recovery time is clearly going to be an exponential with a long time constant. A way to reduce the recovery time is to introduce more coupling between the connections. Congestion is a propery of network path(s), not of connections. When one connection discovers congestion on a path, that information should be made available to all connections in the same machine using that path. This isn't hard to implement: A lot of the congestion happens over paths that look like: Host A-| | | | |-GwyB---------------------------GwyC-| | | | |-Host D Where A is talking to D, the vertical lines are relatively high-speed, local nets and the horizontal line is a low-speed, long-haul net(s). The difference in net speeds means that any congestion will almost certainly occur somewhere on the path from B to C. This means that, from A's point of view, the round trip time to D is characteristic of the RTT to any host served by C (I have data which says that the gateway accounts for 90% of the variation in RTT). If A contains a routing entry for C (IP requires a routing entry in A for B or C or both), a slot could be left in that entry for RTT. If TCP, RDP, etc., used that slot for the value in all their RTT calculations, information about the path would automatically be shared (and also wouldn't be lost when a TCP connection closed). Just this much change would eliminate the "turn- on transient" of retransmissions that occur while a tcp connection is learning the RTT. Once one stops regarding RTT as a property of connections and starts to regard it as a measured, dynamic property of the topology, some related ideas start to look interesting. Like B telling A topology and A telling B transit times (the stability problems of the old Arpanet routing protocol shouldn't show up if this is only done locally and RTT(local) << RTT(long haul)). Or treating "source quench" as if it meant "I'm congested" rather than "You should shut up" (it obviously means both). Under the first interpretation, it is information about the state of part of the path. If gateways along the return path wiretap, the information does service to several hosts rather than a single TCP conversation (and we start to get distributed congestion control via "choke packets" which have some nice properties if there's enough buffering in the subnet to handle the diffusion time.) [I'm sure I'll be toasted for something in the preceding paragraph, if not for the rest of this opus. We learn by making mistakes.] I'll close with a brief reiteration of my context. As the round trip time of the Internet has gotten worse, the nature of our (locally generated) traffic has changed. Only someone desperate or mad would try to telnet. Our ftp lacks an automatic retry and it took only a few "connection timed out"s to make our users abandon file transfer. The result is a high proportion of mail traffic. Our usual congestion is not caused by a few hosts flooding the net with packets (perhaps because the few hosts we've found doing this were quickly and forcibly disconnected). Our usual congestion is the result of a large fraction of the 200 hosts on an ethernet trying to ship mail through a gateway with a 9.6Kbit output line. Each host sends 3 small packets and one big one, "HELO", "MAIL FROM", "RCPT TO" and "DATA..." (the small packets are SMTP's fault -- thanks to John Nagle's accumulate-until-the-ack, all the packets are as big as they can be). The destinations are usually different. I don't know of a congestion algorithm that deals with this situation but I feel we need one. - Van Jacobson
karn@FLASH.BELLCORE.COM (Phil R. Karn) (11/15/86)
I feel that the "round trip timing problem" by itself is a red herring. It's really a symptom of a larger problem. A TCP with a very simple RTT algorithm that always errs on the high side would still perform quite well if the network didn't drop so many packets. The network wouldn't drop so many packets if it wasn't being swamped by so many badly designed and mistuned TCPs. Last summer in Monterey there was a lot of discussion about vendor certification and how much hard work it takes to test a protocol implementation. The thing is, we already HAVE a "validation suite"; it's called "operational use in the ARPA Internet"! Furthermore, it has already revealed some serious problems in some very popular implementations, but the vendors have yet to fix the damn things (including the maker of the workstation I'm typing this on). I've seen several (object only) releases of software for this system come and go since RFC-896 came out and they STILL don't have the Nagle algorithm yet. Given how popular this system is, it's no surprise that the Internet is in such trouble. I think we should concentrate on fixing known problems before we invent new ones to solve. Phil
van@LBL-CSAM.ARPA (Van Jacobson) (11/15/86)
RTTs are not a red herring. As the specs now stand, RTT and the associated TCP retransmit algorithm are the *only* congestion control for 99%+ of the Internet traffic. The Nagle algorithm is not for congestion control, it increases the line efficiency so you are less likely to need congestion control. This only postpones the day of reckoning. We have on the order of 30,000 networked computers sitting behind the Internet backbone and the number is increasing exponentially. Long-haul services like NSFNet supercomputer access are going to increase the traffic those computers send across the backbone. There is a factor of 100 difference between number of customers and number of backbone circuits. There is a factor of 200 impedance mismatch between the local nets and the backbone. With these numbers, congestion is guaranteed, even with everyone running every algorithm that John devises. The problem could be avoided if there was some way to solve it in the gateways. RFC970 proposes one such algorithm. I started to implement it but took some data that convinced me it wouldn't help. In fact, I couldn't see anything that would help short of improving the congestion algorithms in the endnodes. I saw a useful endnode change, implemented part of it and it worked. But to work well it requires more topology information going between the gateways and the endnodes. This is undesirable, as is the thought of changing the tcp in all 1000 of our local computers. Thus it seemed worthwhile to continue the discussion in this forum. If congestion problems can be solved in the gateways, we have quite a bit of time and only need to do trial implementations and measurements to verify that the proposed algorithms work in real world traffic. If problems have to be solved in the endnodes, we have to implement and verify solutions now, then start leaning hard on vendors to adopt those solutions. If we went to the vendors today, it would be at least a year until we could buy the fruits of our labor. With luck, John Nagle's algorithms, and DCA/NSF infusions of money into the backbone, we have a year to solve the next set of problems. But we still have to solve them. Telling vendors to hurry up and market things they should have had yesterday is very important. So is figuring out what to tell them tomorrow. - Van
karn@FLASH.BELLCORE.COM (Phil R. Karn) (11/16/86)
Okay, you're right. RTTs are not a red herring, because yet ANOTHER unfixed bug in my Brand X workstation (and widespread on the net) is the incorrect computation of round trip time from the last transmission to the first ACK of a sequence number. Fix this one, throw in the Nagle algorithm, and set the initial RTT to a reasonable value like 5-10 sec, and I think we'll be in good shape for quite some time. Phil