jbn@GLACIER.STANFORD.EDU (John B. Nagle) (11/02/86)
In general, it is better to overestimate the round-trip time and wait a bit longer than underestimate and cause congestion. Very short initial RTT estimates have caused considerable trouble in the past; at various times, TOPS-20, 4.2BSD, and Symbolics TCP implementations have had unreasonably short initial guesses. And of course, if you consistently use a RTT estimate smaller than the actual RTT, you will fill up the link with multiple copies of the packet and cause congestion collapse. So think big. I would argue for 5 seconds as a good first guess. Unfortunately, there are problems that lead to a strong desire to use a shorter interval. Most of them reflect bugs or weak design in the systems involved, but they are nonetheless real. Here are a few. There are some implementations that lose the first packet at the link level due to a simple-minded implementation of ARP. If both source and destination have this problem, the first two packets may be lost. If source and destination are both on LANs interconnected by gateways, everybody uses ARP, and nobody has the relevant entries cached yet, it may be necessary to send a TCP SYN packet FIVE (5) times before a reply makes it back to the source host. And this is in the absence of any packet loss for other reasons. There are some TCP implementations that lose the first SYN packet received; the first SYN packet triggers the firing off of a server task, but the server task doesn't get the SYN packet and has to wait for a duplicate of it. Again, this is a bad TCP implementation, but such exist. And then, of course, there is packet loss through congestion, about which I have written before. The previous comment about an observed 90% packet loss rate makes it clear that the problems have become more severe in the past few months. Loss rates like that can only come from vast overloads from badly-behaved implementations. The combination of all these problems does indeed temp one to use a short RTT in hopes of improving one's own performance at the expense of everybody else. But resist the temptation. It won't fix the problem, which is elsewhere, and it will make the congestion situation worse. Incidentally, I see nothing wrong with loading up a network to 100% link utilization. If all the proper strategies are in, this should work just fine. We did this routinely at Ford Aerospace, with file transfers running in the background sopping up all the idle line time while TELNET sessions continued to receive adequate service. It's no worse than running a computer with a mixed batch/time sharing load and 100% CPU utilization. (UNIX users may feel otherwise, but UNIX has traditionally had a weak CPU dispatcher, being designed for a pure time-sharing load.) The problem is not legitimate load, it's junk traffic due to bad implementations. We know this because if everybody did it right the net would slow down more or less linearly with load, instead of going into a state of semi-collapse. If some node is dropping 90% of its packets, somebody should be examining the dropped packets to find out who is sending them. The party responsible should be spoken to. Or disconnected. A little logging, some analysis, and some hard-nosed management can cure this problem. For most implementations, the fixes exist. They usually just need to be installed. There are still a lot of stock 4.2BSD systems out there blithering away, especially from vendors that don't track Berkeley too closely. As I pointed out in RFC970 (which will appear in IEEE Trans. on Data Communications early next year, by the way), even a simple scheduling algorithm in the gateways should alleviate this problem, and prevent one bad host from effectively bringing the net down. Good luck, everybody. John Nagle