rwolski@lll-lcc.UUCP (Richard Wolsoi) (08/29/89)
Greetings netlanders. My question regards the Nagle algorithms for small-packet avoidance, which have been implemented in the various flavors of UNIX now running around. A colleague of mine has written an RPC mechanism to run over TCP sockets on UNIX systems and we are seeing some very strange performance numbers for certain kinds of messaging. An ethernet trace and several minutes with the source code convinced us that TCP was delaying both data sends and acknowledgements in an effort to avoid silly windows. Part of the$problem comes from the RPC implementation which makes two send system calls for each request or reply (no flames please we are dealing with some serious history here) in that UNIX sends out the information in two different packets. The first goes out immediately (as the Nagle algorithm prescribes) while the second is delayed until the first is acked. Unfortunately, the ack is delayed as part of the receiver's half of the bargain and so we were seeing a whopping 10 packets per second. Now for my question. Is there any way to defeat the Nagle algorithms under standard implementations? I seem to recall that the Tahoe (or is it Tajo) release of 4.3 had such a defeat wlich was put in for X-11, but we don't have systems which are quite so modern. Specifically, we are using SunOS 3.4, Ultrix 3.0 and UTS something-or-other with WIN-UTS. My boss's boss says that there are some vevsions of RPC which use TCP streams. If they exist, what are they called? Do they suffer this problem or do they rely on the apparent fact that UNIX will attempt to put all of a send system call in a single packet (which is not part of the spec to my knowledge)? Thank you all in advance, Rich A tourist in Technical Disneyland rwolski@lll-lcc.llnl.gov Internet (415)423-8594 Bell-net Rich Wolski Mail-net President, Mark Boolootian ski-boat foundation (an inside joke) P.O. Box$808, L-60 Livermore, CA. 94550
nagle@well.UUCP (John Nagle) (08/29/89)
In article <2581@lll-lcc.UUCP> rwolski@lll-lcc.UUCP (Richard Wolsoi) writes: > >My question regards the Nagle algorithms for small-packet avoidance, which >have been implemented in the various flavors of UNIX now running around. > >A colleague of mine has written an RPC mechanism to run over TCP sockets >on UNIX systems and we are seeing some very strange performance numbers >for certain kinds of messaging. An ethernet trace and several minutes with >the source code convinced us that TCP was delaying both data sends and >acknowledgements in an effort to avoid silly windows. > >Part of the$problem comes from the RPC implementation which makes two >send system calls for each request or reply (no flames please we are dealing >with some serious history here) in that UNIX sends out the information in >two different packets. The first goes out immediately (as the Nagle >algorithm prescribes) while the second is delayed until the first is acked. >Unfortunately, the ack is delayed as part of the receiver's half of the >bargain and so we were seeing a whopping 10 packets per second. >Now for my question. Is there any way to defeat the Nagle algorithms under >standard implementations? I seem to recall that the Tahoe (or is it Tajo) >release of 4.3 had such a defeat wlich was put in for X-11, but we don't >have systems which are quite so modern. Specifically, we are using SunOS >3.4, Ultrix 3.0 and UTS something-or-other with WIN-UTS. What seems to have happened here is that several mechanisms in TCP are interacting with a strange kind of application to produce poor performance. First, "silly window syndrome" is irrelevant here. Silly window syndrome occurs when the window is full most of the time, but here, we have the tinygram problem, which occurs when the window is empty most of the time. It's a common misconception that the two problems are the same. They are not. They are handled by separate code in the UNIX implementation, incidentally. The problem here seems to be that tinygram elimination and delayed ACKs, both performance improvements in TCP, are interacting with this new RPC application. Delayed ACKs are something that first appeared in TOPS-20, which is a system that runs TELNET in remote echo most of the time. (So do most UNIX TELNET implementations, not that they really need to.) The idea there was to make the bet that when a packet came in on a TCP connection, the application would probably have something to reply with shortly. Therefore, TCP was gimmicked to delay sending an ACK for about 100ms, in hopes that the application layer would send something back and that this would be piggybacked on the application layer's reply. Note that this is an assumption built into the transport layer about the behavior of the application layer. Delayed ACKs will cut traffic in half on slow TELNET operations, and they don't bother FTP. So they seemed like a big win at the time, when there were few TCP applications beyond TELNET, FTP, and mail. TCP with delayed ACKs and tinygram elimination sending will support the following kinds of applications well. 1) Big data pipes, like FTP. 2) TELNET-type interaction. 3) Request-reply type transaction protocols. But here, we have someone who is trying something that has the property that it does SEND SEND wait for reply. This doesn't take turns the way a typical transaction protocol does, so the guesses built into TCP are bad for this situation. This, this sort of use creates problems. Of course, as the writer points out, doing multiple tiny sends is a bad thing on general principles. It's always better to do one big send than lots of little ones, given that you're not waiting anxiously for an answer. Sending a 1-byte message takes 41 bytes across the net, plus any overhead at the link layer. It's that 40:1 expansion that led to the need for tinygram elimination. Presumably this RPC package is sending a bit more data at a time, so the expansion factor may be smaller, but it may still be significant. One solution might be to make whatever RPC package he's using go through the standard UNIX I/O library (stdio) and flush the output stream just before reading from the input stream or before reading from another source. This would improve the buffering situation. This is really a buffering problem, after all. TCP is trying to protect the network from dumb applications. We fixed it back in 1985 so that when the application is dumb, the application suffers, not the network. We have here a dumb application layer. John Nagle Newsgroups: poster Subject: Re: Nagle algo. in Unix-TCP Summary: Expires: References: <2581@lll-lcc.UUCP> Sender: Reply-To: nagle@well.UUCP (John Nagle) Followup-To: Distribution: Keywords: In article <2581@lll-lcc.UUCP> rwolski@lll-lcc.UUCP (Richard Wolsoi) writes: > >My question regards the Nagle algorithms for small-packet avoidance, which >have been implemented in the various flavors of UNIX now running around. > >A colleague of mine has written an RPC mechanism to run over TCP sockets >on UNIX systems and we are seeing some very strange performance numbers >for certain kinds of messaging. An ethernet trace and several minutes with >the source code convinced us that TCP was delaying both data sends and >acknowledgements in an effort to avoid silly windows. > >Part of the$problem comes from the RPC implementation which makes two >send system calls for each request or reply (no flames please we are dealing >with some serious history here) in that UNIX sends out the information in >two different packets. The first goes out immediately (as the Nagle >algorithm prescribes) while the second is delayed until the first is acked. >Unfortunately, the ack is delayed as part of the receiver's half of the >bargain and so we were seeing a whopping 10 packets per second. >Now for my question. Is there any way to defeat the Nagle algorithms under >standard implementations? I seem to recall that the Tahoe (or is it Tajo) >release of 4.3 had such a defeat wlich was put in for X-11, but we don't >have systems which are quite so modern. Specifically, we are using SunOS >3.4, Ultrix 3.0 and UTS something-or-other with WIN-UTS. What seems to have happened here is that several mechanisms in TCP are interacting with a strange kind of application to produce poor performance. First, "silly window syndrome" is irrelevant here. Silly window syndrome occurs when the window is full most of the time, but here, we have the tinygram problem, which occurs when the window is empty most of the time. It's a common misconception that the two problems are the same. They are not. They are handled by separate code in the UNIX implementation, incidentally. The problem here seems to be that tinygram elimination and delayed ACKs, both performance improvements in TCP, are interacting with this new RPC application. Delayed ACKs are something that first appeared in TOPS-20, which is a system that runs TELNET in remote echo most of the time. (So do most UNIX TELNET implementations, not that they really need to.) The idea there was to make the bet that when a packet came in on a TCP connection, the application would probably have something to reply with shortly. Therefore, TCP was gimmicked to delay sending an ACK for about 100ms, in hopes that the application layer would send something back and that this would be piggybacked on the application layer's reply. Note that this is an assumption built into the transport layer about the behavior of the application layer. Delayed ACKs will cut traffic in half on slow TELNET operations, and they don't bother FTP. So they seemed like a big win at the time, when there were few TCP applications beyond TELNET, FTP, and mail. TCP with delayed ACKs and tinygram elimination sending will support the following kinds of applications well. 1) Big data pipes, like FTP. 2) TELNET-type interaction. 3) Request-reply type transaction protocols. But here, we have someone who is trying something that has the property that it does SEND SEND wait for reply. This doesn't take turns the way a typical transaction protocol does, so the guesses built into TCP are bad for this situation. This, this sort of use creates problems. Of course, as the writer points out, doing multiple tiny sends is a bad thing on general principles. It's always better to do one big send than lots of little ones, given that you're not waiting anxiously for an answer. Sending a 1-byte message takes 41 bytes across the net, plus any overhead at the link layer. It's that 40:1 expansion that led to the need for tinygram elimination. Presumably this RPC package is sending a bit more data at a time, so the expansion factor may be smaller, but it may still be significant. One solution might be to make whatever RPC package he's using go through the standard UNIX I/O library (stdio) and flush the output stream just before reading from the input stream or before reading from another source. This would improve the buffering situation. This is really a buffering problem, after all. TCP is trying to protect the network from dumb applications. We fixed it back in 1985 so that when the application is dumb, the application suffers, not the network. We have here a dumb application layer. John Nagle
jg@max.crl.dec.com (Jim Gettys) (09/07/89)
As noted, 4.3BSD has a socket option to defeat delayed ACK's, specifically put there on our request (X happened to be the first application to run into the problem, during beta test of 4.3, back well before X11). This has been picked up by essentially everyone. Ultrix had the option since V2 days. The option (found in /usr/include/netinet/tcp.h) is TCP_NODELAY. /* * User-settable options (used with setsockopt). */ #define TCP_NODELAY 0x01 /* don't delay send to coalesce packets */ - Jim Gettys
rwolski@lll-lcc.UUCP (Richard Wolski) (09/13/89)
Hello one last time. Some time ago I posted a request for information about the Nagle algorithms included in various flavors of Unix and got a rather surprising response from John Nagle himself. As I indicated in my previous posting, my questions regarded an RPC system that a colleague of mine had written which sent messages each in two "send" calls (instead of one which we now know is the accepted Unix method). After returning from vacation and reading the various correspondence in which I had been engaging, he felt compelled to contribute his thoughts, but unfortunately does not have access to a machine which gets a news feed. I am, therefore, posting this reply on his behalf in an objective manner as possible. Rich Wolski P.S. Please reply directly to John Fletcher as I don't relish the prospect of intercepting scores of nasty-grams. ------------------------------------------------------------------------------- In article Article 8533 of comp.protocols.tcp-ip John Nagle writes: > What seems to have happened here is that several mechanisms in TCP >are interacting with a strange kind of application to produce poor performance. > > First, "silly window syndrome" is irrelevant here. Silly window >syndrome occurs when the window is full most of the time, but here, >we have the tinygram problem, which occurs when the window is empty most >of the time. It's a common misconception that the two problems are the >same. They are not. They are handled by separate code in the UNIX >implementation, incidentally. . . . >But here, we have someone who is trying something that has the property >that it does > > SEND > SEND > wait for reply. > >This doesn't take turns the way a typical transaction protocol does, so >the guesses built into TCP are bad for this situation. . . . >TCP is trying to protect the network from dumb applications. We fixed it >back in 1985 so that when the application is dumb, the application suffers, >not the network. We have here a dumb application layer. John Fletcher (the author of a locally designed RPC package) replies: My colleague Rich Wolski recently inquired over the net in regard to the poor performance of TCP sockets when the typical behavior was alternately to send two buffers and then to receive two similar buffers. A rate of ten buffer pairs per second was obtained. As I understand it, the second buffer in a pair was not actually sent by the interface until the first was acknowledged, and that acknowledgement was delayed. The responses he received were not very helpful. The most detailed one suggested using library calls that would copy the two buffers into one, an overhead that we are seeking to avoid. It also described the behavior of the application as "dumb" and therefore presumably not deserving of adequate support; this approach, if extended more broadly, could put us all out of work. On our own we found that the difficulty was avoided by using the "sendmsg" call to send the two buffers in one call rather than using two "send" calls, one for each buffer. Later, one correspondent suggested the same. Wolski's more fundamental question remains unanswered: What exactly is the model of TCP socket communication that the applications programmer should have in his mind, such that he can correctly determine the most appropriate way to achieve his purposes? Is there such a model, and is it machine- independent so that portable yet efficient programs can be written? It was mainly our experience as systems programmers, who have implemented communication systems of our own, that lead us to suspect that "sendmsg" _might_ consolidate multiple buffers into one packet and that one packet _might_ fare better than more. The typical applications programmer has likely never heard of packets; to him the choice of one "sendmsg" vs. many "sends" is mainly one of programming convenience, which might well come down in favor of the latter. In fact, in our case the multiple "send" approach yields the shortest, cleanest program. My own suspicion is that there is no clear model, because in general I find that such is characteristic of Unix. I was raised as a physicist and therefore always carry the hope that the rule of parsimony will apply; Learn a few universal rules, and derive all else from them. I am almost prepared to believe that Unix and its milieu are the exact opposite: No matter how many rules you learn, there is always one more, and it will affect the next thing that you do. John Fletcher fletch@ocfmail.ocf.llnl.gov P.O. Box 808, L-60 Livermore, CA. 94550
nagle@well.UUCP (John Nagle) (09/15/89)
It's interesting that the sequence SEND SEND wait for reply is almost the worst possible case here. If the application did (rapidly) SEND SEND SEND SEND wait for reply on a previously idle TCP connection, the first send would take place immediately, and the remaining three would be consolidated into one segment (assuming they are small) which would be transmitted when the ACK was received for the first send. In a UNIX application, the usual way to handle this sort of thing is to write to to the socket via the standard I/O package, specifying buffering, and to flush the buffer just before going into a wait state for any reason. If you're using SELECT(II) for all input, do a SELECT poll (with a zero timeval structure) to determine if there is any input, and if not, flush output buffers before then doing a SELECT which will block. The effect of this is that the application accumulates pending output until it needs some (any) input, then flushes the output. This should eliminate the round-trip waits while minimizing network traffic. Bypassing the tinygram prevention mechanism is not a good idea if the software is ever to be used in a wider world than a single LAN. The impact on gateways is well known, and filling up the gateways of others with tinygrams is considered antisocial. John Nagle
pprindev@wellfleet.com (Philip Prindeville) (09/17/89)
> On our own we found that the difficulty was avoided by using the "sendmsg" ca! > to send the two buffers in one call rather than using two "send" calls, one > for each buffer. Later, one correspondent suggested the same. You could also use writev() for doing scatter/gather writes. > _might_ fare better than more. The typical applications programmer has likely > never heard of packets; to him the choice of one "sendmsg" vs. many "sends" > is mainly one of programming convenience, which might well come down in > favor of the latter. In fact, in our case the multiple "send" approach > yields the shortest, cleanest program. Perhaps. But he has heard of "transactions". If one were to assume that one transaction => one write, then The Correct Thing would be done by the operating system. (I believe that UNIX just gathers all the iov descriptors together into an MBUF chain, thus avoiding the dreaded copy.) If this were a disk-based application, you would want transactions to occur atomically, and hence would use writev()s. This way, if the program were interrupted or the system crashed, you would have the best chances of the operation completing. (As opposed to taking a signal between the first and second write.) So "packets" are not conceptually different from "records". > My own suspicion is that there is no clear model, because in general I find > that such is characteristic of Unix. I was raised as a physicist and therefo! > always carry the hope that the rule of parsimony will apply; Learn a few > universal rules, and derive all else from them. I am almost prepared to Unix is based on such a philosophy -- consistency is its watchword [kiss]. Perhaps you were weaned on a more "sophisticated" system, and hence have certain expectations... -Philip
nagle@well.UUCP (John Nagle) (09/18/89)
In article <8909162241.AA03169@tien.Wellfleet.Com> pprindev@wellfleet.com (Philip Prindeville) writes: >You could also use writev() for doing scatter/gather writes. >... >If this were a disk-based application, you would want transactions >to occur atomically, and hence would use writev()s. This way, if >the program were interrupted or the system crashed, you would have >the best chances of the operation completing. (As opposed to taking >a signal between the first and second write.) So "packets" are not >conceptually different from "records". First, bear in mind that TCP really is a stream protocol. You're guaranteed that the bytes you write get to the other end in the same order, but you are definitely not guaranteed that a unit of data written with one write shows up as a single unit at the receive end. If you need atomic transactions, a suitable protocol must be constructed for that purpose, either on top of TCP or in some other way, such as on top of UDP. Sun's NFS, for example, is implemented on top of UDP. Second, given that we were discussing writes of small amounts of data, copying cost isn't a big issue. So "writev" isn't particularly useful. I'd still recommend going through the standard I/O package and doing "flush" operations at the appropriate time as the most effective way of dealing with the problems described here. Attempts to "increase efficiency" by bypassing the standard I/O package generally make things worse, rather than better, unless very large blocks are involved. John Nagle