[comp.protocols.tcp-ip] Nagle algo. in Unix-TCP

rwolski@lll-lcc.UUCP (Richard Wolsoi) (08/29/89)

Greetings netlanders.
My question regards the Nagle algorithms for small-packet avoidance, which
have been implemented in the various flavors of UNIX now running around.  

A colleague of mine has written an RPC mechanism to run over TCP sockets
on UNIX systems and we are seeing some very strange performance numbers
for certain kinds of messaging.  An ethernet trace and several minutes with
the source code convinced us that TCP was delaying both data sends and
acknowledgements in an effort to avoid silly windows.  

Part of the$problem comes from the RPC implementation which makes two
send system calls for each request or reply (no flames please we are dealing 
with some serious history here) in that UNIX sends out the information in
two different packets.  The first goes out immediately (as the Nagle
algorithm prescribes) while the second is delayed until the first is acked.
Unfortunately, the ack is delayed as part of the receiver's half of the 
bargain and so we were seeing a whopping 10 packets per second.
Now for my question.  Is there any way to defeat the Nagle algorithms under
standard implementations?  I seem to recall that the Tahoe (or is it Tajo) 
release of 4.3 had such a defeat wlich was put in for X-11, but we don't 
have systems which are quite so modern.  Specifically, we are using SunOS
3.4, Ultrix 3.0 and UTS something-or-other with WIN-UTS.  

My boss's boss says that there are some vevsions of RPC which use TCP streams.
If they exist, what are they called?  Do they suffer this problem or do they
rely on the apparent fact that UNIX will attempt to put all of a send
system call in a single packet (which is not part of the spec to my knowledge)?
Thank you all in advance,

Rich
A tourist in Technical Disneyland

rwolski@lll-lcc.llnl.gov			Internet
(415)423-8594					Bell-net
Rich Wolski					Mail-net
President, Mark Boolootian ski-boat foundation (an inside joke)
P.O. Box$808, L-60
Livermore, CA.  94550

nagle@well.UUCP (John Nagle) (08/29/89)

In article <2581@lll-lcc.UUCP> rwolski@lll-lcc.UUCP (Richard Wolsoi) writes:
>
>My question regards the Nagle algorithms for small-packet avoidance, which
>have been implemented in the various flavors of UNIX now running around.  
>
>A colleague of mine has written an RPC mechanism to run over TCP sockets
>on UNIX systems and we are seeing some very strange performance numbers
>for certain kinds of messaging.  An ethernet trace and several minutes with
>the source code convinced us that TCP was delaying both data sends and
>acknowledgements in an effort to avoid silly windows.  
>
>Part of the$problem comes from the RPC implementation which makes two
>send system calls for each request or reply (no flames please we are dealing 
>with some serious history here) in that UNIX sends out the information in
>two different packets.  The first goes out immediately (as the Nagle
>algorithm prescribes) while the second is delayed until the first is acked.
>Unfortunately, the ack is delayed as part of the receiver's half of the 
>bargain and so we were seeing a whopping 10 packets per second.
>Now for my question.  Is there any way to defeat the Nagle algorithms under
>standard implementations?  I seem to recall that the Tahoe (or is it Tajo) 
>release of 4.3 had such a defeat wlich was put in for X-11, but we don't 
>have systems which are quite so modern.  Specifically, we are using SunOS
>3.4, Ultrix 3.0 and UTS something-or-other with WIN-UTS.  

       What seems to have happened here is that several mechanisms in TCP
are interacting with a strange kind of application to produce poor performance.

       First, "silly window syndrome" is irrelevant here.  Silly window
syndrome occurs when the window is full most of the time, but here,
we have the tinygram problem, which occurs when the window is empty most
of the time.  It's a common misconception that the two problems are the
same.  They are not.  They are handled by separate code in the UNIX
implementation, incidentally.

       The problem here seems to be that tinygram elimination and
delayed ACKs, both performance improvements in TCP, are interacting
with this new RPC application.  Delayed ACKs
are something that first appeared in TOPS-20, which is a system that
runs TELNET in remote echo most of the time.  (So do most UNIX TELNET
implementations, not that they really need to.)  The idea there was
to make the bet that when a packet came in on a TCP connection, the
application would probably have something to reply with shortly.
Therefore, TCP was gimmicked to delay sending an ACK for about 100ms,
in hopes that the application layer would send something back and that
this would be piggybacked on the application layer's reply.  Note that
this is an assumption built into the transport layer about the behavior
of the application layer.  

        Delayed ACKs will cut traffic in half on slow TELNET operations,
and they don't bother FTP.  So they seemed like a big win at the time,
when there were few TCP applications beyond TELNET, FTP, and mail.

        TCP with delayed ACKs and tinygram elimination sending will support 
the following kinds of applications well.

	1) Big data pipes, like FTP.
	2) TELNET-type interaction.
	3) Request-reply type transaction protocols.

But here, we have someone who is trying something that has the property
that it does

	SEND
	SEND
	wait for reply.

This doesn't take turns the way a typical transaction protocol does, so
the guesses built into TCP are bad for this situation.  This, this sort of use
creates problems.  Of course, as the writer points out, doing multiple
tiny sends is a bad thing on general principles.  It's always better to
do one big send than lots of little ones, given that you're not waiting
anxiously for an answer.  Sending a 1-byte message takes 41 bytes across
the net, plus any overhead at the link layer.  It's that 40:1 expansion
that led to the need for tinygram elimination.  Presumably this RPC
package is sending a bit more data at a time, so the expansion factor
may be smaller, but it may still be significant.

One solution might be to make whatever RPC package he's using go through
the standard UNIX I/O library (stdio) and flush the output stream just
before reading from the input stream or before reading from another
source.  This would improve the buffering situation.  This is really a
buffering problem, after all.

TCP is trying to protect the network from dumb applications.  We fixed it
back in 1985 so that when the application is dumb, the application suffers, 
not the network.  We have here a dumb application layer.

					John Nagle
Newsgroups: poster
Subject: Re: Nagle algo. in Unix-TCP
Summary: 
Expires: 
References: <2581@lll-lcc.UUCP>
Sender: 
Reply-To: nagle@well.UUCP (John Nagle)
Followup-To: 
Distribution: 
Keywords: 

In article <2581@lll-lcc.UUCP> rwolski@lll-lcc.UUCP (Richard Wolsoi) writes:
>
>My question regards the Nagle algorithms for small-packet avoidance, which
>have been implemented in the various flavors of UNIX now running around.  
>
>A colleague of mine has written an RPC mechanism to run over TCP sockets
>on UNIX systems and we are seeing some very strange performance numbers
>for certain kinds of messaging.  An ethernet trace and several minutes with
>the source code convinced us that TCP was delaying both data sends and
>acknowledgements in an effort to avoid silly windows.  
>
>Part of the$problem comes from the RPC implementation which makes two
>send system calls for each request or reply (no flames please we are dealing 
>with some serious history here) in that UNIX sends out the information in
>two different packets.  The first goes out immediately (as the Nagle
>algorithm prescribes) while the second is delayed until the first is acked.
>Unfortunately, the ack is delayed as part of the receiver's half of the 
>bargain and so we were seeing a whopping 10 packets per second.
>Now for my question.  Is there any way to defeat the Nagle algorithms under
>standard implementations?  I seem to recall that the Tahoe (or is it Tajo) 
>release of 4.3 had such a defeat wlich was put in for X-11, but we don't 
>have systems which are quite so modern.  Specifically, we are using SunOS
>3.4, Ultrix 3.0 and UTS something-or-other with WIN-UTS.  

       What seems to have happened here is that several mechanisms in TCP
are interacting with a strange kind of application to produce poor performance.

       First, "silly window syndrome" is irrelevant here.  Silly window
syndrome occurs when the window is full most of the time, but here,
we have the tinygram problem, which occurs when the window is empty most
of the time.  It's a common misconception that the two problems are the
same.  They are not.  They are handled by separate code in the UNIX
implementation, incidentally.

       The problem here seems to be that tinygram elimination and
delayed ACKs, both performance improvements in TCP, are interacting
with this new RPC application.  Delayed ACKs
are something that first appeared in TOPS-20, which is a system that
runs TELNET in remote echo most of the time.  (So do most UNIX TELNET
implementations, not that they really need to.)  The idea there was
to make the bet that when a packet came in on a TCP connection, the
application would probably have something to reply with shortly.
Therefore, TCP was gimmicked to delay sending an ACK for about 100ms,
in hopes that the application layer would send something back and that
this would be piggybacked on the application layer's reply.  Note that
this is an assumption built into the transport layer about the behavior
of the application layer.  

        Delayed ACKs will cut traffic in half on slow TELNET operations,
and they don't bother FTP.  So they seemed like a big win at the time,
when there were few TCP applications beyond TELNET, FTP, and mail.

        TCP with delayed ACKs and tinygram elimination sending will support 
the following kinds of applications well.

	1) Big data pipes, like FTP.
	2) TELNET-type interaction.
	3) Request-reply type transaction protocols.

But here, we have someone who is trying something that has the property
that it does

	SEND
	SEND
	wait for reply.

This doesn't take turns the way a typical transaction protocol does, so
the guesses built into TCP are bad for this situation.  This, this sort of use
creates problems.  Of course, as the writer points out, doing multiple
tiny sends is a bad thing on general principles.  It's always better to
do one big send than lots of little ones, given that you're not waiting
anxiously for an answer.  Sending a 1-byte message takes 41 bytes across
the net, plus any overhead at the link layer.  It's that 40:1 expansion
that led to the need for tinygram elimination.  Presumably this RPC
package is sending a bit more data at a time, so the expansion factor
may be smaller, but it may still be significant.

One solution might be to make whatever RPC package he's using go through
the standard UNIX I/O library (stdio) and flush the output stream just
before reading from the input stream or before reading from another
source.  This would improve the buffering situation.  This is really a
buffering problem, after all.

TCP is trying to protect the network from dumb applications.  We fixed it
back in 1985 so that when the application is dumb, the application suffers, 
not the network.  We have here a dumb application layer.

					John Nagle

jg@max.crl.dec.com (Jim Gettys) (09/07/89)

As noted, 4.3BSD has a socket option to defeat delayed ACK's,
specifically put there on our request
(X happened to be the first application to run into the problem, during
beta test of 4.3, back well
before X11).

This has been picked up by essentially everyone.  Ultrix had the option
since V2 days.

The option (found in /usr/include/netinet/tcp.h) is TCP_NODELAY.

/*
 * User-settable options (used with setsockopt).
 */
#define TCP_NODELAY     0x01    /* don't delay send to coalesce packets */

				- Jim Gettys

rwolski@lll-lcc.UUCP (Richard Wolski) (09/13/89)

Hello one last time.

Some time ago I posted a request for information about the Nagle algorithms
included in various flavors of Unix and got a rather surprising response from
John Nagle himself.  As I indicated in my previous posting, my questions
regarded an RPC system that a colleague of mine had written which sent
messages each in two "send" calls (instead of one which we now know is the 
accepted Unix method).  After returning from vacation and reading the various
correspondence in which I had been engaging, he felt compelled to contribute
his thoughts, but unfortunately does not have access to a machine which gets
a news feed.  I am, therefore, posting this reply on his behalf in an objective
manner as possible.

Rich Wolski

P.S.
Please reply directly to John Fletcher as I don't relish the prospect of 
intercepting scores of nasty-grams.

-------------------------------------------------------------------------------

In article Article 8533 of comp.protocols.tcp-ip John Nagle writes:


>       What seems to have happened here is that several mechanisms in TCP
>are interacting with a strange kind of application to produce poor performance.
>
>       First, "silly window syndrome" is irrelevant here.  Silly window
>syndrome occurs when the window is full most of the time, but here,
>we have the tinygram problem, which occurs when the window is empty most
>of the time.  It's a common misconception that the two problems are the
>same.  They are not.  They are handled by separate code in the UNIX
>implementation, incidentally.

				.
				.
				.

>But here, we have someone who is trying something that has the property
>that it does
> 
>        SEND
>        SEND
>        wait for reply.
> 
>This doesn't take turns the way a typical transaction protocol does, so
>the guesses built into TCP are bad for this situation.

				.
				.
				.

>TCP is trying to protect the network from dumb applications.  We fixed it
>back in 1985 so that when the application is dumb, the application suffers,
>not the network.  We have here a dumb application layer.

John Fletcher (the author of a locally designed RPC package) replies:

My colleague Rich Wolski recently inquired over the net in regard to the
poor performance of TCP sockets when the typical behavior was alternately to
send two buffers and then to receive two similar buffers.  A rate of ten
buffer pairs per second was obtained.  As I understand it, the second buffer
in a pair was not actually sent by the interface until the first was 
acknowledged, and that acknowledgement was delayed.

The responses he received were not very helpful.  The most detailed one
suggested using library calls that would copy the two buffers into one, an
overhead that we are seeking to avoid.  It also described the behavior 
of the application as "dumb" and therefore presumably not deserving of
adequate support; this approach, if extended more broadly, could put us all out
of work.

On our own we found that the difficulty was avoided by using the "sendmsg" call
to send the two buffers in one call rather than using two "send" calls, one
for each buffer.  Later, one correspondent suggested the same.

Wolski's more fundamental question remains unanswered:  What exactly is the
model of TCP socket communication that the applications programmer should
have in his mind, such that he can correctly determine the most appropriate
way to achieve his purposes?  Is there such a model, and is it machine-
independent so that portable yet efficient programs can be written?  It was
mainly our experience as systems programmers, who have implemented
communication systems of our own, that lead us to suspect that "sendmsg"
_might_ consolidate  multiple buffers into one packet and that one packet
_might_ fare better than more.  The typical applications programmer has likely 
never heard of packets; to him the choice of one "sendmsg" vs. many "sends"
is mainly one of programming convenience, which might well come down in
favor of the latter.  In fact, in our case the multiple "send" approach
yields the shortest, cleanest program.

My own suspicion is that there is no clear model, because in general I find
that such is characteristic of Unix.  I was raised as a physicist and therefore
always carry the hope that the rule of parsimony will apply;  Learn a few
universal rules, and derive all else from them.  I am almost prepared to
believe that Unix and its milieu are the exact opposite:  No matter how
many rules you learn, there is always one more, and it will affect the next
thing that you do.

John Fletcher
fletch@ocfmail.ocf.llnl.gov
P.O. Box 808, L-60
Livermore, CA. 94550

nagle@well.UUCP (John Nagle) (09/15/89)

      It's interesting that the sequence

	SEND
	SEND
	wait for reply

is almost the worst possible case here.  If the application did (rapidly)

	SEND
	SEND
	SEND
	SEND
	wait for reply

on a previously idle TCP connection,
the first send would take place immediately, and the remaining three would
be consolidated into one segment (assuming they are small) which would
be transmitted when the ACK was received for the first send.

      In a UNIX application, the usual way to handle this sort of thing
is to write to to the socket via the standard I/O package, specifying
buffering, and to flush the buffer just before going into a wait state
for any reason.  If you're using SELECT(II) for all input, do a SELECT
poll (with a zero timeval structure) to determine if there is any input,
and if not, flush output buffers before then doing a SELECT which will
block.  The effect of this is that the application accumulates pending
output until it needs some (any) input, then flushes the output.  This
should eliminate the round-trip waits while minimizing network traffic.

      Bypassing the tinygram prevention mechanism is not a good idea if
the software is ever to be used in a wider world than a single LAN.
The impact on gateways is well known, and filling up the gateways of
others with tinygrams is considered antisocial.

					John Nagle

pprindev@wellfleet.com (Philip Prindeville) (09/17/89)

> On our own we found that the difficulty was avoided by using the "sendmsg" ca!
> to send the two buffers in one call rather than using two "send" calls, one
> for each buffer.  Later, one correspondent suggested the same.

You could also use writev() for doing scatter/gather writes.

> _might_ fare better than more.  The typical applications programmer has likely 
> never heard of packets; to him the choice of one "sendmsg" vs. many "sends"
> is mainly one of programming convenience, which might well come down in
> favor of the latter.  In fact, in our case the multiple "send" approach
> yields the shortest, cleanest program.

Perhaps.  But he has heard of "transactions".  If one were to assume
that one transaction => one write, then The Correct Thing would be
done by the operating system.  (I believe that UNIX just gathers all
the iov descriptors together into an MBUF chain, thus avoiding the
dreaded copy.)

If this were a disk-based application, you would want transactions
to occur atomically, and hence would use writev()s.  This way, if
the program were interrupted or the system crashed, you would have
the best chances of the operation completing.  (As opposed to taking
a signal between the first and second write.)  So "packets" are not
conceptually different from "records".

> My own suspicion is that there is no clear model, because in general I find
> that such is characteristic of Unix.  I was raised as a physicist and therefo!
> always carry the hope that the rule of parsimony will apply;  Learn a few
> universal rules, and derive all else from them.  I am almost prepared to

Unix is based on such a philosophy -- consistency is its watchword [kiss].
Perhaps you were weaned on a more "sophisticated" system, and hence have
certain expectations...

-Philip

nagle@well.UUCP (John Nagle) (09/18/89)

In article <8909162241.AA03169@tien.Wellfleet.Com> pprindev@wellfleet.com (Philip Prindeville) writes:
>You could also use writev() for doing scatter/gather writes.
>...
>If this were a disk-based application, you would want transactions
>to occur atomically, and hence would use writev()s.  This way, if
>the program were interrupted or the system crashed, you would have
>the best chances of the operation completing.  (As opposed to taking
>a signal between the first and second write.)  So "packets" are not
>conceptually different from "records".

      First, bear in mind that TCP really is a stream protocol.  You're
guaranteed that the bytes you write get to the other end in the same
order, but you are definitely not guaranteed that a unit of data
written with one write shows up as a single unit at the receive end.
If you need atomic transactions, a suitable protocol must be constructed
for that purpose, either on top of TCP or in some other way, such as
on top of UDP.  Sun's NFS, for example, is implemented on top of UDP.

      Second, given that we were discussing writes of small amounts of
data, copying cost isn't a big issue.  So "writev" isn't particularly
useful.

      I'd still recommend going through the standard I/O package and
doing "flush" operations at the appropriate time as the most effective
way of dealing with the problems described here.  Attempts to "increase
efficiency" by bypassing the standard I/O package generally make things
worse, rather than better, unless very large blocks are involved.

					John Nagle