[mod.protocols.tcp-ip] tcp on SUN computers.

robert@SRI-SPAM.ARPA.UUCP (06/06/86)

Hello there,

	I have a question regarding the implementation of TCP
used on the SUN computers.  Specifically the question concerns
version 2.2 of SUN UNIX, but will no doubt extend to their later
(and earlier) releases.

	The question follows;

	I have had occasion to use non-blocking sockets (TCP links)
as a link across the Internet between 2 or more SUNs.  Empirically,
I have discovered that there is a limit of 2048 bytes which can
be written in a single non-blocking write.  Anything more than
that and an error is returned "EMSGSIZE", which indicates that
the internal buffer is only 2048 bytes.  Note that if I use
blocking sockets, the size is unlimited.  There is also a limit
of 4096 bytes total which can be held "in the pipe", before the
receiving side of the socket must do a read to clear the internal
buffers.
	In a Client/Server relationship, the server will read
the bytes which the client writes into the socket.  I've noticed
that with non-blocking sockets where greater than 2K bytes can
be written, that continuous calls to recv() or read() by the server
will return 2K byte chunks.  This leads me to believe that the
SUN implementation can only handle a max of 2K byte transfers, and
with only two of these before a read must be done.  I've talked
with SUN tech support and they tell me that this 'seems to be'
the case, although without talking directly to an engineer I could
not get a good idea of why this is.
	Finally, my question is this.  Why is there a 2K limit, and
can it be changed, perhaps with a kernal reconfiguration?  It seems
to me that putting such a limit on the sockets is a poor implementation
for that layer, which should not restrict data size.  Shouldn't
such 'packetizing' be done at a lower layer than what sockets are
the entry point to?  Or are sockets actually a lower layer interface
than I think?

	Any comments, pointers, or commiserations would be greatly
appreciated.

					Robert Allen

					(415) 859-2143,

					robert@sri-spam.ARPA

root@BU-CS.BU.EDU.UUCP (06/06/86)

There are two questions in your query,

1.> ...Finally, my question is this.  Why is there a 2K limit, and
>can it be changed, perhaps with a kernal reconfiguration?  

It almost certainly can be changed given sources, perhaps it can be
changed otherwise or perhaps that is a suggestion that might be
reasonable for binary sites. Perhaps for performance reasons it
doesn't make sense to change this (what is a better number and why?
Simply that it would make it easier for you to ignore the fact that
you are doing non-blocking I/O is not a great reason.)

2.> It seems
>to me that putting such a limit on the sockets is a poor implementation
>for that layer, which should not restrict data size.  Shouldn't
>such 'packetizing' be done at a lower layer than what sockets are
>the entry point to?  Or are sockets actually a lower layer interface
>than I think?

This is a different question. Most importantly you state that this
occurs only with non-blocking I/O.

In the first place, this is not 'packetizing'. You provided 2K bytes
and it received 2K bytes (and made it available to your consumer
process.) The fact that they are both the same size is neither a
coincidence nor a fault of the implementation. Something like packetizing
might be implied had you read less than 2K and found the difference
thrown off the floor (I guess, not sure there is any way this term
can be worked into this conversation sensibly.)

At some point a resource runs out, there is no way around that. On
blocking (normal mode) I/O this results in a the process being blocked
and waiting until resources are available. When you requested non-blocking
I/O the system still had to do *something* when resources ran out (as
I said, exactly how much of this resource should be allocated to a given
process is an entirely different question.) What would you suggest it
do? Ignore your I/O? Do it "anyhow"? Block you "anyhow"? I think it
is doing the right thing (telling you immediately that I/O is not possible
right now.)

I think this was simply necessitated by allowing the user to specify
non-blocking I/O, there is no fundamental difference between this and
blocking I/O except the reason you felt it was "unlimited" in the latter
case was only because the system took the responsibility for making you
wait, as soon as you requested non-blocking I/O you stated that you
wished to take that responsibility on yourself (ie. you probably had
something else to do while waiting.)

>Robert Allen

	-Barry Shein, Boston University

mike@BRL.ARPA.UUCP (06/06/86)

Sockets themselves buffer a limited amount of data.  The size of this socket
buffering varries, depending on the type of socket (ie, pipe, TCP, UDP,
etc).

The system-wide default for TCP and UDP can be changed in 4.2 and 4.3 VAX
UNIX systems by using ADB to poke the kernel variables tcp_sendspace and
tcp_recvspace and udp_sendspace and udp_recvspace.  These variables become
parameters to the soreserve() routine each time a connection is opened.

Note that in 4.2 the maximum size is 31k, in 4.3 it is 64k-1.  These values
are stored in the socket structure in SHORT ints; in 4.2 if the sign bit
comes on, the world will break, thus the 31k limit rather than 32k or 64k-1.
These variables can not casually be increased to LONG ints, because that
would cause the socket struct not to fit into an MBUF, which is currently
necessary, if a tad inelegant.  31k of buffering on each end is, in fact,
quite reasonable.

Also note that in 4.3 there is a parameter to the setsockopt() sys-call that
can be used to adjust these values on a per-connection basis, rather than
having to change the system-wide defaults.  For those of you that have
this, this is the preferred method of increasing buffering.

I don't know at which SUN version the 4.3 capabilities will emerge, but
almost certainly not in SUN 2.x (but I don't actually know).

	Best,
	 -Mike Muuss

AUERBACH@SRI-CSL.ARPA (Karl Auerbach) (06/06/86)

This may be incorrect, but I have heard that 4.2/4.3 based TCP
implementations set the push flag for every user write.  This
somewhat transforms the TCP stream into packets as viewed by the
receiver.

I'm sure someone out there knows whether this is true or not.

			--karl--  (Karl Auerbach, Epilogue Technology Corp.)
			auerbach@sri-csl.arpa
-------

Casner@USC-ISIB.ARPA (Stephen Casner) (06/09/86)

I have found that with Sun release 2.0 "the world will break" if I patch
tcp_sendspace and tcp_recvspace to be only 16K, not 32K.  At 15K it works,
but 16K crashes.  I would like very much to get up to 31K because we need
that much for efficient use of the Wideband Satellite Network.  Any clue
as to how I might get there?

						-- Steve Casner
-------

nowicki%rose@SUN.COM (Bill Nowicki) (06/10/86)

	I have found that with Sun release 2.0 "the world will break"
	if I patch tcp_sendspace and tcp_recvspace to be only 16K, not
	32K.  At 15K it works, but 16K crashes.

Please clarify what you mean by "world will break".  Last week I did
some tests and tried various sizes like 16K and 32K without any
"crashes".  There is a severe performance problem if one side has a
large recvspace and the other has a sendspace that is less than or
equal to one quarter of the recvspace. The "silly window syndrome
avoidance feature" then takes effect, and you can send only five times
the sendspace per second.  For example, if a machine with increased
recvspace (8K) sends to a machine with the default sendspace of 2K,
then throghput is 10Kbytes/second, about twenty times slower than 2K
sending to 2K!  This is probably true for 4.2 in general, but have
not tried the 4.3BSD SWSA code.

Note that my experiments are using Sun-3s running 3.x -- you should
upgrade from 2.x as soon as you can, since 3.x has several of the 4.2BSD
bugs removed.

	-- Bill Nowicki
	   Sun Micsrosystems

Casner@USC-ISIB.ARPA.UUCP (06/10/86)

I borrowed the quote about the world breaking from Mike Muuss.  What I
meant by it was that with tcp_sendspace and tcp_recvspace patched to
16K in a 2.0 kernel, the system crashed during the process of booting.
I don't remember the details of the crash, but I believe it was a panic
or the like.

I'd like very much to have my machine upgraded to 3.x, but first we have
to buy more memory => more swap space => no user file space => more NFS
space...
						-- Steve Casner
-------

mike@BRL.ARPA (Mike Muuss) (06/12/86)

Well, I guess for SUN 2.0, the magic number must be 16k-1.
At least this is an improvement :-)
	-M