[net.bugs.4bsd] 4.2BSD non-blocking sockets and selects

thomson@utcsrgv.UUCP (Brian Thomson) (12/21/83)

Index: sys/uipc_socket.c h/socketvar.h 4.2BSD

Description:
	If you do a select() for writing on a non-blocking 
    SOCK_STREAM socket, and there is some send queue buffer space
    available, it will tell you the socket can be written.
    But sosend() insists that all writes to non-blocking sockets
    be atomic, and will return EWOULDBLOCK if there is not enough
    buffer space for the entire write to go in one shot.

	This behaviour is OK for non-stream sockets, but streams
    should allow partial writes.  A couple of distributed utilities
    agree with me ...

Repeat-by:
	Both rlogind(1) and telnetd(1) are prepared for partial
    socket writes.  Try this:
	% rlogin localhost
	< message of the day >
	% cat /usr/dict/words
	<blah>
	<blah>
	<blah>
	~^Z		(i.e. suspend the rlogin locally)
	Stopped
	% jobs
	[1] Stopped		rlogin localhost
	% 
    An iostat at this point will show (unless you happen to exactly fill
    the send queue) that your system is being eaten alive by rlogind.

Fix:
	Allow partial writes to non-blocking sockets unless the
    underlying protocol is atomic.  This is consistent with the
    behaviour of non-blocking ttys, which are a good model
    for stream-oriented sockets.
    In file /sys/h/socketvar.h, change:

	#define	sosendallatonce(so) \
            (((so)->so_state & SS_NBIO) || ((so)->so_proto->pr_flags & PR_ATOMIC))
  
    to
	#define	sosendallatonce(so) \
            ((so)->so_proto->pr_flags & PR_ATOMIC)


    In file /sys/sys/uipc_socket.c, routine sosend(), diff -c shows:

	***************
	*** 281,286
		register int space;
		int len, error = 0, s, dontroute;
		struct sockbuf sendtempbuf;
	  
		if (sosendallatonce(so) && uio->uio_resid > so->so_snd.sb_hiwat)
			return (EMSGSIZE);

	--- 287,293 -----
		register int space;
		int len, error = 0, s, dontroute;
		struct sockbuf sendtempbuf;
	+ 	int sentsome = 0;
	  
		if (sosendallatonce(so) && uio->uio_resid > so->so_snd.sb_hiwat)
			return (EMSGSIZE);
	***************
	*** 324,329
				goto release;
			}
			mp = &top;
		}
		if (uio->uio_resid == 0) {
			splx(s);

	--- 331,337 -----
				goto release;
			}
			mp = &top;
	+ 		sentsome = 1;
		}
		if (uio->uio_resid == 0) {
			splx(s);
	***************
	*** 336,342
			if (space <= 0 ||
			    sosendallatonce(so) && space < uio->uio_resid) {
				if (so->so_state & SS_NBIO)
	! 				snderr(EWOULDBLOCK);
				sbunlock(&so->so_snd);
				sbwait(&so->so_snd);
				splx(s);

	--- 344,353 -----
			if (space <= 0 ||
			    sosendallatonce(so) && space < uio->uio_resid) {
				if (so->so_state & SS_NBIO)
	! 				if(sentsome)
	! 					{ splx(s); goto release; }
	! 				else
	! 					snderr(EWOULDBLOCK);
				sbunlock(&so->so_snd);
				sbwait(&so->so_snd);
				splx(s);



Reservation:
	You should probably HOLD OFF installing this change until
    it gets batted about the net a bit.  The original behaviour appears
    to have been quite deliberate, and although I do think it's wrong,
    I'd like to give someone in the know a chance to explain the
    unobvious reason that it was right in the first place!
-- 
			Brian Thomson,	    CSRG Univ. of Toronto
			{linus,ihnp4,uw-beaver,floyd,utzoo}!utcsrgv!thomson

ka@hou3c.UUCP (Kenneth Almquist) (12/22/83)

It looks like Brian's proposed change would make writes to pipes non-atomic.
					Kenneth Almquist

thomson@utcsrgv.UUCP (Brian Thomson) (01/17/84)

It's been almost a month since I suggested allowing partial writes
to non-blocking stream sockets.  Faithful readers will remember that
such a change cures a tendency of rlogind(8C) and telnetd(8C) to
devour cpu cycles.  I also asked for comments.  Let's delve into the
ol' newsbag and see what lively discussion ensued:

>From: ka@hou3c.UUCP (Kenneth Almquist)
>Message-ID: <148@hou3c.UUCP>
>Date: Thu, 22-Dec-83 10:17:31 EST
>Organization: Bell Labs, Holmdel, NJ
>
>It looks like Brian's proposed change would make writes to pipes non-atomic.
>					Kenneth Almquist

Well ... yes and no.  It only affects writes to pipes whose write ends have
been marked non-blocking.  Such writes are indeed rendered non-atomic, but
I don't know how important that is.  But note that ordinary, synchronous pipes
may not be 'atomic' in 4.2 depending on your definition of that term.
Two reasonable definitions are:
 1) No other process attempting to write on the socket (pipe) can
    intersperse data.
 2) The written data will be buffered in the kernel at one go, so the
    write can complete without a corresponding read being performed.
Previous Unix systems guaranteed 1) if the write was less than PIPSIZ, and
guaranteed 2) if the pipe was not full and the write was less than PIPSIZ.

On synchronous stream sockets, the distributed 4.2 guarantees property 1)
for writes of any size, but 2) only happens if there is adequate space
in the pipe's send queue.
My modifications do not affect this behaviour.

On non-blocking stream sockets, distributed 4.2 REQUIRES that all writes
to pipes be less than PIPSIZ (the limit also exists for non-pipes but may
differ in value).  It guarantees either that both 1) and 2) will be satisfied
or the write will return with an EMSGSIZE error.
My modifications remove the size requirement for all stream sockets
by permitting partial writes, but sacrifice atomicity by both definitions.
I also make select() and send() agree on whether a socket can be written,
which was the original intent of the change.

If it becomes necessary, atomicity can be restored by employing the
(currently unused) so->so_snd.sb_lowat field.  Set it to PIPSIZ-1 for pipes,
and leave it 0 for everyone else, then change the definition of
sowriteable() to include the term
	sbspace(&so->so_snd) > so->so_snd.sb_lowat
instead of
	sbspace(&so->so_snd) > 0
But I don't want to start using a field that the Berkeley people may
have designs for, and besides, I'm not so sure that anything is broken
enough to need fixing.  Just how dependent are people on atomic pipes,
and what kind of atomicity do they need?

Hmmm ... I seem to have reached the bottom of the ol' newsbag.
Not such a lively discussion after all.  Maybe nobody cares.  Sniff.
-- 
			Brian Thomson,	    CSRG Univ. of Toronto
			{linus,ihnp4,uw-beaver,floyd,utzoo}!utcsrgv!thomson

thomson@utcsrgv.UUCP (Brian Thomson) (01/18/84)

Oops.  I said ...
  "On synchronous stream sockets, the distributed 4.2 guarantees 
   [no interspersal of data from contending writes on the same socket]
   for writes of any size..."

Sorry.  I was, uh, mistaken.  Ok, ok, to not mince words, I was WRONG.
It does no such thing: the send queue is not locked continuously during
long writes.
-- 
			Brian Thomson,	    CSRG Univ. of Toronto
			{linus,ihnp4,uw-beaver,floyd,utzoo}!utcsrgv!thomson