[comp.unix.programmer] Detecting Closed Sockets

LordBah@cup.portal.com (Jeffrey J Vanepps) (03/10/91)

I have a problem with telling when the other side of a socket connection
has exited.  Normally a read/recv which returns zero after select(2) says
that there is data to be read signifies a closed connection.  In my case,
the application can't use this technique.  Here's the situation:

Program R is a communications router.  It accepts connections from data
producers and data consumers.  Normally, it waits until data is available
from some producer, reads the data, and then writes the data to each
consumer who requested the type of data being generated by that producer.
However, data is not allowed to be lost or thrown away, so if a producer
produces data for which there is not yet a consumer, then R does not
read that data.  It is left in the socket buffer, eventually blocking
the producer.  

Now, after this has happened, and there is a buffer full of data waiting 
to be read from a producer, R has no way to tell that the producer process 
has exited.  Normally when one side of a socket connection exits, the
other side is told via select(2) that there is data to be read, but the
recv(2) call returns zero bytes.  But in the case described above, R can't
try to recv because it has nowhere to put the data.

This is on a Sparc 1+ and IPC, SunOS 4.1.1.

Some attempted solutions:

- Always reading all available data and queueing it (in memory or on disk)
  is not acceptable.  Data volumes under consideration are too high.  We
  basically want the producer to block while there is no consumer.

- The FIONREAD ioctl(2) call always says that there are many bytes to read,
  since there are many bytes left in the buffer.

- getsockopt(SO_ERROR) never returns any error.

- getpeername(2) still thinks that the socket is connected.

- select(2) still thinks that the socket is writeable, even though the
  other side has exited.

- No exceptional condition is ever apparent to select(2).  Given what I've
  seen written about exceptional conditions, I didn't expect this to work.

- I don't receive SIGURG or even SIGPIPE when the other side closes.

So, is there any method provided by the system for determining whether the
other side of a socket has closed?  I'd rather not do any type of handshaking
because throughput is also an issue with R.

--
--------------------------------------------------------------------
    Jeff Van Epps    amusing!lordbah@bisco.kodak.com
                     lordbah@cup.portal.com
                     sun!portal!cup.portal.com!lordbah

torek@elf.ee.lbl.gov (Chris Torek) (03/10/91)

In article <39989@cup.portal.com> LordBah@cup.portal.com
(Jeffrey J Vanepps) writes:
>Program R is a communications router.  It accepts connections from data
>producers and data consumers.  Normally, it waits until data is available
>from some producer, reads the data, and then writes the data to each
>consumer who requested the type of data being generated by that producer.
>However, data is not allowed to be lost or thrown away, so if a producer
>produces data for which there is not yet a consumer, then R does not
>read that data.  It is left in the socket buffer, eventually blocking
>the producer.  
>
>Now, after this has happened, and there is a buffer full of data waiting 
>to be read from a producer, R has no way to tell that the producer process 
>has exited.  Normally when one side of a socket connection exits, the
>other side is told via select(2) that there is data to be read, but the
>recv(2) call returns zero bytes.  But in the case described above, R can't
>try to recv because it has nowhere to put the data.

There is something missing from the above problem specification, because
as given, there is nothing wrong, nothing to fix.

If data may never be discarded, then:

	producer P runs, outputs data of type T
	producer P exits
	several hours later, consumer C requests data of type T

There also appears to be a potential race condition:

	producer P runs, outputs data of type T
	consumer C1 runs, requests data of type T, gets first two items
		(which thus disappear from the queue)
	consumer C2 runs, requests data of type T; now both C1 and C2
		get the remaining items.  C2 has lost out due to a race.

If the missing goal above is to have router R spawn new producers, then:

   - R can be the parent of *all* producers
     (and thus use wait3() or wait() to detect vanished ones)

or

   - Each producer P can also have a second socket to router R, one
     which exists only to identify P to R initially and then has no
     further data written to it.  When P exits, this socket will see
     an end-of-file condition.

If producers and consumers run on different machines than R, some sort
of keep-alive mechanism may be required as well, since a broken
connection and an idle one are otherwise completely identical.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

fkittred@bbn.com (Fletcher Kittredge) (03/22/91)

In article <39989@cup.portal.com> LordBah@cup.portal.com (Jeffrey J Vanepps) writes:
>I have a problem with telling when the other side of a socket connection
>has exited.  Normally a read/recv which returns zero after select(2) says
>that there is data to be read signifies a closed connection.  In my case,
>the application can't use this technique.  Here's the situation:

So why don't you set SO_KEEPALIVE on the socket and respond to the SIGIO?
If that is a problem for you, you could have R write to the producer's socket.
If the send returns ENOCONN, then you take whatever action necessary. 
Throughput should not be a problem since the write won't block, and the
producer can ignore the message.

Note that doing your own polling is considered by many to be a more elegant
solution than setting SO_KEEPALIVE.  SO_KEEPALIVE causes additional network
load.

regards,
fletcher
Fletcher Kittredge
Platforms and Tools Group, BBN Software Products
10 Fawcett Street,  Cambridge, MA. 02138
617-873-3465  /  fkittred@bbn.com  /  fkittred@das.harvard.edu

pww@bnr.ca (Peter Whittaker) (03/23/91)

In article <63357@bbn.BBN.COM> fkittred@spca.bbn.com (Fletcher Kittredge) writes:
>In article <39989@cup.portal.com> LordBah@cup.portal.com (Jeffrey J Vanepps) writes:
>>I have a problem with telling when the other side of a socket connection
>>has exited.  Normally a read/recv which returns zero after select(2) says
>>that there is data to be read signifies a closed connection.  In my case,
>>the application can't use this technique.  Here's the situation:
>
>So why don't you set SO_KEEPALIVE on the socket and respond to the SIGIO?
>
oooo, my least favourite topic :->

>Note that doing your own polling is considered by many to be a more elegant
>solution than setting SO_KEEPALIVE.  SO_KEEPALIVE causes additional network
>load.

That's not all it does!  Beware: certain OS's have broken KEEPALIVE handling,
notably HP-UX pre-7.0.5 (it seems to be fixed in 7.0.5), and (maybe)
SunOS pre 4.1.1.  The nature of the problem is this:

(Assume TCP connection between any Sun or HP server, and an HPUX 6.5 client):

The client sets KEEPALIVE;  the server crashes;  the server's kernel tries
to close the connection gracefully and sends a RST;  the client ACKs the RST,
and should then wait ~15 minutes before sending its own RST - which the server
side kernel should ACK, whereupon the connection is closed.

What actually happens:  the client receives the RST, ACKs it, then starts
waiting the 15 minutes.  Before the 15 minutes are up, THE CLIENT SENDS
a KEEPALIVE (i.e. an ACK!).  The server ACKs the KEEPALIVE.  Result?
The connection lives forever!

Note that this buggy behaviour depends on the server ACK the KEEPALIVE 
even though it has sent a RST!  Unfortunately, no one seems to have accounted
for that possibility!  (Not surprising:  once you send a RST, the 'other side'
should - one would think - stop KEEPALIVING!  Of course, KEEPALIVE is not TCP!).

Beware the SO_KEEPALIVE, my son, and keep your vorpal ready!
(in other words, use select()) 


--
Peter Whittaker      [~~~~~~~~~~~~~~~~~~~~~~~~~~]   Open Systems Integration
pww@bnr.ca           [                          ]   Bell Northern Research 
Ph: +1 613 765 2064  [                          ]   P.O. Box 3511, Station C
FAX:+1 613 763 3283  [__________________________]   Ottawa, Ontario, K1Y 4H7