[comp.unix.questions] reading on sockets when connection breaks

mnl%IDTSUN1.E-TECHNIK.TH-DARMSTADT.DE@BRL.MIL (Michael@CUNYVM.CUNY.EDU N. Lipp) (12/06/90)

Hello,

I have a program that establishes a TCP-connection with another machine,
requests the server to send some packets of data and then does a

while (read (fd, &packet, sizeof (packet)) == sizeof (packet)) { ... }

This program hangs frequently. I made it QUIT and found it hanging in the
read. As this program frequently connects to diskless machines that are
switched off at night, I assume that the connection comes down while
the program is reading.

I am wondering: shouldn't read return with an error status if the connection
breaks? As it apparently does not, what is the most reasonable fix?
A blocking read with timeout comes to my mind, but what is the best
way to do this?

Thanks Michael

-----------------,------------------------------,------------------------------
Michael N. Lipp  !  Institut fuer Datentechnik  !  XDATMNLX@DDATHD21.BITNET
                 !  Merckstr. 25                !  (local: mnl@idtsun1)
                 !  D-6100 Darmstadt (Germany)  `------------------------------
                 !  Phone: 49-6151-163776          Fax: 49-6151-164976
-----------------'------------------------------'------------------------------

barmar@think.com (Barry Margolin) (12/06/90)

In article <25205@adm.brl.mil> mnl%IDTSUN1.E-TECHNIK.TH-DARMSTADT.DE@BRL.MIL (Michael@CUNYVM.CUNY.EDU N. Lipp) writes:
>I have a program that establishes a TCP-connection with another machine,
>requests the server to send some packets of data and then does a
>
>while (read (fd, &packet, sizeof (packet)) == sizeof (packet)) { ... }
>
>This program hangs frequently. I made it QUIT and found it hanging in the
>read. As this program frequently connects to diskless machines that are
>switched off at night, I assume that the connection comes down while
>the program is reading.
>
>I am wondering: shouldn't read return with an error status if the connection
>breaks? As it apparently does not, what is the most reasonable fix?
>A blocking read with timeout comes to my mind, but what is the best
>way to do this?

Are the diskless machines simply switched off, or are they shut down with
software?  If they're just switched off, then they won't be able to send
the appropriate "close this connection" packets (either a FIN or a RST) on
these connections.

Unfortunately, there's no reliable way to determine whether another machine
is up or down on many network media (Ethernet, in particular).  Lack of
communication can result from a number of other causes: network congestion,
router/bridge failure, a flaky cable or connector, etc.

If you're willing to assume that incommunicado means dead you can use a
keepalive, an empty packet that is sent periodically in order to elicit an
acknowledgement.  If you're using Unix sockets, the SO_KEEPALIVE option can
be enabled to automate this.

By the way, there's another bug in your code, in the "== sizeof (packet)".
Read() is permitted to return fewer bytes than you asked for; the third
argument is only a maximum.  You should use something like

	while ((count = read(...)) > 0) { ... }
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

damenf@motcid.UUCP (Frederick Damen) (12/12/90)

In article <1990Dec6.055353.23846@Think.COM> barmar@think.com (Barry Margolin) writes:
>If you're willing to assume that incommunicado means dead you can use a
>keepalive, an empty packet that is sent periodically in order to elicit an
>acknowledgement.  If you're using Unix sockets, the SO_KEEPALIVE option can
>be enabled to automate this.

After RTFMs, I have a few questions and assumptions that need confirming:

1) After reading some related information on SIGPIPE and running a test program
   it seems as though SIGPIPE is only raised on a pipe/socket that has been 
   written to.  In most/all the documentation on SIGPIPE that I have seen 
   it always refers to writing to the pipe/socket.  In test program that I 
   have written the read(2) command will return a 0 if there has not been any
   writes to that end of the socket, the read(2) command will cause a SIGPIPE
   if that end of the socket has been previously written to. This happens with
   or without SO_KEEPALIVE set.

   Q: Is SIGPIPE only raised for the end of the socket/pipe that has been written to?

2) SunOS 4.0.1 man page for setsockopt(2) Says: 
                                SO_KEEPALIVE enables the periodic
     transmission of messages on a connected socket.  Should  the
     connected  party fail to respond to these messages, the con-
     nection is considered broken and processes using the  socket
     are notified using a SIGPIPE signal.

   Q: What is the period of these messages?
   Q: When is the SIGPIPE sent:
         After n(n=1) messages are not responded to?
         When the next I/O operation is performed on this socket after a nonresponce?
   Q: Define processess *using* the socket.
      Is this:
         Processes that have written to the socket?
         Processes that have an open file descriptor for this socket?
         Processes at both ends of socket connection?
         Processes that are currently performing and I/O operation on the socket?

3) After the signal handler for SIGPIPE is called how do/should you tell which
   socket caused the SIGPIPE?

I am on a Sun 3/80 running SunOS 4.0.1.  I am using AF_INET, SOCK_STREAM.  I have
RTFM and then some.  I have written some programs usings sockets and understand(?)
the basics.

Thanks in advance for any answers or rtfm(references to f___ manuals) that might
be more enlighting.

Fred
-- 
Fred Damen                                    1501 W. Shure Drive
Motorola, Inc.                                Arlington Heights, IL 60004
Cellular Infrastructure Division              708 632-4641
...!uunet!motcid!damenf

barmar@think.com (Barry Margolin) (12/12/90)

In article <5727@navy40.UUCP> damenf@motcid.UUCP (Frederick Damen) writes:
>1) After reading some related information on SIGPIPE and running a test program
>   it seems as though SIGPIPE is only raised on a pipe/socket that has been 
>   written to.

Someone please correct me if I'm wrong (my answers are mostly educated
guesses), but I think Unix keepalives are implemented by periodically
retransmitting a packet with the sequence number of the last one sent.  The
other host acknowledges having received all bytes up to that point, and
this acknowledgement serves as the indication that it is still alive.  But
if you haven't yet written anything, then there is nothing to retransmit

>   Q: When is the SIGPIPE sent:
>         After n(n=1) messages are not responded to?
>         When the next I/O operation is performed on this socket after a nonresponce?

After n (n > 1, possibly a settable kernel parameter) messages are not
responded to.  There would be no reason to use a signal if it waited for
an I/O operation to be performed, as it could simply return an error in
that case.

>   Q: Define processess *using* the socket.
>      Is this:
>         Processes that have written to the socket?
>         Processes that have an open file descriptor for this socket?
>         Processes at both ends of socket connection?
>         Processes that are currently performing and I/O operation on the socket?

I would expect the second definition.  The third definition is unlikely,
because the process at the other end of the socket isn't likely to be on
this host, so it's not possible to send a Unix signal to it (it might not
even be on a Unix system); also if the keepalive is accurately detecting a
crashed host, the process at the other end doesn't even exist.  The fourth
definition is also unlikely for the reason I gave in the answer to the
previous question.  And the first definition seems unlikely because I don't
think the kernel keeps track of which processes have written to a socket;
there's a single buffer that all file descriptors for the socket reference.

>3) After the signal handler for SIGPIPE is called how do/should you tell which
>   socket caused the SIGPIPE?

I don't think there's a reliable way.  I think the intent of the keepalive
mechanism was to provide a way for the process to be killed automatically
if the other end died.  It doesn't provide for much fine control.  It's
probably the case that trying to read on a socket that caused a SIGPIPE
will get an error, but I wouldn't stake much on it.
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

terryl@sail.LABS.TEK.COM (12/14/90)

In article <1990Dec12.060545.7673@Think.COM> barmar@think.com (Barry Margolin) writes:
>In article <5727@navy40.UUCP> damenf@motcid.UUCP (Frederick Damen) writes:
>>1) After reading some related information on SIGPIPE and running a test program
>>   it seems as though SIGPIPE is only raised on a pipe/socket that has been 
>>   written to.
>
>Someone please correct me if I'm wrong (my answers are mostly educated
>guesses), but I think Unix keepalives are implemented by periodically
>retransmitting a packet with the sequence number of the last one sent.  The
>other host acknowledges having received all bytes up to that point, and
>this acknowledgement serves as the indication that it is still alive.  But
>if you haven't yet written anything, then there is nothing to retransmit


     OK, consider yourself corrected!!! (-:

     What actually happens is that sends out a packet with a sequence number
of send unacknowledged - 1, which should have been the last byte sent and
already acknowledged, which is what I guess Barry was trying to say.....

__________________________________________________________
Terry Laskodi		"There's a permanent crease
     of			 in your right and wrong."
Tektronix		Sly and the Family Stone, "Stand!"
__________________________________________________________