[comp.protocols.tcp-ip] Monitoring TCP/IP sockets

litwin@ROBOTICS.JPL.NASA.GOV (Todd Litwin) (01/29/91)

I have a program that uses TCP/IP sockets and needs to know quickly, within a
second or so, if the physical connection between the two systems is broken. It
appears that the operating system is very tolerant of physical disruptions, and
won't timeout the connection and formally break it even if the problem lasts
several minutes. I'm using setsockopt() to turn on SO_KEEPALIVE, but this
doesn't help, either. Is there any way that I can force a socket to disconnect
after a second or so of failure to communicate (short of sending my own
heartbeats)? I am running under Sun OS 4.0.2, but also will need to move a
version of this software to the Silicon Graphics world, and to the VxWorks
real-time operating system. Any suggestions would be greatly appreciated.

 		Todd Litwin
 		Jet Propulsion Laboratory
 		(818) 354-5028
 		litwin@robotics.jpl.nasa.gov

romkey@ASYLUM.SF.CA.US (John Romkey) (01/30/91)

You can't always tell within a second or two whether the physical
connection between two systems is broken. Sometimes the break is a
router crashing. Sometimes it's an AT&T fiberoptic cable cut by a
backhoe in upstate New York when you're in Dallas and the computer
you're talking to is in San Francisco. Most applications want to be
tolerant of order-of-several-minutes disruption of communications,
because there are too many real world transient conditions that aren't
readily distinguishable from long term failures.
		- john romkey			Epilogue Technology
USENET/UUCP/Internet:  romkey@asylum.sf.ca.us	FAX: 415 594-1141

BILLW@MATHOM.CISCO.COM (William "Chops" Westfield) (01/30/91)

    I have a program that uses TCP/IP sockets and needs to know quickly,
    within a second or so, if the physical connection between the two
    systems is broken.

Foo.  tcp/ip is designed for reliability over many media.  You have no
guarantee that your packet will even get to its destination within a
second, even if the network is working perfectly.

If you really need to know that quickly whether the network has gone
away, tcp/ip is not a suitable protocol to be using.

BillW
-------

henry@zoo.toronto.edu (Henry Spencer) (01/31/91)

In article <9101291553.AA06606@litwin.jpl.nasa.gov.> litwin@ROBOTICS.JPL.NASA.GOV (Todd Litwin) writes:
>I have a program that uses TCP/IP sockets and needs to know quickly, within a
>second or so, if the physical connection between the two systems is broken.

This basically can't be done; it's easy to get transient interruptions that
last longer than that, and there is no reliable way to distinguish them
from a real break in the link.  If you're willing to consider even such a
hiccup as a failure, then you need some sort of keepalive protocol at a
higher level.  TCP/IP is deliberately very tolerant of outages.

>... I'm using setsockopt() to turn on SO_KEEPALIVE, but this
>doesn't help, either....

SO_KEEPALIVE is a kludge; its timeout period is non-adjustable and quite
long.  You're going to have to do it yourself.
-- 
If the Space Shuttle was the answer,   | Henry Spencer at U of Toronto Zoology
what was the question?                 |  henry@zoo.toronto.edu   utzoo!henry

mleech@bwdlh131.bnr.ca (Marcus Leech) (01/31/91)

In article <1991Jan30.172337.7084@zoo.toronto.edu>,
henry@zoo.toronto.edu (Henry Spencer) writes:
|> In article <9101291553.AA06606@litwin.jpl.nasa.gov.>
litwin@ROBOTICS.JPL.NASA.GOV (Todd Litwin) writes:
|> 
|> SO_KEEPALIVE is a kludge; its timeout period is non-adjustable and quite
|> long.  You're going to have to do it yourself.
I'll second that, and add that I have successfully used application-level
  "keep-alives" (which should properly be called "make-deads") to detect
  server processes going away.  This *only* works if you have very tight
  control over where your packets are routed.  It happens that my applications
  have servers "in the next room", so network prop delays are quite
  predictable (in lieu of technicians severing cables ;-( ).
--
Marcus Leech, 4Y11             Bell-Northern Research  |opinions expressed
mleech@bnr.ca                  P.O. Box 3511, Stn. C   |are my own, and not
VE3MDL@VE3JF.ON.CAN.NA         Ottawa, ON, CAN K1Y 4H7 |necessarily BNRs

nelson@sun.soe.clarkson.edu (Russ Nelson) (01/31/91)

In article <12657895596.14.BILLW@mathom.cisco.com> BILLW@MATHOM.CISCO.COM (William "Chops" Westfield) writes:

       I have a program that uses TCP/IP sockets and needs to know quickly,
       within a second or so, if the physical connection between the two
       systems is broken.

   Foo.  tcp/ip is designed for reliability over many media.  You have no
   guarantee that your packet will even get to its destination within a
   second, even if the network is working perfectly.

   If you really need to know that quickly whether the network has gone
   away, tcp/ip is not a suitable protocol to be using.

You didn't foo him very well, Bill.  Yes, he shouldn't be using TCP/IP.
But he *can* use IP.  It's just a matter of protocol design.  If you
*really* want to know if your network has gone away in a second, you
obviously have to have a network whose packets can make a round trip
in less than a second.

Moreover, we need to communicate in *much* less than a second, because
we have to be able to retry several times.  We also need to be able to
limit traffic on the network, so that we can guarantee a certain probability
that no return packet *really* means dead machines.

And if the protocol is designed well, it could constantly update a
probability measure of connection downness.

So, he can't do it on an arbitrary LAN (or LANs), nor guarantee a
100% correct answer, but he *can* do it over IP.

--
--russ (nelson@clutx [.bitnet | .clarkson.edu])  FAX 315-268-7600
It's better to get mugged than to live a life of fear -- Freeman Dyson
I joined the League for Programming Freedom, and I hope you'll join too.