[comp.protocols.iso.dev-environ] Why does ISODE use SO_KEEPALIVE?

PWW@BNR.CA (Peter Whittaker, P.W.) (11/20/90)

The ISODE function start_tcp_server sets the socket option SO_KEEPALIVE.
Over the last few days I have received many notes suggesting that the use
of SO_KEEPALIVE violates the TCP protocol (in essence, this is because the
KEEP_ALIVE traffic can cause an otherwise idle connection to fail:  if the
connection is not in use and some intermediate site goes down and comes back
up, the end points will carry on as if nothing happened.  If the down-and-up
happens while one end is waiting for a KEEP_ALIVE response, that end will
know of the intermediate problem).

Furthermore, the KEEP_ALIVE timeout seems unrelated to any useful quantity
(and is generally implememtation dependent).  To quote of the notes I
received recently:

"The keepalive implementations that I have heard about do not use the TCP
round trip time estimator to decide how long to wait for a response to the
keepalive.  This causes perfectly functional connections to die when the
round trip time gets too high."

To quote other notes: "Never predefine a timeout, it will eventually be too
small."  "Who's to say that 30 seconds (or 1 minute or 1 hour) is a
reasonable interval between keepalives? What's reasonable today might be
very unreasonable tomorrow."

If the reason KEEPALIVE is used is make the server aware of a client
crashing, then the error returns from read() or write() (or the appropriate
return from select()) should be sufficient to indicate that the client
has gone to la-la-land.

So why does ISODE use SO_KEEPALIVE?

Peter Whittaker      [~~~~~~~~~~~~~~~~~~~~~~~~~~]   Open Systems Integration
pww@bnr.ca           [                          ]   Bell Northern Research
Ph: +1 613 765 2064  [                          ]   P.O. Box 3511, Station C
FAX:+1 613 763 3283  [__________________________]   Ottawa, Ontario, K1Y 4H7

j.onions@xtel.co.uk (Julian Onions) (11/20/90)

A couple of cases where KEPP_ALIVES are useful.

A client opens a connection and asks questions of it. The server replies.
The client goes away and does something with this information intending to
ask the server more questions in the future over this connection it has open.
Meanwhile, the client machine crashes.
The server will not get to know about this I believe as it will not try
to actively use the connection, it is waiting for input.

Another case is a client, that wants to connect to a client to indicate
it wants certain events. Client connects to server, and says send
me all events of type X down this connection. Both sides sit back and wait.
The server machine crashes or is taken down and rebooted.
The client is unaware of this as no activity is taking place over the
connection. It thinks it still has a registered handler for events of type
X which are now being thrown away. If it detected the server machine had died
it would have tried to actively reregister.

Now both these cases can be solved either by keepalives, or through
higher level NOP operations. The difference is that if I'm using
X.25 I get to know about these events for free. If I'm using TCP I have
to go out and look for them. Any application where I might get into this
situation I have to add NOP operations to the protocol and a suitable
timer. As an applications person, it seems like its not my job to
write into each application specification a KEEP_ALIVE functionality.

Session layer for TCP anyone?

Julian.

G.Knight@CS.UCL.AC.UK (Graham Knight) (11/20/90)

Julian,

	Does this boil own to a problem of what a user is entitled to
expect from COTS? I had a look at the service definition to see what
it said about the TS's responsibility to notify connection failure.
The only thing I could find was:

"The TC release TS primitives are used to release a TC. The release
may be performed:


...
b) by the TS provider to release an established TC; all failures to
maintain a TC are indicated in this way;"

On the face of it, this suggests that failures should be notified even
when the connection is idle. However, as it does not say anything
about how soon after failure a notification should be given, there
remains an area of ambiguity.

						Graham

j.onions@xtel.co.uk (Julian Onions) (11/22/90)

> I'm not sure that I follow your examples, Julian.
>  
> > A client opens a connection and asks questions of it. The server replies.
> > The client goes away and does something with this information
>  
> If the server has this connection open, it must occasionally poll
> (i.e. select()) the socket, or it must dedicate a process to perfrom
> a blocking read() on the socket.  In either case, if the client end
> blows up, the server end will be notified:  either the read() will return
> an error, or the select() will return TRUE with an indication that the
> connection is gone.
No - you have the wrong idea I think - according to my understanding
of TCP anyway.

If you open a TCP connection, the only way you can find if it goes
away is if the remote machine sends a disconnect sequence or you can't
reach it after several retries. However, if the remote machines
crashes it will not send a disconnect - it won't send anything at all.
If you send a packet to the machine when it reboots it can tell that
the sequence number is wrong and shut the connection and everyones
happy. If you never send anything to it though, you don't know if its
up or down. So, if the remote side crashes, and the connection was
idle at that point and you do not send any more data (e.g. you are
waiting for another request) you don;t know if the remote client is up
or down or what. At the unix level, a read will block for ever, and a
select will never select it for reading.
You can see this behaviour with X windows, you have an xterm on a
remote host that crashes, the window stays up. As soon as you type a
character in though, it attempts to send this, finds the host is dead
or rebooted and the window connection breaks.

Of course if the *application* crashes (and the machine doesn't), the
kernel detects this and initiates a close/abort.


Julian.