[comp.protocols.tcp-ip] Warning: Keep-Alive considered harmful

MAP@LCS.MIT.EDU (Michael A. Patton) (11/17/90)

Warning: Keep-Alive considered harmful!!!

Date: 16 Nov 90 16:44:48 GMT
From: van-bc!ubc-cs!news-server.csri.toronto.edu!utgpu!cunews!bnrgate!bwdls61.bnr.ca!usenet@ucbvax.berkeley.edu (Peter Whittaker)

(Boy, what a bogus address. Good thing you put one in your signature.
You may not be able to help the enormous length, but could you try and
get your mail service to put YOU, not your daemon, in the header?)

So, are there in fact substantive reasons not to use SO_KEEPALIVE?

Yes, indeed there are. The implementation of keep-alives in the TCP
level is just WRONG. Although the actual details are (usually) not a
strict violation of the spec, they do stretch it in interesting ways
and I have seen at least one implementation that would sometimes RESET
a connection in response to a keep-alive packet.

If you want the functionality that you think you get automagically by
setting this option, you should really think about the specifics more.
Think about what functionality you really want. Do you just want it
to free up the socket or kill a server? Why do you care? Are you
attacking a symptom rather than the real problem?

Usually, the application needs much better control over how it really
works than just setting an option that causes it to crash if something
goes wrong. You should think about the many effects that can
influence this (I'll list a few presently) and consider whether you
want these in your application domain. Beware that you may only
intend to run your application in a limited context, but eventually
someone will try it over a different domain. Please consider the
different cases before going for options like this.

One thing to consider is that there are links in the world where you
occasionally get only a packet every 5-10 minutes to actually go
through. Do you want to forbid the use of your service over such a
link? Just this morning I had a user complaint that they couldn't FTP
a file between two distant hosts. The problem resolved to a link that
dropped out for several minutes every half hour or so, but the transfer
time for the file they wanted was 45 minutes. When the link dropped
out, they would get punted because of keep-alives, then they had to
start over. If only they weren't running keep-alives on the FTP
Server it would have worked, the person doing the transfer had enough
patience, if only the computer had.

Another thing to consider is that in any application with a person in
control, they are usually a better watch-dog timer than any program
you build. This might suggest building some user-interface feature to
help. The hash marks in some versions of FTP are one example. Or you
might build the high-level timeout, but rather than punting the
operation, merely print out a message to the user explaining how they
could do it.

There are a few other general things to watch out for when building
any distributed system, these are a few related to the keep-alive
question. Never predefine a timeout, it will eventually be too small.
As an example I have an FTP script that had a global timeout on a
transfer that caused it to quit and go on to the next. I discovered
that for one combination of server and file, FIVE HOURS was not
enough, upping it to 10 got that file transferred (but wreaked havoc
with my other assumptions :-).

Well, there are several more things I could bring up, but I seem to
have rambled on for over a page already so I'll cut it short here. I
hope this helps to answer your questions and give you some ideas as to
why keep-alive is considered harmful. Hopefully it will point you in
a useful direction for developing what you really need as well.

__
/| /| /| \ Michael A. Patton, Network Manager
/ | / | /_|__/ Laboratory for Computer Science
/ |/ |/ |atton Massachusetts Institute of Technology

Disclaimer: The opinions expressed above are a figment of the phosphor
on your screen and do not represent the views of MIT, LCS, or MAP. :-)

he@spurv.runit.sintef.no (Havard Eidnes) (11/18/90)

I agree that using TCP keepalives is a bad idea.  I just want to comment on
the specific example Michael A. Patton mentioned, making me able "with
Internet Hosts Requirements in hand" to point out one other area where
traditional implementations of TCP should be changed to improve robustness
in the case of temporary network failures.

In article <9011170344.AA20268@gaak.LCS.MIT.EDU> MAP@LCS.MIT.EDU (Michael A. Patton) writes:
>
>Just this morning I had a user complaint that they couldn't FTP
>a file between two distant hosts.  The problem resolved to a link that
>dropped out for several minutes every half hour or so, but the transfer
>time for the file they wanted was 45 minutes.  When the link dropped
>out, they would get punted because of keep-alives, then they had to
>start over.  If only they weren't running keep-alives on the FTP
>Server it would have worked, the person doing the transfer had enough
>patience, if only the computer had.

It is of course possible that this connection was blown away by the server
using TCP keepalives. However, one other possibility (perhaps more likely)
is that an intermediate gateway issued ICMP net unreachable messages while
the temporary network outage lasted.  Traditional implementations (including
the original BSD 4.3 version) of TCP blow away a live TCP connection when
they receive an ICMP net unreachable message. The Host Requirements state
that a host implementation of TCP MUST NOT do this (specifically: the ICMPs
net unreachable, host unreachable or source route failed should be
considered temporary failures and not permanent conditions). Some gateways
have the ability to turn off the sending of ICMP net unreachables, but this
is just a workaround for "broken" host implementations.

- Havard

barmar@think.com (Barry Margolin) (11/19/90)

Unfortunately, keep-alives are sometimes needed to work around deficiencies
in application protocols.  For instance, there's no way for a server telnet
to detect when the client host has crashed (it could send an IAC
Are-You-There, but there's no standard for the response, so it would
confuse the process receiving the input).

However, I think the common design of keep-alives is incorrect.  The
connection shouldn't be killed as a result of keep-alive timeouts.
Instead, the purpose of keep-alives should be to elicit RSTs from the other
host.  Timeouts can be due to any number of reasons, but a RST indicates
unambiguously that the connection is unusable, because the other end
rebooted or closed the connection itself (perhaps network problems
prevented the FIN from getting through).  If a host crashes, the keepalive
won't actually notice this until it comes back up, which is probably good
enough.

Yes, this will not catch all half-open connections.  If the host dies for
good, or crashes and is given a new address, the other hosts won't
automatically kill their connections to it.  But it's better to leave some
useless connections open than to close some useful connections.  Pinging
for RSTs is fail-safe.

--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

louie@SAYSHELL.UMD.EDU (Louis A. Mamakos) (11/19/90)

In article <1990Nov19.063111.21768@Think.COM> you write:
>Unfortunately, keep-alives are sometimes needed to work around deficiencies
>in application protocols.  For instance, there's no way for a server telnet
>to detect when the client host has crashed (it could send an IAC
>Are-You-There, but there's no standard for the response, so it would
>confuse the process receiving the input).

The server telnet could periodically send TELNET NOP commands (IAC NOP) to 
the client.  This should not cause any response to be generated, but will
poke the TCP re-transmission machinary to make sure that the connection is
intact.

louie

henry@zoo.toronto.edu (Henry Spencer) (11/20/90)

In article <1990Nov19.063111.21768@Think.COM> barmar@think.com (Barry Margolin) writes:
>Unfortunately, keep-alives are sometimes needed to work around deficiencies
>in application protocols.  For instance, there's no way for a server telnet
>to detect when the client host has crashed ...

I think this is a confusion of mechanism with policy.  A server telnet needs
a way to tell the TCP layer "ping the other end".  That is not the same as
having a wired-in policy that the TCP layer will ping the other end regularly
and break the connection if there is no response.
-- 
"I don't *want* to be normal!"         | Henry Spencer at U of Toronto Zoology
"Not to worry."                        |  henry@zoo.toronto.edu   utzoo!henry

clynn@BBN.COM (Charles Lynn) (11/20/90)

The old TOPS-20 TCP/IP used the RST method of probing for half-open
connections, by sending an unacceptable "SYN-ACK".