MAP@LCS.MIT.EDU (Michael A. Patton) (11/17/90)
Warning: Keep-Alive considered harmful!!! Date: 16 Nov 90 16:44:48 GMT From: van-bc!ubc-cs!news-server.csri.toronto.edu!utgpu!cunews!bnrgate!bwdls61.bnr.ca!usenet@ucbvax.berkeley.edu (Peter Whittaker) (Boy, what a bogus address. Good thing you put one in your signature. You may not be able to help the enormous length, but could you try and get your mail service to put YOU, not your daemon, in the header?) So, are there in fact substantive reasons not to use SO_KEEPALIVE? Yes, indeed there are. The implementation of keep-alives in the TCP level is just WRONG. Although the actual details are (usually) not a strict violation of the spec, they do stretch it in interesting ways and I have seen at least one implementation that would sometimes RESET a connection in response to a keep-alive packet. If you want the functionality that you think you get automagically by setting this option, you should really think about the specifics more. Think about what functionality you really want. Do you just want it to free up the socket or kill a server? Why do you care? Are you attacking a symptom rather than the real problem? Usually, the application needs much better control over how it really works than just setting an option that causes it to crash if something goes wrong. You should think about the many effects that can influence this (I'll list a few presently) and consider whether you want these in your application domain. Beware that you may only intend to run your application in a limited context, but eventually someone will try it over a different domain. Please consider the different cases before going for options like this. One thing to consider is that there are links in the world where you occasionally get only a packet every 5-10 minutes to actually go through. Do you want to forbid the use of your service over such a link? Just this morning I had a user complaint that they couldn't FTP a file between two distant hosts. The problem resolved to a link that dropped out for several minutes every half hour or so, but the transfer time for the file they wanted was 45 minutes. When the link dropped out, they would get punted because of keep-alives, then they had to start over. If only they weren't running keep-alives on the FTP Server it would have worked, the person doing the transfer had enough patience, if only the computer had. Another thing to consider is that in any application with a person in control, they are usually a better watch-dog timer than any program you build. This might suggest building some user-interface feature to help. The hash marks in some versions of FTP are one example. Or you might build the high-level timeout, but rather than punting the operation, merely print out a message to the user explaining how they could do it. There are a few other general things to watch out for when building any distributed system, these are a few related to the keep-alive question. Never predefine a timeout, it will eventually be too small. As an example I have an FTP script that had a global timeout on a transfer that caused it to quit and go on to the next. I discovered that for one combination of server and file, FIVE HOURS was not enough, upping it to 10 got that file transferred (but wreaked havoc with my other assumptions :-). Well, there are several more things I could bring up, but I seem to have rambled on for over a page already so I'll cut it short here. I hope this helps to answer your questions and give you some ideas as to why keep-alive is considered harmful. Hopefully it will point you in a useful direction for developing what you really need as well. __ /| /| /| \ Michael A. Patton, Network Manager / | / | /_|__/ Laboratory for Computer Science / |/ |/ |atton Massachusetts Institute of Technology Disclaimer: The opinions expressed above are a figment of the phosphor on your screen and do not represent the views of MIT, LCS, or MAP. :-)
he@spurv.runit.sintef.no (Havard Eidnes) (11/18/90)
I agree that using TCP keepalives is a bad idea. I just want to comment on the specific example Michael A. Patton mentioned, making me able "with Internet Hosts Requirements in hand" to point out one other area where traditional implementations of TCP should be changed to improve robustness in the case of temporary network failures. In article <9011170344.AA20268@gaak.LCS.MIT.EDU> MAP@LCS.MIT.EDU (Michael A. Patton) writes: > >Just this morning I had a user complaint that they couldn't FTP >a file between two distant hosts. The problem resolved to a link that >dropped out for several minutes every half hour or so, but the transfer >time for the file they wanted was 45 minutes. When the link dropped >out, they would get punted because of keep-alives, then they had to >start over. If only they weren't running keep-alives on the FTP >Server it would have worked, the person doing the transfer had enough >patience, if only the computer had. It is of course possible that this connection was blown away by the server using TCP keepalives. However, one other possibility (perhaps more likely) is that an intermediate gateway issued ICMP net unreachable messages while the temporary network outage lasted. Traditional implementations (including the original BSD 4.3 version) of TCP blow away a live TCP connection when they receive an ICMP net unreachable message. The Host Requirements state that a host implementation of TCP MUST NOT do this (specifically: the ICMPs net unreachable, host unreachable or source route failed should be considered temporary failures and not permanent conditions). Some gateways have the ability to turn off the sending of ICMP net unreachables, but this is just a workaround for "broken" host implementations. - Havard
barmar@think.com (Barry Margolin) (11/19/90)
Unfortunately, keep-alives are sometimes needed to work around deficiencies in application protocols. For instance, there's no way for a server telnet to detect when the client host has crashed (it could send an IAC Are-You-There, but there's no standard for the response, so it would confuse the process receiving the input). However, I think the common design of keep-alives is incorrect. The connection shouldn't be killed as a result of keep-alive timeouts. Instead, the purpose of keep-alives should be to elicit RSTs from the other host. Timeouts can be due to any number of reasons, but a RST indicates unambiguously that the connection is unusable, because the other end rebooted or closed the connection itself (perhaps network problems prevented the FIN from getting through). If a host crashes, the keepalive won't actually notice this until it comes back up, which is probably good enough. Yes, this will not catch all half-open connections. If the host dies for good, or crashes and is given a new address, the other hosts won't automatically kill their connections to it. But it's better to leave some useless connections open than to close some useful connections. Pinging for RSTs is fail-safe. -- Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
louie@SAYSHELL.UMD.EDU (Louis A. Mamakos) (11/19/90)
In article <1990Nov19.063111.21768@Think.COM> you write: >Unfortunately, keep-alives are sometimes needed to work around deficiencies >in application protocols. For instance, there's no way for a server telnet >to detect when the client host has crashed (it could send an IAC >Are-You-There, but there's no standard for the response, so it would >confuse the process receiving the input). The server telnet could periodically send TELNET NOP commands (IAC NOP) to the client. This should not cause any response to be generated, but will poke the TCP re-transmission machinary to make sure that the connection is intact. louie
henry@zoo.toronto.edu (Henry Spencer) (11/20/90)
In article <1990Nov19.063111.21768@Think.COM> barmar@think.com (Barry Margolin) writes: >Unfortunately, keep-alives are sometimes needed to work around deficiencies >in application protocols. For instance, there's no way for a server telnet >to detect when the client host has crashed ... I think this is a confusion of mechanism with policy. A server telnet needs a way to tell the TCP layer "ping the other end". That is not the same as having a wired-in policy that the TCP layer will ping the other end regularly and break the connection if there is no response. -- "I don't *want* to be normal!" | Henry Spencer at U of Toronto Zoology "Not to worry." | henry@zoo.toronto.edu utzoo!henry
clynn@BBN.COM (Charles Lynn) (11/20/90)
The old TOPS-20 TCP/IP used the RST method of probing for half-open connections, by sending an unacceptable "SYN-ACK".