[comp.protocols.tcp-ip] SO_KEEPALIVE considered harmful?

rws@EXPO.LCS.MIT.EDU (05/23/89)

I have a random question that I hope this illustrious audience can answer
definitively for me (or else point me to a definitive source).  Is the BSD
notion of SO_KEEPALIVE on a TCP connection considered kosher with respect to
the TCP specification?  If so, is its use to be encouraged?  Specifically,
it has been suggested that in the X Window System world, X libraries
should automatically be setting SO_KEEPALIVE on connections to X servers.  Is this
a reasonable thing to do?

[If this is a totally inappropriate forum for this question, I apologize.]

dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) (05/23/89)

The use of Keepalives is terrible, but sometimes necessary.  The key
word, here, is "sometimes".

The "terrible" is due to the fact that they add traffic to the net.  An
important point to keep in mind, with TCP connections, is that they may
span the globe, over thin wires.  Extra traffic can have a very serious
effect.  Further, they scale poorly.  The incremental traffic from one
connection may not be onerous, but what about 1000 connections?  Lastly,
of course, there is the small fact that there may be a charge for those
extra packets, such as may happen if one of the links along the path
is over a public X.25 network.

If the group proposing the use of Keepalives has already gone through the
exercise of convincing themselves that critical functionality will be
lost if they are not used, then I hope the next question was/is how
to minimize their use.

Dave

craig@NNSC.NSF.NET (Craig Partridge) (05/23/89)

> I have a random question that I hope this illustrious audience can answer
> definitively for me (or else point me to a definitive source).  Is the BSD
> notion of SO_KEEPALIVE on a TCP connection considered kosher with respect to
> the TCP specification?  If so, is its use to be encouraged?  Specifically,
> it has been suggested that in the X Window System world, X libraries
> should automatically be setting SO_KEEPALIVE on connections to X servers.  Is this
> a reasonable thing to do?

Oh what fun!  Keepalive wars return....

Well, I'm a firm hater of keep-alives, although Mike Karels has persuaded
me that in the current world they are a useful tool for catching clients
that go off into hyperspace without telling you.  I have lots of fellow
travellers (actually, I'm probably a fellow traveller with Phil Karn,
president of the "I hate keep-alives" party), witness the current host
requirements text, which is appended.

Craig

	Implementors MAY include "keep-alives" in their TCP           |
	implementations, although this practice is not universally    |
	accepted.  If keep-alives are included, the application MUST  |
	be able to turn them on or off for each TCP connection, and   |
	they MUST default to off.                                     |

	Keep-alive packets MUST NOT be sent when any data or          |
	acknowledgement packets have been received for the            |
	connection within a configurable interval; this interval      |
	MUST default to no less than two hours.                       |

	An implementation SHOULD send a keep-alive segment with no    |
	data; however, it MAY be configurable to send a keep-alive    |
	segment containing one garbage octet, for compatibililty      |
	with erroneous TCP implementations.                           |


	DISCUSSION:                                                   |
	     A "keep-alive" mechanism would periodically probe the    |
	     other end of a connection when the connection was        |
	     otherwise idle, even when there was no data to be sent.  |
	     The TCP specification does not include a keep-alive      |
	     mechanism because it could:  (1) cause perfectly good    |
	     connections to break during transient Internet           |
	     failures; (2) consume unnecessary bandwidth ("if no one  |
	     is using the connection, who cares if it is still        |
	     good?"); and (3) cost money for an Internet path that    |
	     charges for packets.                                     |

	     Some TCP implementations, however, have included a       |
	     keep-alive mechanism. To confirm that an idle            |
	     connection is still active, these implementations send   |
	     a probe segment designed to elicit a response from the   |
	     peer TCP.  Such a segment generally contains SEG.SEQ =   |
	     SND.NXT-1.  The segment may or may not contain one       |
	     garbage octet of data.  Note that on a quiet             |
	     connection, SND.NXT = RCV.NXT and SEG.SEQ will be        |
	     outside the window.  Therefore, the probe causes the     |
	     receiver to return an acknowledgment segment,            |
	     confirming that the connection is still live.  If the    |
	     peer has dropped the connection due to a network         |
	     partition or a crash, it will respond with a reset       |
	     instead of an acknowledgement.                           |

	     Unfortunately, some misbehaved TCP implementations fail  |
	     to respond to a segment with SEG.SEQ = SND.NXT-1 unless  |
	     the segment contains data.  Alternatively, an            |
	     implementation could determine whether a peer responded  |
	     correctly to keep-alive packets with no garbage data     |
	     octet.                                                   |

	     A TCP keep-alive mechanism should only be invoked in     |
	     network servers that might otherwise hang indefinitely   |
	     and consume resources unnecessarily if a client crashes  |
	     or aborts a connection during a network partition.       |

hrp@boring.cray.com (Hal Peterson) (05/24/89)

Here are a couple of relevant extracts from section 4.2.3.5 of the 17
May draft of the Requirements for Internet Hosts RFC:

                 A "keep-alive" mechanism would periodically probe the    |
                 other end of a connection when the connection was        |
                 otherwise idle, even when there was no data to be sent.  |
                 The TCP specification does not include a keep- alive     |
                 mechanism because it could:  (1) cause perfectly good    |
                 connections to break during transient Internet           |
                 failures; (2) consume unnecessary bandwidth ("if no one  |
                 is using the connection, who cares if it is still        |
                 good?"); and (3) cost money for an Internet path that    |
                 charges for packets.                                     |

[ . . . ]

                 A TCP keep-alive mechanism should only be invoked in     |
                 network servers that might otherwise hang indefinitely   |
                 and consume resources unnecessarily if a client crashes  |
                 or aborts a connection during a network partition.       |

Bob Braden points out that one of the design goals of TCP/IP was and
is robustness in the face of errors: even if a few gateways melt down,
the TCP connections that had been using them should pick up where they
left off when new routes materialize.  Keepalives are explicitly
designed to avoid this.

The pros and cons, however, are subject to some disagreement.

--
Hal Peterson			Domain:  hrp@cray.com
Cray Research			Old style:  hrp%cray.com@uc.msc.umn.edu
1440 Northland Dr.		UUCP:  uunet!cray!hrp
Mendota Hts, MN  55120  USA	Telephone:  +1 612 681 3145

mo@prisma.UUCP (05/24/89)

If Keepalives are not (judiciously!) used, how does one transparently
discover that the other end of the connection has died a horrible,
sudden death?

One can argue whether this is a transport or session function, but
the ability to lose one end of the connection while the passive
end just hangs forever is NOT a feature.

	-Mike

dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) (05/24/89)

I tried to avoid saying that keepalives should be prohibited, except,
perhaps, from an aesthetic point of view.  Since aesthetics often are
altered by reality, it is no great concession to acknowledge the 
occasional need for the mechanism.

My point was that they are dangerous and therefore should be used VERY
judiciously.  Craig's note puts this point forward in more detail.

It is worth adding that the excessive use of keepalives has removed a
feature that used to be in TCP and has been recently re-documented by
Bob Braden:  TCP used to be remarkably robust against temporary
outages.  If you were willing to wait, so was TCP.  Now, an outage of
a very short time -- on some implementations, as short as 1-2 minutes --
will abort the connection.

Dave

casey@gauss.llnl.gov (Casey Leedom) (05/24/89)

| From: dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker)
| 
| If the group proposing the use of Keepalives has already gone through the
| exercise of convincing themselves that critical functionality will be
| lost if they are not used, then I hope the next question was/is how to
| minimize their use.

  I think that the big problem that Robert may be trying to deal with is
server crashes.  (Correct me if I'm totally off the deep end Robert.)
Currently, when an X.V11R3 server crashes or simply exits for normal
reasons and there are still clients using it, those clients will
(typically) lay around forever because they never try to contact the
server on their own and they never receive anything from the [now
defunct] server.  [One exception to this is the "xperfmon" client which
periodically attempts to update a system statistics display.  When the X
server disappears, xperfmon starts gobbling up reams of CPU time, not
recognizing the closed connection for what it is.  But this is just a
coding error.]

  I would say that any X client which only tries to use its connection to
the server in response to input from the server should run with
keep-alives on the connection.  Otherwise it will never exit.  I'm
constantly having to go around killing off abandoned xterms because some
people just can't remember to terminate all their client before shutting
down their server.

Casey

rws@EXPO.LCS.MIT.EDU (05/24/89)

Thanks for all the responses so far, I think I get the picture.

braden@VENERA.ISI.EDU (05/24/89)

I don't believe anyone has advocated that keep-alives are a bad thing...
indeed, they appear to be a necessity in an imperfect world.  The
controversy (for the past 10 years, at least!) is whether or not they
belong in TCP.  The decision of the TCP/IP developers was that 
keepalives ought to be in the application layer, not the transport layer.

Each application has its own parameters for keepalive.  Furthermore,
cautious application implementors may already have application-level
keepalives, and economy of protocol mechanism argues for having the
functionality at only one level.  On the other hand, one can (and
some people do) argue that economy of mechanism requires that TCP
provide a keepalive mechanism that may be invoked and parametrized
by an application.  The Host Requirements RFC explicitly allows that. 

Bob Braden

mre@beatnix.UUCP (Mike Eisler) (05/25/89)

In article <8905231205.AA00500@expire.lcs.mit.edu> rws@EXPO.LCS.MIT.EDU writes:
>I have a random question that I hope this illustrious audience can answer
>definitively for me (or else point me to a definitive source).  Is the BSD
>notion of SO_KEEPALIVE on a TCP connection considered kosher with respect to
>the TCP specification?  If so, is its use to be encouraged?  Specifically,
>it has been suggested that in the X Window System world, X libraries
>should automatically be setting SO_KEEPALIVE on connections to X servers.

When we brought up X on our BSD systems we tested it against a Visual Graphics
640 X-term. xterm was set up to spawned by init. When the Visual was powered
off during a connection a new x-term wouldn't get respawned. Analysis
of the BSD client showed the old x-term connection intact, and the xterm
process waiting for a message from the Visual which it would never get. We
figured KEEP alives would solve the problem and put them into the X
library. We found that this cured the problem when the Visual was powered
off for a long time; the KEEP alives eventually timed out waiting for a
response.

But for a quick power-off/power-on, KEEPs didn't help. KEEPs are
implemented as 1 byte segments countaining rcv_next-1,snd_una-1 as the
ACK and SEQ number values (i.e., a 1 byte segment that the segment's
receiver has already acknowledged, containing an ACK sequence # for a
byte that the segment's sender has already received).  The Visual is
listening for a X connection, and as expected responds with a 0 byte
reset, using rcv_next-1 as the SEQ number value.  After getting the
reset, BSD resets the KEEP alive timer because it has "proof" that the
connection is no longer idle. BSD then proceeds to follow instructions
of section 9.2.15.2 "Reset processing" in MIL-STD-1778 (12 Aug 83):

	" ... A reset is valid if its sequence number is in the connection's
	receive window. ... "

Well rcv_next-1 is not in the xterm client's window, so the reset is
tossed, *after* the KEEP timer was reset. So the BSD client sends
another KEEP a few seconds later and the process repeasts itself.  So
we don't get a connection reset, and we don't even get a connection
timeout as a consolation prize.  I suppose we could have "fixed" the
BSD code to not reset the KEEP timer on resets, but we wanted to have
something that would work in the field on existing versions of our
O/S.  We hacked xterm to send send the NOP request of the X protocol to
the server every so often and this has the desired effect (I'm putting
on my asbestos suit now...) of getting the immediate reset from the
Visual, *within* the client's window. The KEEP alive feature doesn't
seem that well thought out. Nor does server crash recovery seem well
thought out in X.
	-Mike Eisler (uunet,sun}!elxsi!mre

phil@ux1.cso.uiuc.edu (05/25/89)

>	     A TCP keep-alive mechanism should only be invoked in     |
>	     network servers that might otherwise hang indefinitely   |
>	     and consume resources unnecessarily if a client crashes  |
>	     or aborts a connection during a network partition.       |

Even this should be unnecessary for servers that have a specific timeout,
e.g. FTP or SMTP will drop on you if you are idle too long.


--Phil howard--  <phil@ux1.cso.uiuc.edu>

barmar@think.COM (Barry Margolin) (05/25/89)

In article <8905250638.AA21706@ucbvax.Berkeley.EDU> dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) writes:
>It is worth adding that the excessive use of keepalives has removed a
>feature that used to be in TCP and has been recently re-documented by
>Bob Braden:  TCP used to be remarkably robust against temporary
>outages.  If you were willing to wait, so was TCP.  Now, an outage of
>a very short time -- on some implementations, as short as 1-2 minutes --
>will abort the connection.

I dispute this claim.  TCP is only robust against temporary outages if
you don't try to use the connection during that period.  For instance,
if I'm using telnet, the connection will stay alive during outages if
I don't type anything to the client or the host doesn't try to send
any output.  If either end tries to use the connection, and the outage
is longer than the TCP acknowledgement timeout, then the connection
will die.  If I happen to know that the network is having trouble I
won't type anything, but how often is this the case?  What it mostly
means is that a temporary outage after I go home won't break my
connections.

TCP's robustness is still a good idea.  It's nice to be able to swap
Ethernet cables without causing all the network connections to die.
But in my experience (which, I admit, isn't all that extensive), any
connection that dies for more than a minute or two probably isn't
going to come back.

What I mostly care about, though, is that the other end definitely has
reinitialized, e.g. it has crashed and been rebooted.  If it's a
telnet server that crashed I can do this by typing into the client,
which will provoke a reset, and the client will abort.  But if it's
the telnet client or an X server that died, there's often no way to
force the other end to try to send something so it will get a reset.

I think the right solution is a compromise.  What's needed is a way to
send a segment with infinite (or near-infinite, e.g. hours or a day)
retransmissions and slow retransmit rate (one to two minutes).  This
would allow idle connections to stay up across most network failures,
but they will die within a minute or so of the other end rebooting.
And, of course, it should be optional, so that applications that
perform frequent output of their own need not compound their network
use (although since keepalives need only be sent when there are no
normal packets in the retransmit queue, any application whose output
rate is higher than the keepalive rate will never invoke the keepalive
mechanism).

Barry Margolin
Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

MRC@CAC.WASHINGTON.EDU (Mark Crispin) (05/26/89)

     I like Barry Margolin's suggestion a lot.  I am responsible for several
servers which have autologout timers *solely* to handle the case of a client
getting rebooted with no mechanism for the server to get a reset.  It has been
shown to be virtually impossible to pick a timer value short enough to be of
use (particularly if a resource is locked up while the server lives) yet long
enough to live across some of the delays and temporary outages we see on the
operating network.

-- Mark --

-------

PADLIPSKY@A.ISI.EDU (Michael Padlipsky) (05/26/89)

In the context of crashes/reboots, the TCP Initial Sequence Number magic
is SUPPOSED to save you from "embarrassment" (or so it says here).

In the context of Telnet, periodic phoney traffic is completely counter to
the desire/necessity for certain Hosts to abort inactive connections (lest,
e.g., a terminal be borrowed by an unauthorized user when the authorized
user broke coffee too long)--unless, of course, such traffic is never "seen"
by the relevant timer-outer.

    cheers, map
    Past President, IHK-A's
    (unless Phil Karn started agitating against 'em before I wrote
    what became p. 151 of The Book)
-------

karn@jupiter (Phil R. Karn) (05/26/89)

>>It is worth adding that the excessive use of keepalives has removed a
>>feature that used to be in TCP and has been recently re-documented by
>>Bob Braden:  TCP used to be remarkably robust against temporary
>>outages. [...]

>I dispute this claim.  TCP is only robust against temporary outages if
>you don't try to use the connection during that period.

TCP becomes quite robust against all outages (whether or not the
connection is idle) once you make a very simple change: get rid of TCP
level timeouts!

I feel very strongly that TCP should *never* just give up on its own
accord; that decision belongs to the application. And, in the event the
application is an interactive one, the decision to abort should be left
to the human user. If he's willing to wait, why shouldn't the system let
him? (The only case when TCP should abort a connection on its own is
when it has clear proof that the other end has crashed, i.e., by
receiving a valid RST.)

Users of my TCP/IP package on amateur packet radio occasionally report
cases of FTP transfers that resume automatically after network outages
lasting for *days* (e.g., those due to crashes of network nodes in
remote locations that require manual resets).  They are most happy to do
without TCP give-up timers, as long as TCP backs off its retransmissions
to avoid channel congestion.

Phil

barmar@THINK.COM (Barry Margolin) (05/26/89)

    Date: Thu, 25 May 89 13:32:04 PDT
    From: braden@venera.isi.edu


    Sorry, but Dave Crocker is perfectly correct.  The behaviour that you
    describe is a property of many current-generation LAN-oriented TCP's
    [a transparent euphemism], but not of the original research TCP's that
    were WAN-oriented ... nor even of a TAC.  A host implementation that
    follows the Host Requirements RFC can behave like a TAC for Telnet
    connections: tell the user when it is retransmitting excessively, but DO
    NOT CLOSE the connection.  Let the user decide when to give up.  I
    don't think we users should accept anything less of our communication
    software.

RFC-793, which defines TCP, says, "If data is not successfully delivered
to the destination within the timeout period, the TCP will abort the
connection."  I can believe that the Host Requirements RFC changes
"abort the connection" to "signal an error", but this contradicts your
claim that original TCPs were more forgiving.

Also, how is a TELNET server or xterm client supposed to tell the user
when it is retransmitting excessively?  Its communication path to the
user is the failing connection.  Sure, it could put something in a
system log or write a message to the system console, but how is the
operator (if there is one) supposed to know why the remote machine isn't
responding?

                                                barmar

dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) (05/26/89)

THe issue of aborting a connection, due to a retransmission timeout, is the choice of the application.  Telnet could, as easily, decide to
keep trying.

Dave

davecb@yunexus.UUCP (David Collier-Brown) (05/26/89)

In article <20761@news.Think.COM> barmar@kulla.think.com.UUCP (Barry
Margolin) writes: TCP's robustness is still a good idea.  It's nice
| to be able to swap Ethernet cables without causing all the network
| connections to die.  But in my experience (which, I admit, isn't all
| that extensive), any connection that dies for more than a minute or
| two probably isn't going to come back.  [...]

  Actually the connection might well come back: I had a crossbar switch
that timed me out every so often, assuming that I don't leave the terminal
for substantial periods without disconnecting it.  (This is silly, but not
unreasonable for a device which thinks its switching telephone voice lines).
  After I got back to the terminal controller I could then reconnect
to my process.

  Keepalives would be more secure in such a situation (anyone could
pretend to be me if the tty server disconnected), but would tend to
cause me to lose work-in-process...
  Methinks that a facility for polling a connection makes sense, as
well as one to send "reset the poll clock, if any" (keepalives redux)
would be useful.  As does Barry, I'd propose they be optional.  I'd also
propose that
	1) if one exists, so must its complement
	2) they be composed out of existing facilities, as were
	  keepalives, and
	3) they be distinguishable from any other facility (unambiguous).

--dave
  

dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) (05/26/89)

Phil,

As a test-of-concept:  I assume that you have no objection to a TCP
implementation's being able to do keepalives, under the control of the
application, where both the fact of keepalives AND their periodicity
can be specified; and the effect of a timeout is a signal to the
application, not an abort?

Dave

barns@GATEWAY.MITRE.ORG (Bill Barns) (05/26/89)

Sigh.  The Tenex TCP of ages ago certainly allowed the user timeout to
be set infinite, by specifying a value of 0.  If there are older ones
than that, I don't think they can be much older!  I think the claims
about the old TCP's having this capability are grounded in fact.

However, it just may be that Bob & Dave fell into a trap here.  I
almost wrote the message they both wrote, but went off to check RFC 793
and I did not locate any text stating that an RFC 793 conformant TCP
necessarily has to provide any particular range of user timeout
settings, except that the default is five minutes.  And I was just SURE
it was there.  Oops.  The designers probably had it so firmly in their
minds from prior discussions that they forgot to write it down explicitly.

THAT sounds like a job for **!!HOST REQUIREMENTS MAN!!**

However**2, TCP is ALSO specified in MIL-STD-1778, and it DOES have an
explicit requirement for the TCP to allow the upper layer to choose
whether a user timeout should result in a notification to the upper
layer or should cause the TCP to abort the connection.  This is
referred to in many places but the most coherent description is in
section 9.2.9.  For your convenience, I've appended it below.

Sad to say, there are (other) inconsistencies within the MIL-STD and
between it and the RFC.  The MIL-STD, section 9.4.4.7, sets the default
timeout as 120 unidentified units.  Obviously 24ths of a minute... etc.

Bill Barns / MITRE-Washington / barns@gateway.mitre.org

-------

[MIL-STD-1778]

9.2.9  ULP timeout and ULP timeout action.  The timeout allows a ULP to
set up a timeout for all data submitted to the TCP entity.  If some
data is not successfully delivered to the destination within the
timeout period, the state of ULP_timeout_action is checked.  If
ULP_timeout_action is 1, the TCP entity will terminate the connection.
If it is 0, the TCP entity informs the ULP that a timeout has occurred,
and then resets the timer.  The timeout appears as an optional
parameter in the open request and the send request.  Upon receiving
either an active open request, or a SYN segment after a passive
request, the TCP entity must maintain a timer set for the interval
specified by the ULP.  As acknowledgments arrive from the remote TCP,
the timer is cancelled and set again for the timeout interval.  As
parameters of the SEND request, timeout and timeout_action can change
during connection lifetime.  If the timeout is reduced below the age of
data waiting to be acknowledged, the event dictated by
ULP_timeout_action will occur.  The implementor may choose to allow
additional options when informing the ULP in case of a timeout; for
example, informing the ULP only on the first timeout.

karn@THUMPER.BELLCORE.COM (Phil R. Karn) (05/27/89)

Dave,

Yes, that might be acceptable to me. I'd go a little further, though,
and say that a REMOTE USER (not just the application code) must always
be able to turn off keepalives, even on binary-only systems. It does no
good to say "the application must be able to disable keepalives" when
I'm having problems with a remote server that I have no administrative
control over.

Much of my animosity toward keepalives came from trying to make a Sun
workstation work properly over SLIP links and amateur packet radio. I
finally replaced the TCP object modules provided by Sun with ones
compiled from Van's latest TCP, which I had already edited to disable
keepalives.  Works like a charm.

At the last InterOp, I sat next to Dave Borman in a panel session on TCP
performance. Between us, we represented a "dynamic range" of about 6
orders of magnitude in TCP transfer rates (1200 bps amateur packet radio
to 500 Mbps between Crays). This is an exceptional achievement for a
single networking protocol, but it was possible only because TCP was
designed from the beginning to scale well over a wide network
performance range.

But broken mechanisms like keepalives threaten this. We need a big red
warning light that will flash whenever someone proposes to put an fixed
time interval into a protocol spec, because you can't scale protocols
that have arbitrary timers.

Phil

barr@frog.UUCP (Chris Barr) (05/27/89)

In support of 'no timeouts at TCP layer':

The first explanation I ever heard of TCP/IP was that it was designed 
(for DOD) to survive battle conditions where connections were expected to 
break and later be restored.

I then used someone's Telnet which broke a session after 15 minutes
without keystrokes.

CERF@A.ISI.EDU (05/29/89)

When TCP was first designed, and for all subsequent versions, it was
thought inappropriate to impose any kind of semantics on the logical
connections extablished by TCP. In particular, no sense of absolute
timeout for the severing of a connection was desired. We thought that
such notions of "impatience" or "time to give up" ought to be the
choice of the upper level protocol using TCP as the basis merely for
reliable delivery.

A part of this view stemmed from the fact that the networks over which
TCP had to function, for the DoD applications we had in mind, were
potentially very unpredictable as to loss and delay. Mobile packet
radio systems had to function under jamming and radio shadow effects,
for instance. TCP never unilaterally severed connections but only
reported failure to achieve positive acknowledgement after a time
which could be controlled by the application or upper-level protocol.
It was up to the application to decide whether to sever the connection
and, even then, the choice to do so gracefully or abruptly was also
left to the application.

The use of a feature (X-level NOP) to test the liveness of a TCP
connection is consonant with the model against which the TCP was
designed. 

Vint Cerf

mo@prisma.UUCP (05/30/89)

I hear you, Bob, but I,  for one, don't think it reasonable
for every applications protocol developer to have to
reinvent all the common stuff of doing keep-alives at the
applications level.  According to the advertising copy,
TCP provides reliable virtual circuits.  In my book, knowing
that the other end has croaked is part of the definition
of "reliable."  Since this is mechanism that is going to 
have to be reinvented by lots of protocols, it makes sense
to get it right ONCE so people don't have to (1) reinvent 
all the bugs and (2) can just use it for what they really
want to be doing.  The notion that protocols are only
designed by "mavens" is long dead, and rightly so.

	-Mike

jas@proteon.com (John A. Shriver) (06/01/89)

The user (client) Telnet in the MIT UNIX V6 TCP/IP (one of those
pre-Bezerkely WAN TCP/IP's) would periodically print:

	Host not responding, type ^^q to quit

on the user's terminal when (and only when) it had outstanding data to
send, and could not get it acknowledged.  If you had reason to beleive
it was right, you aborted the connection.  Otherwise, it sat there
retransmitting at a slow rate until connectivity was regained.
Meanwhile, you would go and fix the broken router, and *would not lose
your current session* on the remote host.

Now, if the server Telnet gets into a pickle, it would probably just
abort and die.  That UNIX lacks any way to preserve a login session is
it's problem, MIT AI ITS (on PDP-10's) knew exactly how to preserve
your state when this happened.  Of course, most systems are not in the
habit of generating unsolicited output, so this didn't happen as
often.

jqj@HOGG.CC.UOREGON.EDU (06/02/89)

Seems to me that much of this discussion is missing the point that an
open TCP connection (especially a telnet session) can tie up expensive
resources on the server; most of the recent discussion has focussed on
the problems of a user who may or may not want to abort a connection on
network or remote host failure.  For example, many timesharing systems
charge based on "connect time", and some even enforce a maximum number
of outstanding sessions.  In such cases it is in the interest of the
user and the system to abort a telnet session if there is reason to
believe that loss of connectivity is not just briefly transient.  One
can obviously do this with a (perhaps user settable) timeout, but are
there other heuristics that might usefully be used as well?

Does anyone have any data on the distribution of time-length of network
partitions?  How, for that matter, might we define a network
partition?  Many events (e.g. the TR card in our NSS going bad) yield
obvious network partitions with well defined lengths.  Others, e.g. a
degraded quality line, may imply very short (a few ms or s) partitions,
which increase the errors and retransmissions and ultimately imply an
unusable TCP connection.  Can we come up with an analytic model that
includes both sorts of failures?

stev@VAX.FTP.COM (06/02/89)

*Phil,
*
*As a test-of-concept:  I assume that you have no objection to a TCP
*implementation's being able to do keepalives, under the control of the
*application, where both the fact of keepalives AND their periodicity
*can be specified; and the effect of a timeout is a signal to the
*application, not an abort?
*
*Dave


if an application wants a keep alive mechanism, it should do it
itself, sending  a byte of garbage data, and abusing the sequence
numbers is not the way to go about this . . . . 

and hopefully, the people doing the keepalive mechanism will alow
either end to disable it. if i startup an ftp to run all night
sucking over the latest X distribution, i dont want it being aborted
because a gateway goes down for an hour for PM.



stev knowles
ftp software
stev@ftp.com

MAP@LCS.MIT.EDU (Michael A. Patton) (06/08/89)

   From: prisma!mo@uunet.uu.net
   Date: Tue, 30 May 89 08:07:02 -0600

   [...]  According to the advertising copy, TCP provides reliable
   virtual circuits.  In my book, knowing that the other end has
   croaked is part of the definition of "reliable."

But just because you aren't getting replies does NOT mean the other
end "croaked", just that something did.  If it's internal to the
network, it should recover and you can continue.  The indication that
the other end "croaked" is receiving a RST!  How an application deals
with being temporarily partitioned has to be up to that application.
There are just too many possibilities.
						     Since this is
   mechanism that is going to have to be reinvented by lots of
   protocols, it makes sense to get it right ONCE [...]

But there isn't one right answer so how can we "get it right ONCE"?
The whole argument here is that the BSD implementation goes against
the design of TCP in that they chose one specific requirement and
implemented a solution to it ONCE, but what I want is NOT what they
provide and what the guy in the next office wants is not what I want.
No strategy that is built into the TCP layer will be right for all
applications, and it can get in the way of applications that want some
other specific type of handling for these cases.

	[...] so people don't have to (1) reinvent all the bugs and
   (2) can just use it for what they really want to be doing.

But you don't have to break TCP (oops, I mean add to it) to prevent
people from reinventing things.  Provide them with a library of
different techniques for handling various network problems.  If I want
one of the standard techniques, I just use it.  If I want something
special, I write it (and if it's of general use, it's an addition to
the library).

	    __
  /|  /|  /|  \		Michael A. Patton, Network Manager
 / | / | /_|__/		Laboratory for Computer Science
/  |/  |/  |atton	Massachusetts Institute of Technology

Disclaimer: The opinions expressed above are a figment of the phosphor
on your screen and do not represent the views of MIT, LCS, or MAP. :-)

frg@jfcl.dec.com (Fred R. Goldstein) (06/09/89)

This is probably a stupid question since I'm not familiar with the way
different systems (ie BSD) implement TCP timeouts.  But wouldn't the
problem of dissimilar systems (ie, AX.25 on one end and Cray on the
other) still be solvable by basing the timeout on the smoothed round
trip time (srtt)?  If the keepalive timer were some significant multiple
of srtt (or longer, if srtt is short) then it would still scale.

Proper behavior, of course, is still open to debate -- whether the
application or TCP should do the teardown.  I'm not joining in...
       fred 

karels@OKEEFFE.BERKELEY.EDU (Mike Karels) (06/09/89)

Sorry, I can't let this go by without commenting on Phil's message
and this discussion, even though the discussion has mostly died down.
(I haven't been reading tcp-ip very often, but noticed this subject
line going by.)

Last time Phil and I talked about keepalives in person, I asked him
whether he had problems with telnet/rlogin servers accumulating on
his systems if they didn't use keepalives.  We certainly accumulate
junk, including xterm programs, waiting for input from a half-open
connection.  Phil told me that he doesn't have problems, because
he runs a "wall" every night to force output to all users, and of
course breaking connections that time out.  In other words, Phil
violently objects to servers requesting keepalives from TCP, but
allows the system manager (himself) to force them above the application
level.  And before people jump up to point out the difference in time
scales, the current BSD code sends no keepalive packets until a connection
has been idle for 2 hr, and that interval is easily changeable.
One proposal for the Host Requirements document was to wait for 12 hr.
I think that's a bit high, but the difference is only a factor of 6.
Compare the number of keepalive packets with the number of packets
exchanged by an xterm and an X server over the course of a week
if used 4 hours a day!

Phil says:
	... I'd go a little further, though,
	and say that a REMOTE USER (not just the application code) must always
	be able to turn off keepalives, even on binary-only systems. It does no
	good to say "the application must be able to disable keepalives" when
	I'm having problems with a remote server that I have no administrative
	control over.

I'm sorry, Phil, but remote users have no more right to override system
management policies than do local users (at least on *our* systems!).
On some of the systems where I have guest accounts, local or remote
users are logged off if they aren't active for two hours.  I don't like
that, either, but I don't claim that the managers of those systems
have no right to enforce such a policy.

		Mike

dcrocker@AHWAHNEE.STANFORD.EDU (Dave Crocker) (06/10/89)

Steve,

Let me try, one last time:

If the application can direct TCP as to the periodicity and the action
to be taken (notify application vs. abort connection) then the application
will not abort your connection unless the application programmer decided
to force that condition.  Under proper design, the programmer will give the
user a switch to set, indicating something about the "persistance" that
is desired.

With respect to having the mechanism in tcp or the application, I agree with
you, philosophically, that the mechanism should be in the application (although
I believe the OSI model would put it into the session layer, but that seems
mostly to be part of the application process, these days.

The major issues, however, are kernel vs. user space, and additional
complexity to the application protocol.

There is a remarkable economy that derives from puting this mechanism
into the kernel/transport system.  It may be an accident that TCP does
not have the mechanism but can be tricked into creating one, but it still
is remarkably simple.

Most application protocols have very simple interaction styles and tend to
be relatively easy to program.  To force time-based generation of action
would complexify these protocols significantly.

Dave