[comp.protocols.tcp-ip] The case of the phantom KEEPALIVEs

earle@POSEUR.JPL.NASA.GOV (Greg Earle - Sun JPL on-site Software Support) (08/15/90)

I don't want to start the religious war over KEEPALIVEs again, but ...

Has anyone ever run across this kind of behavior?  If so, please respond to
millis%sunpeaks.Central@Sun.COM (a.k.a. millis@sunpeaks.Central.Sun.COM).
BTW, the 2 machines described in question are running SunOS 4.0.3, which I
think has a 4.3BSD-Tahoe-ish TCP (but not all the latest goodies; they were
added in SunOS 4.1).  Thanks.

------- Begin Included Message

I have a Sybase application which fails occasionally because of TCP resets.
Here's the details:

The application is split across 2 machines - client and server.

1- Client (Sun-4/330) negotiates a TCP connection to the server (Sun-4/280)
   (Sybase TCP port)

2- Client uses the TCP connection to send transactions and recieve
   transactions from the server.

3- The connection stays open (by design), even during long periods of idleness.

4- The application sends transactions asychronously, every few minutes
   typically.

5- The above behavior may continue indefinitely; at least it is SUPPOSED to.

6- TCP KEEPALIVE is turned ON on the server, but OFF on the client.
   Therefore, during idle periods, the server TCP sends a KEEPALIVE every 75
   seconds, which the client dutifully acknowledges.

7- HOWEVER, the server TCP occasionally "misses" sending a KEEPALIVE;
   sometimes 1 or more 75-second intervals will pass with no KEEPALIVES seen
   from the server.  Sometimes the KEEPALIVES resume, but sometimes they seem
   to go away permanently.

8- Server TCP seems to believe that KEEPALIVES always go out, even when they
   really don't.

9- After failing to receive acknowledgements from 8 consecutive "phantom"
   KEEPALIVEs, the TCP port on server is abruptly closed.

10- Sometime later, the client attempts to send another transaction.

11- Server TCP responds immediately with a RESET, since the connection no
    longer exists.

12- Client ack's the RESET, and the Sybase application crashes with a
    "connection reset by peer" error.

Thus, if the application routinely has idle periods of longer than 11 minutes,
15 sec (9*75 sec), the application is at risk of being brought down by this
connection reset problem.

Both client and server are running SunOS 4.0.3; a 4.1 upgrade will not be
possible for several months, for other reasons.

The KEEPALIVE idle period and the KEEPALIVE connection timeout are set in
sys/netinet/tcp_timer.h (on Suns): (the defaults have not been changed)

#define TCPTV_KEEP      ( 75*PR_SLOWHZ)         /* keep alive - 75 secs */
#define TCPTV_MAXIDLE   (  8*TCPTV_KEEP)        /* maximum allowable idle

I was told by someone that I cannot change TCPTV_KEEP or TCPTV_MAXIDLE -
changes made to tcp_timer.h will not affect how the kernel makes.

Finally, the questions:

1-      Has anyone seen anything like this?  Why would the server occasionally
        fail to send KEEPALIVES?

2-      I have trouble believing that I cannot change TCPTV_KEEP or
        TCPTV_MAXIDLE (perhaps "cannot" is really "should not"?).
	Is this really true?

3-      I want to increase TCPTV_MAXIDLE, as a Band-aid.  The only
        major sideffect that I can think of is that truly dead connections
        (the machine at the other end has died) will take longer to time out.
        Any other bad side effects?

------- End of Included Message

--
	- Greg Earle
	  Sun Microsystems, Inc. - JPL on-site Software Support
	  earle@poseur.JPL.NASA.GOV	(Direct)
	  earle@Sun.COM			(Indirect)