earle@POSEUR.JPL.NASA.GOV (Greg Earle - Sun JPL on-site Software Support) (08/15/90)
I don't want to start the religious war over KEEPALIVEs again, but ...
Has anyone ever run across this kind of behavior? If so, please respond to
millis%sunpeaks.Central@Sun.COM (a.k.a. millis@sunpeaks.Central.Sun.COM).
BTW, the 2 machines described in question are running SunOS 4.0.3, which I
think has a 4.3BSD-Tahoe-ish TCP (but not all the latest goodies; they were
added in SunOS 4.1). Thanks.
------- Begin Included Message
I have a Sybase application which fails occasionally because of TCP resets.
Here's the details:
The application is split across 2 machines - client and server.
1- Client (Sun-4/330) negotiates a TCP connection to the server (Sun-4/280)
(Sybase TCP port)
2- Client uses the TCP connection to send transactions and recieve
transactions from the server.
3- The connection stays open (by design), even during long periods of idleness.
4- The application sends transactions asychronously, every few minutes
typically.
5- The above behavior may continue indefinitely; at least it is SUPPOSED to.
6- TCP KEEPALIVE is turned ON on the server, but OFF on the client.
Therefore, during idle periods, the server TCP sends a KEEPALIVE every 75
seconds, which the client dutifully acknowledges.
7- HOWEVER, the server TCP occasionally "misses" sending a KEEPALIVE;
sometimes 1 or more 75-second intervals will pass with no KEEPALIVES seen
from the server. Sometimes the KEEPALIVES resume, but sometimes they seem
to go away permanently.
8- Server TCP seems to believe that KEEPALIVES always go out, even when they
really don't.
9- After failing to receive acknowledgements from 8 consecutive "phantom"
KEEPALIVEs, the TCP port on server is abruptly closed.
10- Sometime later, the client attempts to send another transaction.
11- Server TCP responds immediately with a RESET, since the connection no
longer exists.
12- Client ack's the RESET, and the Sybase application crashes with a
"connection reset by peer" error.
Thus, if the application routinely has idle periods of longer than 11 minutes,
15 sec (9*75 sec), the application is at risk of being brought down by this
connection reset problem.
Both client and server are running SunOS 4.0.3; a 4.1 upgrade will not be
possible for several months, for other reasons.
The KEEPALIVE idle period and the KEEPALIVE connection timeout are set in
sys/netinet/tcp_timer.h (on Suns): (the defaults have not been changed)
#define TCPTV_KEEP ( 75*PR_SLOWHZ) /* keep alive - 75 secs */
#define TCPTV_MAXIDLE ( 8*TCPTV_KEEP) /* maximum allowable idle
I was told by someone that I cannot change TCPTV_KEEP or TCPTV_MAXIDLE -
changes made to tcp_timer.h will not affect how the kernel makes.
Finally, the questions:
1- Has anyone seen anything like this? Why would the server occasionally
fail to send KEEPALIVES?
2- I have trouble believing that I cannot change TCPTV_KEEP or
TCPTV_MAXIDLE (perhaps "cannot" is really "should not"?).
Is this really true?
3- I want to increase TCPTV_MAXIDLE, as a Band-aid. The only
major sideffect that I can think of is that truly dead connections
(the machine at the other end has died) will take longer to time out.
Any other bad side effects?
------- End of Included Message
--
- Greg Earle
Sun Microsystems, Inc. - JPL on-site Software Support
earle@poseur.JPL.NASA.GOV (Direct)
earle@Sun.COM (Indirect)