earle@POSEUR.JPL.NASA.GOV (Greg Earle - Sun JPL on-site Software Support) (08/15/90)
I don't want to start the religious war over KEEPALIVEs again, but ... Has anyone ever run across this kind of behavior? If so, please respond to millis%sunpeaks.Central@Sun.COM (a.k.a. millis@sunpeaks.Central.Sun.COM). BTW, the 2 machines described in question are running SunOS 4.0.3, which I think has a 4.3BSD-Tahoe-ish TCP (but not all the latest goodies; they were added in SunOS 4.1). Thanks. ------- Begin Included Message I have a Sybase application which fails occasionally because of TCP resets. Here's the details: The application is split across 2 machines - client and server. 1- Client (Sun-4/330) negotiates a TCP connection to the server (Sun-4/280) (Sybase TCP port) 2- Client uses the TCP connection to send transactions and recieve transactions from the server. 3- The connection stays open (by design), even during long periods of idleness. 4- The application sends transactions asychronously, every few minutes typically. 5- The above behavior may continue indefinitely; at least it is SUPPOSED to. 6- TCP KEEPALIVE is turned ON on the server, but OFF on the client. Therefore, during idle periods, the server TCP sends a KEEPALIVE every 75 seconds, which the client dutifully acknowledges. 7- HOWEVER, the server TCP occasionally "misses" sending a KEEPALIVE; sometimes 1 or more 75-second intervals will pass with no KEEPALIVES seen from the server. Sometimes the KEEPALIVES resume, but sometimes they seem to go away permanently. 8- Server TCP seems to believe that KEEPALIVES always go out, even when they really don't. 9- After failing to receive acknowledgements from 8 consecutive "phantom" KEEPALIVEs, the TCP port on server is abruptly closed. 10- Sometime later, the client attempts to send another transaction. 11- Server TCP responds immediately with a RESET, since the connection no longer exists. 12- Client ack's the RESET, and the Sybase application crashes with a "connection reset by peer" error. Thus, if the application routinely has idle periods of longer than 11 minutes, 15 sec (9*75 sec), the application is at risk of being brought down by this connection reset problem. Both client and server are running SunOS 4.0.3; a 4.1 upgrade will not be possible for several months, for other reasons. The KEEPALIVE idle period and the KEEPALIVE connection timeout are set in sys/netinet/tcp_timer.h (on Suns): (the defaults have not been changed) #define TCPTV_KEEP ( 75*PR_SLOWHZ) /* keep alive - 75 secs */ #define TCPTV_MAXIDLE ( 8*TCPTV_KEEP) /* maximum allowable idle I was told by someone that I cannot change TCPTV_KEEP or TCPTV_MAXIDLE - changes made to tcp_timer.h will not affect how the kernel makes. Finally, the questions: 1- Has anyone seen anything like this? Why would the server occasionally fail to send KEEPALIVES? 2- I have trouble believing that I cannot change TCPTV_KEEP or TCPTV_MAXIDLE (perhaps "cannot" is really "should not"?). Is this really true? 3- I want to increase TCPTV_MAXIDLE, as a Band-aid. The only major sideffect that I can think of is that truly dead connections (the machine at the other end has died) will take longer to time out. Any other bad side effects? ------- End of Included Message -- - Greg Earle Sun Microsystems, Inc. - JPL on-site Software Support earle@poseur.JPL.NASA.GOV (Direct) earle@Sun.COM (Indirect)