jbn@wdl1.UUCP (04/17/85)
It is legitimate for a TCP connection to be in FIN_WAIT_2 forever! There is no idle traffic on a TCP connection. When a connection is in the ESTABLISHED state at both ends, and nothing is going on (as with an idle TELNET connection) no packets are exchanged. So if one end goes down in this situation, the other end never hears about it unless some traffic is generated on the connection. Jon Postel, the ARPANET protocol czar in the days before the Defense Communications Agency took over the responsibility for the standards, has taken the position that no idle traffic should be generated in TCP and that TCP should have no idle timeout; if a host can't handle a large number of half-dead connections, that's tough. He thinks that idle traffic, if any, should be generated in the applications layer. FIN_WAIT_2 is the state your end is in when your end has closed, and the other end has acknowledged your close, but the other end has not closed yet. Closing in TCP is separate for read and write; you can close your write pipe and continue to read in some implementations. (I don't know if 4.2BSD is one of these.) One way to get into this situation legitimately is to start a long job remotely via TELNET, then close the TELNET connection at your end. This indicates that you want to receive data from your remote job and log out when the remote job finally completes. In this situation, you can still receive output but cannot send any more data. If the remote machine crashes while you are in this situation, you will be hung in FIN_WAIT_2 forever. This is a tough one. In the hung-in-ESTAB case, a local attempt to send anything will start the retransmission mechanism, which will detect failure within a minute or two. But if you are hung in FIN_WAIT_2, there is nothing that your end can do to probe the connection; you've closed, and are forbidden to send anything; you have to sit there and wait for a FIN that may never come. (Your end can abort the connection, of course, but the application has to do that; TCP isn't entitled to do so.) The typical work-around here is to close idle TCP connections after some huge timeout, such as one day. Whatever timeout is chosen must be longer than the longest legitimate idle connection; if it was, for example, ten minutes, idle TELNET connections would log out in ten minutes. Given the present standards; it's hard to do better than this. 4.2BSD is probably doing something like this. We have a one hour idle timeout in our implementation (based on 3COM's UNET) and send an empty ACK every 4 minutes to keep things alive. But this solution won't work generally unless everybody sends an empty ACK every few minutes. Still, it's a clean work-around. John Nagle Ford Aerospace and Communications Corp.
jsq@ut-sally.UUCP (John Quarterman) (04/20/85)
If you have many connections to TOPS-20 hosts, you will accumulate FIN_WAIT_2 state connections rapidly. Here is a kludge that we have used at U. Texas for about two years, in 2.8BSD, 4.1C BSD, and 4.2BSD. It puts a timeout on the FIN_WAIT_2 state only (no effect on ordinary established telnet/ftp/rcp/rlogin connections). It violates the TCP spec in that there is no such timeout in the spec. However, we have never seen any ill effects from it, nor had any complaints, and it does away with the hanging FIN_WAIT_2 connections. *** tcp_input.c.orig Wed Apr 17 15:47:38 1985 --- tcp_input.c Wed Apr 17 15:48:06 1985 *************** *** 530,535 if (so->so_state & SS_CANTRCVMORE) soisdisconnected(so); tp->t_state = TCPS_FIN_WAIT_2; } break; --- 530,536 ----- if (so->so_state & SS_CANTRCVMORE) soisdisconnected(so); tp->t_state = TCPS_FIN_WAIT_2; + tp->t_timer[TCPT_2MSL] = 2 * TCPTV_MSL; } break; -- John Quarterman, jsq@ut-sally.ARPA, {ihnp4,seismo,ctvax}!ut-sally!jsq
steveh@hammer.UUCP (Stephen Hemminger) (04/21/85)
Don't apply the bug fix which sets a 2MSL timer. This fix is wrong, it breaks programs that use one way Inet sockets. Example: rsh -n longrunningprogram >output.store & The problem is easily fixed by the change below. What happens is that the code was there to handle the problem but due to control flow problems, ti->ti_len is always zero by the time it gets to the test. You should probably make sure have all the other problems fixed, but that's another story... (Of course if the problem is on a non Unix system you are out of luck, and may have to go back to the original one [but use a MUCH LONGER timer say 2 hours]) ------ *** tcp_input.c.orig Sun Apr 21 12:34:15 1985 --- tcp_input.c.finfix Sun Apr 21 12:39:09 1985 *************** *** 328,333 goto dropafterack; if (ti->ti_len > 0) { m_adj(m, ti->ti_len); ti->ti_len = 0; ti->ti_flags &= ~(TH_PUSH|TH_FIN); } --- 328,342 ----- goto dropafterack; if (ti->ti_len > 0) { m_adj(m, ti->ti_len); + /* + * If data is received on a connection after the + * user processes are gone, then RST the other end. + */ + if ((so->so_state & SS_NOFDREF) + && tp->t_state > TCPS_CLOSE_WAIT) { + tp = tcp_close(tp); + goto dropwithreset; + } ti->ti_len = 0; ti->ti_flags &= ~(TH_PUSH|TH_FIN); } *************** *** 374,389 ti->ti_len -= todrop; ti->ti_flags &= ~(TH_PUSH|TH_FIN); } - } - - /* - * If data is received on a connection after the - * user processes are gone, then RST the other end. - */ - if ((so->so_state & SS_NOFDREF) && tp->t_state > TCPS_CLOSE_WAIT && - ti->ti_len) { - tp = tcp_close(tp); - goto dropwithreset; } /* --- 383,388 ----- ti->ti_len -= todrop; ti->ti_flags &= ~(TH_PUSH|TH_FIN); } } /*
jbn@wdl1.UUCP (04/23/85)
The timeout in FINWAIT2 does violate the TCP spec, but unless you use some applications that like to close one side of a connection while using the other for a long period thereafter, it shoudn't hurt. Since UNIX lacks an asymmetrical close, this is not much of a problem between UNIX sites. It is a known bug that most TOPS-20 systems don't perform the TCP close handshake properly. We first noticed this problem several years ago, and it still hasn't been fixed everywhere; the machines at BBN are OK, but expect some minor problems from TOPS-20 sites that got the code via DEC. On to TP-4. John Nagle
jds@rlgvax.UUCP (Jack Slingwine) (04/24/85)
> If you have many connections to TOPS-20 hosts, you will accumulate > FIN_WAIT_2 state connections rapidly. Here is a kludge that we have > used at U. Texas for about two years, in 2.8BSD, 4.1C BSD, and 4.2BSD. > It puts a timeout on the FIN_WAIT_2 state only (no effect on ordinary > established telnet/ftp/rcp/rlogin connections). It violates the TCP > spec in that there is no such timeout in the spec. However, we have > never seen any ill effects from it, nor had any complaints, and it > does away with the hanging FIN_WAIT_2 connections. Unfortunately, it does cause trouble for the following: rsh other 'cd dir;tar cf - stuff' | (cd lcldir;tar xf -) which allows you to "siphon" stuff off another machine. For some reason, "rsh" shuts down its send side of the stream, putting its end of the connection in FIN_WAIT_2. Depending on the amount of data to be sent, the "2*TCPTV_MSL" timer could expire before all the data is transmitted. I suppose the least painful solution is to keep the FIN_WAIT_2 timer and NOT shutdown the send side of the stream even though you don't plan to use it.