[comp.protocols.tcp-ip] FIN_WAIT_2 problem ?

cdjohns@NSWC-G.ARPA.UUCP (06/15/87)

	The problem in question is that of a TCP session left in limbo
due to the unusual termination of the matching socket on the remote host.
This condition is usually characterized by a TCP session dangling in the
FIN_WAIT_2 or LAST_ACK states.  This indicates that the session is awaiting
the arrival of a final FIN or ACK of FIN.  The sessions will never see these
messages because the matching side of the connection is either dead or awaiting
a similiar message.

	We finally decided to look in to the problem seriously and have come up
with the following observations and (apparently) reasonable solution.

When a TCP session is hung, it's usually in FIN_WAIT_2 or LAST_ACK.

Normally, a session in FIN_WAIT_2 will receive a FIN, send ACK and transition to
TIME_WAIT.  In this state, the session waits a default amount of time 
(suggested 1 minute) and then assumes the matching side of the connection
received the ACK of FIN. Then the TCP Protocol Control Block (PCB) is deleted
and the session is closed. (the PCB is the internal representation of the 
TCP connection)

The timer mentioned above is referred to as the 2MSL timer and is apparantly
only used in the TIME_WAIT state.  This timer is simple in operation.  The
current value of the timer is stored in the PCB for that session. (the 
structure of a TCP PCB can be found in <netinet/in_pcb.h> ) The kernel
decrements this counter every 500ms until it reaches zero, then the session
is internally deleted.

The solution we have found to the dangling session is to fake the kernel into
thinking the session is in the TIME_WAIT state, ready to die.  This is done by
setting the 2MSL timer to a non-zero value.  The kernel then takes over, 
decrementing the timer until it reaches zero and ZAP! the TCP session is gone.
This setting of the timer is easily done with adb.

It is sometimes the case that both sides of the connection are hung, one in FIN_WAIT_2, the other in LAST_ACK.  It appears that clearing the FIN_WAIT_2 using
the above method will also clear the matching LAST_ACK.

If this appears to be a reasonable solution, a permanent solution might be 
achieved in the following way:
Upon entering the FIN_WAIT_2 state, set the 2MSL timer to a large enough 
default time to give the remote host adequate time to send the FIN.  If the FIN
never shows up, eventually the PCB will be deleted.

We would appreciate any comments about this solution.  The following script
should demonstrate setting the 2MSL timer.

------------------- adb script starts here ------------------

#!/bin/sh
#  This script is designed to aid in eliminating lingering TCP
#  connections without having to reboot the host kernel.
#  It is intended to be used on TCP connections stuck in the
#  FIN_WAIT_2 state.  Elimination of the connection is done by
#  setting the 2MSL timer in the TCP Protocol Control Block (PCB)
#  to a non-zero value.  The kernel then begins to decrement this
#  value until it reachs zero, at which point the kernel forces a
#  close on the socket and deletes the TCP PCB.  If both sides of
#  the connection are hung, clearing one side will possibly clear
#  the other (FIN_WAIT_2 should be cleared as a first try).


# MSLOFFSET is the offset in the tcpcb record for the 2MSL timer.
# The tcpcb record is found in <netinet/in_pcb.h>
# This value is the number of bytes offset, expressed in hexidecimal.

MSLOFFSET=10

# TIMETODEATH is the number of half seconds until the connection is 
# closed.  This value is expressed in hexidecimal and must be greater
# than zero.

TIMETODEATH=06

# Display netstat.  Addresess for PCB's are found in first column.

netstat -A

echo
echo 'PCB address to terminate ? '
read addr
echo

# Perform adb on kernel and display the PCB of the specified address

adb -k /vmunix /dev/mem << SHAR_EOF
$addr\$<tcpcb
\$q
SHAR_EOF

# Check to see if this was the correct address and PCB. state should be
# 8 for LAST_ACK, 9 for FIN_WAIT_2

echo
echo 'Is this the correct PCB (y/n) ?     (state = 9 = FIN_WAIT_2) '
read ans
echo
if test $ans != 'y'
  then echo 'No Changes.'
       exit
fi

# Perform adb on kernel and set the 2MSL timer for the PCB

adb -k -w /vmunix /dev/mem << SHAR_EOF
$addr+$MSLOFFSET/w $TIMETODEATH
\$q
SHAR_EOF

# These lines are used in place of the above for testing the script.
#adb -k  /vmunix /dev/mem << SHAR_EOF
#$addr+$MSLOFFSET/x 
#\$q
#SHAR_EOF

echo
echo 'Connection will be terminated in 0x'$TIMETODEATH'/2 seconds.'
echo 
echo 'netmod done.'



=============================================================================

Chuck     --  Naval Surface Weapons Center
Johnson       Code K33 (Systems Integration and Networking)
              Dahlgren, Va  22448
              DDN Mail -  cdjohns@nswc-g.arpa
		   or     cdjohns@nswc-oas.arpa
              phone - (703) 663 - 7745

cdjohns@NSWC-G.ARPA (06/16/87)

If you were interested in the adb script I sent yesterday, you will
be interested in this correction
the TCP PCB structure can be found in <netinet/tcp_var.h> rather than
<netinet/in_pcb.h>
sorry for the bad dump.

minshall@OPAL.BERKELEY.EDU.UUCP (06/16/87)

Chuck Johnson,

	If you have connections hanging in FIN_WAIT_2 and LAST_ACK
then there is a bug somewhere.  There was a bug in the 4.2 TCP
implementation (in tcp_input.c) which would cause connections to
hang in this state.  The fix, which is in the Mt. Xinu bug list
(among other places), involved repositioning a block of code
(though I can't remember which block where).

Greg Minshall

karn@FALINE.BELLCORE.COM (Phil R. Karn) (06/16/87)

No! FIN_WAIT_2 state is supposed to be a "stable" state in TCP. That is, one
end of the connection may close while the other end may continue to send for
as long as it likes. In this situation, the end that closed first will be
in FIN_WAIT_2 state (assuming its FIN has been ACKed) and the other end will
be in CLOSE_WAIT state.

As for connections getting hung up in LAST_ACK state, this must be a bug
in the TCP implementation (I've noticed it myself on BSD machines). In that
state you are waiting for an ACK to your FIN; since there is "data" (the
FIN is considered to be data for sequence numbering and acknowledgement
purposes) the retransmission timer should be running.  Enough
unsuccessful attempts to get an ACK for the last FIN will eventually cause
TCP to go to the closed state -- if the TCP is working properly.  There
is no need to make incompatible changes to the protocol -- fix the bugs
instead!

Phil

ron@TOPAZ.RUTGERS.EDU (Ron Natalie) (06/17/87)

I should reemphasize to people who would automatically clear FIN_WAIT_2
after some period of time that this will blow away RSH possibly.  RSH
will close one side of the conection (remember TCP sockets are full-duplex)
and hence waiting for small amounts of time (like 30 seconds) will zap
legitimate connections for commands that take longer than 30 seconds to
return their output.

-Ron