[comp.bugs.4bsd] various bugs in 4.x

hedrick@topaz.rutgers.edu (Charles Hedrick) (08/16/87)

I believe I have found a set of bugs in 4.2 or 4.3 TCP.  Some of them
are the kind where you say "this can't be true.  Somebody would have
noticed."  Thus I'd appreciate it if somebody with a bit more
experience with this code would take a look to see whether these are
real.  The most serious problem is with the FIN processing.  The
newest version of 4.3 networking, which we installed on the Sun from
Van Jacobsen's distribution, likes to send a sequence of packets that
causes meltdowns.  I believe that the meltdown will occur even if one
of the two machines is still running the old code.  If I'm right, we
may be in for some problems unless everybody fixes this problem at the
same time.

Here is a summary of the things reported in this message:
 - a problem in uipc_socket that causes out of band data to be lost.
	The effect is that when you rlogin to a host and run Emacs,
	the terminal mode does not change, and ^S stops output rather
	than doing a search.
 - a problem in uipc_socket that would cause push to get lost.  This
	may not have any actual effects, since the 4.x TCP
	implementation seems to more or less ignore push.  (Pyramid's
	version does not, and that's where we developed the fix.)
 - a problem where we get into a meltdown caused by a missprocessing
	of FIN's.
 - unknown but probably not serious problems due to a misdefinition
	of TCPS_HAVERCVDFIN

Here are the details:

(1) Out of band data gets lost.  In 4.3 TCP, out of band data is
removed from the normal data stream.  However the point where it
appeared is saved.  If you ever read the normal data past that point,
the out of band data is invalidated.  In order to be sure you are
going to see all out of band data, you have to set up a handler for
SIGURG.  The assumption is that the out of band data will trigger the
SIGURG immediately, so that you go look at the out of band data before
continuing to read the normal data.  Unfortunately, if out of band
data arrives at just the wrong time, SIGURG may not trigger soon
enough.  To fix this, I make the socket read code test for a pending
SIGURG.  If there is a pending SIGURG, you want to process it before
going ahead to read the normal data.

(2) Push gets lost (at least on the Pyramid).  In the socket read
routine, a variable "tomark" gets set if there is a push mark.  The
read is supposed to stop at this point.  Unfortunately, it gets set
before a loop.  Inside that loop, the kernel has to turn on interrupts
in order to copy data to a user.  At this point it is possible that a
packet could arrive and get processed.  If it has a push mark, then
the next time around the loop you want to use it.  So I claim that the
calculation of tomark should be moved inside the loop.

(3) Here is a TCP trace showing our meltdown.  This trace was between
two Suns running Van's code:

11:01:27.48  RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: P 1:98(97) ack 1 win 4096
11:01:27.54  ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: P 1:34(33) ack 98 win 4096
11:01:27.72  RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: . ack 34 win 4096
11:01:27.74  RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: P 98:134(36) ack 34 win 4096
11:01:27.82  ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: . ack 134 win 4096
11:01:27.82  ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: P 34:40(6) ack 134 win 4096
11:01:27.82  RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: P 134:189(55) ack 40 win 4096
11:01:27.84  RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: F 189:189(0) ack 40 win 4096
11:01:27.84  ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: F 40:40(0) ack 189 win 4096
11:01:27.84  ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: F 40:40(0) ack 190 win 4096
11:01:27.84  RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: F 189:189(0) ack 41 win 4096

For the next several minutes, the two machines sent FIN's back and
forth at a very high rate.  About 200,000 packets were sent during
this period.  This problem happens to us several times a day, always
between the same pair of hosts.  I believe the key to the problem is
the fact that Aramis sends two FIN's. The first one acks the last data
that was sent.  The second one ack's the other side's FIN.  The bug
turns out to be that the ack field from a FIN is only used if it is
the first FIN received.  The problem is that the first FIN received
causes snd_nxt to be incremented.  From then on, todrop turns out to
be nonzero, and the packet is dropped as if it were a duplicate.
That's easy enough to fix.  The fix needs to use the macro
TCPS_HAVERCVDFIN, to figure out whether snd_nxt has been incremented.
As far as I can tell from the code, this macro is intended to indicate
that TCP is in a state where we can tell that it has received a FIN.
However the definition appears to be completely off the wall.  I have
redefined the macro to use the correct list of states.  It's unclear
what problems the original definition might have caused.  Possibly
none would be observable.  However a correct definition is essential
to my fix of the infinite FIN problem.

Here are my diffs.

*** tcp_fsm.h.HOLD	Wed Dec  3 21:51:04 1986
--- tcp_fsm.h	Sun Aug 16 01:05:50 1987
***************
*** 23,29
! #define	TCPS_HAVERCVDFIN(s)	((s) >= TCPS_TIME_WAIT)
--- 24,34 -----
! #define finmask(s) (1 << (s))
! /* RU 32 redo HAVERCVDFIN.  Original had no resemblence to reality */
! #define TCPS_HAVERCVDFIN(s)	(finmask(s) & (finmask(TCPS_CLOSE_WAIT) | \
! 					       finmask(TCPS_CLOSING) | \
! 					       finmask(TCPS_LAST_ACK) | \
! 					       finmask(TCPS_TIME_WAIT)))
*** tcp_input.c.HOLD	Sun Jul 12 02:50:06 1987
--- tcp_input.c	Sun Aug 16 00:46:50 1987
***************
*** 505,510
  	 */
  	todrop = tp->rcv_nxt - ti->ti_seq;
  	if (todrop > 0) {
  		if (tiflags & TH_SYN) {
  			tiflags &= ~TH_SYN;
  			ti->ti_seq++;

--- 505,516 -----
  	 */
  	todrop = tp->rcv_nxt - ti->ti_seq;
  	if (todrop > 0) {
+ /*
+  * RU 32 - if fin already processed, rcv_nxt has been incremented.
+  * So we have to hack todrop to compensate.
+  */
+ 		if ((tiflags & TH_FIN) && (TCPS_HAVERCVDFIN(tp->t_state)))
+ 			todrop--;
  		if (tiflags & TH_SYN) {
  			tiflags &= ~TH_SYN;
  			ti->ti_seq++;
*** uipc_socket.c.ORIG	Thu Sep 25 09:18:58 1986
--- uipc_socket.c	Sat Aug 15 04:42:08 1987
***************
*** 416,421
  	struct protosw *pr = so->so_proto;
  	int moff;
  
  	if (rightsp)
  		*rightsp = 0;
  	if (aname)

--- 416,424 -----
  	struct protosw *pr = so->so_proto;
  	int moff;
  
+ /* RU 29 - need procp for test of SIGURG. see below */
+ 	register struct proc *p = u.u_procp;
+ 
  	if (rightsp)
  		*rightsp = 0;
  	if (aname)
***************
*** 527,534
  	}
  	eor = 0;
  	moff = 0;
! 	tomark = so->so_oobmark;
! 	do {
  		if (uio->uio_resid <= 0)
  			break;
  		len = uio->uio_resid;

--- 530,553 -----
  	}
  	eor = 0;
  	moff = 0;
! /*
!  * RU 29 - move calculation of tomark into loop, so we exit
!  * if push happens to arrive while we are in the loop with
!  * priority lowered.  Also, arrange to exit if we get SIGURG,
!  * so that we don't read over the urgent data, making it
!  * invalid.
!  */
! 	if (flags & MSG_PEEK)
! 	  tomark = so->so_oobmark;
! 	if (((p)->p_sig & sigmask(SIGURG)) &&
! 	    ((p)->p_flag&STRC ||
! 	     (((p)->p_sig & sigmask(SIGURG)) &~ 
! 	      ((p)->p_sigignore | (p)->p_sigmask))))
! 	  error = EINTR;
! 	else do {
! /* RU 29 - move tomark inside loop */
! 		if ((flags & MSG_PEEK) == 0)
! 	    	  tomark = so->so_oobmark;
  		if (uio->uio_resid <= 0)
  			break;
  		len = uio->uio_resid;
***************
*** 573,579
  			if (tomark == 0)
  				break;
  		}
! 	} while (m && error == 0 && !eor);
  	if (flags & MSG_PEEK)
  		goto release;
  	if ((so->so_proto->pr_flags & PR_ATOMIC) && eor == 0)

--- 592,603 -----
  			if (tomark == 0)
  				break;
  		}
! /* RU 29 - stop if we get SIGURG */
! 	} while (m && error == 0 && !eor && !
! 		   (((p)->p_sig & sigmask(SIGURG)) &&
! 		    ((p)->p_flag&STRC ||
! 		     (((p)->p_sig & sigmask(SIGURG)) &~ 
! 		      ((p)->p_sigignore | (p)->p_sigmask)))));
  	if (flags & MSG_PEEK)
  		goto release;
  	if ((so->so_proto->pr_flags & PR_ATOMIC) && eor == 0)

hedrick@topaz.rutgers.edu (Charles Hedrick) (08/17/87)

After some additional testing and purusing packet logs, I have
convinced myself that the meltdown described in the my previous
message will be prevented if either of the hosts involved in the
conversation has the fixes to tcp_input.c and tcp_fsm.h posted in my
previous message.  In that message I suggested that it might be
dangerous to put up Van Jacobson's Sun TCP modifications unless you
could make sure that every host in the world got the fix at the same
time.  Fortunately, this turns out not to be the case.  Since it is
only necessary for the fix to be present on one end, this means that
it is safe to put up the new TCP as long as all the hosts you put it
up on get the fix.  I have so far been unable to explain what it is
about the new TCP that causes the problem to happen.  As far as I can
see, the bug was present in SunOS 3.2, which means it was almost
certainly present in 4.2BSD.  As far as I can tell, the only reason it
hasn't turned up before is that timings happened to work out.

I should note in passing that this entire business may be unimportant
to other sites.  Problems that happen every few hours at Rutgers
happen every few months elsewhere, and every few months, things happen
here that are simply impossible.  It must be the air.