hedrick@topaz.rutgers.edu (Charles Hedrick) (08/16/87)
I believe I have found a set of bugs in 4.2 or 4.3 TCP. Some of them are the kind where you say "this can't be true. Somebody would have noticed." Thus I'd appreciate it if somebody with a bit more experience with this code would take a look to see whether these are real. The most serious problem is with the FIN processing. The newest version of 4.3 networking, which we installed on the Sun from Van Jacobsen's distribution, likes to send a sequence of packets that causes meltdowns. I believe that the meltdown will occur even if one of the two machines is still running the old code. If I'm right, we may be in for some problems unless everybody fixes this problem at the same time. Here is a summary of the things reported in this message: - a problem in uipc_socket that causes out of band data to be lost. The effect is that when you rlogin to a host and run Emacs, the terminal mode does not change, and ^S stops output rather than doing a search. - a problem in uipc_socket that would cause push to get lost. This may not have any actual effects, since the 4.x TCP implementation seems to more or less ignore push. (Pyramid's version does not, and that's where we developed the fix.) - a problem where we get into a meltdown caused by a missprocessing of FIN's. - unknown but probably not serious problems due to a misdefinition of TCPS_HAVERCVDFIN Here are the details: (1) Out of band data gets lost. In 4.3 TCP, out of band data is removed from the normal data stream. However the point where it appeared is saved. If you ever read the normal data past that point, the out of band data is invalidated. In order to be sure you are going to see all out of band data, you have to set up a handler for SIGURG. The assumption is that the out of band data will trigger the SIGURG immediately, so that you go look at the out of band data before continuing to read the normal data. Unfortunately, if out of band data arrives at just the wrong time, SIGURG may not trigger soon enough. To fix this, I make the socket read code test for a pending SIGURG. If there is a pending SIGURG, you want to process it before going ahead to read the normal data. (2) Push gets lost (at least on the Pyramid). In the socket read routine, a variable "tomark" gets set if there is a push mark. The read is supposed to stop at this point. Unfortunately, it gets set before a loop. Inside that loop, the kernel has to turn on interrupts in order to copy data to a user. At this point it is possible that a packet could arrive and get processed. If it has a push mark, then the next time around the loop you want to use it. So I claim that the calculation of tomark should be moved inside the loop. (3) Here is a TCP trace showing our meltdown. This trace was between two Suns running Van's code: 11:01:27.48 RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: P 1:98(97) ack 1 win 4096 11:01:27.54 ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: P 1:34(33) ack 98 win 4096 11:01:27.72 RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: . ack 34 win 4096 11:01:27.74 RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: P 98:134(36) ack 34 win 4096 11:01:27.82 ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: . ack 134 win 4096 11:01:27.82 ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: P 34:40(6) ack 134 win 4096 11:01:27.82 RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: P 134:189(55) ack 40 win 4096 11:01:27.84 RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: F 189:189(0) ack 40 win 4096 11:01:27.84 ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: F 40:40(0) ack 189 win 4096 11:01:27.84 ARAMIS.RUTGERS.EDU.1735 > RUTGERS.EDU.nntp: F 40:40(0) ack 190 win 4096 11:01:27.84 RUTGERS.EDU.nntp > ARAMIS.RUTGERS.EDU.1735: F 189:189(0) ack 41 win 4096 For the next several minutes, the two machines sent FIN's back and forth at a very high rate. About 200,000 packets were sent during this period. This problem happens to us several times a day, always between the same pair of hosts. I believe the key to the problem is the fact that Aramis sends two FIN's. The first one acks the last data that was sent. The second one ack's the other side's FIN. The bug turns out to be that the ack field from a FIN is only used if it is the first FIN received. The problem is that the first FIN received causes snd_nxt to be incremented. From then on, todrop turns out to be nonzero, and the packet is dropped as if it were a duplicate. That's easy enough to fix. The fix needs to use the macro TCPS_HAVERCVDFIN, to figure out whether snd_nxt has been incremented. As far as I can tell from the code, this macro is intended to indicate that TCP is in a state where we can tell that it has received a FIN. However the definition appears to be completely off the wall. I have redefined the macro to use the correct list of states. It's unclear what problems the original definition might have caused. Possibly none would be observable. However a correct definition is essential to my fix of the infinite FIN problem. Here are my diffs. *** tcp_fsm.h.HOLD Wed Dec 3 21:51:04 1986 --- tcp_fsm.h Sun Aug 16 01:05:50 1987 *************** *** 23,29 ! #define TCPS_HAVERCVDFIN(s) ((s) >= TCPS_TIME_WAIT) --- 24,34 ----- ! #define finmask(s) (1 << (s)) ! /* RU 32 redo HAVERCVDFIN. Original had no resemblence to reality */ ! #define TCPS_HAVERCVDFIN(s) (finmask(s) & (finmask(TCPS_CLOSE_WAIT) | \ ! finmask(TCPS_CLOSING) | \ ! finmask(TCPS_LAST_ACK) | \ ! finmask(TCPS_TIME_WAIT))) *** tcp_input.c.HOLD Sun Jul 12 02:50:06 1987 --- tcp_input.c Sun Aug 16 00:46:50 1987 *************** *** 505,510 */ todrop = tp->rcv_nxt - ti->ti_seq; if (todrop > 0) { if (tiflags & TH_SYN) { tiflags &= ~TH_SYN; ti->ti_seq++; --- 505,516 ----- */ todrop = tp->rcv_nxt - ti->ti_seq; if (todrop > 0) { + /* + * RU 32 - if fin already processed, rcv_nxt has been incremented. + * So we have to hack todrop to compensate. + */ + if ((tiflags & TH_FIN) && (TCPS_HAVERCVDFIN(tp->t_state))) + todrop--; if (tiflags & TH_SYN) { tiflags &= ~TH_SYN; ti->ti_seq++; *** uipc_socket.c.ORIG Thu Sep 25 09:18:58 1986 --- uipc_socket.c Sat Aug 15 04:42:08 1987 *************** *** 416,421 struct protosw *pr = so->so_proto; int moff; if (rightsp) *rightsp = 0; if (aname) --- 416,424 ----- struct protosw *pr = so->so_proto; int moff; + /* RU 29 - need procp for test of SIGURG. see below */ + register struct proc *p = u.u_procp; + if (rightsp) *rightsp = 0; if (aname) *************** *** 527,534 } eor = 0; moff = 0; ! tomark = so->so_oobmark; ! do { if (uio->uio_resid <= 0) break; len = uio->uio_resid; --- 530,553 ----- } eor = 0; moff = 0; ! /* ! * RU 29 - move calculation of tomark into loop, so we exit ! * if push happens to arrive while we are in the loop with ! * priority lowered. Also, arrange to exit if we get SIGURG, ! * so that we don't read over the urgent data, making it ! * invalid. ! */ ! if (flags & MSG_PEEK) ! tomark = so->so_oobmark; ! if (((p)->p_sig & sigmask(SIGURG)) && ! ((p)->p_flag&STRC || ! (((p)->p_sig & sigmask(SIGURG)) &~ ! ((p)->p_sigignore | (p)->p_sigmask)))) ! error = EINTR; ! else do { ! /* RU 29 - move tomark inside loop */ ! if ((flags & MSG_PEEK) == 0) ! tomark = so->so_oobmark; if (uio->uio_resid <= 0) break; len = uio->uio_resid; *************** *** 573,579 if (tomark == 0) break; } ! } while (m && error == 0 && !eor); if (flags & MSG_PEEK) goto release; if ((so->so_proto->pr_flags & PR_ATOMIC) && eor == 0) --- 592,603 ----- if (tomark == 0) break; } ! /* RU 29 - stop if we get SIGURG */ ! } while (m && error == 0 && !eor && ! ! (((p)->p_sig & sigmask(SIGURG)) && ! ((p)->p_flag&STRC || ! (((p)->p_sig & sigmask(SIGURG)) &~ ! ((p)->p_sigignore | (p)->p_sigmask))))); if (flags & MSG_PEEK) goto release; if ((so->so_proto->pr_flags & PR_ATOMIC) && eor == 0)
hedrick@topaz.rutgers.edu (Charles Hedrick) (08/17/87)
After some additional testing and purusing packet logs, I have convinced myself that the meltdown described in the my previous message will be prevented if either of the hosts involved in the conversation has the fixes to tcp_input.c and tcp_fsm.h posted in my previous message. In that message I suggested that it might be dangerous to put up Van Jacobson's Sun TCP modifications unless you could make sure that every host in the world got the fix at the same time. Fortunately, this turns out not to be the case. Since it is only necessary for the fix to be present on one end, this means that it is safe to put up the new TCP as long as all the hosts you put it up on get the fix. I have so far been unable to explain what it is about the new TCP that causes the problem to happen. As far as I can see, the bug was present in SunOS 3.2, which means it was almost certainly present in 4.2BSD. As far as I can tell, the only reason it hasn't turned up before is that timings happened to work out. I should note in passing that this entire business may be unimportant to other sites. Problems that happen every few hours at Rutgers happen every few months elsewhere, and every few months, things happen here that are simply impossible. It must be the air.