larry@hcr.UUCP (Larry Philps) (08/09/89)
Subject: Running out of mbufs can cause a system panic Index: sys/uipc_socket2.c 4.3BSD Description: When a 4.3BSD system is accepting a new TCP connection, the following scenario can take place. i) sonewconn gets called to allocate a new struct socket for the connection. It calls soqinsque with the old and new sockets to queue the new one on the list of connections waiting to be accepted by the user. As well, so_head in the new socket structure is set to point at the original socket. ii) tcp_usrreq is called with the new socket and the cmd PRU_ATTACH. It in turn calls tcp_attach. iii) tcp_attach calls in_pcballoc to allocate a new internet protocol control block. It then calls tcp_newtcpcb to allocate a new tcp protocol control block. iv) This is where the problem starts. Say you are out of available mbufs and can't get another without sleeping so that the call to tcp_newtcpcb fails. (Very hard to reproduce, but it happened to me!) v) tcp_attach decides to take the connection apart, so it calls in_pcbdetach to undo the in_pcballoc that completed succesfully. vi) in_pcbdetach calls sofree to free the data held in the socket. sofree undoes the the effect of the soqinsque done in sonewconn by calling soqremque, then it sets so->so_head to NULL. vii) We return up the call chain to sonewconn who calls soqremque to undo the effect of its soqinsque. Ooops. sofree has already done this, and a NULL value in so_head will cause a segmentation violation in soqremque. Repeat-By: Good Luck! The code in question was running on a multiprocessor version of System V Release 3 with BSD networking, and the number of locks involved in acquiring physical pages from the shared memory made the probability of failing to get an mbuf much higher than on most single processor implementations. Even so, the code had been running for over 6 months before we saw this happen. You will probably have to hack tcp_attach to "fail" the tcp_newtcpcb call under some conditions if you want to see this happen. Fix: The simplest thing to do to prevent this problem is to avoid the second call to soqremque. This is easy to do since the routine that did the first one, sofree, set so_head to NULL afterwards. So, just supress the second call unless so_head is still non-NULL. This will be the case if the in_pcballoc fails instead of the tcp_newtcpcb. *** uipc_socket2.c.orig Thu Jul 20 13:22:04 1989 --- uipc_socket2.c Thu Jul 20 13:22:07 1989 *************** *** 194,200 **** soqinsque(head, so, 0); if ((*so->so_proto->pr_usrreq)(so, PRU_ATTACH, (struct mbuf *)0, (struct mbuf *)0, (struct mbuf *)0)) { ! (void) soqremque(so, 0); MFREE(m, junk); goto bad; } --- 194,213 ---- soqinsque(head, so, 0); if ((*so->so_proto->pr_usrreq)(so, PRU_ATTACH, (struct mbuf *)0, (struct mbuf *)0, (struct mbuf *)0)) { ! /* ! * If we got most of the way through the attach before ! * failing on the tcp control block, in_pcbdetach could ! * have been called, which in turn would have called sofree. ! * sofree does soqremque(so, 0) and soqremque(so, 1) then ! * sets so->so_head = NULL. Thus another call here will cause ! * a segmentation violation becasue soqremque does not check ! * for a valid so->so_head field. ! * ! * We fix this here by not calling soremque again if so->head ! * is already NULL. ! */ ! if (so->so_head) ! (void) soqremque(so, 0); MFREE(m, junk); goto bad; } Larry Philps HCR Corporation 130 Bloor St. West, 10th floor Toronto, Ontario. M5S 1N5 (416) 922-1937 {utzoo,utcsri,uunet}!hcr!larry