[comp.bugs.4bsd] Bug in 4.3bsd networking

larry@hcr.UUCP (Larry Philps) (08/09/89)
Subject: Running out of mbufs can cause a system panic
Index:	sys/uipc_socket2.c 4.3BSD

Description:
	When a 4.3BSD system is accepting a new TCP connection, the following
	scenario can take place.
	    i) sonewconn gets called to allocate a new struct socket for the 
	       connection.  It calls soqinsque with the old and new sockets
	       to queue the new one on the list of connections waiting to
	       be accepted by the user.  As well, so_head in the new socket
	       structure is set to point at the original socket.
	   ii) tcp_usrreq is called with the new socket and the cmd PRU_ATTACH.
	       It in turn calls tcp_attach.
	  iii) tcp_attach calls in_pcballoc to allocate a new internet protocol
	       control block.  It then calls tcp_newtcpcb to allocate a new
	       tcp protocol control block.
	   iv) This is where the problem starts.  Say you are out of available
	       mbufs and can't get another without sleeping so that the call
	       to tcp_newtcpcb fails.  (Very hard to reproduce, but it happened
	       to me!)
	    v) tcp_attach decides to take the connection apart, so it calls
	       in_pcbdetach to undo the in_pcballoc that completed succesfully.
	   vi) in_pcbdetach calls sofree to free the data held in the socket.
	       sofree undoes the the effect of the soqinsque done in sonewconn
	       by calling soqremque, then it sets so->so_head to NULL.
	  vii) We return up the call chain to sonewconn who calls soqremque to
	       undo the effect of its soqinsque.  Ooops.  sofree has already
	       done this, and a NULL value in so_head will cause a segmentation
	       violation in soqremque.

Repeat-By:
	Good Luck!  The code in question was running on a multiprocessor version
	of System V Release 3 with BSD networking, and the number of locks
	involved in acquiring physical pages from the shared memory made the
	probability of failing to get an mbuf much higher than on most single
	processor implementations.  Even so, the code had been running for over
	6 months before we saw this happen.

	You will probably have to hack tcp_attach to "fail" the tcp_newtcpcb
	call under some conditions if you want to see this happen.

Fix:
	The simplest thing to do to prevent this problem is to avoid the second
	call to soqremque.  This is easy to do since the routine that did the
	first one, sofree, set so_head to NULL afterwards.  So, just supress the
	second call unless so_head is still non-NULL.  This will be the case
	if the in_pcballoc fails instead of the tcp_newtcpcb.

*** uipc_socket2.c.orig	Thu Jul 20 13:22:04 1989
--- uipc_socket2.c	Thu Jul 20 13:22:07 1989
***************
*** 194,200 ****
  	soqinsque(head, so, 0);
  	if ((*so->so_proto->pr_usrreq)(so, PRU_ATTACH,
  	    (struct mbuf *)0, (struct mbuf *)0, (struct mbuf *)0)) {
! 		(void) soqremque(so, 0);
  		MFREE(m, junk);
  		goto bad;
  	}
--- 194,213 ----
  	soqinsque(head, so, 0);
  	if ((*so->so_proto->pr_usrreq)(so, PRU_ATTACH,
  	    (struct mbuf *)0, (struct mbuf *)0, (struct mbuf *)0)) {
! 		/*
! 		 * If we got most of the way through the attach before
! 		 * failing on the tcp control block, in_pcbdetach could
! 		 * have been called, which in turn would have called sofree.
! 		 * sofree does soqremque(so, 0) and soqremque(so, 1) then
! 		 * sets so->so_head = NULL.  Thus another call here will cause
! 		 * a segmentation violation becasue soqremque does not check
! 		 * for a valid so->so_head field.
! 		 *
! 		 * We fix this here by not calling soremque again if so->head
! 		 * is already NULL.
! 		 */
! 		if (so->so_head)
! 			(void) soqremque(so, 0);
  		MFREE(m, junk);
  		goto bad;
  	}

Larry Philps                             HCR Corporation
130 Bloor St. West, 10th floor           Toronto, Ontario.  M5S 1N5
(416) 922-1937                           {utzoo,utcsri,uunet}!hcr!larry