[comp.bugs.4bsd] TIME_WAIT sockets clog system

scs@adam.pika.mit.edu (Steve Summit) (07/04/89)

There is an interesting discussion going on in comp.bugs.2bsd
about an out-of-mbufs problem caused by an mget in ftp.  The
problem obviously occurs primarily on a pdp11 with its limited
memory, but the 2.10bsd code is taken directly from the VAX
version, and I have noticed the same problem (and indeed the
original submittor acknowledges the possibility in the excerpt
from his posting I've reproduced below) when doing an mput (as I
recall) on an overloaded MicroVAX being used as a file server.

In article <comp.bugs.2bsd:33132@wlbr.IMSD.CONTEL.COM> sms@wlv.imsd.contel.com (Steven M. Schultz(Y)) writes:
G>Subject: TIME_WAIT sockets clog system (part 2)
O>Index:	sys/sys/uipc_mbuf.c 2.10BSD
O>
D>Description:
 >	Sockets in a TIME_WAIT state can constipate the networking
O>	buffer memory when generated in rapid succession by, for
L>	example, an "mget" in an ftp session.  If more than a dozen
D>	or so small files are transferred in rapid succession over
 >	an ethernet, all the mbufs in the system will be taken up
I>	by sockets in a TIME_WAIT state (from the socket opened for
N>	each data transfer).
E>
W>Repeat-By:
S>	ftp in to a 2.10.1BSD system, do an "mget *" in a largish
 >	directory.  note that the transfer will hang/develope problems
 >	after about a dozen to twenty files.  the 2.10.1BSD system
 >	has run out of mbufs and will recover in a minute or so (hopefully).
 >	It should be noted that even a Vax could be run out of mbufs
 >	if the directory were large enough and network memory was full
 >	due to other causes.

There is some debate about the efficacy of the proposed fix,
which involves fleshing out the (previously stubbed) tcp_drain
routine.

                                            Steve Summit
                                            scs@adam.pika.mit.edu

sms@wlv.imsd.contel.com (Steven M. Schultz) (07/04/89)

In article <12417@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes:

>There is an interesting discussion going on in comp.bugs.2bsd
>about an out-of-mbufs problem caused by an mget in ftp.  The
>problem obviously occurs primarily on a pdp11 with its limited
>memory, but the 2.10bsd code is taken directly from the VAX
>version, and I have noticed the same problem (and indeed the
>original submittor acknowledges the possibility in the excerpt
>from his posting I've reproduced below) when doing an mput (as I
>recall) on an overloaded MicroVAX being used as a file server.

	ahhh, so others have seen the problem on larger machines.  i had
	not seen any other references before, so i thought it only a
	'theoretical' possibility to run a vax out of network memory.

>There is some debate about the efficacy of the proposed fix,
>which involves fleshing out the (previously stubbed) tcp_drain
>routine.

	the pitfalls of my proposed change to the mbuf allocator
	have been made known to me (i really should have known better).
	an alternative solution is-being/has-been prepared.  

	a small change to mbuf.h is made, adding a new 'wait' flag
	and modifying the MGET macro to test whether it is safe
	(i.e. not being at splimp) to manipulate the tcb chain(s).
	the 0340 and 0100 are the processor priority mask and network
	priority (2) level for the pdp-11, but hopefully the idea is clear.
	ideally the appropriate symbolic names should be used, but
	"real work" reared it's head ;-)

	the idea is to add another state that will NOT sleep, but WILL
	invoke the drain code if the network code was at splnet.  (thanks
	to Dan Lanciani - ddl@harvard.harvard.edu for pointers in this
	area).

	it would be enlightening to know why sockets stay around so long
	in a TIME_WAIT state (especially on a LAN) and what would break
	if the timeout interval were reduced.

	the tcp_drain() modification with the removal
	of the un-necessary splimp call seems adequate.  here's what
	tcp_drain() looks like at the moment:

tcp_drain()
{
	register struct inpcb *ip, *ipnxt;
	register struct tcpcb *tp;

	/*
	 * Search through tcb's and look for TIME_WAIT states to liberate,
	 * these are due to go away soon anyhow and we're short of space or
 	 * we wouldn't be here...
	 */
	ip = tcb.inp_next;
	if (ip == 0)
		return;
	for (; ip != &tcb; ip = ipnxt) {
		ipnxt = ip->inp_next;
		tp = intotcpcb(ip);
		if (tp == 0)
			continue;
		if (tp->t_state == TCPS_TIME_WAIT)
			tcp_close(tp);
	}
}

	and the change to mbuf.h:

/* flags to m_get */
#define	M_DONTWAIT	0
#define	M_WAIT		1
#define	M_DONTWAITLONG	2		/* THIS IS NEW */
	...
#define	MGET(m, i, t) \
	{ int ms = splimp(); \
	  if ((m)=mfree) \
		{ if ((m)->m_type != MT_FREE) panic("mget"); (m)->m_type = t; \
		  mbstat.m_mtypes[MT_FREE]--; mbstat.m_mtypes[t]++; \
		  mfree = (m)->m_next; (m)->m_next = 0; \
		  (m)->m_off = MMINOFF; } \
	  else \
		(m) = m_more((((ms&0340) <= 0100) && (i==M_DONTWAIT)) ? M_DONTWAITLONG : i, t); \
	  splx(ms); }

dls@mace.cc.purdue.edu (David L Stevens) (07/04/89)

In article <33437@wlbr.IMSD.CONTEL.COM>, sms@wlv.imsd.contel.com (Steven M. Schultz) writes:
> 	it would be enlightening to know why sockets stay around so long
> 	in a TIME_WAIT state (especially on a LAN) and what would break
> 	if the timeout interval were reduced.

	Since TCP works on top of IP, in theory, anyway, it can't tell the
difference between a LAN and a 5-hop internet connection. The timeout is
(perhaps unfortunately) fixed at 2*(Maximum Segment Lifetime) with no
assumptions about the connection.
	One bad thing that can happen if you timeout too fast is that, if
the ACK to the remote guy's FIN is dropped, you won't be there to ACK his
retransmission and he'll be stuck in LAST_ACK. On pre-tahoe (maybe pre-4.3--
too many distributions!) systems, that wait is forever; current BSD code
times out of LAST_ACK. Non BSD, RFC-793 conforming code will still wait
forever.
	Another way of handling this is to free all of the resources except
a stub PCB with just the basic info. Pretty much the only function of a
TIME_WAIT state endpoint is to re-ACK retransmitted FINs.
	Yet another way, not to TCP spec but in the right spirit, is to
make the TIME_WAIT interval a function of the srt. On fast connections,
it'd go away quicker. 2*MSL is the absolute worst case, which probably
never happens.
-- 
					+-DLS  (dls@mace.cc.purdue.edu)

sms@wlv.imsd.contel.com (Steven M. Schultz) (07/09/89)

	Here are the latest changes in an attempt to alleviate mbuf
	exhaustion from sockets persisting in the TIME_WAIT state
	(caused by, for example, "ftp mget/mput" in a directory with
	many short files).

	There are 3 modules to be changed:  /sys/h/mbuf.h, 
	/sys/netinet/tcp_subr.c and /sys/sys/uipc_mbuf.c.  The first
	change is given as a context diff suitable for 'patch', the
	last two are replacement functions.

	The concept behind the changes is when the mbufs are exhausted
	to check whether or not the current processor priority is at
	or below the NET level (not running at interface priority) and
	use the M_DONTWAITLONG state instead of the M_DONTWAIT.  This
	insures that we will NOT sleep(), but that it is safe to call
	the drain routines (which manipulate the tcb list amoung other
	things).

	The change to mbuf.h adds the new 'wait' state and modifies the
	MGET macro.  The "mysterious" numbers 0340 and 0200 are the
	processor priority field mask and SPLNET respectively, the
	appropriate symbolic defines SHOULD have been used, but i didn't
	have the  time to futz with the necessary "ifdef/include" sequences
	to incorporate the proper header files.

	At the present time, the change to tcp_drain() only looks for
	sockets in the TIME_WAIT state to remove - this is reasonably
	safe since these are due to expire shortly anyhow.  If other
	suggestions for augmenting the tcp_drain() arrive, they can easily
	be incorporated.

	m_expand() is essentially a 4.3BSD version with an ifdef for the pdp11
	since m_clalloc() isn't implemented.

	
*** mbuf.h.old	Mon Jul  3 11:35:32 1989
--- mbuf.h	Mon Jul  3 13:50:01 1989
***************
*** 83,88 ****
--- 83,89 ----
  /* flags to m_get */
  #define	M_DONTWAIT	0
  #define	M_WAIT		1
+ #define	M_DONTWAITLONG	2
  
  /* flags to m_pgalloc */
  #define	MPG_MBUFS	0		/* put new mbufs on free list */
***************
*** 106,112 ****
  		  mfree = (m)->m_next; (m)->m_next = 0; \
  		  (m)->m_off = MMINOFF; } \
  	  else \
! 		(m) = m_more(i, t); \
  	  splx(ms); }
  /*
   * Mbuf page cluster macros.
--- 107,113 ----
  		  mfree = (m)->m_next; (m)->m_next = 0; \
  		  (m)->m_off = MMINOFF; } \
  	  else \
! 		(m) = m_more((((ms&0340) <= 0100) && (i==M_DONTWAIT)) ? M_DONTWAITLONG : i, t); \
  	  splx(ms); }
  /*
   * Mbuf page cluster macros.

==========================================================================

tcp_drain()
{
	register struct inpcb *ip, *ipnxt;
	register struct tcpcb *tp;

	/*
	 * Search through tcb's and look for TIME_WAIT states to liberate,
	 * these are due to go away soon anyhow and we're short of space or
 	 * we wouldn't be here...
	 */
	ip = tcb.inp_next;
	if (ip == 0)
		return;
	for (; ip != &tcb; ip = ipnxt) {
		ipnxt = ip->inp_next;
		tp = intotcpcb(ip);
		if (tp == 0)
			continue;
		if (tp->t_state == TCPS_TIME_WAIT)
			tcp_close(tp);
	}
}

==============================================================================

m_expand(canwait)
	int canwait;
{
	register struct domain *dp;
	register struct protosw *pr;
	register int tries;

	for (tries = 0;; ) {
#ifdef	pdp11
		if (mfree)
			return (1);
#else
		if (m_clalloc(1, MPG_MBUFS, canwait))
			return (1);
#endif
		if (canwait == M_DONTWAIT || tries++)
			return (0);

		/* ask protocols to free space */
		for (dp = domains; dp; dp = dp->dom_next)
			for (pr = dp->dom_protosw; pr < dp->dom_protoswNPROTOSW;
			    pr++)
				if (pr->pr_drain)
					(*pr->pr_drain)();
		mbstat.m_drain++;
	}
}