[comp.bugs.2bsd] TIME_WAIT sockets clog system

sms@wlv.imsd.contel.com (Steven M. Schultz(Y)) (06/29/89)

Subject: TIME_WAIT sockets clog system (part 1)
Index:	sys/netinet/tcp_subr.c 2.10BSD

Description:
	Sockets in a TIME_WAIT state can constipate the networking
	buffer memory when generated in rapid succession by, for
	example, an "mget" in an ftp session.  If more than a dozen
	or so small files are transferred in rapid succession over
	an ethernet, all the mbufs in the system will be taken up
	by sockets in a TIME_WAIT state (from the socket opened for
	each data transfer).

Repeat-By:
	ftp in to a 2.10.1BSD system, do an "mget *" in a largish
	directory.  note that the transfer will hang/develope problems
	after about a dozen to twenty files.  the 2.10.1BSD system
	has run out of mbufs and will recover in a minute or so (hopefully).

Fix:
	Apply this patch to netinet/tcp_subr.c, this fix implements
	a tcp_drain() function that actually does something (the 
	original 4.3BSD tcp_drain() was a null function).  The other
	half of this fix will follow shortly and consists of a small
	modification to uipc_mbuf to invoke the drain routine at the
	appropriate time.

	The comment in the function pretty much sums the situation up.

*** tcp_subr.c.old	Thu Apr 28 16:22:56 1988
--- tcp_subr.c	Fri Jun  9 13:57:29 1989
***************
*** 251,257 ****
--- 251,279 ----
  
  tcp_drain()
  {
+ 	register struct inpcb *ip, *ipnxt;
+ 	register struct tcpcb *tp;
+ 	int s = splimp();
  
+ 	/*
+ 	 * Search through tcb's and look for TIME_WAIT states to liberate,
+ 	 * these are due to go away soon anyhow and we're short of space or
+  	 * we wouldn't be here...
+ 	 */
+ 	ip = tcb.inp_next;
+ 	if (ip == 0) {
+ 		splx(s);
+ 		return;
+ 	}
+ 	for (; ip != &tcb; ip = ipnxt) {
+ 		ipnxt = ip->inp_next;
+ 		tp = intotcpcb(ip);
+ 		if (tp == 0)
+ 			continue;
+ 		if (tp->t_state == TCPS_TIME_WAIT)
+ 			tcp_close(tp);
+ 	}
+ 	splx(s);
  }
  
  /*

sms@wlv.imsd.contel.com (Steven M. Schultz(Y)) (06/29/89)

Subject: TIME_WAIT sockets clog system (part 2)
Index:	sys/sys/uipc_mbuf.c 2.10BSD

Description:
	Sockets in a TIME_WAIT state can constipate the networking
	buffer memory when generated in rapid succession by, for
	example, an "mget" in an ftp session.  If more than a dozen
	or so small files are transferred in rapid succession over
	an ethernet, all the mbufs in the system will be taken up
	by sockets in a TIME_WAIT state (from the socket opened for
	each data transfer).

Repeat-By:
	ftp in to a 2.10.1BSD system, do an "mget *" in a largish
	directory.  note that the transfer will hang/develope problems
	after about a dozen to twenty files.  the 2.10.1BSD system
	has run out of mbufs and will recover in a minute or so (hopefully).
	It should be noted that even a Vax could be run out of mbufs
	if the directory were large enough and network memory was full
	due to other causes.

Fix:
	Apply this patch to /sys/sys/uipc_mbuf.c.  The effect of this fix
	is to call the drain routines at least once - the earlier version
	of m_more() checked the waitability state and rejected the 
	request immediately rather than attempt non-blocking methods
	of freeing up memory first.

	It is (as far as i can see - and yes, these fixes have been tested)
	safe to call the drain routines at this point for two reasons:  
	1) the tcp_drain() routine that existed prior to part 1 of this 
	fix did nothing AND the NEW tcp_drain() routine is guaranteed not 
	to sleep(), 2) the ip_drain() routine only releases fragments and 
	does not sleep() at any point along the way.

	With both part 1 and 2 installed a "ftp mget *" of /etc was done
	with no problems and 8 calls to the drain routines (as reported
	by 'netstat -s').

*** uipc_mbuf.old	Tue Jun 27 23:20:46 1989
--- uipc_mbuf.c	Tue Jun 27 23:23:14 1989
***************
*** 164,177 ****
  {
  	register struct mbuf *m;
  
- 	if (canwait == M_DONTWAIT) {
- 		mbstat.m_drops++;
- 		return (NULL);
- 	}
  	for(;;) {
  		m_expand(canwait);
  		if (mfree)
  			break;
  		mbstat.m_wait++;
  		m_want++;
  		SLEEP((caddr_t)&mfree, PZERO - 1);
--- 164,177 ----
  {
  	register struct mbuf *m;
  
  	for(;;) {
  		m_expand(canwait);
  		if (mfree)
  			break;
+ 		if (canwait == M_DONTWAIT) {
+ 			mbstat.m_drops++;
+ 			return (NULL);
+ 		}
  		mbstat.m_wait++;
  		m_want++;
  		SLEEP((caddr_t)&mfree, PZERO - 1);

ddl@husc6.harvard.edu (Dan Lanciani) (06/30/89)

	There is a potential problem with the proposed modifications
to the mbuf allocator.  The "cantwait" argument to the allocation
routines is more than just an indication of whether it is ok to sleep.
It is also a subtle hint that the routine was not called from a
(potentially plimp) interrupt routine.  If an allocation routine
is called from, e.g., the ethernet interrupt it would be incorrect
to (re)enter the network at plnet because (1) the network code may
inadvertently lower the pl and (2) the network code may itself have
been interrupted at a critical section.
	Unfortunately, to be useful, most *_drain routines must
ultimately access global structures which are protected only by
splnet's and thus are eventually likely to cause corruption.  It
may be possible to check the current pl on entry to the allocator
(tricky because of the macros) and distinguish three classes of
request (from least to most restrictive):  task-time (may sleep
and/or call the network), interrupt-time where previous pl was
less than plnet (may call the network), and interrupt-time where
previous pl was plnet or higher.  Of course, since you cannot
find the true previous pl, you would have to assume that the current
pl is higher than the previous and work from there.  This scheme
breaks down if any plnet code calls the allocator in the expectation
that it won't reenter, but such cases could be fixed.
	There is a somewhat different approach to the mbuf
problem which might also be helpful.  History:  The allocator
in 2.9 ignored the "cantwait" argument--it never slept at all.
Needless to say, problems arose frequently.  My first attempt
to improve the situation was to add appropriate sleeps, making
the system run much the way 4.3 does now, i.e., top-level
code can wait for mbufs but most other code can't.  This helped
very little.  Typically, a send call would block in the
allocator until an mbuf became available and then call the
tcp send routine which would promptly request an mbuf "DONTWAIT"
and fail.  The solution was to make allocation requests which
*could* wait *always* wait until some fraction (say, 50%) of
mbufs were free.  This is effect reserved half of the mbufs
for code that couldn't wait and improved matters significantly.

				Dan Lanciani
				ddl@harvard.*

sms@WLV.IMSD.CONTEL.COM (Steven M. Schultz(Y)) (07/01/89)

In article <2144@husc6.harvard.edu> ddl@husc6.harvard.edu (Dan Lanciani) writes:

>	There is a potential problem with the proposed modifications
>to the mbuf allocator.  The "cantwait" argument to the allocation
>routines is more than just an indication of whether it is ok to sleep.
>It is also a subtle hint that the routine was not called from a
>(potentially plimp) interrupt routine.  If an allocation routine
>is called from, e.g., the ethernet interrupt it would be incorrect
>to (re)enter the network at plnet because (1) the network code may
>inadvertently lower the pl and (2) the network code may itself have
>been interrupted at a critical section.

>	Unfortunately, to be useful, most *_drain routines must
>ultimately access global structures which are protected only by
>splnet's and thus are eventually likely to cause corruption.  

	Thanks for the hints/warning.

	i'd like to "explore"/"discuss" this a bit.

	Inadvertent lowering of the priority is the bugaboo to be
	afraid of as i understand it, the consequences of recursive
	interrupts, etc are well known to me (i only blew it twice in
	getting if_ec.c working for 2.10.1BSD).

	The m_more() routine is required to be called at splimp(), not splnet().

	The m_expand() routine which is called by m_more() also is required 
	to be called at splimp().

	The tcp_drain() routine i proposed (actually i'm running it as i type)
	raises spl to splimp() out of paranoia.  Besides which, the *_drain 
	routines are only called from m_expand() which is at splimp already.

	Last i looked, splimp() is higher than splnet() so there is not
	any inadvertent lowering of priority.

	Yes, if in the future, the *_drain routines get "smarter" or more
	complicated, then problems probably will arise.  Given the current
	setup everything seems in order.

	What have i missed?  Please be kind - it was late when i "had to
	do something about the ftp mget" problem ;-)

>The solution was to make allocation requests which
>*could* wait *always* wait until some fraction (say, 50%) of
>mbufs were free.  This is effect reserved half of the mbufs
>for code that couldn't wait and improved matters significantly.

	Hmmmm, that sounds interesting!  Might be worth a shot.  Thanks!

	Steven M. Schultz
	sms@wlv.imsd.contel.com

ddl@husc6.harvard.edu (Dan Lanciani) (07/02/89)

In article <33315@wlbr.IMSD.CONTEL.COM>, sms@WLV.IMSD.CONTEL.COM (Steven M. Schultz(Y)) writes:
| 	Inadvertent lowering of the priority is the bugaboo to be
| 	afraid of as i understand it, the consequences of recursive
| 	interrupts, etc are well known to me (i only blew it twice in
| 	getting if_ec.c working for 2.10.1BSD).

	This is no more or less to be feared than corruption of global
data structures.

| 	The m_more() routine is required to be called at splimp(), not splnet().
| 
| 	The m_expand() routine which is called by m_more() also is required 
| 	to be called at splimp().

	This is done so that the allocator can be called from plimp
interrupt routines.  The mbuf allocator is one of the few pieces of
the network that may be called by such routines.  Another is the
arpinput code, but that's easy to get rid of and I digress...

| 	The tcp_drain() routine i proposed (actually i'm running it as i type)
| 	raises spl to splimp() out of paranoia.  Besides which, the *_drain 
| 	routines are only called from m_expand() which is at splimp already.

	This piece of paranoia both does no good and is misleading.  It
protects your tcp_drain routine from being interrupted, but this isn't
the problem.  The problem is that your tcp_drain routine can be *called*
from a plimp interrupt routine.  In any case, the problem is not
with tcp_drain, but with the modification to the memory allocator.

| 	Last i looked, splimp() is higher than splnet() so there is not
| 	any inadvertent lowering of priority.

	But your tcp_drain does not stand by itself.  It calls tcp_close.
Tcp_close calls (among other things) in_pcbdetach.  In_pcbdetach access
the linked list of protocol control blocks.  It also calls sofree.

| 	Yes, if in the future, the *_drain routines get "smarter" or more
| 	complicated, then problems probably will arise.  Given the current
| 	setup everything seems in order.

	You can't get much more complicated than calling the tcp_close
function...

| 	What have i missed?  Please be kind - it was late when i "had to
| 	do something about the ftp mget" problem ;-)

	From the structure of your loop in tcp_drain, you obviously
noticed that tcp_close will delete protocol control blocks from the
tcb chain.  How would you feel if you were a happy little plnet
routine walking the tcb chain and then the tcp_drain routine was called
by the ethernet interrupt's asking for an mbuf, thus invalidating
your pointer?
	For a more definite example, consider the tcp_fasttimo routine
and think about what would happen if you deleted the protocol control
block that it was currently looking at.

				Dan Lanciani
				ddl@harvard.*

sms@wlv.imsd.contel.com (Steven M. Schultz) (07/04/89)

In article <12417@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes:

>There is an interesting discussion going on in comp.bugs.2bsd
>about an out-of-mbufs problem caused by an mget in ftp.  The
>problem obviously occurs primarily on a pdp11 with its limited
>memory, but the 2.10bsd code is taken directly from the VAX
>version, and I have noticed the same problem (and indeed the
>original submittor acknowledges the possibility in the excerpt
>from his posting I've reproduced below) when doing an mput (as I
>recall) on an overloaded MicroVAX being used as a file server.

	ahhh, so others have seen the problem on larger machines.  i had
	not seen any other references before, so i thought it only a
	'theoretical' possibility to run a vax out of network memory.

>There is some debate about the efficacy of the proposed fix,
>which involves fleshing out the (previously stubbed) tcp_drain
>routine.

	the pitfalls of my proposed change to the mbuf allocator
	have been made known to me (i really should have known better).
	an alternative solution is-being/has-been prepared.  

	a small change to mbuf.h is made, adding a new 'wait' flag
	and modifying the MGET macro to test whether it is safe
	(i.e. not being at splimp) to manipulate the tcb chain(s).
	the 0340 and 0100 are the processor priority mask and network
	priority (2) level for the pdp-11, but hopefully the idea is clear.
	ideally the appropriate symbolic names should be used, but
	"real work" reared it's head ;-)

	the idea is to add another state that will NOT sleep, but WILL
	invoke the drain code if the network code was at splnet.  (thanks
	to Dan Lanciani - ddl@harvard.harvard.edu for pointers in this
	area).

	it would be enlightening to know why sockets stay around so long
	in a TIME_WAIT state (especially on a LAN) and what would break
	if the timeout interval were reduced.

	the tcp_drain() modification with the removal
	of the un-necessary splimp call seems adequate.  here's what
	tcp_drain() looks like at the moment:

tcp_drain()
{
	register struct inpcb *ip, *ipnxt;
	register struct tcpcb *tp;

	/*
	 * Search through tcb's and look for TIME_WAIT states to liberate,
	 * these are due to go away soon anyhow and we're short of space or
 	 * we wouldn't be here...
	 */
	ip = tcb.inp_next;
	if (ip == 0)
		return;
	for (; ip != &tcb; ip = ipnxt) {
		ipnxt = ip->inp_next;
		tp = intotcpcb(ip);
		if (tp == 0)
			continue;
		if (tp->t_state == TCPS_TIME_WAIT)
			tcp_close(tp);
	}
}

	and the change to mbuf.h:

/* flags to m_get */
#define	M_DONTWAIT	0
#define	M_WAIT		1
#define	M_DONTWAITLONG	2		/* THIS IS NEW */
	...
#define	MGET(m, i, t) \
	{ int ms = splimp(); \
	  if ((m)=mfree) \
		{ if ((m)->m_type != MT_FREE) panic("mget"); (m)->m_type = t; \
		  mbstat.m_mtypes[MT_FREE]--; mbstat.m_mtypes[t]++; \
		  mfree = (m)->m_next; (m)->m_next = 0; \
		  (m)->m_off = MMINOFF; } \
	  else \
		(m) = m_more((((ms&0340) <= 0100) && (i==M_DONTWAIT)) ? M_DONTWAITLONG : i, t); \
	  splx(ms); }

sms@wlv.imsd.contel.com (Steven M. Schultz) (07/09/89)

	Here are the latest changes in an attempt to alleviate mbuf
	exhaustion from sockets persisting in the TIME_WAIT state
	(caused by, for example, "ftp mget/mput" in a directory with
	many short files).

	There are 3 modules to be changed:  /sys/h/mbuf.h, 
	/sys/netinet/tcp_subr.c and /sys/sys/uipc_mbuf.c.  The first
	change is given as a context diff suitable for 'patch', the
	last two are replacement functions.

	The concept behind the changes is when the mbufs are exhausted
	to check whether or not the current processor priority is at
	or below the NET level (not running at interface priority) and
	use the M_DONTWAITLONG state instead of the M_DONTWAIT.  This
	insures that we will NOT sleep(), but that it is safe to call
	the drain routines (which manipulate the tcb list amoung other
	things).

	The change to mbuf.h adds the new 'wait' state and modifies the
	MGET macro.  The "mysterious" numbers 0340 and 0200 are the
	processor priority field mask and SPLNET respectively, the
	appropriate symbolic defines SHOULD have been used, but i didn't
	have the  time to futz with the necessary "ifdef/include" sequences
	to incorporate the proper header files.

	At the present time, the change to tcp_drain() only looks for
	sockets in the TIME_WAIT state to remove - this is reasonably
	safe since these are due to expire shortly anyhow.  If other
	suggestions for augmenting the tcp_drain() arrive, they can easily
	be incorporated.

	m_expand() is essentially a 4.3BSD version with an ifdef for the pdp11
	since m_clalloc() isn't implemented.

	
*** mbuf.h.old	Mon Jul  3 11:35:32 1989
--- mbuf.h	Mon Jul  3 13:50:01 1989
***************
*** 83,88 ****
--- 83,89 ----
  /* flags to m_get */
  #define	M_DONTWAIT	0
  #define	M_WAIT		1
+ #define	M_DONTWAITLONG	2
  
  /* flags to m_pgalloc */
  #define	MPG_MBUFS	0		/* put new mbufs on free list */
***************
*** 106,112 ****
  		  mfree = (m)->m_next; (m)->m_next = 0; \
  		  (m)->m_off = MMINOFF; } \
  	  else \
! 		(m) = m_more(i, t); \
  	  splx(ms); }
  /*
   * Mbuf page cluster macros.
--- 107,113 ----
  		  mfree = (m)->m_next; (m)->m_next = 0; \
  		  (m)->m_off = MMINOFF; } \
  	  else \
! 		(m) = m_more((((ms&0340) <= 0100) && (i==M_DONTWAIT)) ? M_DONTWAITLONG : i, t); \
  	  splx(ms); }
  /*
   * Mbuf page cluster macros.

==========================================================================

tcp_drain()
{
	register struct inpcb *ip, *ipnxt;
	register struct tcpcb *tp;

	/*
	 * Search through tcb's and look for TIME_WAIT states to liberate,
	 * these are due to go away soon anyhow and we're short of space or
 	 * we wouldn't be here...
	 */
	ip = tcb.inp_next;
	if (ip == 0)
		return;
	for (; ip != &tcb; ip = ipnxt) {
		ipnxt = ip->inp_next;
		tp = intotcpcb(ip);
		if (tp == 0)
			continue;
		if (tp->t_state == TCPS_TIME_WAIT)
			tcp_close(tp);
	}
}

==============================================================================

m_expand(canwait)
	int canwait;
{
	register struct domain *dp;
	register struct protosw *pr;
	register int tries;

	for (tries = 0;; ) {
#ifdef	pdp11
		if (mfree)
			return (1);
#else
		if (m_clalloc(1, MPG_MBUFS, canwait))
			return (1);
#endif
		if (canwait == M_DONTWAIT || tries++)
			return (0);

		/* ask protocols to free space */
		for (dp = domains; dp; dp = dp->dom_next)
			for (pr = dp->dom_protosw; pr < dp->dom_protoswNPROTOSW;
			    pr++)
				if (pr->pr_drain)
					(*pr->pr_drain)();
		mbstat.m_drain++;
	}
}