sms@wlv.imsd.contel.com (Steven M. Schultz(Y)) (06/29/89)
Subject: TIME_WAIT sockets clog system (part 1)
Index: sys/netinet/tcp_subr.c 2.10BSD
Description:
Sockets in a TIME_WAIT state can constipate the networking
buffer memory when generated in rapid succession by, for
example, an "mget" in an ftp session. If more than a dozen
or so small files are transferred in rapid succession over
an ethernet, all the mbufs in the system will be taken up
by sockets in a TIME_WAIT state (from the socket opened for
each data transfer).
Repeat-By:
ftp in to a 2.10.1BSD system, do an "mget *" in a largish
directory. note that the transfer will hang/develope problems
after about a dozen to twenty files. the 2.10.1BSD system
has run out of mbufs and will recover in a minute or so (hopefully).
Fix:
Apply this patch to netinet/tcp_subr.c, this fix implements
a tcp_drain() function that actually does something (the
original 4.3BSD tcp_drain() was a null function). The other
half of this fix will follow shortly and consists of a small
modification to uipc_mbuf to invoke the drain routine at the
appropriate time.
The comment in the function pretty much sums the situation up.
*** tcp_subr.c.old Thu Apr 28 16:22:56 1988
--- tcp_subr.c Fri Jun 9 13:57:29 1989
***************
*** 251,257 ****
--- 251,279 ----
tcp_drain()
{
+ register struct inpcb *ip, *ipnxt;
+ register struct tcpcb *tp;
+ int s = splimp();
+ /*
+ * Search through tcb's and look for TIME_WAIT states to liberate,
+ * these are due to go away soon anyhow and we're short of space or
+ * we wouldn't be here...
+ */
+ ip = tcb.inp_next;
+ if (ip == 0) {
+ splx(s);
+ return;
+ }
+ for (; ip != &tcb; ip = ipnxt) {
+ ipnxt = ip->inp_next;
+ tp = intotcpcb(ip);
+ if (tp == 0)
+ continue;
+ if (tp->t_state == TCPS_TIME_WAIT)
+ tcp_close(tp);
+ }
+ splx(s);
}
/*
sms@wlv.imsd.contel.com (Steven M. Schultz(Y)) (06/29/89)
Subject: TIME_WAIT sockets clog system (part 2) Index: sys/sys/uipc_mbuf.c 2.10BSD Description: Sockets in a TIME_WAIT state can constipate the networking buffer memory when generated in rapid succession by, for example, an "mget" in an ftp session. If more than a dozen or so small files are transferred in rapid succession over an ethernet, all the mbufs in the system will be taken up by sockets in a TIME_WAIT state (from the socket opened for each data transfer). Repeat-By: ftp in to a 2.10.1BSD system, do an "mget *" in a largish directory. note that the transfer will hang/develope problems after about a dozen to twenty files. the 2.10.1BSD system has run out of mbufs and will recover in a minute or so (hopefully). It should be noted that even a Vax could be run out of mbufs if the directory were large enough and network memory was full due to other causes. Fix: Apply this patch to /sys/sys/uipc_mbuf.c. The effect of this fix is to call the drain routines at least once - the earlier version of m_more() checked the waitability state and rejected the request immediately rather than attempt non-blocking methods of freeing up memory first. It is (as far as i can see - and yes, these fixes have been tested) safe to call the drain routines at this point for two reasons: 1) the tcp_drain() routine that existed prior to part 1 of this fix did nothing AND the NEW tcp_drain() routine is guaranteed not to sleep(), 2) the ip_drain() routine only releases fragments and does not sleep() at any point along the way. With both part 1 and 2 installed a "ftp mget *" of /etc was done with no problems and 8 calls to the drain routines (as reported by 'netstat -s'). *** uipc_mbuf.old Tue Jun 27 23:20:46 1989 --- uipc_mbuf.c Tue Jun 27 23:23:14 1989 *************** *** 164,177 **** { register struct mbuf *m; - if (canwait == M_DONTWAIT) { - mbstat.m_drops++; - return (NULL); - } for(;;) { m_expand(canwait); if (mfree) break; mbstat.m_wait++; m_want++; SLEEP((caddr_t)&mfree, PZERO - 1); --- 164,177 ---- { register struct mbuf *m; for(;;) { m_expand(canwait); if (mfree) break; + if (canwait == M_DONTWAIT) { + mbstat.m_drops++; + return (NULL); + } mbstat.m_wait++; m_want++; SLEEP((caddr_t)&mfree, PZERO - 1);
ddl@husc6.harvard.edu (Dan Lanciani) (06/30/89)
There is a potential problem with the proposed modifications to the mbuf allocator. The "cantwait" argument to the allocation routines is more than just an indication of whether it is ok to sleep. It is also a subtle hint that the routine was not called from a (potentially plimp) interrupt routine. If an allocation routine is called from, e.g., the ethernet interrupt it would be incorrect to (re)enter the network at plnet because (1) the network code may inadvertently lower the pl and (2) the network code may itself have been interrupted at a critical section. Unfortunately, to be useful, most *_drain routines must ultimately access global structures which are protected only by splnet's and thus are eventually likely to cause corruption. It may be possible to check the current pl on entry to the allocator (tricky because of the macros) and distinguish three classes of request (from least to most restrictive): task-time (may sleep and/or call the network), interrupt-time where previous pl was less than plnet (may call the network), and interrupt-time where previous pl was plnet or higher. Of course, since you cannot find the true previous pl, you would have to assume that the current pl is higher than the previous and work from there. This scheme breaks down if any plnet code calls the allocator in the expectation that it won't reenter, but such cases could be fixed. There is a somewhat different approach to the mbuf problem which might also be helpful. History: The allocator in 2.9 ignored the "cantwait" argument--it never slept at all. Needless to say, problems arose frequently. My first attempt to improve the situation was to add appropriate sleeps, making the system run much the way 4.3 does now, i.e., top-level code can wait for mbufs but most other code can't. This helped very little. Typically, a send call would block in the allocator until an mbuf became available and then call the tcp send routine which would promptly request an mbuf "DONTWAIT" and fail. The solution was to make allocation requests which *could* wait *always* wait until some fraction (say, 50%) of mbufs were free. This is effect reserved half of the mbufs for code that couldn't wait and improved matters significantly. Dan Lanciani ddl@harvard.*
sms@WLV.IMSD.CONTEL.COM (Steven M. Schultz(Y)) (07/01/89)
In article <2144@husc6.harvard.edu> ddl@husc6.harvard.edu (Dan Lanciani) writes: > There is a potential problem with the proposed modifications >to the mbuf allocator. The "cantwait" argument to the allocation >routines is more than just an indication of whether it is ok to sleep. >It is also a subtle hint that the routine was not called from a >(potentially plimp) interrupt routine. If an allocation routine >is called from, e.g., the ethernet interrupt it would be incorrect >to (re)enter the network at plnet because (1) the network code may >inadvertently lower the pl and (2) the network code may itself have >been interrupted at a critical section. > Unfortunately, to be useful, most *_drain routines must >ultimately access global structures which are protected only by >splnet's and thus are eventually likely to cause corruption. Thanks for the hints/warning. i'd like to "explore"/"discuss" this a bit. Inadvertent lowering of the priority is the bugaboo to be afraid of as i understand it, the consequences of recursive interrupts, etc are well known to me (i only blew it twice in getting if_ec.c working for 2.10.1BSD). The m_more() routine is required to be called at splimp(), not splnet(). The m_expand() routine which is called by m_more() also is required to be called at splimp(). The tcp_drain() routine i proposed (actually i'm running it as i type) raises spl to splimp() out of paranoia. Besides which, the *_drain routines are only called from m_expand() which is at splimp already. Last i looked, splimp() is higher than splnet() so there is not any inadvertent lowering of priority. Yes, if in the future, the *_drain routines get "smarter" or more complicated, then problems probably will arise. Given the current setup everything seems in order. What have i missed? Please be kind - it was late when i "had to do something about the ftp mget" problem ;-) >The solution was to make allocation requests which >*could* wait *always* wait until some fraction (say, 50%) of >mbufs were free. This is effect reserved half of the mbufs >for code that couldn't wait and improved matters significantly. Hmmmm, that sounds interesting! Might be worth a shot. Thanks! Steven M. Schultz sms@wlv.imsd.contel.com
ddl@husc6.harvard.edu (Dan Lanciani) (07/02/89)
In article <33315@wlbr.IMSD.CONTEL.COM>, sms@WLV.IMSD.CONTEL.COM (Steven M. Schultz(Y)) writes: | Inadvertent lowering of the priority is the bugaboo to be | afraid of as i understand it, the consequences of recursive | interrupts, etc are well known to me (i only blew it twice in | getting if_ec.c working for 2.10.1BSD). This is no more or less to be feared than corruption of global data structures. | The m_more() routine is required to be called at splimp(), not splnet(). | | The m_expand() routine which is called by m_more() also is required | to be called at splimp(). This is done so that the allocator can be called from plimp interrupt routines. The mbuf allocator is one of the few pieces of the network that may be called by such routines. Another is the arpinput code, but that's easy to get rid of and I digress... | The tcp_drain() routine i proposed (actually i'm running it as i type) | raises spl to splimp() out of paranoia. Besides which, the *_drain | routines are only called from m_expand() which is at splimp already. This piece of paranoia both does no good and is misleading. It protects your tcp_drain routine from being interrupted, but this isn't the problem. The problem is that your tcp_drain routine can be *called* from a plimp interrupt routine. In any case, the problem is not with tcp_drain, but with the modification to the memory allocator. | Last i looked, splimp() is higher than splnet() so there is not | any inadvertent lowering of priority. But your tcp_drain does not stand by itself. It calls tcp_close. Tcp_close calls (among other things) in_pcbdetach. In_pcbdetach access the linked list of protocol control blocks. It also calls sofree. | Yes, if in the future, the *_drain routines get "smarter" or more | complicated, then problems probably will arise. Given the current | setup everything seems in order. You can't get much more complicated than calling the tcp_close function... | What have i missed? Please be kind - it was late when i "had to | do something about the ftp mget" problem ;-) From the structure of your loop in tcp_drain, you obviously noticed that tcp_close will delete protocol control blocks from the tcb chain. How would you feel if you were a happy little plnet routine walking the tcb chain and then the tcp_drain routine was called by the ethernet interrupt's asking for an mbuf, thus invalidating your pointer? For a more definite example, consider the tcp_fasttimo routine and think about what would happen if you deleted the protocol control block that it was currently looking at. Dan Lanciani ddl@harvard.*
sms@wlv.imsd.contel.com (Steven M. Schultz) (07/04/89)
In article <12417@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes: >There is an interesting discussion going on in comp.bugs.2bsd >about an out-of-mbufs problem caused by an mget in ftp. The >problem obviously occurs primarily on a pdp11 with its limited >memory, but the 2.10bsd code is taken directly from the VAX >version, and I have noticed the same problem (and indeed the >original submittor acknowledges the possibility in the excerpt >from his posting I've reproduced below) when doing an mput (as I >recall) on an overloaded MicroVAX being used as a file server. ahhh, so others have seen the problem on larger machines. i had not seen any other references before, so i thought it only a 'theoretical' possibility to run a vax out of network memory. >There is some debate about the efficacy of the proposed fix, >which involves fleshing out the (previously stubbed) tcp_drain >routine. the pitfalls of my proposed change to the mbuf allocator have been made known to me (i really should have known better). an alternative solution is-being/has-been prepared. a small change to mbuf.h is made, adding a new 'wait' flag and modifying the MGET macro to test whether it is safe (i.e. not being at splimp) to manipulate the tcb chain(s). the 0340 and 0100 are the processor priority mask and network priority (2) level for the pdp-11, but hopefully the idea is clear. ideally the appropriate symbolic names should be used, but "real work" reared it's head ;-) the idea is to add another state that will NOT sleep, but WILL invoke the drain code if the network code was at splnet. (thanks to Dan Lanciani - ddl@harvard.harvard.edu for pointers in this area). it would be enlightening to know why sockets stay around so long in a TIME_WAIT state (especially on a LAN) and what would break if the timeout interval were reduced. the tcp_drain() modification with the removal of the un-necessary splimp call seems adequate. here's what tcp_drain() looks like at the moment: tcp_drain() { register struct inpcb *ip, *ipnxt; register struct tcpcb *tp; /* * Search through tcb's and look for TIME_WAIT states to liberate, * these are due to go away soon anyhow and we're short of space or * we wouldn't be here... */ ip = tcb.inp_next; if (ip == 0) return; for (; ip != &tcb; ip = ipnxt) { ipnxt = ip->inp_next; tp = intotcpcb(ip); if (tp == 0) continue; if (tp->t_state == TCPS_TIME_WAIT) tcp_close(tp); } } and the change to mbuf.h: /* flags to m_get */ #define M_DONTWAIT 0 #define M_WAIT 1 #define M_DONTWAITLONG 2 /* THIS IS NEW */ ... #define MGET(m, i, t) \ { int ms = splimp(); \ if ((m)=mfree) \ { if ((m)->m_type != MT_FREE) panic("mget"); (m)->m_type = t; \ mbstat.m_mtypes[MT_FREE]--; mbstat.m_mtypes[t]++; \ mfree = (m)->m_next; (m)->m_next = 0; \ (m)->m_off = MMINOFF; } \ else \ (m) = m_more((((ms&0340) <= 0100) && (i==M_DONTWAIT)) ? M_DONTWAITLONG : i, t); \ splx(ms); }
sms@wlv.imsd.contel.com (Steven M. Schultz) (07/09/89)
Here are the latest changes in an attempt to alleviate mbuf exhaustion from sockets persisting in the TIME_WAIT state (caused by, for example, "ftp mget/mput" in a directory with many short files). There are 3 modules to be changed: /sys/h/mbuf.h, /sys/netinet/tcp_subr.c and /sys/sys/uipc_mbuf.c. The first change is given as a context diff suitable for 'patch', the last two are replacement functions. The concept behind the changes is when the mbufs are exhausted to check whether or not the current processor priority is at or below the NET level (not running at interface priority) and use the M_DONTWAITLONG state instead of the M_DONTWAIT. This insures that we will NOT sleep(), but that it is safe to call the drain routines (which manipulate the tcb list amoung other things). The change to mbuf.h adds the new 'wait' state and modifies the MGET macro. The "mysterious" numbers 0340 and 0200 are the processor priority field mask and SPLNET respectively, the appropriate symbolic defines SHOULD have been used, but i didn't have the time to futz with the necessary "ifdef/include" sequences to incorporate the proper header files. At the present time, the change to tcp_drain() only looks for sockets in the TIME_WAIT state to remove - this is reasonably safe since these are due to expire shortly anyhow. If other suggestions for augmenting the tcp_drain() arrive, they can easily be incorporated. m_expand() is essentially a 4.3BSD version with an ifdef for the pdp11 since m_clalloc() isn't implemented. *** mbuf.h.old Mon Jul 3 11:35:32 1989 --- mbuf.h Mon Jul 3 13:50:01 1989 *************** *** 83,88 **** --- 83,89 ---- /* flags to m_get */ #define M_DONTWAIT 0 #define M_WAIT 1 + #define M_DONTWAITLONG 2 /* flags to m_pgalloc */ #define MPG_MBUFS 0 /* put new mbufs on free list */ *************** *** 106,112 **** mfree = (m)->m_next; (m)->m_next = 0; \ (m)->m_off = MMINOFF; } \ else \ ! (m) = m_more(i, t); \ splx(ms); } /* * Mbuf page cluster macros. --- 107,113 ---- mfree = (m)->m_next; (m)->m_next = 0; \ (m)->m_off = MMINOFF; } \ else \ ! (m) = m_more((((ms&0340) <= 0100) && (i==M_DONTWAIT)) ? M_DONTWAITLONG : i, t); \ splx(ms); } /* * Mbuf page cluster macros. ========================================================================== tcp_drain() { register struct inpcb *ip, *ipnxt; register struct tcpcb *tp; /* * Search through tcb's and look for TIME_WAIT states to liberate, * these are due to go away soon anyhow and we're short of space or * we wouldn't be here... */ ip = tcb.inp_next; if (ip == 0) return; for (; ip != &tcb; ip = ipnxt) { ipnxt = ip->inp_next; tp = intotcpcb(ip); if (tp == 0) continue; if (tp->t_state == TCPS_TIME_WAIT) tcp_close(tp); } } ============================================================================== m_expand(canwait) int canwait; { register struct domain *dp; register struct protosw *pr; register int tries; for (tries = 0;; ) { #ifdef pdp11 if (mfree) return (1); #else if (m_clalloc(1, MPG_MBUFS, canwait)) return (1); #endif if (canwait == M_DONTWAIT || tries++) return (0); /* ask protocols to free space */ for (dp = domains; dp; dp = dp->dom_next) for (pr = dp->dom_protosw; pr < dp->dom_protoswNPROTOSW; pr++) if (pr->pr_drain) (*pr->pr_drain)(); mbstat.m_drain++; } }