[comp.unix.wizards] panic: out of mbufs: map full

cpw@lanl.GOV (05/28/87)

Our MILNET Gateway is a VAX-11/750 running Berkeley 4.3 UNIX. Since installing 
4.3 we have experienced a number of crashes of the type:

	panic: out of mbufs: map full

for which I have a fix.  Please bear with me.

After much searching and accounting for all the message buffers (mbufs), I
determined that this 'panic', in 2 cases, was precipitated by an uncontrolled
stream of 1 byte TCP packets emanating from (in panic #1) a MILNET host 2500
(plus or minus a thousand) miles away.  The actual configurations were:

	panic #1: (VMS login to NRL )             ( a telnet from NRL )
			 -------       -----------     ------     ----
	          tty---|UB RF?	|     |VAX-11/780 |---|MILNET|   |LANL|
			|connect|---tt|VMS Woll...|   |      |---|    |
			 -------       -----------     ------     ----

	panic #2: (VMS login to AFWL )            ( a telnet from AFWL)
			 -------       -----------     ------     ----
	          ibm---|UB ??	|     |VAX-11/780 |---|MILNET|   |LANL|
		  clone	|connect|---tt|VMS Woll...|   |      |---|    |
			 -------       -----------     ------     ----

Interestingly, the host at NRL crashed at the same time our host did.
I did not check out the AFWL case completely.  I would be pleased to know
why these 1 byte packets are coming our way.  In both cases, each byte was
a hexadecimal '0d'.  Could that stand for 0VERdOSE???

The crash results when, a tcp sender begins to pump beaucoup packets out
on the net in assending (sic) sequence each 1 character in length, the network
conveniently loses some small number of them early on, and the receivers window
is greater than the total number of mbufs available.  The tcp input routine
appends these 1 character mbufs on a reassembly queue hoping for the
retransmission of lost packets so that it can make the list available to a
receiving application.  Needless to say, the offending system never sends
the missing pieces, the kernel links whatever remaining system message
buffers there are on the reassembly queue, and finally, some other process
asks for an mbuf with the M_WAIT option, and 4.3 panics in m_clalloc.

I call this the 'silly assembly syndrome'.

Thinking I might as well give it a shot, I spent some time spinning up on how
message buffers were managed by the kernel.  This included spinning up on
adb and fixing some of the /usr/lib/adb scripts as well.  Two interesting
things surfaced:

	1.	The sleep/wakeup code was not being exercised in the mbuf
		code.
	2.	If you did turn it on, by causing 'm_clalloc' to return a zero
		instead of the above panic, and having 'm_expand' return
		'mfree' in the hopes that some process had released an mbuf
		or two, the kernel would panic with some kind of privileged
		operation.  This was due to 'm_get' being called with M_WAIT
		rather than M_CANTWAIT on an interrupt.

So, to make a short story slightly longer and more nauseating, I modified
some modules, and have appended the 'diffs' below.  The fixes involved:

	a few netinet modules
			- changing M_WAIT to M_CANTWAIT in code called on
			  interrupts.
	netinet/tcp_input.c
			- useing a rather heavy handed approach for controlling
			  the maximum number of mbufs on a particular
			  reassembly queue.  A little tweeking indicates
			  that 8 is a good maximum for our installation.
			  Of course you could probably get a thesis out of
			  how to do it right.
	sys/uipc_mbuf.c
			- returning a zero from m_clalloc when out of space.
			- returning 'mfree' from m_expand as an answer to
			  'NEED SOME WAY TO RELEASE SPACE'.

Amazingly enough this works for us. I can pound away at the system from
a couple of ethernets, bombarding it with tons of packets (which used to
crash the system almost immediately), the user applications go through
their sleep/wakeup phases, all the mbufs eventually return to the free list,
and the system stays up.  I assume that it will survive the '0d' kind
of attacks.

P.S.	Don't use ULTRIX 2.0, on an 8600, to pound your 750, the 8600 will die.

P.S.S.	As a free gift, I have included a fix to if_ether.c which will prevent
	still another panic of the kind Protection Fault caused when 'arptfree'
	runs out of arp buckets and returns a zero mbuf pointer (this is a
	'can never happen' kind of bug).
	The important thing to do is type cast an unsigned variable 'at_timer'
	to integer to achieve the desired logical result.

*** netinet/ip_input.c	Wed May 20 12:52:09 1987
--- netinet/ip_input.c.orig	Thu Jul 10 10:43:39 1986
***************
*** 6 ****
!  *	@(#)ip_input.c	7.1 (Berkeley) 6/5/86 (modified by cpw@lanl.gov)
--- 6 ----
!  *	@(#)ip_input.c	7.1 (Berkeley) 6/5/86
***************
*** 303 ****
! 		if ((t = m_get(M_DONTWAIT, MT_FTABLE)) == NULL)
--- 303 ----
! 		if ((t = m_get(M_WAIT, MT_FTABLE)) == NULL)
***************
*** 736,737 ****
! 	if( (m = m_get(M_DONTWAIT, MT_SOOPTS)) == 0 )
! 		return ((struct mbuf *)0);
--- 732 ----
! 	m = m_get(M_WAIT, MT_SOOPTS);
*** netinet/tcp_subr.c	Wed May 20 12:22:01 1987
--- netinet/tcp_subr.c.orig	Wed May 20 12:21:58 1987
***************
*** 60 ****
! 		m = m_get(M_DONTWAIT, MT_HEADER);
--- 60 ----
! 		m = m_get(M_WAIT, MT_HEADER);
*** netinet/tcp_input.c	Fri May 22 17:06:31 1987
--- netinet/tcp_input.c.orig	Wed May 20 12:54:34 1987
***************
*** 36 ****
- int	tcpreasscnt = 0;
--- 35 ----
***************
*** 60,62 ****
! #define TCP_REASS_MAX	12
! int	tcp_reass_max = TCP_REASS_MAX;
! int	tcp_reass_drop = 0;
--- 59 ----
! 
***************
*** 68 ****
- 	register short	s_cnt;
--- 64 ----
***************
*** 83,84 ****
! 	tcpreasscnt++;
! 	for (s_cnt = 0, q = tp->seg_next; q != (struct tcpiphdr *)tp;s_cnt++,
--- 79 ----
! 	for (q = tp->seg_next; q != (struct tcpiphdr *)tp;
***************
*** 89,92 ****
- 	if( s_cnt > tcp_reass_max ){
- 		tcp_reass_drop++;
- 		 goto drop;
- 	}
--- 83 ----
*** sys/uipc_mbuf.c	Fri May 22 17:07:46 1987
--- sys/uipc_mbuf.c.orig	Mon May  4 13:48:25 1987
***************
*** 6 ****
!  *	@(#)uipc_mbuf.c	7.1.2 (LANL) 05/04/87
--- 6 ----
!  *	@(#)uipc_mbuf.c	7.1 (Berkeley) 6/5/86
***************
*** 20 ****
! int m_rmalloc_cnt = 0; /* just so we can find out it happened */
--- 20 ----
! 
***************
*** 51 ****
! 		m_rmalloc_cnt++;
--- 51,52 ----
! 		if (canwait == M_WAIT)
! 			panic("out of mbufs: map full");
***************
*** 109 ****
! 	return ((int)mfree); /* someone might have done an m_free! */
--- 110 ----
! 	return (0);
*** netinet/if_ether.c  Thu Apr 30 11:38:33 1987
--- netinet/if_ether.c.orig     Thu Jun  5 01:24:40 1986
***************
*** 6 ****
!  *    @(#)if_ether.c  7.1.1.1 (LANL) 2/27/87
--- 6 ----
!  *    @(#)if_ether.c  7.1 (Berkeley) 6/5/86
***************
*** 75 ****
!               if ((int)(++at->at_timer) < ((at->at_flags&ATF_COM) ?
--- 75 ----
!               if (++at->at_timer < ((at->at_flags&ATF_COM) ?
***************
*** 436 ****
!               if ((int)at->at_timer > oldest) {
--- 432 ----
!               if (at->at_timer > oldest) {