[comp.unix.wizards] 4.3 BSD networking

cpw%sneezy@LANL.GOV (C. Philip Wood) (07/26/87)

INTRODUCTION

Our MILNET host (VAX BSD 4.3) can get pretty busy sometimes.  And, if you
throw in rude hosts with broken network implementions, sending junk, the
result used to be:

	'panic: out of mbufs: map full

I have made a number of changes (some posted to news) to the kernel to allow
us to weather this problem, with some good success.  However, since then,
I have had a few crashes which, I assume, resulted from traversing untested
code in the kernel.  I am hoping for some discussion, pointers to other
discussions, fixes, etc., on buffering, congestion control, garbage collection,
recognition and control of rude hosts.  What follows is an attempt to
summarize my experience modifiying the kernel.  Familiarity with 'netinet/*.c'
and 'sys/uipc*.c' and 'h/*.h' modules is assumed.

SUMMARY OF CHANGES

To begin with, I noticed there was provision for sleep and
wakeup on the 'mfree' queue.  However, this code was never exercised since
the panic occured first.  I modified 'm_clalloc' to just return a failure
which would cause 'm_more' to sleep in the M_WAIT state and 'm_expand'
to return 'mfree' on the off chance that some other process had released
some message buffers.

At first this did not work at all!  I found that the numerous calls
to MGET/m_get were not very careful about the wait option.  Consequently,
the kernel would attempt a sleep on an interrupt.  I found all these
babys and changed the canwait flag appropriately.

This revised system worked very well.  I could prove this by pounding the
system with thousands of packets which used to panic the unrevised system.
The new version stayed up and I thought "Oh boy".  However, my joy was
short lived (6 days).

The first crash I experienced resulted from a bug in MCLGET which assumed
that on call the mbuf (m)->m_len was not equal to CLBYTES.  So, a failure
return from MCLALLOC would return a success from MCLGET if (on call) m->m_len
was equal to CLBYTES.  Then the calling process would happily fill in the
pseudo cluster with whatever, eventually leading to some awful panic like
a Segmentation Fault, depending on what that cluster space might have been
used for (like a socket structure or someones data space).

I fixed this one, and thought "Oh boy".  Well, another few days went by and
we restarted the named daemon, and:

	panic: Segmentation fault

By this time I had accumulated a pretty neat set of adb scripts with which
to dump out numerous aspects of the message buffering scheme, and found that:

	1. There were no free mbufs.  The kernel had run out of mbufs
	   2516 times and droped 2462 M_DONTWAIT requests.  The difference,
	   54, would be the number of times processes had been put to sleep.
	   The 'm_want' flag was zero so, presumably there were no processes
	   waiting for mbufs or one was about to awake.

	2. There were 232 UDP domain nameserver packets waiting on receive
	   queues on the 'udb' queue.

	3. The kernel was attempting to attach a new tcp socket with the
	   sequence:

		ipintr -> tcp_input -> sonewconn

	   when it encountered a failure from tcp_usrreq and attempted to
	   dequeue the socket vi 'soqremque'.  The socket had already
	   been soqremque'd deep in the guts of a sequence something like:

		tcp_usrreq -> in_pcbdetach -> sofree

	   Consequently, the code in soqremque attempted to use a 0 based
	   mbuf and grabbed a weird address for a socket out of low core.

I am trying to figure out how to fix this last one.  One fix would be to
put a silent check for zero in soqremque and just return, maybe bump a
counter or print from where called?  Any suggestions, would be appreciated.

COMMENTARY

In one sense I am fixing untested kernel code.  But, if I step back just
a tad and take a look at what I'm doing, I see that I am attempting,
haphazardly, to resolve the problem of buffering and congestion control.
It turns out, in all cases (see below) I have investigated, the exhaustion
of the page map happened after all mbufs had been put on one queue or another.
That is to say, I can account for every mbuf in the pool.  None had been
"leaked" or forgotten.

   case 1. The first case came to light when I discovered most of the mbufs
	   were linked on a tcp reassembly queue for some telnet connection
	   from a VMS system over MILNET.  Each mbuf had one character in it.
	   With a receive window of 4K you can run out of mbufs pretty easy.

   case 2. The second case came resulted from sending lots of udp packets
	   of trash over an ethernet and swamping the udp queue.

   case 3. The last case I investigated, resulted from many domain name
	   udp packets queueing up on the udp queue.  Similar to case 2,
	   but in this case the packets were 'legitimate'.

AS I SEE IT

The above points to two related items:

1. The 4.3 BSD kernel must be made more robust, to avoid being corrupted
   by rude hosts.  Does anyone have ideas on how to identify resource hogs?
   What to do when you find one?

2. Once a misbehaving host has been identified, who is it we contact
   to get the problem fixed in a timely fashion.  Where is it written
   down who to contact when XYZ vendors, ABC-n/XXX, zzzOS operating
   system is doing something wrong, and it is located 2527 miles away
   in a vault operated by QRS, International?  Should this be part of
   the registration process for a particular domain?  Is it already?

Thanks for reading this far.

Phil Wood, cpw@lanl.gov.

jbn@glacier.STANFORD.EDU (John B. Nagle) (07/26/87)

In article <8479@brl-adm.ARPA> cpw%sneezy@LANL.GOV (C. Philip Wood) writes:
>
>   case 1. The first case came to light when I discovered most of the mbufs
>	   were linked on a tcp reassembly queue for some telnet connection
>	   from a VMS system over MILNET.  Each mbuf had one character in it.
>	   With a receive window of 4K you can run out of mbufs pretty easy.

    This is the old "tinygram problem", and appears in many old TCP
    implementations, including 4.2BSD.  I devised a theoretical
    solution to this problem years ago (see RFC896, Jan. 1984), and Mike
    Karels put it in 4.3BSD.  But there are still a lot of broken TCP
    implementations around, especially ones that are derived from 4.2.
    Ordinarily the tinygram problem only results in wasted bandwidth.
    But crashing the system is unreasonable.

    The receiver could protect itself against this situation by limiting
    the number of mbufs on the reassembly queue to (1+(window/max seg size)).
    A sender with the tinygram problem fixed will not exceed this limit.
    When that limit is reached, drop something, preferably the packet with
    the largest sequence number in the window.  This will prevent buffer
    exhaustion due to out of order tinygrams.

    Examine the TCP sequence numbers in the queued mbufs and find out if
    there are duplicates.  If many packets are duplicated, the other end
    has a broken retransmission algorithm.

>   case 2. The second case came resulted from sending lots of udp packets
>	   of trash over an ethernet and swamping the udp queue.

    The system crashed just because of a transient packet overload?  Strange.  
>
>   case 3. The last case I investigated, resulted from many domain name
>	   udp packets queueing up on the udp queue.  Similar to case 2,
>	   but in this case the packets were 'legitimate'.
>
    

    One problem with a shared dynamic resource such as mbufs is that for
    a system to work reliably, either all requestors for the resource must
    be able to tolerate rejected requests for the resource, or all requestors
    of the resource must have quotas which prevent hogging.   Given the
    way 4.3BSD works, the first solution appears to be partially implemented.
    When out of mbufs, one can discard incoming packets, of course, but
    this can be regarded only as an emergency measure.  On the other hand,
    waiting for an mbuf introduces the possibility of deadlock.

>AS I SEE IT
>
>The above points to two related items:
>
>1. The 4.3 BSD kernel must be made more robust, to avoid being corrupted
>   by rude hosts.  Does anyone have ideas on how to identify resource hogs?
>   What to do when you find one?
>
>2. Once a misbehaving host has been identified, who is it we contact
>   to get the problem fixed in a timely fashion?  Where is it written
>   down who to contact when XYZ vendors, ABC-n/XXX, zzzOS operating
>   system is doing something wrong, and it is located 2527 miles away
>   in a vault operated by QRS, International?  Should this be part of
>   the registration process for a particular domain?  Is it already?
>
     The system manager for each host known to the NIC is in the "whois"
database.  When I was faced with the problem of dealing with faulty hosts,
I used to send letters along the lines of "your MILNET host is causing
network interference due to noncompliance with MIL-STD-1778 (Transmission
Control Protocol) para 9.2.5.5.; see attached annotated packet trace;
copy to DCA code 252.", and followed this up with a phone call.  After
about a year of nagging, most of the worst offenders were fixed.

     Now that there exist decent TCP implementations for most iron, it
is usually sufficient to get sites upgraded to a current revision of
the network software for their machine.  So it is easier than it used to
be to get these problems fixed.

     The stock 4.3BSD kernel doesn't log much useful data to help in 
this task.  It's a real question whether this sort of test instrumentation
belongs in a production system.  I once put heavy logging in a 4.1BSD system
using 3COM's old TCP, and found it immensely useful, but one shouldn't
generalize from this.  This is really a subject for the TCP-IP list.

					John Nagle

					John Nagle

forys@sigi.Colorado.EDU (Jeff Forys) (07/27/87)

In article <8479@brl-adm.ARPA> cpw%sneezy@LANL.GOV (C. Philip Wood) writes:
> if you throw in rude hosts with broken network implementions, sending
> junk, the result used to be:
>
>	panic: out of mbufs: map full

I had the same problem here.  Our 4.3BSD 11/785 would run out of mbufs
and crash at *least* once a week.  We've now gone over 20 days without
a crash so I'm betting the problem is no more...

> I am hoping for some discussion, pointers to other discussions, fixes,
> etc., on buffering, congestion control, garbage collection [...]

I spoke with Mike Karels about the problem.  He directed me to a couple
fixes Dave Borman (dab@umn-rei.uc.arpa) added to UNICOS (Cray UNIX).
He had advertised the fixes on the tcp-ip mailing list and they can be
easily ported to a BSD system.

> the exhaustion of the page map happened after all mbufs had been put
> on one queue or another.

Right.  I too, had assumed a leak when this first started happening (when
we put up 4.3BSD in January) but soon discovered this was not the case.

> [...] Each mbuf had one character in it.  With a receive window of 4K
> you can run out of mbufs pretty easy.

Uh huh, by any chance, are these running Wollongong TCP/IP?  After closer
examination, I discovered that's where our `problem' packets were comming
from.  Anyways, what you want to do here is compact the TCP reassembly
queues (i.e. take all the 1 byte packets and merge them together).

> [...] sending lots of udp packets of trash over an ethernet and
> swamping the udp queue.

While I have not experienced crashing due to being "swamped" (in fact,
our deuna would drop the packets before UDP ever sees them), the other
fix from Dave asks the protocols to free up mbufs when they run out.
This also takes care of the case where some brain-damaged machine sends
you every *other* packet.  Freeing things in tcp-reassembly queues doesnt
make me "happy", but since no acknowledgments for the stuff have gone
out, it's "safe".  Besides, what else can you do?  It's an unpleasant
situation...

This second fix has explicitly *not* been tested, but it looks like it
does the "right thing" (I should throw a log() message in there to see
if it's been used yet).  At any rate, besides the couple fixes to get
Dave's mods working under BSD, I also ifdef'd them so either or both
could be `activated' in the kernel config file.  I added his name to
the mods so I suppose I could be persuaded to pass them out if you dont
wanna fix them for BSD yourself.
---
Jeff Forys @ UC/Boulder Engineering Research Comp Cntr (303-492-4991)
forys@Boulder.Colorado.EDU  -or-  ..!{hao|nbires}!boulder!forys