cpw%sneezy@LANL.GOV.UUCP (05/28/87)
Our MILNET Gateway is a VAX-11/750 running Berkeley 4.3 UNIX. Since installing
4.3 we have experienced a number of crashes of the type:
panic: out of mbufs: map full
After much searching and accounting for all the message buffers (mbufs), I
determined that this 'panic', in 2 cases, was precipitated by an uncontrolled
stream of 1 byte TCP packets emanating from (in panic #1) a MILNET host 2500
(plus or minus a thousand) miles away. The actual configurations were:
panic #1: (VMS login to NRL ) ( a telnet from NRL )
------- ----------- ------ ----
tty---|UB RF? | |VAX-11/780 |---|MILNET| |LANL|
|connect|---tt|VMS Woll...| | |---| |
------- ----------- ------ ----
panic #2: (VMS login to AFWL ) ( a telnet from AFWL)
------- ----------- ------ ----
ibm---|UB ?? | |VAX-11/780 |---|MILNET| |LANL|
clone |connect|---tt|VMS Woll...| | |---| |
------- ----------- ------ ----
Interestingly, the host at NRL crashed at the same time our host did.
I did not check out the AFWL case completely. I would like to know
why these 1 byte packets are coming our way. In both panics, each byte was
a hexadecimal '0d'. Could that stand for 0VERdOSE???
The crash results when, a tcp sender begins to pump beaucoup packets out
on the net in assending (sic) sequence each 1 character in length, the network
conveniently loses some small number of them early on, and the receivers window
is greater than the total number of mbufs available. The tcp input routine
appends these 1 character mbufs on a reassembly queue hoping for the
retransmission of lost packets so that it can make the list available to a
receiving application. Needless to say, the offending system never sends
the missing pieces, the kernel links whatever remaining system message
buffers there are on the reassembly queue, and finally, some other process
asks for an mbuf with the M_WAIT option, and 4.3 panics in m_clalloc.
I call this the 'silly assembly syndrome'.
I have posted a fix to unix-wizards for the 4.3BSD UNIX folks, which prevents
the receiver from crashing. However, it is most likely a problem in other TCP
implementations. It appears to be something the AFWL and NRL, not to mention
other installations, might like fixed.
Any comments?
-Phil Wood (cpw@lanl.gov)
Takano%THOR@hplabs.HP.COM (Haruka Takano) (05/28/87)
Could it have something to do with the Wollongong software? We have Wollongong software running on a Vax some PC/AT clones here and I've noticed that occasionally, when talking to our Dec-20, the window size will go down to 1 (sounds suspiciously like a version of the silly window syndrome). Has anybody else run into this problem? --Haruka Takano (Takano@HPLABS.HP.COM) -------
heker@JVNCA.CSC.ORG (Sergio Heker) (05/29/87)
Hakura: I ran a test over a T1 line connecting a VAX8600 running ULTRIX and a VAX750 running Wollongong. Both machines with DMR11 interfaces. The MTU's for both interfaces is 1248. I noticed that the round trip delay I was getting was less that 30 ms from 18 bytes to 1248 bytes, then as I continued increasing the packet size it jumps to around 6 sec. This only happens when we do it with a machine running Wollongong. When I tried from ULTRIX to ULTRIX the jump is about one order of magnitude instead of 200 times. -- Sergio Heker, JVNCnet
dab@oliver.cray.COM.UUCP (05/29/87)
Many months ago I ran into problems with running out of mbufs on the Cray computers. Among the problems was the one described by Phil Wood, that of getting in lots of little (1 byte) packets, and having one of the early ones get lost. I did a twofold fix to the Berkeley code (4.2, same mods apply to 4.3) to keep this problem to a minimum. 1) Compact the TCP reassembly queue. The Berkeley code does not compact the TCP reassembly queue. If you have 500 1 byte packets on your reassembly queue, you are using up 500 mbufs. 2) In uipc_mbuf.c, there is the comment in m_expand() /* should ask protocols to free code */ Well, I did just that. I wrote a routine called pfdrain(), almost identical to pfctlinput(). Then, I also added code to the tcp_drain() routine to actually go through all the tcp reassembly queues and free up all the fragments. Since we haven't acknowledged any of it, it's no problem to toss it. These mods are not very long, and I sent them to Mike Karels awhile ago (but not in time for the release). It took around 40 lines of code to do the above mods. Perhaps these fixes might show up on the 4.3 bug list at some point. If you are in urgent need of this code, contact me directly. Dave Borman Cray Research, Inc. dab@umn-rei-uc.arpa
cpw%sneezy@LANL.GOV (C. Philip Wood) (07/01/87)
About a month ago I posted some commentary on a Berkeley UNIX panic caused by an uncontrolled stream of 1 byte TCP packets emanating from some VMS hosts. I had also posted a fix to UNIX-wizards. Unfortunately, that was not the complete story. We continued to crash occasionally due to a clobbered message buffer pool. The reason for this has been found. 4.3 BSD has a cluster buffer concept which is used in socket creation among other things. A cluster buffer is relatively large block of storage (CLBYTES bytes) pointed to in mysterious ways by the normal small message buffers. The routine to get a cluster buffer (MCLGET) did not set up an error condition if it failed to reserve a cluster buffer. Consequently, at some point after running out of message buffers, the system would run across one of the clobbered cluster buffers and panic. The fix to MCLGET is to set the 'm_len' field in the small message buffer to 0 or anything other than CLBYTES before returning to the caller.