[comp.sys.pyramid] help with mbuf leak problem?

karl_kleinpaste@cis.ohio-state.edu (09/15/90)

Pyramid 98xe, OSx4.4c, nfsd 8, biod 8.

I've developed a nasty problem with one of my Pyrs in the last 16
hours or so.  It has developed a serious problem with mbuf lossage.
Here's a netstat -m output just before his last reboot, about 10
minutes ago:

	2003/2032 mbufs in use:
	        1877 mbufs allocated to data
	        12 mbufs allocated to packet headers
	        109 mbufs allocated to routing table entries
	        3 mbufs allocated to socket names and addresses
	        2 mbufs allocated to interface addresses
	128/128 mapped pages in use
	510 Kbytes allocated to network (99% in use)
	1 requests for memory denied

Note excessive data mbuf allocation, and 99% utilization.  Consider
the same thing from his twin, in the next cabinet, looking quite
normal and running for days:

	86/288 mbufs in use:
	        3 mbufs allocated to data
	        4 mbufs allocated to packet headers
	        75 mbufs allocated to routing table entries
	        2 mbufs allocated to socket names and addresses
	        2 mbufs allocated to interface addresses
	28/96 mapped pages in use
	228 Kbytes allocated to network (29% in use)
	0 requests for memory denied

This leakage started happening sometime around 5pm or 6pm last
evening.  I have had to reboot almost hourly just to keep the @#$%
machine alive.  I've experimented with several things, trying to find
the cause.  Killing off assorted network daemons didn't help;
sendmail, nntp, inetd as a whole, routed, pcnfsd were all killed, and
yet the data mbuf allocation keeps ratcheting upward.  I tried
rebooting with 16 nfsd/biod but this was no help either.  Killing off
all nfsd/biod and the portmapper didn't help.  Renicing nfsd and/or
biod didn't help.  As near as I an see, nothing running on the Pyr
itself is the cause of this.

"etherfind -r -n src victim-pyr or dst victim-pyr" run from a nearby
SunOS4.1 Sun3 shows a great deal of NFS traffic, of this form:

	UDP from another-pyr.1023 to victim-pyr.2049  128 bytes
	 RPC Call prog 200000 proc 1 V1 [93dc7]
	UDP from victim-pyr.2049 to another-pyr.1023  104 bytes
	 RPC Reply  [93dc7] AUTH_NULL Success
	UDP from another-pyr.1023 to victim-pyr.2049  172 bytes
	 RPC Call prog 200000 proc 9 V1 [93dc8]
	UDP from victim-pyr.2049 to another-pyr.1023  36 bytes
	 RPC Reply  [93dc8] AUTH_NULL Success
	UDP from another-pyr.1023 to victim-pyr.2049  172 bytes
	 RPC Call prog 200000 proc 9 V1 [93dc9]
	UDP from victim-pyr.2049 to another-pyr.1023  36 bytes
	 RPC Reply  [93dc9] AUTH_NULL Success
	UDP from another-pyr.1023 to victim-pyr.2049  128 bytes
	 RPC Call prog 200000 proc 1 V1 [93dca]
	UDP from victim-pyr.2049 to another-pyr.1023  104 bytes
	 RPC Reply  [93dca] AUTH_NULL Success

But not all of this traffic is coming from another-pyr -- assorted
Pyrs, Suns, and the occasional HP show up.

I'm also getting messages like
	NFS server write failed: (err=13, dev=0xffa610a4, ino=0xffa69bd0).
on the console occasionally.  Errno 13 is EACCES.  ???

The only anomalous thing about this Pyr's configuration is that it's
the departmental /usr/spool/mail NFS server.  But that's been the case
for a couple of years now, nothing new or unusual about that.

As I said, I'm rebooting roughly hourly at this point to keep it
alive.  It seems to perform admirably right up until the end, when the
2032/2032 mbuf condition hits.  It reboots in 10 minutes and is fine
again for the next hour, while the mbuf count goes up.

Clues, anyone?  I can't think of anything that would have been started
at 5pm on a Friday evening which might cause this sort of thing.  What
sort of activity on the Pyr or elsewhere on my network should I be
looking for?

--karl

karl_kleinpaste@cis.ohio-state.edu (09/18/90)

On Saturday, I wrote in desperation:
   I've developed a nasty problem with one of my Pyrs in the last 16
   hours or so.  It has developed a serious problem with mbuf lossage.

I received the following extremely helpful, detailed reply from Robert
Ehrlich.  The cause he describes correlates perfectly with our problem
here; we recently upgraded several hundred Sun3/50s under SunOS3.5 to
Sun SLCs under SunOS4.1, and the SLC causing the abuse (finally found
with a combination of etherfind and shutting down Sun subnets in turn)
belongs to a faculty member who is in 9 groups when the Pyrs only
believe in 8.  We are now seeing to it that all users are in no more
than 8 groups.

Is the problem Robert describes "fixed in 5.0?"  The Pyr RTOC (or CCSC
or whatever it's called now) folks have been pestering me for a while
to upgrade (4.4c here still), but I've been putting it off due to a
lack of excitement over the prospect.  But if it'll take care of this
problem for me, I'll probably see about upgrading Real Soon Now, as
in, a couple of weekends from now.

Many thanx to Robert.

--karl

    ----------------
UNIX-From: ehrlich@inria.inria.fr  Mon Sep 17 19:41:33 1990
Received: from inria.inria.fr by cheops.cis.ohio-state.edu (5.61-kk/5.900912)
	id AA03385; Mon, 17 Sep 90 19:41:33 -0400
Received: by inria.inria.fr (5.64+/89.0.8)
	via Fnet-EUnet id AA12608; Tue, 18 Sep 90 01:42:17 +0200 (MET)
Date: Tue, 18 Sep 90 01:42:17 +0200
From: ehrlich@inria.inria.fr (Robert Ehrlich)
Message-Id: <9009172342.AA12608@inria.inria.fr>
To: karl_kleinpaste@cis.ohio-state.edu
Subject: Re:  help with mbuf leak problem?
Cc: dupont@inria.inria.fr
Lines: 56

We had a very similar problem on a Pyramid here, and after a long investigation
I discovered the reason.

In fact the problem is in no way specific of Pyramid, it can happen on any
machine acting as NFS server running the code written by SUN as kernel NFS
server.

There is a bug in this code. It occurs when a RPC call whose server is in the 
kernel aborts early due to an error. Early means after the socket layer has
delivered the packet to the RPC layer, but before calling the approriate
dispatcher (in this case the nfs dispatcher). Such an error may be due to 3
causes:
 1) authentication failure (e.g. client sending 16 groups to a server
    supporting a maximum of 8 groups (that was our case));
 2) program not found;
 3) version mismatch.

There a good chance that second case will never happen, because there is a
port number associated whith the NFS server, packets for other programs will
be sent to another port.

But the cases 1) and 3) may occur, in fact the easiest way to see it is
simply to do "rpcinfo -u host nfs", without giving a version number. Rpcinfo
makes a first call with a null version number, gets an error reply with the
proper version number and issue a second sucessful call.

In all the 3 case mentionned above, HERE IS THE BUG, the mbuf containing the
packet will never be freed. In the normal case, this is the job of the
called program, but in case of these 3 kinds of error, as no program is called,
nobody does it.

You can see it perfectly by putting the command "rpcinfo -u host nfs" in a
loop, and monitoring in the same time the free mbufs using netstat -m.

After discovering that, I was wondering why it doesn't happen more often. It
seems that the kind of errors causing the problem are very seldom. And you
must have a lot of them until you notice the loss of mbufs. Excepted in the
case some code loops on such an error. It was the case by us, we noticed the
problem when sombody on a SUN started a program doing "df" on each mounted
partition every hour. Unfortunatley the SUN was under SunOS 4.0 with 16 groups
allowed, and the user doing that was the only one belonging to more than 8
groups. The result was a regular crash of the Pyranid 2 days after boot. I
think in your case also something is looping somewhere on the error.

Having source code from Pyramid, I had no difficulty in fixing the problem 
(you need only to add a call to the approriate routine freeing the mbufs in
the 3 error cases). But of course I have no right to publish that. The bug
was reported to Pyramid, with the fix. In fact I found the bug in all kernels
running NFS, excepted in DEC Ultrix.

I used mail rather than news for my answer, because I never use news, because
when you have never read news the first time you try it takes a so long time
that you always cancel it by ^C before. I know your problem because a friend
reading news has told me about that. If you think my answer may be useful to
others, feel free of sending any part of it in the appropriate newsgroup,
in better english if you want, since, as you probably noticed, I am french.
    ----------------