karl_kleinpaste@cis.ohio-state.edu (09/15/90)
Pyramid 98xe, OSx4.4c, nfsd 8, biod 8. I've developed a nasty problem with one of my Pyrs in the last 16 hours or so. It has developed a serious problem with mbuf lossage. Here's a netstat -m output just before his last reboot, about 10 minutes ago: 2003/2032 mbufs in use: 1877 mbufs allocated to data 12 mbufs allocated to packet headers 109 mbufs allocated to routing table entries 3 mbufs allocated to socket names and addresses 2 mbufs allocated to interface addresses 128/128 mapped pages in use 510 Kbytes allocated to network (99% in use) 1 requests for memory denied Note excessive data mbuf allocation, and 99% utilization. Consider the same thing from his twin, in the next cabinet, looking quite normal and running for days: 86/288 mbufs in use: 3 mbufs allocated to data 4 mbufs allocated to packet headers 75 mbufs allocated to routing table entries 2 mbufs allocated to socket names and addresses 2 mbufs allocated to interface addresses 28/96 mapped pages in use 228 Kbytes allocated to network (29% in use) 0 requests for memory denied This leakage started happening sometime around 5pm or 6pm last evening. I have had to reboot almost hourly just to keep the @#$% machine alive. I've experimented with several things, trying to find the cause. Killing off assorted network daemons didn't help; sendmail, nntp, inetd as a whole, routed, pcnfsd were all killed, and yet the data mbuf allocation keeps ratcheting upward. I tried rebooting with 16 nfsd/biod but this was no help either. Killing off all nfsd/biod and the portmapper didn't help. Renicing nfsd and/or biod didn't help. As near as I an see, nothing running on the Pyr itself is the cause of this. "etherfind -r -n src victim-pyr or dst victim-pyr" run from a nearby SunOS4.1 Sun3 shows a great deal of NFS traffic, of this form: UDP from another-pyr.1023 to victim-pyr.2049 128 bytes RPC Call prog 200000 proc 1 V1 [93dc7] UDP from victim-pyr.2049 to another-pyr.1023 104 bytes RPC Reply [93dc7] AUTH_NULL Success UDP from another-pyr.1023 to victim-pyr.2049 172 bytes RPC Call prog 200000 proc 9 V1 [93dc8] UDP from victim-pyr.2049 to another-pyr.1023 36 bytes RPC Reply [93dc8] AUTH_NULL Success UDP from another-pyr.1023 to victim-pyr.2049 172 bytes RPC Call prog 200000 proc 9 V1 [93dc9] UDP from victim-pyr.2049 to another-pyr.1023 36 bytes RPC Reply [93dc9] AUTH_NULL Success UDP from another-pyr.1023 to victim-pyr.2049 128 bytes RPC Call prog 200000 proc 1 V1 [93dca] UDP from victim-pyr.2049 to another-pyr.1023 104 bytes RPC Reply [93dca] AUTH_NULL Success But not all of this traffic is coming from another-pyr -- assorted Pyrs, Suns, and the occasional HP show up. I'm also getting messages like NFS server write failed: (err=13, dev=0xffa610a4, ino=0xffa69bd0). on the console occasionally. Errno 13 is EACCES. ??? The only anomalous thing about this Pyr's configuration is that it's the departmental /usr/spool/mail NFS server. But that's been the case for a couple of years now, nothing new or unusual about that. As I said, I'm rebooting roughly hourly at this point to keep it alive. It seems to perform admirably right up until the end, when the 2032/2032 mbuf condition hits. It reboots in 10 minutes and is fine again for the next hour, while the mbuf count goes up. Clues, anyone? I can't think of anything that would have been started at 5pm on a Friday evening which might cause this sort of thing. What sort of activity on the Pyr or elsewhere on my network should I be looking for? --karl
karl_kleinpaste@cis.ohio-state.edu (09/18/90)
On Saturday, I wrote in desperation: I've developed a nasty problem with one of my Pyrs in the last 16 hours or so. It has developed a serious problem with mbuf lossage. I received the following extremely helpful, detailed reply from Robert Ehrlich. The cause he describes correlates perfectly with our problem here; we recently upgraded several hundred Sun3/50s under SunOS3.5 to Sun SLCs under SunOS4.1, and the SLC causing the abuse (finally found with a combination of etherfind and shutting down Sun subnets in turn) belongs to a faculty member who is in 9 groups when the Pyrs only believe in 8. We are now seeing to it that all users are in no more than 8 groups. Is the problem Robert describes "fixed in 5.0?" The Pyr RTOC (or CCSC or whatever it's called now) folks have been pestering me for a while to upgrade (4.4c here still), but I've been putting it off due to a lack of excitement over the prospect. But if it'll take care of this problem for me, I'll probably see about upgrading Real Soon Now, as in, a couple of weekends from now. Many thanx to Robert. --karl ---------------- UNIX-From: ehrlich@inria.inria.fr Mon Sep 17 19:41:33 1990 Received: from inria.inria.fr by cheops.cis.ohio-state.edu (5.61-kk/5.900912) id AA03385; Mon, 17 Sep 90 19:41:33 -0400 Received: by inria.inria.fr (5.64+/89.0.8) via Fnet-EUnet id AA12608; Tue, 18 Sep 90 01:42:17 +0200 (MET) Date: Tue, 18 Sep 90 01:42:17 +0200 From: ehrlich@inria.inria.fr (Robert Ehrlich) Message-Id: <9009172342.AA12608@inria.inria.fr> To: karl_kleinpaste@cis.ohio-state.edu Subject: Re: help with mbuf leak problem? Cc: dupont@inria.inria.fr Lines: 56 We had a very similar problem on a Pyramid here, and after a long investigation I discovered the reason. In fact the problem is in no way specific of Pyramid, it can happen on any machine acting as NFS server running the code written by SUN as kernel NFS server. There is a bug in this code. It occurs when a RPC call whose server is in the kernel aborts early due to an error. Early means after the socket layer has delivered the packet to the RPC layer, but before calling the approriate dispatcher (in this case the nfs dispatcher). Such an error may be due to 3 causes: 1) authentication failure (e.g. client sending 16 groups to a server supporting a maximum of 8 groups (that was our case)); 2) program not found; 3) version mismatch. There a good chance that second case will never happen, because there is a port number associated whith the NFS server, packets for other programs will be sent to another port. But the cases 1) and 3) may occur, in fact the easiest way to see it is simply to do "rpcinfo -u host nfs", without giving a version number. Rpcinfo makes a first call with a null version number, gets an error reply with the proper version number and issue a second sucessful call. In all the 3 case mentionned above, HERE IS THE BUG, the mbuf containing the packet will never be freed. In the normal case, this is the job of the called program, but in case of these 3 kinds of error, as no program is called, nobody does it. You can see it perfectly by putting the command "rpcinfo -u host nfs" in a loop, and monitoring in the same time the free mbufs using netstat -m. After discovering that, I was wondering why it doesn't happen more often. It seems that the kind of errors causing the problem are very seldom. And you must have a lot of them until you notice the loss of mbufs. Excepted in the case some code loops on such an error. It was the case by us, we noticed the problem when sombody on a SUN started a program doing "df" on each mounted partition every hour. Unfortunatley the SUN was under SunOS 4.0 with 16 groups allowed, and the user doing that was the only one belonging to more than 8 groups. The result was a regular crash of the Pyranid 2 days after boot. I think in your case also something is looping somewhere on the error. Having source code from Pyramid, I had no difficulty in fixing the problem (you need only to add a call to the approriate routine freeing the mbufs in the 3 error cases). But of course I have no right to publish that. The bug was reported to Pyramid, with the fix. In fact I found the bug in all kernels running NFS, excepted in DEC Ultrix. I used mail rather than news for my answer, because I never use news, because when you have never read news the first time you try it takes a so long time that you always cancel it by ^C before. I know your problem because a friend reading news has told me about that. If you think my answer may be useful to others, feel free of sending any part of it in the appropriate newsgroup, in better english if you want, since, as you probably noticed, I am french. ----------------