[comp.unix.ultrix] nfs daemon blocks system.

parker@waters.mpr.ca (Ross Parker) (04/20/89)

In article <278@kubix.UUCP> mvw@kubix.UUCP (Maarten van Wijk) writes:
>
>VAX 11/750 with DEUNA controller and Ultrix 3.0.
>
>When I NFS mount a disk of the VAX on a SUN 3/50 and start some compilations
>on that disk, after some time the VAX hangs. The system is completely
>dead and no messages appear on the console.
>After forcing a crashdump and looking at the core I get the impression
>one of the nfs daemons has taken over the CPU. The output from ps -axk
>of the dump looks like this:
>
	...
	...
>
>In a normal state the nfs daemons consume about the same cpu time.
>
>Has anybody experienced something like this or can give a solution
>to the problem.


Yes! We've been having a problem with NFS that appears to be caused by
PCs on our network (using Sun's PC-NFS). Every once in a while the
nfs daemons on one of our microvaxes (Ultrix 2.2 or 2.3) will just go
bananas and eat up most of the CPU. We run 8 nfs daemons, and for the
space of about 5 minutes (sometimes less), each will chew up about
10 percent of the CPU. This drives the load up to a level where everyone
has to sit and wait for this to die down before they can work again.

We think that this is happening when someone prints a file from a PC
(but certainly not every time someone prints a file). It happens to all
four microvaxes that we have PCs connected to.

If anyone has any ideas, I'd certainly like to hear them!! DEC support
is clueless so far.

Ross Parker      uunet!ubc-cs!mpre!parker       |
Microtel Pacific Research Ltd.			| You can't erase the dream,
Burnaby, B.C.,					| you can only wake me up...
Canada, eh?					|

david@ms.uky.edu (David Herron -- One of the vertebrae) (04/24/89)

In article <1659@eric.mpr.ca> parker@waters.UUCP (Ross Parker) writes:
>In article <278@kubix.UUCP> mvw@kubix.UUCP (Maarten van Wijk) writes:
>Yes! We've been having a problem with NFS that appears to be caused by
>PCs on our network (using Sun's PC-NFS). Every once in a while the
>nfs daemons on one of our microvaxes (Ultrix 2.2 or 2.3) will just go
>bananas and eat up most of the CPU. We run 8 nfs daemons, and for the
>space of about 5 minutes (sometimes less), each will chew up about
>10 percent of the CPU. This drives the load up to a level where everyone
>has to sit and wait for this to die down before they can work again.
...
>If anyone has any ideas, I'd certainly like to hear them!! DEC support
>is clueless so far.

We have the same sort of problem .. about 5 or 6 times a week one
of our uVaxIIen will lock up as you describe.  We do not run PC-NFS
but we do have some Sun's (v4 of SunOS) and a Sequent (v3.?? of Dynix)
and all these guys share NFS back and forth.  We are at v3 of Ultrix.

DEC support is also clueless.

Something which just occurred to me ... I believe that most of our
uVaxIIen have DEQNA's rather than DELQA's in 'em.
-- 
<- David Herron; an MMDF guy                              <david@ms.uky.edu>
<- ska: David le casse\*'      {rutgers,uunet}!ukma!david, david@UKMA.BITNET
<- By all accounts, Cyprus was covered with trees at one time
<- 		-- Until they discovered Bronze

steved@longs.LANCE.ColoState.Edu (Steve Dempsey) (04/24/89)

> In article <1659@eric.mpr.ca> parker@waters.UUCP (Ross Parker) writes:
> >In article <278@kubix.UUCP> mvw@kubix.UUCP (Maarten van Wijk) writes:
> >Yes! We've been having a problem with NFS that appears to be caused by
> >PCs on our network (using Sun's PC-NFS). Every once in a while the
> >nfs daemons on one of our microvaxes (Ultrix 2.2 or 2.3) will just go
> >bananas and eat up most of the CPU. We run 8 nfs daemons, and for the
> >space of about 5 minutes (sometimes less), each will chew up about
> >10 percent of the CPU. This drives the load up to a level where everyone
> >has to sit and wait for this to die down before they can work again.
> ...
> >If anyone has any ideas, I'd certainly like to hear them!! DEC support
> >is clueless so far.
> 
> We have the same sort of problem .. about 5 or 6 times a week one
> of our uVaxIIen will lock up as you describe.  We do not run PC-NFS
> but we do have some Sun's (v4 of SunOS) and a Sequent (v3.?? of Dynix)
> and all these guys share NFS back and forth.  We are at v3 of Ultrix.
> 

All this talk of stuck NFS servers, etc. sounds very familiar.  We
have quite the variety of hardware and software: Vax780's, '730,
uVaxII, '3600's, 3200's, SUN3/50's, and many VS2000's; most running
Ultrix2.2, the '780's running 4.3BSD+XINU.  Ethernet is DELQA on the
newer machines along with Proteon P1100's (proNET-10 ring).  Every
machine mounts at least one remote file system, and some make 2 or 3
gateway hops to get there.  Machines on the same physical net do just
fine.  Different gateways seem to cause different problems: lots of
timeouts, but they reasonable return (a few seconds), some LONG
timeouts, and some just hang forever.  Usually the client hangs, but
sometimes the server all but locks up as described above.

So what causes all this?  Beats me, but our solution is to fix the
read and write size to something smaller than a packet.  That's
options rsize=xxxx,wsize=xxxx in /etc/fstab.  We chose 1024 because
both proNET and ethernet tcp/ip packets are a few hundred bytes larger
than 1K.  All our NFS problems seem to have disappeared.  Of course
this solution was discovered completely by trial and error (and error
and .... :-)

        Steve Dempsey,  Center for Computer Assisted Engineering
  Colorado State University, Fort Collins, CO  80523    +1 303 491 0630
INET: steved@longs.LANCE.ColoState.Edu, dempsey@handel.CS.ColoState.Edu
UUCP: boulder!ccncsu!longs.LANCE.ColoState.Edu!steved, ...!ncar!handel!dempsey

grr@cbmvax.UUCP (George Robbins) (04/24/89)

In article <11582@s.ms.uky.edu> david@ms.uky.edu (David Herron -- One of the vertebrae) writes:
> In article <1659@eric.mpr.ca> parker@waters.UUCP (Ross Parker) writes:
> >In article <278@kubix.UUCP> mvw@kubix.UUCP (Maarten van Wijk) writes:
> >Yes! We've been having a problem with NFS that appears to be caused by
> >PCs on our network (using Sun's PC-NFS). Every once in a while the
> >nfs daemons on one of our microvaxes (Ultrix 2.2 or 2.3) will just go
> >bananas and eat up most of the CPU. We run 8 nfs daemons...
...
> >If anyone has any ideas, I'd certainly like to hear them!! DEC support
> >is clueless so far.
> 
> We have the same sort of problem .. 
...
> DEC support is also clueless.

Well, I was about to suggest that you blast the offending daemon with
at quit signal and see if you could get a useful "core" file, which
would at least give some clue as to what the daemon thought is was up
to.  Unfortunatly, the object is stripped, which makes debugging all
but impossible.  Still, you might give it a shot and send DEC the
data.  See "man signal" for things that will elict a dump, they probably
don't try to catch all of them.

Is there any hope of getting DEC to put unstripped objects on the
distribution tape as an optional file?  They seem to be limiting both
the customers and their own ability to diagnose problems by shipping
only the stripped versions...

We run 4 biod deamons here on a 785 2.2, and I haven't noticed the problem
you mention.  On the other hand, NFS use here isn't very intensive and
I might not even notice an occasional "lockup" as long as rn still
works good.   8-)  Oh yea, no PC-NFS (yet) and Sun-2's running 3.x,
a Sun-4 running that interiem release and a bunch of Amiga's running
Ameristar's NFS package.

-- 
George Robbins - now working for,	uucp: {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing	arpa: cbmvax!grr@uunet.uu.net
Commodore, Engineering Department	fone: 215-431-9255 (only by moonlite)

abstine@sun.soe.clarkson.edu (Arthur Stine) (04/24/89)

>In article <278@kubix.UUCP> mvw@kubix.UUCP (Maarten van Wijk) writes:
>Yes! We've been having a problem with NFS that appears to be caused by
>PCs on our network (using Sun's PC-NFS). Every once in a while the
>nfs daemons on one of our microvaxes (Ultrix 2.2 or 2.3) will just go
>bananas and eat up most of the CPU. We run 8 nfs daemons...

>If anyone has any ideas, I'd certainly like to hear them!! DEC support
>is clueless so far.

>We have the same sort of problem .. 
> ....
> DEC support is also clueless.
> 

Well, I'm not sure if its a DEC specific problem. Our Sun servers exhibit the
same behaviour. Last nite one of them went up to a load ave of 51 ! before
I shut it down. DEC is probably using basically the same NFS code, so I suspect
that there are just some latent bugs which are brought out by the presence of
things like PCNFS on the net.

art stine
sr network engineer
clarkson u

jim@maxwell.cs.strath.ac.uk (Jim Reid) (04/25/89)

In article <6677@cbmvax.UUCP> grr@cbmvax.UUCP (George Robbins) writes:
>Well, I was about to suggest that you blast the offending daemon with
>at quit signal and see if you could get a useful "core" file, which
>would at least give some clue as to what the daemon thought is was up
>to.  Unfortunatly, the object is stripped, which makes debugging all
>but impossible. 

This would not be much help, even if the daemons were unstripped. Both
nfsd and biod are trivial programs - they execute only kernel code apart
from a small amount of initialisation at start up. All nfsd and biod do
is detach from their control tty and then invoke a system call that
NEVER returns. This puts the process into kernel mode, where it runs the
kernel's NFS client or server code.

Getting a core dump is not much use unless you know how to get hold of
the kernel stack inside the dump's u. area and then map that with the
kernel's symbol table [with the kernel sources by your side for
reference.... :-)]. You could get the daemon's stack backtrace much
easier using adb on /vmunix and /dev/kmem. Even then, that might be of
little use if the daemon is making repeated function calls and so all
you get is a snapshot of the daemon's kernel activity.

		Jim
ARPA:	jim%cs.strath.ac.uk@ucl-cs.arpa, jim@cs.strath.ac.uk
UUCP:	jim@strath-cs.uucp, ...!uunet!mcvax!ukc!strath-cs!jim
JANET:	jim@uk.ac.strath.cs

"!rof si ver tahw s'taht oS"

eric@pprg.unm.edu (Eric Engquist [CoE]) (04/26/89)

We were seeing a simular problem.  About once a week our vax would go
crazy with nfs loads.  The problem was discovered to be find from machines
that mounted nfs partitions from a vax.  The source machine would go do
the find from say /usr.  If all machine nfs something such as /usr/man
the finds from remote machines would all bang on the ultrix /usr/man
partition.  Hence if you do find's, do them carefully.  If you are on a sun
use -prune and -fstype options.

						-Eric Engquist
						UNM College of Engr.
						eric@sybil.unm.edu

guy@auspex.auspex.com (Guy Harris) (04/26/89)

>Well, I was about to suggest that you blast the offending daemon with
>at quit signal and see if you could get a useful "core" file,

Unless DEC has put the NFS server into user mode, you won't get a useful
"core" file; NFS daemons in UNIX systems tend to run in the kernel, and
have no user-mode data or stack segments, and thus you won't get a very
interesting "core" file.

parker@waters.mpr.ca (Ross Parker) (05/03/89)

Oops!!!

I was the original poster of the 'nfs daemon blocks system' discussion, and
we lost our news link just as I started seeing some replies. It's back up
now, and I would like to throw myself on the mercy of the net and ask that
either people re-post any helpful replies, or (better still) if some kind
soul could e-mail me the relevant replies, I'd be eternally grateful!!!


Thanks!

Ross Parker      uunet!ubc-cs!mpre!parker       |
Microtel Pacific Research Ltd.			| You can't erase the dream,
Burnaby, B.C.,					| you can only wake me up...
Canada, eh?					|