[comp.sys.sun] Nfs daemons freezing

scl@sasha.acc.Virginia.EDU (Steve Losen) (02/23/90)
We are having problems with our nfsd daemons freezing on a sun 3/160 and a
sun 3/260.

Symptoms:

The server completely stops servicing nfs requests.  A ps shows all 8 nfsd
daemons to be in the dreaded "D" (disk wait) state.  They never leave this
state, so they apparently never become runnable again.  A reboot clears up
the problem.  Other network services function normally.  I can rlogin to
the server and all appears normal except for the frozen nfsds.  This
sometimes happens more than once a day, and sometimes we are fine for
several days in a row.

Here is our current setup:

Two interdependent clusters of diskless sun 3/50s and color 3/60s.  The
disk servers are a sun 3/160 and a sun 3/260.  Each cluster has 9 diskless
clients.  The 3/50s and 3/60s are evenly distributed over the two
clusters.  All systems have 8Mb of main memory and run 4.0.3.  Both
servers are gateways between their small LAN of clients and our main
backbone ethernet.  We are using subnetting and the nameserver (resolver)
shared library.  We are running one server as a yp master and the other as
a yp slave.  All diskless suns are yp clients.  Both clusters sit in the
same room, so user accounts from both servers are mounted on both
clusters.  We do not run the automounter.  We have a lot of accounts --
over 350.  Disk quotas are enabled on both servers in the partitions that
hold user accounts.

We suspect, but have not proven, that this problem is related to disk
quotas.  One user claims to have hit his disk quota while running a job
and the whole sun lab immediately froze up.  He claims that this was
repeatable, although I didn't witness this firsthand.  The user was on a
diskless client at the time.  I have since deliberately exceeded my quota
several times from diskless clients without triggering the problem.  We
turned quotas off for awhile and I don't think we had this nfsd problem
during that time, but I can't swear to this.   We have other sun3 clusters
that are very similar (except they do not run quotas) and have never seen
this problem.

Are there any NFS gurus out there who can help us out?  Is there indeed
a bug in NFS that is triggered by quotas?

Steve Losen
scl@virginia.edu     University of Virginia Academic Computing Center