scl@sasha.acc.Virginia.EDU (Steve Losen) (02/23/90)
We are having problems with our nfsd daemons freezing on a sun 3/160 and a sun 3/260. Symptoms: The server completely stops servicing nfs requests. A ps shows all 8 nfsd daemons to be in the dreaded "D" (disk wait) state. They never leave this state, so they apparently never become runnable again. A reboot clears up the problem. Other network services function normally. I can rlogin to the server and all appears normal except for the frozen nfsds. This sometimes happens more than once a day, and sometimes we are fine for several days in a row. Here is our current setup: Two interdependent clusters of diskless sun 3/50s and color 3/60s. The disk servers are a sun 3/160 and a sun 3/260. Each cluster has 9 diskless clients. The 3/50s and 3/60s are evenly distributed over the two clusters. All systems have 8Mb of main memory and run 4.0.3. Both servers are gateways between their small LAN of clients and our main backbone ethernet. We are using subnetting and the nameserver (resolver) shared library. We are running one server as a yp master and the other as a yp slave. All diskless suns are yp clients. Both clusters sit in the same room, so user accounts from both servers are mounted on both clusters. We do not run the automounter. We have a lot of accounts -- over 350. Disk quotas are enabled on both servers in the partitions that hold user accounts. We suspect, but have not proven, that this problem is related to disk quotas. One user claims to have hit his disk quota while running a job and the whole sun lab immediately froze up. He claims that this was repeatable, although I didn't witness this firsthand. The user was on a diskless client at the time. I have since deliberately exceeded my quota several times from diskless clients without triggering the problem. We turned quotas off for awhile and I don't think we had this nfsd problem during that time, but I can't swear to this. We have other sun3 clusters that are very similar (except they do not run quotas) and have never seen this problem. Are there any NFS gurus out there who can help us out? Is there indeed a bug in NFS that is triggered by quotas? Steve Losen scl@virginia.edu University of Virginia Academic Computing Center