dwight@uunet.uu.net (Dwight Ernest) (02/19/89)
In v7n148, Rob McMahon <cudcv%warwick.ac.uk@nss.cs.ucl.ac.uk> writes: > ... a process which is stuck, reported by ps as in `D' "short > term" wait, and has been for two days. > ... Last time it happended, earlier > in the week, all the nfsd's got stuck in this state, > and the clients froze. Same thing happened to us once, yesterday, (for the first time) while running tops2d, the TOPS version 2 Sun daemon. Not killable. Had to reboot. Pain in the ***. All of our (Mac) clients froze too. Do we have a generic problem? Details: SunOS 3.5, 3/60 with 8 MB, and an excerpt from 'what tops2d' reveals: TOPS Version 2.0 11/12/87 tops.c 1.23 11/12/87 etc... --Dwight Ernest Technical Systems Coordinator The Independent (Newspaper Publishing PLC) 40 City Road, London EC1Y 2DB Phone: +44 1 956 1633 ...ukc!independent!dwight
sch@eeserv.ee.umanitoba.ca (R+C Schneider) (03/02/89)
In v7n161 Dwight Ernest mentions programs getting stuck in a 'D' wait. We had a similar problem here, and it was traced to the lock daemon, rpc.lockd, which tended to dump core or otherwise fail. There are a number of known bugs (a patch is available) with the rpc.lockd under 3.5, as well as 4.0. We got the patch tapes and have experienced no problems since. Roland Schneider <sch@eeserv.ee.umanitoba.ca> University of Manitoba
mjk@fluffy.rice.edu (Mark J. Kilgard) (03/02/89)
We recently experienced a problem similiar to the one Rob McMahon <cudcv%warwick.ac.uk@nss.cs.ucl.ac.uk> (v7n148) and more recently Dwight Ernest <mcvax!independent!dwight@uunet.uu.net> (v7n161) experienced. We started getting NFS file server not responding errors from one of our file servers. When I logged into the file server and did a ps, it showed all the nfsd's in the 'D' "short-term" wait state. Logically the file server worked fine but all its clients hung when they tried to access it over NFS. It was impossible to kill the nfsd's and attempts to start a new set failed. We were running 8 by the way but I don't think that is important. We rebooted the machine with 'reboot' and the fsck failed. We did a manual fsck and 'fix'ed 5-10 inconsistencies. We changed the configuration to bring the machine up with only 4 nfsd's and brought the machine up multi-user. Everything was fine till about 27 hours later when those 4 nfsd's were found again in the D state. The machine was 'reboot'ed and again the fsck failed. With some inspection by John Deuel <kink@rice.edu> an anomaly was found in the the /barn/lost+found directory where the fsck had complained. Link counts were all messed up and it appeared that there were two copies of the lost+found inode??? John clri'ed the lost+found inode and did an fsck to fix the resulting mess. The machine was rebooted and has been running for 36 hours now. It is running with 8 nfsd's presently. There don't seem to be any more problems. It seems reasonable to think that the nfsd's might get confused by anomalies in the file system and hang in a D state. Or could it be that the nfsd's got screwed up and possibily created the anomaly? I can't explain the cause of the initial fsck problems - the system had been running for nearly a week without down time before the occurance. It seemed that the first fsck didn't fix the anomaly (or maybe it just reappeared?). Perhaps a small glitch in fsck? Have people had similiar experiences? If so, what did you guess the cause to be? Were there fsck problems before? - Mark
brent@uunet.uu.net (Brent Chapman) (03/02/89)
There has been a fair amount of discussion of this on the Sun-Nets mailing
list lately (questions about Sun-Nets go to sun-nets-request@brillig.umd.edu).
This is not a problem with TOPS, but with the server-side NFS.
Apparently there is some subtle filesystem inconsistency in an inode which
can cause an NFS daemon to deadlock when trying to access that inode. The
NFS client who originally issued the request never gets a response, so it
issues another request, which is caught by a different NFS server daemon,
which then goes and gets itself deadlocked, and so on, until all your NFS
daemons are hung. There doesn't appear to be any way to unhang them or to
kill them; the only solution anyone has found is to reboot the server
(ugh...).
This little nasty bites me (I'm running 3.5) once every few months; it
hits others more often (some folks with multiple servers and lots of disk
activity were complaining of this happening weekly or even daily).
>From the accounts I've seen, I suspect it's somehow tied to high disk
load; the few times I've seen it, it's always happened in the middle of a
lot of bashing on the disk. Others have reported running into it after
accidentally starting a 'find' on an NFS partition from 30 clients at the
same time, and while running with quotas enabled (which apparently
increases disk activity).
Someone (I forget who, and I've already deleted the message) said they'd
checked the Sun Online Bugs Database, but didn't find anything relevant.
-Brent
--
Brent Chapman Capital Market Technology, Inc.
Computer Operations Manager 1995 University Ave., Suite 390
brent@capmkt.com Berkeley, CA 94704
{cogsci,lll-tis,uunet}!capmkt!brent Phone: 415/540-6400
jeff@tc.fluke.com (Jeff Stearns) (03/04/89)
mcvax!independent!dwight@uunet.uu.net (Dwight Ernest) writes: >> ... a process which is stuck, reported by ps as in `D' "short term" wait, >> and has been for two days. ... Last time it happended, earlier in the week, >> all the nfsd's got stuck in this state, and the clients froze. > >Same thing happened to us once, yesterday, (for the first time) while >running tops2d, the TOPS version 2 Sun daemon.... >Do we have a generic problem? On my "generic" 3/60, I've wedged a variety of processes from cp(1) on up. (Around here, FrameMaker seems to do it a lot.) The stuck-in-D-wait bug is much more prominent under SunOS 4.0. And it isn't limited to "daemons". Jeff Stearns John Fluke Mfg. Co, Inc. (206) 356-5064 jeff@tc.fluke.COM {uw-beaver,microsoft,sun}!fluke!jeff PS - Calling all users of the Vitalink TransLAN IV Ethernet bridge! Please drop me a line.