[comp.sys.sun] Daemons stuck in 'D' "short-term" wait state

dwight@uunet.uu.net (Dwight Ernest) (02/19/89)

In v7n148, Rob McMahon <cudcv%warwick.ac.uk@nss.cs.ucl.ac.uk> writes:

> ... a process which is stuck, reported by ps as in `D' "short
> term" wait, and has been for two days.
>                           ... Last time it happended, earlier
> in the week, all the nfsd's got stuck in this state,
> and the clients froze.

Same thing happened to us once, yesterday, (for the first time) while
running tops2d, the TOPS version 2 Sun daemon. Not killable.  Had to
reboot. Pain in the ***. All of our (Mac) clients froze too.

Do we have a generic problem? Details: SunOS 3.5, 3/60 with 8 MB, and an
excerpt from 'what tops2d' reveals:

	TOPS Version 2.0 11/12/87
	tops.c  1.23    11/12/87
etc...
		--Dwight Ernest
		  Technical Systems Coordinator
		  The Independent (Newspaper Publishing PLC)
		  40 City Road, London EC1Y 2DB
			  Phone: +44 1 956 1633
			  ...ukc!independent!dwight

sch@eeserv.ee.umanitoba.ca (R+C Schneider) (03/02/89)

In v7n161 Dwight Ernest mentions programs getting stuck in a 'D' wait.  We
had a similar problem here, and it was traced to the lock daemon,
rpc.lockd, which tended to dump core or otherwise fail.  There are a
number of known bugs (a patch is available) with the rpc.lockd under 3.5,
as well as 4.0.  We got the patch tapes and have experienced no problems
since.

Roland Schneider                       <sch@eeserv.ee.umanitoba.ca>
University of Manitoba

mjk@fluffy.rice.edu (Mark J. Kilgard) (03/02/89)

We recently experienced a problem similiar to the one Rob McMahon
<cudcv%warwick.ac.uk@nss.cs.ucl.ac.uk> (v7n148) and more recently Dwight
Ernest <mcvax!independent!dwight@uunet.uu.net> (v7n161) experienced.

We started getting NFS file server not responding errors from one of our
file servers.  When I logged into the file server and did a ps, it showed
all the nfsd's in the 'D' "short-term" wait state.

Logically the file server worked fine but all its clients hung when they
tried to access it over NFS.  It was impossible to kill the nfsd's and
attempts to start a new set failed.  We were running 8 by the way but I
don't think that is important.

We rebooted the machine with 'reboot' and the fsck failed.  We did a
manual fsck and 'fix'ed 5-10 inconsistencies.  We changed the
configuration to bring the machine up with only 4 nfsd's and brought the
machine up multi-user.

Everything was fine till about 27 hours later when those 4 nfsd's were
found again in the D state.  The machine was 'reboot'ed and again the fsck
failed.

With some inspection by John Deuel <kink@rice.edu> an anomaly was found in
the the /barn/lost+found directory where the fsck had complained.  Link
counts were all messed up and it appeared that there were two copies of
the lost+found inode???

John clri'ed the lost+found inode and did an fsck to fix the resulting
mess.  The machine was rebooted and has been running for 36 hours now.  It
is running with 8 nfsd's presently.  There don't seem to be any more
problems.

It seems reasonable to think that the nfsd's might get confused by
anomalies in the file system and hang in a D state.  Or could it be that
the nfsd's got screwed up and possibily created the anomaly?  I can't
explain the cause of the initial fsck problems - the system had been
running for nearly a week without down time before the occurance.

It seemed that the first fsck didn't fix the anomaly (or maybe it just
reappeared?).  Perhaps a small glitch in fsck?

Have people had similiar experiences?  If so, what did you guess the cause
to be?  Were there fsck problems before?

- Mark

brent@uunet.uu.net (Brent Chapman) (03/02/89)

There has been a fair amount of discussion of this on the Sun-Nets mailing
list lately (questions about Sun-Nets go to sun-nets-request@brillig.umd.edu).
This is not a problem with TOPS, but with the server-side NFS.

Apparently there is some subtle filesystem inconsistency in an inode which
can cause an NFS daemon to deadlock when trying to access that inode.  The
NFS client who originally issued the request never gets a response, so it
issues another request, which is caught by a different NFS server daemon,
which then goes and gets itself deadlocked, and so on, until all your NFS
daemons are hung.  There doesn't appear to be any way to unhang them or to
kill them; the only solution anyone has found is to reboot the server
(ugh...).

This little nasty bites me (I'm running 3.5) once every few months; it
hits others more often (some folks with multiple servers and lots of disk
activity were complaining of this happening weekly or even daily).

>From the accounts I've seen, I suspect it's somehow tied to high disk
load; the few times I've seen it, it's always happened in the middle of a
lot of bashing on the disk.  Others have reported running into it after
accidentally starting a 'find' on an NFS partition from 30 clients at the
same time, and while running with quotas enabled (which apparently
increases disk activity). 

Someone (I forget who, and I've already deleted the message) said they'd
checked the Sun Online Bugs Database, but didn't find anything relevant.

-Brent
--
Brent Chapman					Capital Market Technology, Inc.
Computer Operations Manager			1995 University Ave., Suite 390
brent@capmkt.com				Berkeley, CA  94704
{cogsci,lll-tis,uunet}!capmkt!brent		Phone:  415/540-6400

jeff@tc.fluke.com (Jeff Stearns) (03/04/89)

mcvax!independent!dwight@uunet.uu.net (Dwight Ernest) writes:
>> ... a process which is stuck, reported by ps as in `D' "short term" wait,
>> and has been for two days.  ... Last time it happended, earlier in the week,
>> all the nfsd's got stuck in this state, and the clients froze.
>
>Same thing happened to us once, yesterday, (for the first time) while
>running tops2d, the TOPS version 2 Sun daemon....
>Do we have a generic problem?

On my "generic" 3/60, I've wedged a variety of processes from cp(1) on up.
(Around here, FrameMaker seems to do it a lot.)  The stuck-in-D-wait bug
is much more prominent under SunOS 4.0.  And it isn't limited to
"daemons".

    Jeff Stearns        John Fluke Mfg. Co, Inc.               (206) 356-5064
    jeff@tc.fluke.COM   {uw-beaver,microsoft,sun}!fluke!jeff

PS - Calling all users of the Vitalink TransLAN IV Ethernet bridge! Please
     drop me a line.