[comp.sys.sun] mounted machine down => df hangs

kae@ihlpm.att.com (Kenneth A Edwards) (11/07/89)

In article <2652@brazos.Rice.edu> rush@xanadu.llnl.gov (Alan Edwards) writes:
>X-Sun-Spots-Digest: Volume 8, Issue 180, message 14 of 15
>
>When one of our disk servers goes down, doing a 'df' on a machine that has
>the one of the disk server's partitions mounted, causes the 'df' process
>to hang PERMANENTLY.  The df process cannot be killed by kill -9.  Is
>there anything I can do to prevent this?  The machine that hangs is
>running SunOS 3.5.  Will this be fixed when we upgrade to 4.0.3?
>
>Thanks,
>-Alan

This is not likely to be fixed (and isn't fixed) in 4.0.3, since the
problem is inherent in the definition of how "hard" (the default mount)
NFS works.  There are a couple of things you can do:

1) Mount the filesystems as soft and adjust the retry and timeout values.
This is not recommended by most people, since writes can fail as opposed
to "hard" mounts where writes hang until acknoledged (as you have
noticed).  There is also a mount option called "intr".  Supposed to allow
keyboard interupts, but I don't know if that works only on mount or all
nfs operations.

2) Automount the filesystems.  You might have a better chance that they
are not mounted when the remote system goes down.

Good Luck ALAN!  From the "OTHER ALAN EDWARDS"!

 -Alan Edwards
  IX 1T-448 x34115
  (ihlpm!kae)

greg@sj.ate.slb.com (Greg Wageman) (11/07/89)

In article <2652@brazos.Rice.edu> rush@xanadu.llnl.gov (Alan Edwards) writes:
>X-Sun-Spots-Digest: Volume 8, Issue 180, message 14 of 15
>
>When one of our disk servers goes down, doing a 'df' on a machine that has
>the one of the disk server's partitions mounted, causes the 'df' process
>to hang PERMANENTLY.  The df process cannot be killed by kill -9.  Is
>there anything I can do to prevent this?  The machine that hangs is
>running SunOS 3.5.  Will this be fixed when we upgrade to 4.0.3?

This isn't a bug, it's a feature.

A process that tries to perform an access on a hard-mounted filesystem
from a down server will block in the kernel NFS code.  There it will
remain until the NFS code can complete the access.  Once the server comes
up, the operation completes without any indication of error to the process
in question- this is the reason for a hard mount.

You cannot "kill" such a process, as it is not running.  The signal is
queued for the process and won't be delivered until it unblocks-- which
means when the server comes back up, at which time it will resume running
anyway, and the operation will complete normally.

On the other hand, a soft-mounted filesystem will only block the process
until the timeout and retry counts are exhausted- at which time you'll get
an error to the console and the file access will return an error to the
program.

Since the purpose of NFS is to make remote filesystems appear
indistinguishable from local ones, and since a down server is not
considered a "normal condition", "df" is doing exactly what one would
expect.  Sorry.

Copyright 1989 Greg Wageman	DOMAIN: greg@sj.ate.slb.com
Schlumberger Technologies	UUCP:   {uunet,decwrl,amdahl}!sjsca4!greg
San Jose, CA 95110-1397		BIX: gwage  CIS: 74016,352  GEnie: G.WAGEMAN
        Permission granted for not-for-profit reproduction only.

richard%aiai.edinburgh.ac.uk@nsfnet-relay.ac.uk (Richard Tobin) (11/09/89)

Here's a replacement for df that shouldn't hang if the remote file server
is down (I have seen it hang when the remote machine was in some bizarre
semi-down state).  It's almost completely df compatible - the only
difference I know is that when a file system is more than 100% full it
prints out a negative number of available k rather than zero.

-- Richard

Richard Tobin,                     JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,         ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.              UUCP:  ...!ukc!ed.ac.uk!R.Tobin

Ed's Note:

FTP	HostName : titan.rice.edu (128.42.1.30)
	Directory: sun-source
	FileName : repdf.c

Archive-Server Address: archive-server@rice.edu
Archive-Server Command: get sun-source repdf.c

jsulliva@killington.prime.com (Jeff Sullivan) (11/13/89)

>When one of our disk servers goes down, doing a 'df' on a machine that has
>the one of the disk server's partitions mounted, causes the 'df' process
>to hang PERMANENTLY.

I think that the reason WHY this occurs has been well covered in recent
responses to this inquiry. However, no one (yet) has offered a fix or
workaround. Since this has happened to me several times in the past, I'll
offer this workaround:

alias df "(echo ' '; /bin/df \!*) &"

That puts df in the background, just in case it doesn't return for a while.

Jeff Sullivan           Computervision Division
CADDS R&D               Prime Computer, Inc.
                        Bedford, MA     01730

UUCP    : {decvax|linus|sun}!cvbnet!jsulliva
Internet: jsulliva@cvbnet.prime.com

gaynor@busboys.rutgers.edu (Silver) (11/17/89)

> alias df "(echo ' '; /bin/df \!*) &"

What follows is a better solution.  Without further ado, [Ag]

            alias df '/bin/sh -c "/bin/df \!* & wait"'

mikel@decwrl.dec.com (Mikel Lechner) (12/03/89)

kae@ihlpm.att.com (Kenneth A Edwards) writes:

>In article <2652@brazos.Rice.edu> rush@xanadu.llnl.gov (Alan Edwards) writes:
>>
>>When one of our disk servers goes down, doing a 'df' on a machine that has
>>the one of the disk server's partitions mounted, causes the 'df' process
>>to hang PERMANENTLY.  The df process cannot be killed by kill -9.  Is

>This is not likely to be fixed (and isn't fixed) in 4.0.3, since the
>problem is inherent in the definition of how "hard" (the default mount)
>NFS works.  There are a couple of things you can do:

Actually, there is a new bug introduced with release 4.0 with NFS mounted
filesytems.  Processes that are not accessing the downed system still also
hang up waiting on the dead system.  Under 3.5 and previous releases it
was possible to work around this problem by mounting all NFS filesystems
in a directory under root and with a separate directory for each server.
For example: "/hosts/sun1/disk1" would be a mount point for an NFS
filesystem under this scheme.  Then with the library call "getwd()", if
your process is in directory "/hosts/sun2/disk1", your process can safely
step up the directory tree and not touch the dead NFS mount point.  This
worked just fine for us until release 4.0.

With SunOS4.0 and later releases, Sun introduced a "performance
improvement" to the "getwd()" library call.  The library function ends up
"stat()"ing virtually all your mounted filesystems every time your program
tries to compute its working directory.  This is nearly guaranteed to hang
up any process makeing a "getwd()" call when a hard-mounted NFS filesystem
hangs.

IMHO a process that is not accessing data on a dead NFS server should not
hang waiting on that server, but it does.  Can we have slow "getwd()" call
back? :^(.

Mikel Lechner			UUCP:  mikel@teraida.UUCP
Teradyne EDA, Inc.

per@erix.ericsson.se (Per Hedeland) (12/11/89)

In article <2832@brazos.Rice.edu>, greg@sj.ate.slb.com (Greg Wageman) writes:
>This isn't a bug, it's a feature.
[lots of explanation deleted...]

Well, I'm sure most of us by now have a rather fixed opinion on the truth
of that statement, and I don't really intend to comment on it. However,
something occurred to me while discussing our latest manifestation of that
"feature" (mailtool hanging an entire SunView session because the "mail
server" had become inaccessible:-() - sorry if this has already been
discussed, if so I must have missed it:

As I see it, the purpose of hard mounts is to let programs that don't
handle errors on read/write/etc gracefully (which may in some cases be
hard to do), work "correctly" all the same. On the other hand, most of the
*problems* caused by hard mounts (such as these two examples) seem to be
at *initial* file access, i.e. open/stat/etc. (Or, put another way: Most
programs don't have files open for long periods of time.) Also, most
programs already handle errors at that point - e.g. they are prepared for
an expected file to be missing/ inaccessible etc.

Now, I know next to nothing about how the NFS protocol works, and thus how
feasible this would be, but my conclusion from the above is that it would
be *very* useful to have a soft_open_stat_etc_but_hard_read_write_etc
option when mounting NFS file systems. Anyone got any comments on this?
(Besides "Oh no, not *another* option!":-).

Regards
--Per Hedeland
per@erix.ericsson.se  or
per%erix.ericsson.se@uunet.uu.net  or
...uunet!erix.ericsson.se!per