kae@ihlpm.att.com (Kenneth A Edwards) (11/07/89)
In article <2652@brazos.Rice.edu> rush@xanadu.llnl.gov (Alan Edwards) writes: >X-Sun-Spots-Digest: Volume 8, Issue 180, message 14 of 15 > >When one of our disk servers goes down, doing a 'df' on a machine that has >the one of the disk server's partitions mounted, causes the 'df' process >to hang PERMANENTLY. The df process cannot be killed by kill -9. Is >there anything I can do to prevent this? The machine that hangs is >running SunOS 3.5. Will this be fixed when we upgrade to 4.0.3? > >Thanks, >-Alan This is not likely to be fixed (and isn't fixed) in 4.0.3, since the problem is inherent in the definition of how "hard" (the default mount) NFS works. There are a couple of things you can do: 1) Mount the filesystems as soft and adjust the retry and timeout values. This is not recommended by most people, since writes can fail as opposed to "hard" mounts where writes hang until acknoledged (as you have noticed). There is also a mount option called "intr". Supposed to allow keyboard interupts, but I don't know if that works only on mount or all nfs operations. 2) Automount the filesystems. You might have a better chance that they are not mounted when the remote system goes down. Good Luck ALAN! From the "OTHER ALAN EDWARDS"! -Alan Edwards IX 1T-448 x34115 (ihlpm!kae)
greg@sj.ate.slb.com (Greg Wageman) (11/07/89)
In article <2652@brazos.Rice.edu> rush@xanadu.llnl.gov (Alan Edwards) writes: >X-Sun-Spots-Digest: Volume 8, Issue 180, message 14 of 15 > >When one of our disk servers goes down, doing a 'df' on a machine that has >the one of the disk server's partitions mounted, causes the 'df' process >to hang PERMANENTLY. The df process cannot be killed by kill -9. Is >there anything I can do to prevent this? The machine that hangs is >running SunOS 3.5. Will this be fixed when we upgrade to 4.0.3? This isn't a bug, it's a feature. A process that tries to perform an access on a hard-mounted filesystem from a down server will block in the kernel NFS code. There it will remain until the NFS code can complete the access. Once the server comes up, the operation completes without any indication of error to the process in question- this is the reason for a hard mount. You cannot "kill" such a process, as it is not running. The signal is queued for the process and won't be delivered until it unblocks-- which means when the server comes back up, at which time it will resume running anyway, and the operation will complete normally. On the other hand, a soft-mounted filesystem will only block the process until the timeout and retry counts are exhausted- at which time you'll get an error to the console and the file access will return an error to the program. Since the purpose of NFS is to make remote filesystems appear indistinguishable from local ones, and since a down server is not considered a "normal condition", "df" is doing exactly what one would expect. Sorry. Copyright 1989 Greg Wageman DOMAIN: greg@sj.ate.slb.com Schlumberger Technologies UUCP: {uunet,decwrl,amdahl}!sjsca4!greg San Jose, CA 95110-1397 BIX: gwage CIS: 74016,352 GEnie: G.WAGEMAN Permission granted for not-for-profit reproduction only.
richard%aiai.edinburgh.ac.uk@nsfnet-relay.ac.uk (Richard Tobin) (11/09/89)
Here's a replacement for df that shouldn't hang if the remote file server is down (I have seen it hang when the remote machine was in some bizarre semi-down state). It's almost completely df compatible - the only difference I know is that when a file system is more than 100% full it prints out a negative number of available k rather than zero. -- Richard Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin Ed's Note: FTP HostName : titan.rice.edu (128.42.1.30) Directory: sun-source FileName : repdf.c Archive-Server Address: archive-server@rice.edu Archive-Server Command: get sun-source repdf.c
jsulliva@killington.prime.com (Jeff Sullivan) (11/13/89)
>When one of our disk servers goes down, doing a 'df' on a machine that has >the one of the disk server's partitions mounted, causes the 'df' process >to hang PERMANENTLY. I think that the reason WHY this occurs has been well covered in recent responses to this inquiry. However, no one (yet) has offered a fix or workaround. Since this has happened to me several times in the past, I'll offer this workaround: alias df "(echo ' '; /bin/df \!*) &" That puts df in the background, just in case it doesn't return for a while. Jeff Sullivan Computervision Division CADDS R&D Prime Computer, Inc. Bedford, MA 01730 UUCP : {decvax|linus|sun}!cvbnet!jsulliva Internet: jsulliva@cvbnet.prime.com
gaynor@busboys.rutgers.edu (Silver) (11/17/89)
> alias df "(echo ' '; /bin/df \!*) &"
What follows is a better solution. Without further ado, [Ag]
alias df '/bin/sh -c "/bin/df \!* & wait"'
mikel@decwrl.dec.com (Mikel Lechner) (12/03/89)
kae@ihlpm.att.com (Kenneth A Edwards) writes: >In article <2652@brazos.Rice.edu> rush@xanadu.llnl.gov (Alan Edwards) writes: >> >>When one of our disk servers goes down, doing a 'df' on a machine that has >>the one of the disk server's partitions mounted, causes the 'df' process >>to hang PERMANENTLY. The df process cannot be killed by kill -9. Is >This is not likely to be fixed (and isn't fixed) in 4.0.3, since the >problem is inherent in the definition of how "hard" (the default mount) >NFS works. There are a couple of things you can do: Actually, there is a new bug introduced with release 4.0 with NFS mounted filesytems. Processes that are not accessing the downed system still also hang up waiting on the dead system. Under 3.5 and previous releases it was possible to work around this problem by mounting all NFS filesystems in a directory under root and with a separate directory for each server. For example: "/hosts/sun1/disk1" would be a mount point for an NFS filesystem under this scheme. Then with the library call "getwd()", if your process is in directory "/hosts/sun2/disk1", your process can safely step up the directory tree and not touch the dead NFS mount point. This worked just fine for us until release 4.0. With SunOS4.0 and later releases, Sun introduced a "performance improvement" to the "getwd()" library call. The library function ends up "stat()"ing virtually all your mounted filesystems every time your program tries to compute its working directory. This is nearly guaranteed to hang up any process makeing a "getwd()" call when a hard-mounted NFS filesystem hangs. IMHO a process that is not accessing data on a dead NFS server should not hang waiting on that server, but it does. Can we have slow "getwd()" call back? :^(. Mikel Lechner UUCP: mikel@teraida.UUCP Teradyne EDA, Inc.
per@erix.ericsson.se (Per Hedeland) (12/11/89)
In article <2832@brazos.Rice.edu>, greg@sj.ate.slb.com (Greg Wageman) writes: >This isn't a bug, it's a feature. [lots of explanation deleted...] Well, I'm sure most of us by now have a rather fixed opinion on the truth of that statement, and I don't really intend to comment on it. However, something occurred to me while discussing our latest manifestation of that "feature" (mailtool hanging an entire SunView session because the "mail server" had become inaccessible:-() - sorry if this has already been discussed, if so I must have missed it: As I see it, the purpose of hard mounts is to let programs that don't handle errors on read/write/etc gracefully (which may in some cases be hard to do), work "correctly" all the same. On the other hand, most of the *problems* caused by hard mounts (such as these two examples) seem to be at *initial* file access, i.e. open/stat/etc. (Or, put another way: Most programs don't have files open for long periods of time.) Also, most programs already handle errors at that point - e.g. they are prepared for an expected file to be missing/ inaccessible etc. Now, I know next to nothing about how the NFS protocol works, and thus how feasible this would be, but my conclusion from the above is that it would be *very* useful to have a soft_open_stat_etc_but_hard_read_write_etc option when mounting NFS file systems. Anyone got any comments on this? (Besides "Oh no, not *another* option!":-). Regards --Per Hedeland per@erix.ericsson.se or per%erix.ericsson.se@uunet.uu.net or ...uunet!erix.ericsson.se!per