G.Eustace@massey.ac.nz (Glen Eustace) (04/15/91)
The following saga is from our Tales of Mystery and Imagination.... Actually it happened this afternoon and after having consulted as many individuals as possible, all are at a loss to explain the following behaviour. I am hoping that someone else has observed this behaviour and if they don't have a cure may at least have an explanation. Environment: Pyramid 9815 OSx5.0d All user file systems have quotas enabled All nfs user file systems are mounted without quotas DECStation 3100s Ultrix 4.1 All user file systems have quotas enabled All nfs user file systems are mounted without quotas All user file systems are cross mounted hard on all servers. What happened. One of the DECStations developed a fault and died. the /users/u12 file system was now offline. This file system could not be dismounted from the Pyramid. We don't have this problem with Ultrix, if the host is unreachable the unmount reports the problem but succeeds in unmounting the file system anyway. I was attempting to archive some files from one of the Pyramids local file systems /users/u5. The 'find' was getting stuck on a 'D' wait, it would come up with cc-labserver2:/users/u12 not responding... and would eventually timeout with a getattr failure. We tried 'ls', it succeeded sometimes but failed with the same error on others. Note that we were working completely within the local machine. Why is the Pyramid trying to access a remote file system in this situation? In addition to the above the login process is very slow. It would appear that login is trying to check quotas on the faulty machine. Why? we have not mounted its file system with quotas. All in all the only way to fix this was to reboot the pyramid. Not a solution I like but it did work. I am perplexed with why such a simple command like 'ls' or 'find' on a local file system should cause a 'getattr' call on an entirely different remote file system. -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Glen Eustace, Systems Software Manager | EMail: G.Eustace@massey.ac.nz Computer Centre, Massey University, Palmerston North, New Zealand Phone: +64 63 69099 x7440, Fax: +64 63 505 607, Timezone: GMT-12
johnk@gordian.com (John Kalucki) (04/17/91)
In article <1991Apr15.045511.6354@massey.ac.nz>, G.Eustace@massey.ac.nz (Glen Eustace) writes: |> [...] |> All in all the only way to fix this was to reboot the pyramid. Not a |> solution I like but it did work. I am perplexed with why such a simple |> command like 'ls' or 'find' on a local file system should cause a |> 'getattr' call on an entirely different remote file system. |> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= |> Glen Eustace, Systems Software Manager | EMail: G.Eustace@massey.ac.nz |> Computer Centre, Massey University, Palmerston North, New Zealand |> Phone: +64 63 69099 x7440, Fax: +64 63 505 607, Timezone: GMT-12 What is probably happening is the gettattr is being called on the nfs mount point as it walks up the tree. This is the exact same problem that I posted a few weeks ago about my Mips machines. I've mounted them as follows and it has helped a bit: alecto:/n/alecto /n/alecto nfs bg,rw,soft,timeo=2,intr 0 0 ..but what I'd really like is some method to allow me to umount a dead nfs filesystem. -John Kalucki
brent@terra.Eng.Sun.COM (Brent Callaghan) (04/17/91)
In article <130@gordius.gordian.com>, johnk@gordian.com (John Kalucki) writes: > ...but what I'd really like is some method to allow me to umount a > dead nfs filesystem. What's the problem ? SunOS lets you unmount a dead NFS filesystem. The same should be true for other implementations - the client's kernel does not need to communicate with the server in order to do an unmount. After the unmount is complete the umount command will do the server a courtesy and send an unmount message to it's mount daemon - but's no big deal if this times out. On the other hand, if you can easily get an umountable mount if you have hierarchically related mounts e.g. if /usr/local and /usr/local/share are two NFS mounts and the server for /usr/local falls over then heaven and earth won't shift the mounts. Why ? Can't unmount /usr/local because the sub-mount of /usr/local/share keeps it EBUSY. Can't unmount /usr/local/share because the unmount system call will hang in the pathname lookup when it hits /usr/local. -- Made in New Zealand --> Brent Callaghan @ Sun Microsystems Email: brent@Eng.Sun.COM phone: (415) 336 1051
barmar@think.com (Barry Margolin) (04/18/91)
In article <11700@exodus.Eng.Sun.COM> brent@terra.Eng.Sun.COM (Brent Callaghan) writes: >In article <130@gordius.gordian.com>, johnk@gordian.com (John Kalucki) writes: >> ...but what I'd really like is some method to allow me to umount a >> dead nfs filesystem. >What's the problem ? SunOS lets you unmount a dead NFS filesystem. This is basically true, but circumstances often prevent it. You identified the problem with hierarchical mounts; a solution to this is to avoid hierarchical mounts and use symbolic links to emulate them. A worse problem is that you can't unmount a file system on which any process has a file open. You generally don't notice that a file server is down until you start seeing "NFS server xxx not responding", and these occur when someone is trying to access a file on that file system. Unless you have the file system mounted soft, the process won't continue until the file server comes back. This causes a deadlock: the process can't close files because the server is down, and the file system can't be unmounted because the process hasn't closed all its files. Another deadlock occurs if the file system configuration on the server changes in such a way that old handles to the root of the file system becomes invalid (I think this can sometimes happen as a result of the fsck that runs when the server reboots, if the file system needed lots of fixing). Unmount gets a "stale file handle" error when it tries to reference the mount point, and it can't unmount it. -- Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar