[comp.protocols.nfs] perculiar NFS behaviour

G.Eustace@massey.ac.nz (Glen Eustace) (04/15/91)

The following saga is from our Tales of Mystery and Imagination....

Actually it happened this afternoon and after having consulted as
many individuals as possible, all are at a loss to explain the
following behaviour.  I am hoping that someone else has observed this
behaviour and if they don't have a cure may at least have an
explanation.

Environment:
  Pyramid 9815 OSx5.0d
     All user file systems have quotas enabled
     All nfs user file systems are mounted without quotas
  DECStation 3100s Ultrix 4.1
     All user file systems have quotas enabled
     All nfs user file systems are mounted without quotas

  All user file systems are cross mounted hard on all servers.

What happened.
  One of the DECStations developed a fault and died.  the /users/u12
file system was now offline.  This file system could not be
dismounted from the Pyramid.  We don't have this problem with Ultrix,
if the host is unreachable the unmount reports the problem but
succeeds in unmounting the file system anyway.

  I was attempting to archive some files from one of the Pyramids
local file systems /users/u5.  The 'find' was getting stuck on a 'D'
wait, it would come up with cc-labserver2:/users/u12 not responding...
and would eventually timeout with a getattr failure.  We tried 'ls',
it succeeded sometimes but failed with the same error on others.
Note that we were working completely within the local machine.  Why
is the Pyramid trying to access a remote file system in this
situation?

In addition to the above the login process is very slow.  It would
appear that login is trying to check quotas on the faulty machine.
Why?  we have not mounted its file system with quotas.

All in all the only way to fix this was to reboot the pyramid.  Not a
solution I like but it did work.  I am perplexed with why such a simple
command like 'ls' or 'find' on a local file system should cause a
'getattr' call on an entirely different remote file system.

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Glen Eustace, Systems Software Manager | EMail: G.Eustace@massey.ac.nz
 Computer Centre,  Massey University,  Palmerston North,  New Zealand
Phone: +64 63 69099 x7440, Fax: +64 63 505 607,       Timezone: GMT-12

johnk@gordian.com (John Kalucki) (04/17/91)

In article <1991Apr15.045511.6354@massey.ac.nz>, G.Eustace@massey.ac.nz (Glen Eustace) writes:
|> [...]
|> All in all the only way to fix this was to reboot the pyramid.  Not a
|> solution I like but it did work.  I am perplexed with why such a simple
|> command like 'ls' or 'find' on a local file system should cause a
|> 'getattr' call on an entirely different remote file system.
|> -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
|> Glen Eustace, Systems Software Manager | EMail: G.Eustace@massey.ac.nz
|>  Computer Centre,  Massey University,  Palmerston North,  New Zealand
|> Phone: +64 63 69099 x7440, Fax: +64 63 505 607,       Timezone: GMT-12

What is probably happening is the gettattr is being called on the
nfs mount point as it walks up the tree.  This is the exact same
problem that I posted a few weeks ago about my Mips machines. I've
mounted them as follows and it has helped a bit:

alecto:/n/alecto        /n/alecto       nfs bg,rw,soft,timeo=2,intr 0 0

..but what I'd really like is some method to allow me to umount a
dead nfs filesystem.

		-John Kalucki

brent@terra.Eng.Sun.COM (Brent Callaghan) (04/17/91)

In article <130@gordius.gordian.com>, johnk@gordian.com (John Kalucki) writes:
> ...but what I'd really like is some method to allow me to umount a
> dead nfs filesystem.

What's the problem ?  SunOS lets you unmount a dead NFS filesystem.

The same should be true for other implementations - the client's
kernel does not need to communicate with the server in order to
do an unmount.  After the unmount is complete the umount command
will do the server a courtesy and send an unmount message to it's
mount daemon - but's no big deal if this times out.

On the other hand, if you can easily get an umountable mount if
you have hierarchically related mounts e.g.  if /usr/local and
/usr/local/share are two NFS mounts and the server for /usr/local
falls over then heaven and earth won't shift the mounts.  Why ?

Can't unmount /usr/local because the sub-mount of /usr/local/share
keeps it EBUSY.

Can't unmount /usr/local/share because the unmount system call will
hang in the pathname lookup when it hits /usr/local.
--

Made in New Zealand -->  Brent Callaghan  @ Sun Microsystems
			 Email: brent@Eng.Sun.COM
			 phone: (415) 336 1051

barmar@think.com (Barry Margolin) (04/18/91)

In article <11700@exodus.Eng.Sun.COM> brent@terra.Eng.Sun.COM (Brent Callaghan) writes:
>In article <130@gordius.gordian.com>, johnk@gordian.com (John Kalucki) writes:
>> ...but what I'd really like is some method to allow me to umount a
>> dead nfs filesystem.
>What's the problem ?  SunOS lets you unmount a dead NFS filesystem.

This is basically true, but circumstances often prevent it.  You identified
the problem with hierarchical mounts; a solution to this is to avoid
hierarchical mounts and use symbolic links to emulate them.

A worse problem is that you can't unmount a file system on which any
process has a file open.  You generally don't notice that a file server is
down until you start seeing "NFS server xxx not responding", and these
occur when someone is trying to access a file on that file system.  Unless
you have the file system mounted soft, the process won't continue until the
file server comes back.  This causes a deadlock: the process can't close
files because the server is down, and the file system can't be unmounted
because the process hasn't closed all its files.

Another deadlock occurs if the file system configuration on the server
changes in such a way that old handles to the root of the file system
becomes invalid (I think this can sometimes happen as a result of the fsck
that runs when the server reboots, if the file system needed lots of
fixing).  Unmount gets a "stale file handle" error when it tries to
reference the mount point, and it can't unmount it.
--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar