[comp.protocols.nfs] login hangs when server dies

petri (Stefan Petri) (09/29/89)

Configuration : Targon/35-M50 with TOS3.2 and NFS3.2
	(thats a clone of Pyramid 9810 OSx4.0 , I think)
	linked via ethernet to some Sun3/60 (SunOs 3.5 and 4.0.3)

Problem: when a nfs-servers dies, the `login' on all of his
clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel
not responding, still trying) ; even if neither the user
nor the system needs (seems to need ? ) the files from that died
server. ( e.g. remote-mounted man-pages).

From the manuals I see that a process is supposed to hang, if it tries to
access a file from a nfs-server, but why does this happen at login-time ?
The Problem occurs on the Suns as well as on the Pyramid-clone, all running
sun-nfs.

HELP !

S.P.
--
Stefan Petri					<petri@tubsibr.UUCP>
Technische Universitaet Braunschweig, Institut fuer Betriebssysteme und
Rechnerverbund, 3300 Braunschweig, W. Germany.

geoff@hinode.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (09/29/89)

In article <1989Sep28.195215.4656@tubsibr.uucp> petri@tubsibr.UUCP (Stefan Petri) writes:
>
>Configuration : Targon/35-M50 with TOS3.2 and NFS3.2
>	(thats a clone of Pyramid 9810 OSx4.0 , I think)
>	linked via ethernet to some Sun3/60 (SunOs 3.5 and 4.0.3)
>
>Problem: when a nfs-servers dies, the `login' on all of his
>clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel
>not responding, still trying) ; even if neither the user
>nor the system needs (seems to need ? ) the files from that died
>server. ( e.g. remote-mounted man-pages).

You may not need them, but your shell may! Seriously, if you have
an inaccessible NFS-mounted drive anywhere in your path you are likely to
see this problem. For example, my "/usr/local/`arch`/bin" occurs in
my path ahead of my home directory (possibly dumb, but never mind), and
my ".login" includes a reference to a private script in my home directory.
Obviously the shell had to try to scan "/usr/local/`arch`/bin" to resolve this.

For a long time I mounted "local-host:/usr/local" on my "/usr/local",
which led to the problem you described whenever "local-host" was down.
Eventually I switched to using the automounter: my "/usr/local" is
now a symbolic link to "/vol/local", and "auto.vol" defines a couple
of alternative servers for "/usr/local". Not only do I get fault-
tolerance, but if neither server is available the automounter will time
out after a sensible interval.

If the Targon system doesn't support the automounter yet, ask them when
they'll be upgrading to NFS based on NFSSRC4.0.

Geoff Arnold,                              Internet: geoff@East.Sun.COM
PCDS Group, Sun Microsystems Inc.
---------------------------------------------------------------------------
I love standards: de facto, de jure, de gustibus, de lusionary, de credo...

barmar@kulla (Barry Margolin) (10/01/89)

In article <1989Sep28.195215.4656@tubsibr.uucp> petri@tubsibr.UUCP (Stefan Petri) writes:
>Problem: when a nfs-servers dies, the `login' on all of his
>clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel
>not responding, still trying) ; even if neither the user
>nor the system needs (seems to need ? ) the files from that died
>server. ( e.g. remote-mounted man-pages).

We've seen this on Suns running SunOS 4.0.3, as well as earlier
releases.  And it isn't only the shell that hangs; I've seen "rn" and
"emacsclient" hang, as well.

Someone else already mentioned the possibility that a directory
mounted from the dead server is in your PATH.  Another problem is
NFS-mounted files adjacent to one of the directories between your
working directory and the root.  This can cause the getwd() function
to hang.  Here's why:

In order to find out the name of a directory, getwd() first stat()s
"." to find out its inode#.  Then it scans ".." and stat()s each file
until it finds one with the same inode#.  If one of these files is an
NFS mount point, or a symbolic link to a path on an NFS-mounted file
system, the stat() must contact the NFS server, and will hang if it's
down.

When using the automounter, there is an additional behavior caused by
this problem.  When all the NFS servers are up, getwd() may still take
a long time to complete.  This is because many of the file systems it
encounters may not be mounted, and it takes a significant fraction of
a second to mount them.

I think the solution is for getwd() to use lstat() rather than stat(),
and to keep mount points out of the root.  This way, getwd() will only
encounter links, and lstat() shouldn't need to access the target of
the link.

Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

brent%terra@Sun.COM (Brent Callaghan) (10/04/89)

In article <30535@news.Think.COM>, barmar@kulla (Barry Margolin) writes:
> Someone else already mentioned the possibility that a directory
> mounted from the dead server is in your PATH.  Another problem is
> NFS-mounted files adjacent to one of the directories between your
> working directory and the root.  This can cause the getwd() function
> to hang.  Here's why:
> 
> In order to find out the name of a directory, getwd() first stat()s
> "." to find out its inode#.  Then it scans ".." and stat()s each file
> until it finds one with the same inode#.  If one of these files is an
> NFS mount point, or a symbolic link to a path on an NFS-mounted file
> system, the stat() must contact the NFS server, and will hang if it's
> down.

In SunOs 4.0 the getwd() algorithm was changed.  If it crosses a
mountpoint in its walk up the file tree it takes a peek in /etc/mtab.
It looks for a mount entry with the same device id and when it finds
the mountpoint it just steals the path from the /etc/mtab entry
and prepends it to the current path.

The good news about this is that you don't have to walk all the way
up to the root and can avoid stat'ing mountpoints in "/" or "/usr".
The bad news is that you stat *all* the mountpoints in /etc/mtab - so
the hanging problem could be worse than before.  To work around this
problem the device id's for all the mountpoints in /etc/mtab are
cached in a file /tmp/.getwd.  As long as the /etc/mtab is not
updated (no mounts or unmounts) the cache can be consulted for
pathname/device-id pairs instead of risking lots of stat'ing of
mountpoints.

A problem with this scheme is that the getwd cache became invalid
whenever the /etc/mtab is modified.  If you use the automounter the
/etc/mtab can being updated so frequently that the getwd cache is
almost useless.

In SunOs 4.1 (real soon now) the /tmp/.getwd cache is gone.  The
device id for each mountpoint is now kept in the /etc/mtab file
itself.  It lives in the mount options string as a hex number
following the string "dev=".  Commands like mount and automount
that append entries to the /etc/mtab just stat() the new mount
(not much chance of hanging) and insert the "dev=" stuff into
the mount option string.  This is not user-visible - unless
you cat the /etc/mtab.  Since the device id is a invariant for
the lifetime of a mount it makes sense for the device id to
live in the /etc/mtab with all the other mount-invariant 
information.  Any program like getwd(), df or find that likes
to stat mountpoints to get a devid can avoid the stat if
the "dev=" is present.

Now getwd() doesn't need to stat() mountpoints at all (unless the dev=
is missing for some reason).  This seems to be working well - I 
think we've got that one licked (finally).

	Brent

Made in New Zealand -->  Brent Callaghan  @ Sun Microsystems
			 uucp: sun!bcallaghan
			 phone: (415) 336 1051

dwh@twg-ap.UUCP (Dave Hamaker) (10/06/89)

From article <1989Sep28.195215.4656@tubsibr.uucp>, by petri (Stefan Petri):
> 
> Configuration : Targon/35-M50 with TOS3.2 and NFS3.2
> 	(thats a clone of Pyramid 9810 OSx4.0 , I think)
> 	linked via ethernet to some Sun3/60 (SunOs 3.5 and 4.0.3)
> 
> Problem: when a nfs-servers dies, the `login' on all of his
> clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel
> not responding, still trying) ; even if neither the user
> nor the system needs (seems to need ? ) the files from that died
> server. ( e.g. remote-mounted man-pages).

You might look at your shell profile file(s).  The default /etc/profile on my
system did a ". /etc/dfspace" just before printing /etc/motd.  /etc/dfspace
is a shell script which edits the output of df -t with awk to tell you about
available disk space.  It wasn't written with NFS in mind but turns out to
ignore mounted NFS filesystems which df reports.  When I changed the dfspace
reference to report specifically on my normal local filesystems (e.g.
". /etc/dfspace" -> "/etc/dfspace / /usr", login hanging ceased.

Dave Hamaker
The Wollongong Group
dwh@twg.com

petri (Stefan Petri) (10/08/89)

In article <1989Sep28.195215.4656@tubsibr.uucp>, I wrote :
[..]
> Problem: when a nfs-servers dies, the `login' on all of his
> clients will hang somewhere after displaying /etc/motd
[..]

Ok, I got the solution to my problem :
( mailed to me by guido%rwthinf%unido@infbs (Guido Bunsen) )

It's NOT a $PATH referencing the nfs-directories.
It's NOT some /etc/dfspace or the like.
It's NOT getwd() stat()-ing around in a directory containing a mountpoint.

It IS the mount option "noquota", that is not documented for nfs-filesystems
in my manuals - but turns out to be essential for me.

Thanks to all those thinking about my problem

S.P.
--
Stefan Petri					<petri@tubsibr.UUCP>
Technische Universitaet Braunschweig, Institut fuer Betriebssysteme und
Rechnerverbund, 3300 Braunschweig, W. Germany.