[comp.sys.sun] /bin/login hangs if an NFS server down?

dank@moc.jpl.nasa.gov (Dan Kegel) (08/28/90)

We run Solbourne's SunOS 4.0.1, without quotas.  Whenever an NFS server on
my network is down, logins seem to take forever even if the filesystem
provided by the server has nothing to do with the user who is logging in.


Here is part of the output of trace -t /bin/login when a server is down:

10:20:11 open ("/etc/motd", 0, 0666) = 3
10:20:11 ioctl (3, 0x40125401, 0xf7fffaa6) = -1 ENOTTY (Inappropriate ioctl for device)
10:20:11 fstat (3, 0xf7fffb18) = 0
10:20:11 read (3, "OS/MP 4.0B (GENERIC/root) #1: Tu".., 8192) = 60
10:20:11 write (1, "OS/MP 4.0B (GENERIC/root) #1: Tu".., 59) = 59
10:20:11 write (1, "\n", 1) = 1
10:20:11 read (3, "", 8192) = 0
10:20:11 close (3) = 0
10:20:11 sigvec (2, 0xf7fffb34, 0xf7fffba0) = 0
10:20:11 stat ("/var/spool/mail/dank", 0xf7fffc88) = 0
10:20:11 write (1, "You have new mail.\n", 19) = 19
10:20:11 vfork () = 3639
10:20:11 wait4 (0, 0, 0, 0) = 0
10:21:27 - SIGCHLD (20)
10:21:27 wait4 (0, 0, 0, 0) = 3639
10:21:27 sigvec (14, 0xf7fffb94, 0xf7fffc00) = 0
10:21:27 sigvec (3, 0xf7fffb94, 0xf7fffc00) = 0
10:21:27 sigvec (2, 0xf7fffb94, 0xf7fffc00) = 0
10:21:27 sigvec (18, 0xf7fffb94, 0xf7fffc00) = 0
10:21:27 execve ("/bin/csh", 0xf7fffc60, 0x8958) = 0

If I unmount the offending file system, the output is the same except that
wait4() completes immediately.

Does anybody know what the heck /bin/login is doing in that child process
that takes 1'16" to terminate?  Is it secretly doing a df?

Please reply by e-mail, and I'll summarize.
- Dan Kegel (dank@moc.jpl.nasa.gov)

jim@cs.strath.ac.uk (Jim Reid) (10/08/90)

In article <1990Aug29.231300.10232@rice.edu> dank@moc.jpl.nasa.gov (Dan Kegel) writes:

|We run Solbourne's SunOS 4.0.1, without quotas.  Whenever an NFS server on
|my network is down, logins seem to take forever even if the filesystem
|provided by the server has nothing to do with the user who is logging in.

|Does anybody know what the heck /bin/login is doing in that child process
|that takes 1'16" to terminate?  Is it secretly doing a df?

No.

When login forks and execs /bin/csh, csh does a getwd() to determine its
current working directory. This involves getting the name and inode number
of every file in the parent directory. Comparing the inode number for '.'
(the current directory) with each name/inode number pair in the parent
directory gives the filename of the current directory. This procedure is
executed in successive parent directories (building the pathname as it
goes along) until it hits '/' where the inode numbers  for '.' and '..'
are the same.

Now to get the inode number for an NFS mount point, the kernel has to make
a request to the NFS server. If the server is down and the filesystem hard
mounted, the kernel will wait forever for a reply to the NFS request it
made. I suppose that there must be a mount point for a dead NFS server
somewhere in one of the directories between the users home directory and
'/'.

		Jim

guy@uunet.uu.net (Guy Harris) (10/27/90)

>|Does anybody know what the heck /bin/login is doing in that child process
>|that takes 1'16" to terminate?  Is it secretly doing a df?

One hypothesis, presented in the posting to which I'm following up, is
that the user's shell is doing a "getwd()" to find the current directory.

This may well be the case, but there's another possible answer as well.

You may not be running quotas, but unless the client has done all its NFS
mounting with the "noquota" option, "/usr/ucb/quota", which is run by
"/usr/bin/login", will still try to talk to the quota daemons on all the
servers for file systems the client has mounted in order to find out
whether the user logging in is over quota or not.  If the server is down,
this will time out, and take a while to do so.

ruck@reef.cis.ufl.edu (John Ruckstuhl) (11/16/90)

In article <1990Oct26.221451.18656@rice.edu> auspex!guy@uunet.uu.net (Guy Harris) writes:
>>|Does anybody know what the heck /bin/login is doing in that child process
>>|that takes 1'16" to terminate?  Is it secretly doing a df?
>
>You may not be running quotas, but unless the client has done all its NFS
>mounting with the "noquota" option, "/usr/ucb/quota", which is run by
>"/usr/bin/login", will still try to talk to the quota daemons on all the
>servers for file systems the client has mounted in order to find out
>whether the user logging in is over quota or not.  If the server is down,
>this will time out, and take a while to do so.

Our SPARCstation 1+'s (SunOS 4.1) nfs-mount auxilliary filesystems, which
consist of archives of various flavors and applications maintained by the
College's computer support department).  When the computer support
department's SPARCstation is unreachable (network failure or actually
down), all of our SPARCstation 1+'s hang (until the remote computer is
again reachable).

We would like to avoid this disruption (since the data on the remote
computer is not necessary for our continued operation).

I would think from reading fstab(5), that a "bg" OR a "soft" option in the
mounting of the remote filesystems would avoid this hanging, but someone
else explains to me that "these options refer to the initial attempt to
mount, i.e. at boot-time, and are irrelevant to the interrupted
availability of the remote file-system".

"Further, there is no way to avoid your computers' hanging when the
already nfs-mounted remote filesystem becomes unavailable".

Please confirm this or suggest a workaround.

I follow-up to Guy Harris' article because it suggests (to me, at least)
that my problem may be related to quota accounting. 

John R Ruckstuhl, Jr
University of Florida		ruck@cis.ufl.edu, uflorida!ruck

chris@com50.c2s.mn.org (Chris Johnson) (11/21/90)

In article <1990Oct26.221451.18656@rice.edu> auspex!guy@uunet.uu.net (Guy Harris) writes:
>>|Does anybody know what the heck /bin/login is doing in that child process
>>|that takes 1'16" to terminate?  Is it secretly doing a df?
>
>You may not be running quotas, but unless the client has done all its NFS
>mounting with the "noquota" option, "/usr/ucb/quota", which is run by
>"/usr/bin/login", will still try to talk to the quota daemons on all the
>servers for file systems the client has mounted in order to find out
>whether the user logging in is over quota or not.  If the server is down,
>this will time out, and take a while to do so.

So, if you want to get around this hang, and are not running quotas,
you've got a few choices.  Instead of mounting with the noquota option
(since it's not a default, you'll have to remember it all the time for
manual mounts), we here instead just replaced /usr/ucb/quota with a link
to /usr/bin/true.  Makes things run a lot quicker at bootup and login,
since the quota check turns into a very short program call.

david@cs.uow.edu.au (David E A Wilson) (11/29/90)

chris@com50.c2s.mn.org (Chris Johnson) writes:
>Instead of mounting with the noquota option
>(since it's not a default, you'll have to remember it all the time for
>manual mounts), we here instead just replaced /usr/ucb/quota with a link
>to /usr/bin/true.  Makes things run a lot quicker at bootup and login,
>since the quota check turns into a very short program call.

Alternatively, even if your site is running quotas, you can create
.hushlogin in your home directory and /bin/login will not call
/usr/ucb/quota.  It will also not check for mail etc but you can do that
yourself in .profile.

David Wilson	Dept Comp Sci, Uni of Wollongong	david@cs.uow.edu.au