petri (Stefan Petri) (09/29/89)
Configuration : Targon/35-M50 with TOS3.2 and NFS3.2 (thats a clone of Pyramid 9810 OSx4.0 , I think) linked via ethernet to some Sun3/60 (SunOs 3.5 and 4.0.3) Problem: when a nfs-servers dies, the `login' on all of his clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel not responding, still trying) ; even if neither the user nor the system needs (seems to need ? ) the files from that died server. ( e.g. remote-mounted man-pages). From the manuals I see that a process is supposed to hang, if it tries to access a file from a nfs-server, but why does this happen at login-time ? The Problem occurs on the Suns as well as on the Pyramid-clone, all running sun-nfs. HELP ! S.P. -- Stefan Petri <petri@tubsibr.UUCP> Technische Universitaet Braunschweig, Institut fuer Betriebssysteme und Rechnerverbund, 3300 Braunschweig, W. Germany.
geoff@hinode.East.Sun.COM (Geoff Arnold @ Sun BOS - R.H. coast near the top) (09/29/89)
In article <1989Sep28.195215.4656@tubsibr.uucp> petri@tubsibr.UUCP (Stefan Petri) writes: > >Configuration : Targon/35-M50 with TOS3.2 and NFS3.2 > (thats a clone of Pyramid 9810 OSx4.0 , I think) > linked via ethernet to some Sun3/60 (SunOs 3.5 and 4.0.3) > >Problem: when a nfs-servers dies, the `login' on all of his >clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel >not responding, still trying) ; even if neither the user >nor the system needs (seems to need ? ) the files from that died >server. ( e.g. remote-mounted man-pages). You may not need them, but your shell may! Seriously, if you have an inaccessible NFS-mounted drive anywhere in your path you are likely to see this problem. For example, my "/usr/local/`arch`/bin" occurs in my path ahead of my home directory (possibly dumb, but never mind), and my ".login" includes a reference to a private script in my home directory. Obviously the shell had to try to scan "/usr/local/`arch`/bin" to resolve this. For a long time I mounted "local-host:/usr/local" on my "/usr/local", which led to the problem you described whenever "local-host" was down. Eventually I switched to using the automounter: my "/usr/local" is now a symbolic link to "/vol/local", and "auto.vol" defines a couple of alternative servers for "/usr/local". Not only do I get fault- tolerance, but if neither server is available the automounter will time out after a sensible interval. If the Targon system doesn't support the automounter yet, ask them when they'll be upgrading to NFS based on NFSSRC4.0. Geoff Arnold, Internet: geoff@East.Sun.COM PCDS Group, Sun Microsystems Inc. --------------------------------------------------------------------------- I love standards: de facto, de jure, de gustibus, de lusionary, de credo...
barmar@kulla (Barry Margolin) (10/01/89)
In article <1989Sep28.195215.4656@tubsibr.uucp> petri@tubsibr.UUCP (Stefan Petri) writes: >Problem: when a nfs-servers dies, the `login' on all of his >clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel >not responding, still trying) ; even if neither the user >nor the system needs (seems to need ? ) the files from that died >server. ( e.g. remote-mounted man-pages). We've seen this on Suns running SunOS 4.0.3, as well as earlier releases. And it isn't only the shell that hangs; I've seen "rn" and "emacsclient" hang, as well. Someone else already mentioned the possibility that a directory mounted from the dead server is in your PATH. Another problem is NFS-mounted files adjacent to one of the directories between your working directory and the root. This can cause the getwd() function to hang. Here's why: In order to find out the name of a directory, getwd() first stat()s "." to find out its inode#. Then it scans ".." and stat()s each file until it finds one with the same inode#. If one of these files is an NFS mount point, or a symbolic link to a path on an NFS-mounted file system, the stat() must contact the NFS server, and will hang if it's down. When using the automounter, there is an additional behavior caused by this problem. When all the NFS servers are up, getwd() may still take a long time to complete. This is because many of the file systems it encounters may not be mounted, and it takes a significant fraction of a second to mount them. I think the solution is for getwd() to use lstat() rather than stat(), and to keep mount points out of the root. This way, getwd() will only encounter links, and lstat() shouldn't need to access the target of the link. Barry Margolin, Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
brent%terra@Sun.COM (Brent Callaghan) (10/04/89)
In article <30535@news.Think.COM>, barmar@kulla (Barry Margolin) writes: > Someone else already mentioned the possibility that a directory > mounted from the dead server is in your PATH. Another problem is > NFS-mounted files adjacent to one of the directories between your > working directory and the root. This can cause the getwd() function > to hang. Here's why: > > In order to find out the name of a directory, getwd() first stat()s > "." to find out its inode#. Then it scans ".." and stat()s each file > until it finds one with the same inode#. If one of these files is an > NFS mount point, or a symbolic link to a path on an NFS-mounted file > system, the stat() must contact the NFS server, and will hang if it's > down. In SunOs 4.0 the getwd() algorithm was changed. If it crosses a mountpoint in its walk up the file tree it takes a peek in /etc/mtab. It looks for a mount entry with the same device id and when it finds the mountpoint it just steals the path from the /etc/mtab entry and prepends it to the current path. The good news about this is that you don't have to walk all the way up to the root and can avoid stat'ing mountpoints in "/" or "/usr". The bad news is that you stat *all* the mountpoints in /etc/mtab - so the hanging problem could be worse than before. To work around this problem the device id's for all the mountpoints in /etc/mtab are cached in a file /tmp/.getwd. As long as the /etc/mtab is not updated (no mounts or unmounts) the cache can be consulted for pathname/device-id pairs instead of risking lots of stat'ing of mountpoints. A problem with this scheme is that the getwd cache became invalid whenever the /etc/mtab is modified. If you use the automounter the /etc/mtab can being updated so frequently that the getwd cache is almost useless. In SunOs 4.1 (real soon now) the /tmp/.getwd cache is gone. The device id for each mountpoint is now kept in the /etc/mtab file itself. It lives in the mount options string as a hex number following the string "dev=". Commands like mount and automount that append entries to the /etc/mtab just stat() the new mount (not much chance of hanging) and insert the "dev=" stuff into the mount option string. This is not user-visible - unless you cat the /etc/mtab. Since the device id is a invariant for the lifetime of a mount it makes sense for the device id to live in the /etc/mtab with all the other mount-invariant information. Any program like getwd(), df or find that likes to stat mountpoints to get a devid can avoid the stat if the "dev=" is present. Now getwd() doesn't need to stat() mountpoints at all (unless the dev= is missing for some reason). This seems to be working well - I think we've got that one licked (finally). Brent Made in New Zealand --> Brent Callaghan @ Sun Microsystems uucp: sun!bcallaghan phone: (415) 336 1051
dwh@twg-ap.UUCP (Dave Hamaker) (10/06/89)
From article <1989Sep28.195215.4656@tubsibr.uucp>, by petri (Stefan Petri): > > Configuration : Targon/35-M50 with TOS3.2 and NFS3.2 > (thats a clone of Pyramid 9810 OSx4.0 , I think) > linked via ethernet to some Sun3/60 (SunOs 3.5 and 4.0.3) > > Problem: when a nfs-servers dies, the `login' on all of his > clients will hang somewhere after displaying /etc/motd (saying nfs-server gargel > not responding, still trying) ; even if neither the user > nor the system needs (seems to need ? ) the files from that died > server. ( e.g. remote-mounted man-pages). You might look at your shell profile file(s). The default /etc/profile on my system did a ". /etc/dfspace" just before printing /etc/motd. /etc/dfspace is a shell script which edits the output of df -t with awk to tell you about available disk space. It wasn't written with NFS in mind but turns out to ignore mounted NFS filesystems which df reports. When I changed the dfspace reference to report specifically on my normal local filesystems (e.g. ". /etc/dfspace" -> "/etc/dfspace / /usr", login hanging ceased. Dave Hamaker The Wollongong Group dwh@twg.com
petri (Stefan Petri) (10/08/89)
In article <1989Sep28.195215.4656@tubsibr.uucp>, I wrote : [..] > Problem: when a nfs-servers dies, the `login' on all of his > clients will hang somewhere after displaying /etc/motd [..] Ok, I got the solution to my problem : ( mailed to me by guido%rwthinf%unido@infbs (Guido Bunsen) ) It's NOT a $PATH referencing the nfs-directories. It's NOT some /etc/dfspace or the like. It's NOT getwd() stat()-ing around in a directory containing a mountpoint. It IS the mount option "noquota", that is not documented for nfs-filesystems in my manuals - but turns out to be essential for me. Thanks to all those thinking about my problem S.P. -- Stefan Petri <petri@tubsibr.UUCP> Technische Universitaet Braunschweig, Institut fuer Betriebssysteme und Rechnerverbund, 3300 Braunschweig, W. Germany.