[comp.protocols.nfs] hard vs. soft mounts on Suns and Pyramids

tr@dduck.ctt.bellcore.com (tom reingold) (04/28/89)

It says in the SunOS 4.x manual page for mount(8) that a soft mount
"returns an error if the server does not respond" and that a hard mount
"continues the retry request until the server responds".  In my
experience, the choice is worse than it appears:  Soft mounts do not
always return errors on writes; they can just garble the output without
warnings.  And hard mounts are not interruptible, even with the "intr"
option.

Further, it says that "filesystems that are mounted `rw' (read-write)
should use the `hard' option.  The problem with this is that the entire
client and all of its processes can hang when one process is trying to
write on a hard-mounted filesystem when the server is down!  This is a
terrible situation when the client is a large time-sharing system.

What are my real choices, and how do they differ from the manual.  What
else can I read on this?

Tom Reingold                   |INTERNET:       tr@ctt.bellcore.com
Bell Communications Research   |UUCP:           bellcore!ctt!tr
444 Hoes La room 1E225         |PHONE:          (201) 699-7058 [work],
Piscataway, NJ 08854           |                (201) 287-2345 [home]

ed@mtxinu.COM (Ed Gould) (04/29/89)

>And hard mounts are not interruptible, even with the "intr"
>option.

Hard mounts *are* interruptible with the "intr" option.  That's precisely
what the option is for, since soft mounts were always interruptible
in essence.

>Further, it says that "filesystems that are mounted `rw' (read-write)
>should use the `hard' option.  The problem with this is that the entire
>client and all of its processes can hang when one process is trying to
>write on a hard-mounted filesystem when the server is down!  This is a
>terrible situation when the client is a large time-sharing system.

The client shouldn't hang when waiting for a server, only processes
waiting for that server should block.  It's easy to be careless
and depend too heavily on a server, though.

-- 
Ed Gould                    mt Xinu, 2560 Ninth St., Berkeley, CA  94710  USA
ed@mtxinu.COM		    +1 415 644 0146

"I'll fight them as a woman, not a lady.  I'll fight them as an engineer."

mre@beatnix.UUCP (Mike Eisler) (05/02/89)

In article <840@mtxinu.UUCP> ed@garcia.mtxinu.COM (Ed Gould) writes:
>Hard mounts *are* interruptible with the "intr" option.  That's precisely
>what the option is for, since soft mounts were always interruptible
>in essence.

They are certainly designed and coded to be interruptible in SunOS,
but I know that in at least SunOS 3.2, they didn't work, or they took
a loooong time to interrupt. Can't say about SunOS 4.0 though. In
Lachman's NFS for System V.3, hard mounts were always instantly
interruptible.

>>Further, it says that "filesystems that are mounted `rw' (read-write)
>>should use the `hard' option.  The problem with this is that the entire
>>client and all of its processes can hang when one process is trying to
>>write on a hard-mounted filesystem when the server is down!  This is a
>>terrible situation when the client is a large time-sharing system.

Often, NFS file systems are mounted on a directory immediately under
"/". When a process does a getcwd() this will ultimately result in the
directory entries of "/" being searched and stat()ed. So if one those
directories is a mount point for a NFS filesystem, and the server is
down, getcwd() will hang, and so will the process calling it.  Getcwd()
is called more often than one would think, so most of the interactive
proceses on the client end up hanging. This is particularily noticeable
during login. The fix is to never mount NFS filesystems under a
directory that is likely going to be a component of somebodies cwd, and
"/" is in everyone's cwd. We mount all our NFS filesystems under a
directory called /n.
	-Mike Eisler
	{uunet,sun}!elxsi!mre

fletcher@cs.utexas.edu (Fletcher Mattox) (05/02/89)

In article <840@mtxinu.UUCP> ed@garcia.mtxinu.COM (Ed Gould) writes:
>Hard mounts *are* interruptible with the "intr" option.  That's precisely
>what the option is for, since soft mounts were always interruptible
>in essence.

I realise that's what the manuals say.  And I've looked at the code
which puports to do this.  But the fact is that the "intr" option
just plain doesn't work much of the time.  

Does anybody know why this is so?

Thanks
Fletcher Mattox

Ps: Oh yeah, I'm *not* complaining about it taking a long, but finite,
amount of time to process the interrupt.

hedrick@geneva.rutgers.edu (Charles Hedrick) (05/03/89)

I have yet to see a case on the Sun where you couldn't ^C out of a
hung NFS connection when intr is set.  However it only checks for
interrupts at one point in the code.  Depending upon how timeouts
are set, it can take a couple of minutes to get out.  I agree that
this is very frustrating.  My three big gripe for NFS are:

  - ^C should happen immediately
  - df should not hang if one of the servers is hung.  It should
	simply print "server not responding" on that line and go
	on to the next.  (Pyramid attempted to make this work,
	by allowing a program to specify that disk access should
	be treated as soft just for this program, even if the
	disk is mounted hard.)
  - pwd, getcwd, etc., should be coded so that they don't hang
	unless your current directory is actually on the hung server

NFS is very useful, but our users have gotten to hate it.  (It doesn't
help that SunOS 4.0 has several bugs that cause NFS to hang.  They're
slowing being fixed, but our users are never going to forgive me for
putting them through this.  Of course *I'm* to blame personally for
all Sun bugs.)

alarson@pavo.SRC.Honeywell.COM (Aaron Larson) (05/03/89)

In article <May.2.15.58.31.1989.5175@geneva.rutgers.edu> hedrick@geneva.rutgers.edu (Charles Hedrick) writes:

    ....
     - pwd, getcwd, etc., should be coded so that they don't hang
	   unless your current directory is actually on the hung server

We've gotten around this particular bug (and in the process several related
ones) by hard mounting our partitions in a rather indirect manner.  On our
machines /n is a directory where we (conceptually) mount all the remote
file systems.  If you ACTUALLY mount remote partitions in one directory you
lose when one of the machines goes down.  This happens because as the
directory is scanned looking for machine foo, stat hangs on the down machine.
We get around this by actually mounting the paritions in another directory
structure, and link to them.  For example our /n dir looks like:

   /n/foo -> mumble/foo/mp
   /n/bar -> mumble/bar/mp

We mount foo's partitions on top of mumble/foo/mp, and reference them
through /n/foo (the actuall structure is not important other than you must
be sure that only one machines partitions are mounted in any one
directory).  Now, when the /n directory is scanned, the process won't hang
because it won't actually stat a directory that has mount points until it
finds the desired link.  Since doing this, we've not had any problems with
hanging processes unless they actually reference the partitions of a down
host (as df and friends do).  It even appears that the time spent following
the links is more than offset by the time spent STATing the remote
partitions!

Aaron Larson  MN65-2100              (612) 782-7308
Honeywell Systems & Research Center  alarson@SRC.Honeywell.COM  (internet)
3660 Technology Drive                alarson@srcsip             (uucp)
Mpls, MN  55418                      {umn-cs,ems,bthpyd}!srcsip!alarson

beepy%commuter@Sun.COM (Brian Pawlowski) (05/04/89)

<I wasn't sure if this made it to the newsgroup - I'll try
  again - bjp>
	

Carl Smith and Mark Stein have given me some notes on what intr, hard
and soft mean to an NFS client. This doesn't answer all your questions,
but does put them in context. This information is pretty accurate
for UNIX client implementations of NFS derived from the NFS/ONC
reference port.

The analogies are entirely my own.

An NFS client will timeout a request if the server does not respond
in some (user specifiable) period. This is coupled with a retrans count
and a backoff mechanism on the timeouts to deal with slow servers.
A server must be able to deal with multiple, duplicate requests arising
from retries as a result of his tardy responses.

Mark Stein gave an enlightening talk on the timeout strategy
for a UNIX client during the MVS/NFS server development:

The mount operation allows specifying a timeout value - timeo,
and a retry value - retrans - the number of retransmissions
of the NFS operation, and whether the mount is hard or soft.
The soft option returns an error if the  server  does
not  respond (as described below), whereas hard says continue the retry
request until the server responds.

The intr option may be added to modify the the behaviour of hard mounts,
and allows keyboard interrupts to stop the retransmissions.

These are described in the following pictures.

A normal NFS request (which is successful first shot) is processed 
as follows:

        Client	        The Ether                Server
        ------          ---------                ------

        NFS --->        --------->                      |
                        request	                        |
                                                <------ |
                                                server  |  Increasing
                       	                       responds |  Time
                        <---------                      |
                          Response                      |
        Client                        	                |
        Continues                                       |
                	                                V

The following picture shows timeouts (timeo value is entered
in tenths of seconds) up to a retry value (4) against an
unresponsive server:

        Client	        The Ether                Server
        ------          ---------                ------

        NFS --->   -    --------->                      |
                   |    request	                        |
                timeo = 7                               |
                   |                                    |
        NFS --->   -    --------->                      |
                   |    request	                        |
                timeo = 14                              |
                   |                                    |  Increasing
        NFS --->   -    --------->                      |   Time
                   |    request	                        |
                timeo = 28                              |
                   |                                    |
        NFS --->   -    --------->                      |
                   |    request	                        |
                timeo = 56                              |
 retrans = 4       |                                    |
                   -    --------->                      |
                        request                         |
        <-------                                        |
         Timeout returned                               |
         to caller IF SOFT! - A Major Timeout           |
         else if HARD or INTR, double                   |
         timeo and reenter loop.                        |
                	                                V

A TIMEOUT is registered on the client from NFS only after the timeo
time has elapsed for the specified number of retrans retransmission
specified. The initial timeo value itself may be dependent on the
type of operation (write vs. getattr vs. read) in a given NFS
client implementation. On each retransmission, the timeo value is doubled.

If a server is mounted soft, the timeout is returned to the calling
procedure or program.

If a server is mounted hard, NFS will backoff (double the current timeo
value on each major timeout to some maximum) with a new, longer
timeo value and attempt again for the specified retrans count
(with the new current timeo value doubled at each retransmission).
The initial default timeo on entry to each retransmission cycle
has a maximum value of 30 seconds. The maximum timeo in retransmission
sequence has a maximum value of 60 seconds. timeo is specified in
tenths of seconds

If the server is mounted intr, this is the same as hard, except that
on major timeouts (current, aged timeo value times retrans count with
backoff) a software interrupt may force an error return of timeout
to the calling procedure or program. In older implementations of NFS,
an interrupt can only slip in on a major timeout, a request that has
an aged timeo value with even a small retrans count can take a mighty
long time indeed to respond when a server is mounted intr. Later
implementations allow the interrupt to stop retransmissions much sooner.

Sometimes a mount may seem uninterruptable, when in actuality
the client may have backed off so the window for the interrupt to take
effect is a long way off, in an older NFS implementation.

Now on soft mounts garbling the data: this is entrirely application
dependent. If applications check their errors on write()'s
(mine do :-) then they will see the error and will most likely
abnormally end. Most applications probably do not, so you get intermittent
failures, some successes, and resulting garbled data. Now what
I forget to do is to check the return values on close() - where you
may see an asynchronous error from a previous write() call. This may
be where you see writes returning OK - but if you check your close(),
it will probably fail. Remember - the writes to the server a REALLY
asynchronous to your application given the buffering inherent
in UNIX (which exists between your application and NFS). The fact
that the write returns OK to the application, and may later fail
(soft mounted) is consistent with normal UNIX behaviour for say
a failing disk - where the error is detected at some time after
the write() returns OK to the application when the buffer is actually
attempted to write to disk.

That is why the safest bet, for critical data (such as the NFS
files which represent your ROOT PARTITION for diskless clients),
is hard mount the file system. If you like living on the ragged edge,
specify the intr option on writable partitions - then you have
the control as to whether or not you'll trash your file writes
in process - with the same behaviour as if you've interrupted
write()'s to a local hard disk. An analogy is that mounting hard
with the intr option makes your server most resemble local hard disk
for your applications.

If you mount your active writable filesystems soft,
you might consider taking up skyjumping for a hobby where you use
randomly defective parachutes for that certain extra thrill.

Some work is being done for future NFS releases which implements
a dynamic retransmission algorithmn which would affect the above
discussion. This is pretty valid for UNIX clients out there
now.

Brian Pawlowski
Manager ONC Porting


			Brian Pawlowski <beepy@sun.com> <sun!beepy>
			Sun Microsystems, Portable Software Products

paul@morganucodon.cis.ohio-state.edu (Paul Placeway) (05/16/89)

In article <21283@srcsip.UUCP> alarson@pavo.SRC.Honeywell.COM (Aaron Larson) writes:

   In article <May.2.15.58.31.1989.5175@geneva.rutgers.edu> hedrick@geneva.rutgers.edu (Charles Hedrick) writes:

       ....
	- pwd, getcwd, etc., should be coded so that they don't hang
	      unless your current directory is actually on the hung server

   We get around this by actually mounting the paritions in another directory
   structure, and link to them.  For example our /n dir looks like:

      /n/foo -> mumble/foo/mp
      /n/bar -> mumble/bar/mp

   We mount foo's partitions on top of mumble/foo/mp, and reference them
   through /n/foo (the actuall structure is not important other than you must
   be sure that only one machines partitions are mounted in any one
   directory).

This is quite similar to what we do here.  For each client, we have a
local directory /n, and in it a _local_ directory for each machine,
inside of which are the NFS mount points:

	/n/dinosaur/0/paul
	   ^        ^
	   |        NFS mount point
	   LOCAL directory containing all mount points for dinosaur
	   (the staff server)

When pwd comes crawling down, it goes from the nfs mounts into a
directory of mount points for just that server (which must be up to
get this far), and then into /.  We also have the servers mount the
disks in the same places so we don't go out of our minds when on a
server rather than a client.

We generally soft mount everything with timeo=20 and retrans=8.
Despite what anyone from Sun may say, we havn't had any problems with
soft mounting most everything, (and no, none of us skydive :-) ).  I
have my own machine hard mount all of it's own server's partitions,
since if the server is down my sun hangs anyway (no local disks here).

On the other hand, waiting forever and a day for ^C to respond is
_real_ annoying.  I'm supprised that Sun didn't put in a special check
for NFS while they were fattening the kernal with other stuff...

		-- Paul