[comp.protocols.nfs] Is there a reliable way to write a file to an NFS server?

moore@chili.cs.utk.edu (Keith Moore) (03/26/91)

Background:

I'm working on a way to distribute electronic mail delivery, in order
to make it more reliable.  Currently we have about 100 machines of
various sizes that all mount /var/spool/mail from one place via NFS.
One consequence of this is that the mail server is a single
point-of-failure -- if it goes down, mail delivery stops for everyone.
Even if it's only down for an hour or two, this is annoying -- our
users expect e-mail to be as reliable as the telephone.  (Not that
they lose messges, they are just annoyed that they aren't delivered
immediately.)

My current idea for a mail delivery scheme replaces the
/var/spool/mail/$user directory with a MESSAGES directory in each
user's home directory.  One or more mail servers (each of which has an
Internet MX record pointing to it) will access a recipient's MESSAGES
directory via NFS, and write a uniquely-named file in that directory
for each message delivered.  We will modify our mail user agents to
read files from this directory rather than from /var/spool/mail/$user.
(A similar scheme used by CMU's Andrew Message Delivery System, which
is normally layered on top of the Andrew File System.)

There are several reasons for delivering the mail on the same file
system as the user's files.  The most important, but least obvious,
reason is that the user's file system must be present anyway before
mail can be delivered, otherwise the user's .forward file might be
ignored.  Delivery to the user's file system thus minimizes the
probability of failure: if the user needs access to both his own files
and to his incoming mail messages in order to work effectively, it's
best if they fail at the same time rather than independently (assuming
the same failure rate for both).

The Problem:

Mail delivery has to be absolutely reliable.  We would like to avoid
delays when possible, and under no circumstances is it acceptable for
the mail system to lose a message during transfer or delivery.

Our NFS load is distributed over several file servers which are used
almost exclusively for NFS access.  Under normal conditions, response
for NFS clients is quite good.  But during periods of peak load,
response degrades to the point at which a single NFS file operation
may require on the order of a minute to complete.  Under these
conditions, I have seen two instances within the last week under which
a file created on a server with the normal creat(), write(), ...,
close() sequence ended up zero length with no indication of error,
even though the return values on all system calls were checked.  I
have observed this on both "real" applications (like vi) and with a
simple C program I wrote.  My suspicion is that the initial NFS
create RPC was duplicated while waiting for the server to respond, and
the server performed one (or more) of them after the file was written.
My workstation is running SunOS 4.1.1, and the server 4.1, so these
are fairly recent implementations of the software.

Even though we could tune our NFS clients, servers, and workload to
avoid these problems to some degree (not that we haven't already done
this), this would only push back the knee of the curve, instead of
eliminating the problem entirely.  Unfortunately, this is not good
enough.  I realize that it will always be possible for the load on a
server to be so high that it cannot service any more requests, but I
would at least expect an error indication in this case.  My mail
delivery agent already has to cope with various temporary failures
(like when the user's file server is down), and this works fine.

So my question to this newsgroup is:

Is there any way I can reliably create a file on an NFS mounted
directory, write its contents, close it, and know whether the file was
written correctly and completely?

The best idea I've had so far is to issue the creat() syscall, then
wait several seconds (time t) to make sure any duplicate NFS create
RPCs have been processed by the server, then to write out the file and
close it (checking return codes, of course).  This might work if I can
come up with reasonable bounds for t.  (How long can one of these
things stay in a server's input queue, anyway?)

--
Keith Moore / U.Tenn CS Dept / 107 Ayres Hall / Knoxville TN  37996-1301
Internet: moore@cs.utk.edu      BITNET: moore@utkvx

dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (03/30/91)

In <1991Mar25.233245.15209@cs.utk.edu> moore@chili.cs.utk.edu (Keith
Moore) writes:

>Is there any way I can reliably create a file on an NFS mounted
>directory, write its contents, close it, and know whether the file was
>written correctly and completely?

>The best idea I've had so far is to issue the creat() syscall, then
>wait several seconds (time t) to make sure any duplicate NFS create
>RPCs have been processed by the server, then to write out the file and
>close it (checking return codes, of course).  This might work if I can
>come up with reasonable bounds for t.

What you might want to do is use a locking protocol, so that any
creation is atomic.  NFS makes this very hard to do.  The simplest
practical algorithm I could think of is attached.  Please check
carefully for minor bugs.

--------
Date:    22 Jan 91 07:41:25 GMT
From:    dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi)
Newsgroups: comp.lang.perl
Subject: Re: Locking files across NFS
References: <1991Jan15.203815.25561@uvaarpa.Virginia.EDU>
Sender:  cirrusl!news
Organization: Cirrus Logic Inc.

In <1991Jan15.203815.25561@uvaarpa.Virginia.EDU> worley@compass.com
(Dale Worley) writes:

>Beware that Sun's locking daemons don't always work correctly.

Sun's locking daemons have never worked correctly whenever I have tried
them.  I finally decided that it would be better to rely on the
standard reliable UNIX method:  create a lock file.  I used this
successfully for a while.  Then discovered with a shock that NFS has no
mechanism for ensuring exclusive creation of a file even if the O_EXCL
flag is given to open().  NFS does make symbolic links links correctly.
I think it may even make hard links correctly.  The following algorithm
assumes that hard links are correctly created atomically.

So the only reliable mechanism that exists to do file locking over NFS
is the following or its equivalent.  if you want reliable locking that
is reasonably immune to locks being held by dead processes, I see no
way of making this algorithm any simpler.

int get_a_lock()
{
     if (create(symlink called MUTEX that points anywhere) == failed) {
	die("serious problem -- can't create MUTEX");
     }
     /* reach here when gained exclusive access */
     attempts = 0;
     while (++attempts < SOME_LIMIT) {
	if (create(some unique temp file called $TMP) == succeeded) {
	   to $TEMP write our host name and pid;
	   break; /* done with while loop */
	} else {
	   sleep (a few seconds);
	}
      }
      if (attempts == SOME_LIMIT) {
	 die("serious problem -- can't create mutex");
      }
   try_again:
      {
        static int loop_breaker;
	if (++loop_breaker > SOME_OTHER_LIMIT) {
	   loop_breaker = 0;
	   unlink($TMP);
	   unlink(MUTEX);
	   return LOCK_ATTEMPT_FAILED; /* or die here */
	 }
      }
      if (create(link from $TMP to LOCK) == success) {
	 /* we have the lock!! */
	 unlink($TMP);  /* not needed, link is now LOCK */
	 unlink(MUTEX); /* not needed, done its work */
	 return GOT_A_LOCK;
      } 
      /* failed to create link;  see if it's a stray link */
      if (LOCK doesn't exist) {
	 unlink($TMP);
	 unlink(MUTEX);
	 die("serious problem, LOCK nonexistent but can't create");
      }
      if (read(contents of LOCK) == failed) {
	 unlink($TMP);
	 unlink(MUTEX);
	 die("serious problem, can't read existing LOCK");
      }
      lock_host = name of host read from LOCK;
      lock_pid = pid read from LOCK;
      if (lock_host is our current host) {
	 /* see if process still alive */
	 if (kill(pid, SIG_SEE_IF_IT'S_THERE) == ENO_SUCH_PROCESS) {
	    unlink(LOCK); /* must have been stray */
	    goto try_again;
	 } 
      }
      /* LOCK is already held by existing process on this host
      or is on some other host */
      return LOCK_ATTEMPT_FAILED;
}
--
Rahul Dhesi <dhesi@cirrus.COM>
UUCP:  oliveb!cirrusl!dhesi

dean@fyvie.cs.wisc.edu (Dean Luick) (03/30/91)

In <1991Mar25.233245.15209@cs.utk.edu> moore@chili.cs.utk.edu (Keith
Moore) writes:

>Is there any way I can reliably create a file on an NFS mounted
>directory, write its contents, close it, and know whether the file was
>written correctly and completely?

Under certain conditions, just stat'ing the file should be enough.

dean

Dean Luick
University of Wisconsin-Madison Computer Sciences Dept.
uucp:	...!{allegra,harvard,seismo,topaz}!uwvax!dream!dean
arpa:	dean@cs.wisc.edu
--
Dean Luick
University of Wisconsin-Madison Computer Sciences Dept.
uucp:	...!{allegra,harvard,seismo,topaz}!uwvax!dream!dean
arpa:	dean@cs.wisc.edu