moore@chili.cs.utk.edu (Keith Moore) (03/26/91)
Background: I'm working on a way to distribute electronic mail delivery, in order to make it more reliable. Currently we have about 100 machines of various sizes that all mount /var/spool/mail from one place via NFS. One consequence of this is that the mail server is a single point-of-failure -- if it goes down, mail delivery stops for everyone. Even if it's only down for an hour or two, this is annoying -- our users expect e-mail to be as reliable as the telephone. (Not that they lose messges, they are just annoyed that they aren't delivered immediately.) My current idea for a mail delivery scheme replaces the /var/spool/mail/$user directory with a MESSAGES directory in each user's home directory. One or more mail servers (each of which has an Internet MX record pointing to it) will access a recipient's MESSAGES directory via NFS, and write a uniquely-named file in that directory for each message delivered. We will modify our mail user agents to read files from this directory rather than from /var/spool/mail/$user. (A similar scheme used by CMU's Andrew Message Delivery System, which is normally layered on top of the Andrew File System.) There are several reasons for delivering the mail on the same file system as the user's files. The most important, but least obvious, reason is that the user's file system must be present anyway before mail can be delivered, otherwise the user's .forward file might be ignored. Delivery to the user's file system thus minimizes the probability of failure: if the user needs access to both his own files and to his incoming mail messages in order to work effectively, it's best if they fail at the same time rather than independently (assuming the same failure rate for both). The Problem: Mail delivery has to be absolutely reliable. We would like to avoid delays when possible, and under no circumstances is it acceptable for the mail system to lose a message during transfer or delivery. Our NFS load is distributed over several file servers which are used almost exclusively for NFS access. Under normal conditions, response for NFS clients is quite good. But during periods of peak load, response degrades to the point at which a single NFS file operation may require on the order of a minute to complete. Under these conditions, I have seen two instances within the last week under which a file created on a server with the normal creat(), write(), ..., close() sequence ended up zero length with no indication of error, even though the return values on all system calls were checked. I have observed this on both "real" applications (like vi) and with a simple C program I wrote. My suspicion is that the initial NFS create RPC was duplicated while waiting for the server to respond, and the server performed one (or more) of them after the file was written. My workstation is running SunOS 4.1.1, and the server 4.1, so these are fairly recent implementations of the software. Even though we could tune our NFS clients, servers, and workload to avoid these problems to some degree (not that we haven't already done this), this would only push back the knee of the curve, instead of eliminating the problem entirely. Unfortunately, this is not good enough. I realize that it will always be possible for the load on a server to be so high that it cannot service any more requests, but I would at least expect an error indication in this case. My mail delivery agent already has to cope with various temporary failures (like when the user's file server is down), and this works fine. So my question to this newsgroup is: Is there any way I can reliably create a file on an NFS mounted directory, write its contents, close it, and know whether the file was written correctly and completely? The best idea I've had so far is to issue the creat() syscall, then wait several seconds (time t) to make sure any duplicate NFS create RPCs have been processed by the server, then to write out the file and close it (checking return codes, of course). This might work if I can come up with reasonable bounds for t. (How long can one of these things stay in a server's input queue, anyway?) -- Keith Moore / U.Tenn CS Dept / 107 Ayres Hall / Knoxville TN 37996-1301 Internet: moore@cs.utk.edu BITNET: moore@utkvx
dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (03/30/91)
In <1991Mar25.233245.15209@cs.utk.edu> moore@chili.cs.utk.edu (Keith Moore) writes: >Is there any way I can reliably create a file on an NFS mounted >directory, write its contents, close it, and know whether the file was >written correctly and completely? >The best idea I've had so far is to issue the creat() syscall, then >wait several seconds (time t) to make sure any duplicate NFS create >RPCs have been processed by the server, then to write out the file and >close it (checking return codes, of course). This might work if I can >come up with reasonable bounds for t. What you might want to do is use a locking protocol, so that any creation is atomic. NFS makes this very hard to do. The simplest practical algorithm I could think of is attached. Please check carefully for minor bugs. -------- Date: 22 Jan 91 07:41:25 GMT From: dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) Newsgroups: comp.lang.perl Subject: Re: Locking files across NFS References: <1991Jan15.203815.25561@uvaarpa.Virginia.EDU> Sender: cirrusl!news Organization: Cirrus Logic Inc. In <1991Jan15.203815.25561@uvaarpa.Virginia.EDU> worley@compass.com (Dale Worley) writes: >Beware that Sun's locking daemons don't always work correctly. Sun's locking daemons have never worked correctly whenever I have tried them. I finally decided that it would be better to rely on the standard reliable UNIX method: create a lock file. I used this successfully for a while. Then discovered with a shock that NFS has no mechanism for ensuring exclusive creation of a file even if the O_EXCL flag is given to open(). NFS does make symbolic links links correctly. I think it may even make hard links correctly. The following algorithm assumes that hard links are correctly created atomically. So the only reliable mechanism that exists to do file locking over NFS is the following or its equivalent. if you want reliable locking that is reasonably immune to locks being held by dead processes, I see no way of making this algorithm any simpler. int get_a_lock() { if (create(symlink called MUTEX that points anywhere) == failed) { die("serious problem -- can't create MUTEX"); } /* reach here when gained exclusive access */ attempts = 0; while (++attempts < SOME_LIMIT) { if (create(some unique temp file called $TMP) == succeeded) { to $TEMP write our host name and pid; break; /* done with while loop */ } else { sleep (a few seconds); } } if (attempts == SOME_LIMIT) { die("serious problem -- can't create mutex"); } try_again: { static int loop_breaker; if (++loop_breaker > SOME_OTHER_LIMIT) { loop_breaker = 0; unlink($TMP); unlink(MUTEX); return LOCK_ATTEMPT_FAILED; /* or die here */ } } if (create(link from $TMP to LOCK) == success) { /* we have the lock!! */ unlink($TMP); /* not needed, link is now LOCK */ unlink(MUTEX); /* not needed, done its work */ return GOT_A_LOCK; } /* failed to create link; see if it's a stray link */ if (LOCK doesn't exist) { unlink($TMP); unlink(MUTEX); die("serious problem, LOCK nonexistent but can't create"); } if (read(contents of LOCK) == failed) { unlink($TMP); unlink(MUTEX); die("serious problem, can't read existing LOCK"); } lock_host = name of host read from LOCK; lock_pid = pid read from LOCK; if (lock_host is our current host) { /* see if process still alive */ if (kill(pid, SIG_SEE_IF_IT'S_THERE) == ENO_SUCH_PROCESS) { unlink(LOCK); /* must have been stray */ goto try_again; } } /* LOCK is already held by existing process on this host or is on some other host */ return LOCK_ATTEMPT_FAILED; } -- Rahul Dhesi <dhesi@cirrus.COM> UUCP: oliveb!cirrusl!dhesi
dean@fyvie.cs.wisc.edu (Dean Luick) (03/30/91)
In <1991Mar25.233245.15209@cs.utk.edu> moore@chili.cs.utk.edu (Keith Moore) writes: >Is there any way I can reliably create a file on an NFS mounted >directory, write its contents, close it, and know whether the file was >written correctly and completely? Under certain conditions, just stat'ing the file should be enough. dean Dean Luick University of Wisconsin-Madison Computer Sciences Dept. uucp: ...!{allegra,harvard,seismo,topaz}!uwvax!dream!dean arpa: dean@cs.wisc.edu -- Dean Luick University of Wisconsin-Madison Computer Sciences Dept. uucp: ...!{allegra,harvard,seismo,topaz}!uwvax!dream!dean arpa: dean@cs.wisc.edu