moore@utkcs2.cs.utk.edu (Keith Moore) (09/10/89)
Briefly, here's the problem: We have all of our UNIX systems organized so that they appear to share a common filesystem via NFS mounts. Users' home directories are on their own machines if possible in order to minimize net traffic and load on the file servers. Other users have home directories spread across four servers. We also share a single /var/spool/mail directory among all of these machines. Each of the sendmails is set up to forward local mail to the "mail server" system. All of this works fine as long as all of the systems are up. But if any of the systems containing someone's home directory goes down, and someone tries to send mail to that user, the sendmail hangs up trying to open that user's .forward file. What I'd like to do is to modify sendmail's forwarding code to check for the case that the user's .forward file is temporarily unavailable, and to mark mail for that user as being temporarily undeliverable. Is there any way to do this? Shouldn't there be? This weekend our mail server's sendmail shut down because of the failure of a single machine owned by a user who gets a lot of mail. It's beginning to look as if kernel mods are the cleanest way out... Our mail server's sendmail is running 5.61+IDA patches and Ultrix 3.0. Keith Moore Internet: moore@cs.utk.edu University of Tenn. CS Dept. BITNET: moore@utkvx 107 Ayres Hall, UT Campus UT Decnet: utkcs2::moore Knoxville Tennessee 37996-1301 Telephone: +1 615 974 0822 -- Keith Moore Internet: moore@utkcs2.cs.utk.edu University of Tenn. CS Dept. BITNET: moore@utkvx 107 Ayres Hall, UT Campus UT Decnet: utkcs2::moore Knoxville Tennessee 37996-1301 Telephone: +1 615 974 0822
cfe+@andrew.cmu.edu (Craig F. Everhart) (09/12/89)
We in the Andrew project at CMU gave up on using sendmail to touch anything but files that were on the machine's local disk, for reasons much like what you outlined. We wound up re-writing the whole local transport mechanism for AFS (Andrew File System--yes, not NFS) so that it would be sensitive to the existence of transient failures. Not only that, but the AFS developers were working in the next-door offices, so we had an ``opportunity'' to make sure that transient errors were distinguishable from persistent ones by returning different values in errno. (Thus, an open()-for-reading that fails with an errno of ENOENT is an authoritative statement of the absence of some file or directory, while other errno values, such as ETIMEDOUT, are returned to indicate some transient problem such as a server or network outage.) Two things: (1) we expect that all of this local mail delivery system (AMDS, Andrew Mail Delivery System) will be available on the X11R4 tape under contrib/andrew; and (2) Does NFS have some collection of rules for indicating transient vs. persistent failures? What are they? Whatever they are, I'm real interested in finding out, and they could be the way out for Keith Moore's problems, too. Thanks, Craig Everhart