[comp.mail.sendmail] How do I recover from NFS hangups from within sendmail?

moore@utkcs2.cs.utk.edu (Keith Moore) (09/10/89)

Briefly, here's the problem:

We have all of our UNIX systems organized so that they appear to share a 
common filesystem via NFS mounts.  Users' home directories are on their
own machines if possible in order to minimize net traffic and load on the
file servers.  Other users have home directories spread across four servers.
We also share a single /var/spool/mail directory among all of these machines.
Each of the sendmails is set up to forward local mail to the "mail server" 
system.  All of this works fine as long as all of the systems are up.  But
if any of the systems containing someone's home directory goes down, and 
someone tries to send mail to that user, the sendmail hangs up trying to
open that user's .forward file.  What I'd like to do is to modify sendmail's
forwarding code to check for the case that the user's .forward file is 
temporarily unavailable, and to mark mail for that user as being temporarily 
undeliverable.  Is there any way to do this?  Shouldn't there be?

This weekend our mail server's sendmail shut down because of the failure
of a single machine owned by a user who gets a lot of mail. 

It's beginning to look as if kernel mods are the cleanest way out...

Our mail server's sendmail is running 5.61+IDA patches and Ultrix 3.0.

Keith Moore			Internet: moore@cs.utk.edu
University of Tenn. CS Dept.	BITNET: moore@utkvx
107 Ayres Hall, UT Campus	UT Decnet: utkcs2::moore
Knoxville Tennessee 37996-1301	Telephone: +1 615 974 0822
-- 
Keith Moore			Internet: moore@utkcs2.cs.utk.edu
University of Tenn. CS Dept.	BITNET: moore@utkvx
107 Ayres Hall, UT Campus	UT Decnet: utkcs2::moore
Knoxville Tennessee 37996-1301	Telephone: +1 615 974 0822

cfe+@andrew.cmu.edu (Craig F. Everhart) (09/12/89)

We in the Andrew project at CMU gave up on using sendmail to touch
anything but files that were on the machine's local disk, for reasons
much like what you outlined.  We wound up re-writing the whole local
transport mechanism for AFS (Andrew File System--yes, not NFS) so that
it would be sensitive to the existence of transient failures.  Not only
that, but the AFS developers were working in the next-door offices, so
we had an ``opportunity'' to make sure that transient errors were
distinguishable from persistent ones by returning different values in
errno.  (Thus, an open()-for-reading that fails with an errno of ENOENT
is an authoritative statement of the absence of some file or directory,
while other errno values, such as ETIMEDOUT, are returned to indicate
some transient problem such as a server or network outage.)

Two things:
(1) we expect that all of this local mail delivery system (AMDS, Andrew
Mail Delivery System) will be available on the X11R4 tape under
contrib/andrew; and
(2) Does NFS have some collection of rules for indicating transient vs.
persistent failures?  What are they?  Whatever they are, I'm real
interested in finding out, and they could be the way out for Keith
Moore's problems, too.

		Thanks,
		Craig Everhart