[comp.mail.headers] trying multiple addresses

MRC@PANDA.PANDA.COM (Mark Crispin) (02/15/88)

     I am sure the advocates of trying multiple mail addresses would
feel quite differently if they had to pay per-packet charges for network
access.  Historically, only a small percentage of network connection
failures -- typically less than 1% -- have been due to a dysfunctional
IP address.  The remaining (= overwhelming majority of) failures have
been due to dysfunctional networks, dysfunctional hosts, or dysfunctional
servers.

     It is possible that trying a different IP address may help in the
dysfunctional network case, although typically the "non-best" IP addresses
all involve the dysfunctional network in some way (look at some network
topology maps some time).  This is a relatively rare case anyway.

     Many times, the "non-best" IP address is substantially inferior to
the point where it should not be used under ANY circumstance.  No site
outside of Stanford should *ever* use SAIL's, Score's, or SUMEX-AIM's
net 36 IP address; the gateway between net 10 and net 36 (as well as the
net 36 subnet from that gateway) is seriously overloaded.

     If I understand JLarson.pa correctly, he's saying that Xerox.COM
will use SUMEX-AIM's net 36 address just because they couldn't connect
to the net 10 address the last time.  If this is common behavior it's
no wonder those of us who must use the net 10/36 gateway find it so
unusable.  Will I have to instruct the servers on multi-homed net 10/36
hosts to refuse connections on net 36 from non-net 36 hosts to get them
to stop?

     What about those guys multi-homed on a "free" and a pay-per-packet
X.25 net?  Do they appreciate this behavior?

     The *correct* solution to this problem is NOT kludgy algorithms in
the mailer.  The correct solution is multi-part, and involves:
1) complete the migration from the host table to the domain system.  The
   NIC simply cannot keep up with the changes in network topology (as the
   Xerox experience showed), and, frankly, it's unreasonable for us to
   expect them to.
2) domain database managers need to keep their name servers updated with
   changes to network topology.  TTL's should not be allowed to be so long
   that topology changes go unnoticed by resolvers for excessive periods
   of time.
3) better support needs to exist in the domain infrastructure for "best"
   IP address selection.

     This last point is important.  Presently, it is up to the local host
to decide upon a "best" IP address, based on quite incomplete information.
Many hosts (all Unix hosts?) simply pick the first IP address listed in
the NIC host table (or returned as A RR's from the domain system).  TOPS-20
selects in priority order: (1) first IP address from a directly connected
net that is "preferred" (e.g. a fast LAN), (2) first IP address from a
directly connected net that is "default" (e.g. a core net such as ARPANET),
(3) first IP address from any other directly connected net, (4) first IP
address.  "First IP address" means first from the address list from the
host table (or a set of A RR's from the domain system).  Note that there
is nothing whatsoever to do with "net 10".

     Almost 100% of the time, this makes the best possible choice of an
IP address.  It's only in those very few cases (which come up perhaps
2 or 3 times a YEAR!!!) where an otherwise highly desirable path breaks
for a long period of time that a problem comes up.  I consider it highly
objectionable to cycle through every other IP address (waiting a minute
or more for an IP retransmission timeout if the network is courteous
enough to tell me the other guy ain't there) every time I attempt to
connect to a dead host.

     JLarson's suggestion is less objectionable, but it involves one
piece of software (the mailer) telling a completely different piece of
software (host table or domain resolver) that the IP address given it
was sick.  Nobody wants to do the work to the host table software to
add such a feature.  It might be doable with the domain resolver (SRA
can comment on this); it certainly wouldn't be hard for the mailer to
pass on the word to the domain resolver.

     The problem is, what does "this IP address was sick" really mean?
How does "retransmission timeout" differ from "host dead" (a type 7
1822 message) differ from "host sent a reset" (refused the connection)
differ from any of the other ways a connection failed?  In which one(s)
of these do you say try another IP address, and in which one(s) do you
assume the host is really down, or really doesn't want to talk now?

     Again, what do you do about those cases when we really shouldn't
be using a particular IP address because of charging, or other
administrative issues?

     The domain system may be able to help; it was always my belief (I
remember suggesting this at the meeting when the domain system concept
was first invented) that nameservers should be allowed to tailor their
responses based on who was asking the question.  A domain query should
be something like: "I am on net 128.43 seeking an SMTP server for
FOO.BAR.COM, which is the best address for me to use?" and later on "I
am on net 128.43 seeking an SMTP server for FOO.BAR.COM and I already
tried 69.105.8.3, is there any other I should try?"

     The point is that a perfectly valid answer may be "if 69.105.8.3
ain't answering, he ain't up; try again later."

     This also gives the remote organization (which presumably knows
the status of their hosts) control over the IP address selection criteria,
based upon their knowledge instead of the local host's educated guesswork.

     Please, no flames.  If you're going to babble on and on about how I
should break my mailer to conform to your fantasy of how the world should
work, send it to *NUL: or /dev/null or whatever you call it.  Furthermore,
I'm not interested in any comments about a host table based means of IP
address selection.  The systems I support do not use host tables (and, for
the record, are currently the only TOPS-20's supporting MX mailing).  I
can't help but feel that if the problem of a sick "best" IP address happens
to a domain-based mailer, that the fault is that of the management of the
nameserver for that organization and not that of the mailer.

     If you have constructive observations, then let's talk.  Remember
that this is not about porting arguably "better" (or "worse") ideas from a
16-year-old operating system to a 19-year-old operating system.  This is
about what's going to be done in the next generation, that maybe will be
ported to the 16 and 19 year-olds.  I think we can do better than any of
the guesswork, and we should, if the threats of pay-per-packet come to
pass.

-- Mark --
-------

WANCHO@SIMTEL20.arpa (Frank J. Wancho) (02/15/88)

For those who may be unaware, Mark's pay-per-packet comments *will*
apply to all MILNET hosts for outgoing traffic and TAC access sometime
in FY 89, unless the plans have been rescinded by some miracle.  And,
from what I've seen of the algorithm DCA intends to impose on us, it
ain't cheap!  When that happens, those sites, such as this one, which
support large mailing lists and anonymous ftp service may be forced to
withdraw those services... (or move to ARPANET where flat fees will
continue - no :).

--Frank

ron@topaz.rutgers.edu (Ron Natalie) (02/16/88)

I'm not sure which packet charges you are referring to.  Surely this presents
a problem on the IP over X.25 PDN, but if we actually had an IP (or the ISO
equivelent) surely some more equitable charging for non-completed calls would
previal.

Second, one of the major reasons for network failure these days is routing
dysfunction.  Consider RUTGERS, a typical University network, we have two
paths to the outside world.  The first is through the traditional ARPANET
connection.  The other is via our NSF regional net to the NSFNet.  Routing
is such that I usually had a 50-50 chance of getting to the University of
MD by either route (Mimsy had a MILNET interface and a UMDNet->NSFNet i/f).
Somedays the ARPANET/MILNET would be so poor we couldn't get connections
through, someday our REGIONAL would be broken, some days there were NSF
problems (the above are in decreasing order of probability by the way).

Routing is a sticky issue,  while there are people spending a lot of time
working on that, introducing this info into the already burdened domain
system is not likely the answer.  Routing techonology needs to progress
so that a host has a decent way of finding out which address in the set
of addresses returned by the name server is likely to work.

None of this is actually a mail question (it's just where we notice it
a lot) so I suggest you move this discussion to some other mailing list.

-Ron

ron@topaz.rutgers.edu (Ron Natalie) (02/16/88)

Of course anyone silly enough to want to use a TAC ought to
have to pay for their stupidity.  As for charging for the
MILNET, while I can't say I was for it, it would just mean
that MILNET hosts that provide those services should have
to justify them as they should have been doing in the past.

-Ron

AI.CLIVE@MCC.COM (Clive Dawson) (02/16/88)

Re:	Date: Sun, 14 Feb 88 18:18:09 PST
	From: Mark Crispin <MRC@PANDA.PANDA.COM>
	Subject: trying multiple addresses

Mark--

Your message implies that trying an alternate IP address after the
best one has failed will only win 1% of the time.  I disagree.  As
somebody who still has the job of dealing with the day-to-day hassles
of "keeping the mail flowing", I cannot afford to sit back and
consider what the "correct solution" will ultimately be once the whole
world arrives at some standard.  I will gladly accept any "kludgy
algorithms" that get the job done with no adverse effects.

Not counting the host-table-related problems (I know, you don't want
to hear this!) such as SU's Sierra, whose 10 address still appears
over a year after it died, or the more recent HI-MULTICS/CIM-VAX
address problem at Honeywell, where both addresses work, but SMTP
listens only on one, I have also had to deal with a multitude of other
problems in just the last week which were all solved by using
non-"best" IP addresses.  The following scenario comes up A LOT, at
least at this site:
	Q: Why isn't my mail getting through to host X?
	A: The host must be down.
	Q: Then why can I Telnet to it?
	A: Ah, your workstation is not directly connected to net 10,
	   and is so "dumb" that it doesn't realize that net 10 is the
	   only way out of this place anyway, so it chooses to use host
	   X's 128 address.  Meanwhile, our "smart" mail relay host (and
	   gateway) is directly connected to net 10 and therefore
	   insists on sticking to host X's 10 address regardless.

How am I supposed to tell users that "correct failure" is better
than "kludgy success"?  They could care less about these details,
and rightfully so.

	     If I understand JLarson.pa correctly, he's saying that Xerox.COM
	will use SUMEX-AIM's net 36 address just because they couldn't connect
	to the net 10 address the last time.  If this is common behavior it's
	no wonder those of us who must use the net 10/36 gateway find it so
	unusable.  Will I have to instruct the servers on multi-homed net
	10/36 hosts to refuse connections on net 36 from non-net 36 hosts to
	get them to stop?

Mark,  that's not how I understood JLarson at all.  He said that
this procedure was followed on RETRIES, and then ONLY at the
next retry interval.  This is quite different from adopting the
alternative IP address "permanently" thereafter for all deliveries.

It seems to me that this procedure is almost a complete win.  There is
no extra load on the sending mailer, and non-"best" IP addresses are
used only when they have to be, on a message-by-message basis.  It
might possibly be improved by a heuristic which says that if, on the
first retry, the second address is found to be down too, then 4 out
of every 5 (say) future retries should go back to using the "best" IP
address. This way, if SCORE is really down, the poor 10/36 gateway
won't get so much of a pounding with doomed connection attempts from
everybody.

Cheers,

Clive
-------

MRC@PANDA.PANDA.COM (Mark Crispin) (02/16/88)

Clive -

     The NIC host table is a disaster, and the only way to fix the many
problems associated with it is to abandon it for domains.  Supported
TOPS-20 domain and mail software was announced a while ago; perhaps you
should consider adopting it at your site.  [As a side note, the problem
with Sierra's net 10 address being in the host table will go away very
soon since the plug is being pulled on Sierra.]

     Trying every possible IP address for a host which fails to respond
on its "primary" IP address is a recipe for performance disaster for any
background mailer with a large workload.  We're talking about thousands
of messages/day, friends, not a piddly couple of hundred.

     Furthermore, we all can rattle off the cases where the primary IP
address was wrong.  We can do this because there have been so few of
them!  What's more, most of those few are host-table only problems.  If
you run obsolete software, you should expect to have to put in some
maintenance on your own.

-- Mark --
-------

AI.CLIVE@MCC.COM (Clive Dawson) (02/16/88)

	     Trying every possible IP address for a host which fails to
	respond on its "primary" IP address is a recipe for performance
	disaster for any background mailer with a large workload.  We're
	talking about thousands of messages/day, friends, not a piddly couple
	of hundred.

Yes, Mark, I know what large workloads are.  Considering that 

	a) this discussion deals with multi-homed hosts which
	   comprise only about 5% of the total host population, and
	b) the success rate for first delivery attempts exceeds 95%, and 
        c) once a message is in the retry queue the average number of
	   retries is greater than 1,

I would appreciate it if you would elaborate on just how this
"performance disaster" you predict will come about, especially for
sites using the algorithm we've discussed in the last few messages.

I'm glad to see you acknowledge that only "most" of these problems are
host-table related.  OK, let's eliminate these from consideration.
What do you propose that we do about the rest of the problems which
are not host-table related?  You seem to be resigned to the idea that
they should just be ignored because there are so few of them.  If
this is true, then a solution which only affects THESE FEW can hardly
cause a performance disaster.

Clive
-------

craig@NNSC.NSF.NET (Craig Partridge) (02/17/88)

Mark,

The two classic cases I see which justify our use of trying multiple IP
addresses is:

    - network interfaces which fail.  So the host is up but the interface
    is down and people cannot or do not want to remove the old interface
    address from the domain system during its two or three day repair
    period.

    - transient network routing problems which people are having trouble
    tracking.  So one address works sometimes and the other all the time.

Craig

braden@venera.isi.EDU (02/17/88)

Egad!  I thought this question was settled about 4 years ago!  And the
answer was... JLarson is entirely right, and Mark Crispin is confusing 
a settled issue.

Bob Braden

MRC@PANDA.PANDA.COM (Mark Crispin) (02/17/88)

Craig -

     I have never said that trying multiple IP addresses must not be
done.  What I did say is that such a strategy is neither required nor
is it necessarily advisable.  This entire discussion got started because
Jordan Hayes declared that a mailer which does not try multiple IP
addresses is "broken" in the sense that a mailer which tosses out the
RFC821 return path information is broken.

     Now, nobody is going to think about fixing the BITNET mailers which
lose the RFC821 return path, and those BITNET guys are going to continue
to insist we should add this ridiculous "Errors-to:" header line in all
our mailers to work around a deficiency in their mailer.

     Thank you, Jordan, for your contribution towards making electronic
mail work better.

-- Mark --
-------

jordan@ucbarpa.Berkeley.EDU (Jordan Hayes) (02/17/88)

Mark talks all about "host table this" and "obsolete software this"
but the problem that I had (and I think J Larson had) was that,
due to NIC paperwork (i.e., changing *anything* about a host/gateway
that requires an NCD), there are things that will hose you for
sure.  I couldn't get the host table *or the domain database*
changed, so places like sushi.stanford.edu, mcc.com, a.isi.edu (a
*root domain server*!!!) bounced mail headed our way for 5 weeks.

I'm running all the correct software, i'm hip to the "now it's your
namespace, now you have control over it" thang, but I think it's a bit
naive to say that "Ah well, 1% of the time, you'll lose" ...

Summary:  Mark, i'd appreciate it if you'd try a bit harder not to
bounce my mail.  Thanks.

/jordan

hedrick@aramis.rutgers.edu (Charles Hedrick) (02/17/88)

I absolutely agree with Mark Crispin's comments about everybody moving
to domains and not using the NIC host table.  However I thought this
was a reasonable opportunity to mention some bad news.  For the last
couple of months, Rutgers computer science, math, and part of
engineering have been depending entirely upon the domain system.  We
have users who can't receive mail at all unless the sender's machine
can talk to the domain system.  We have been getting serious
complaints from faculty who are unable to get mail back from faculty
at other institutions.  We are now moving back to a system where all
from addresses will involve hosts that appear in the NIC host table.
The problem is that I am unwilling to make innocent bystanders pay the
cost of turkeys elsewhere who haven't implemented domains.  (I checked
the RFC's not long ago.  The deadline was Oct 85.)  I can't even
always get very mad at the system administrators involved.  There are
still vendors who don't support domains.  Unfortunately, nobody is
going to fail to buy a Convex minisupercomputer because their network
software doesn't support domains.  I wish I had a nice solution to put
at the end of this message, but I don't.  We may keep sending some
staff mail from machines that aren't listed in the NIC tables, just so
we still know who to complain about.