[comp.protocols.tcp-ip] trying multiple addresses

MRC@PANDA.PANDA.COM (Mark Crispin) (02/15/88)

     I am sure the advocates of trying multiple mail addresses would
feel quite differently if they had to pay per-packet charges for network
access.  Historically, only a small percentage of network connection
failures -- typically less than 1% -- have been due to a dysfunctional
IP address.  The remaining (= overwhelming majority of) failures have
been due to dysfunctional networks, dysfunctional hosts, or dysfunctional
servers.

     It is possible that trying a different IP address may help in the
dysfunctional network case, although typically the "non-best" IP addresses
all involve the dysfunctional network in some way (look at some network
topology maps some time).  This is a relatively rare case anyway.

     Many times, the "non-best" IP address is substantially inferior to
the point where it should not be used under ANY circumstance.  No site
outside of Stanford should *ever* use SAIL's, Score's, or SUMEX-AIM's
net 36 IP address; the gateway between net 10 and net 36 (as well as the
net 36 subnet from that gateway) is seriously overloaded.

     If I understand JLarson.pa correctly, he's saying that Xerox.COM
will use SUMEX-AIM's net 36 address just because they couldn't connect
to the net 10 address the last time.  If this is common behavior it's
no wonder those of us who must use the net 10/36 gateway find it so
unusable.  Will I have to instruct the servers on multi-homed net 10/36
hosts to refuse connections on net 36 from non-net 36 hosts to get them
to stop?

     What about those guys multi-homed on a "free" and a pay-per-packet
X.25 net?  Do they appreciate this behavior?

     The *correct* solution to this problem is NOT kludgy algorithms in
the mailer.  The correct solution is multi-part, and involves:
1) complete the migration from the host table to the domain system.  The
   NIC simply cannot keep up with the changes in network topology (as the
   Xerox experience showed), and, frankly, it's unreasonable for us to
   expect them to.
2) domain database managers need to keep their name servers updated with
   changes to network topology.  TTL's should not be allowed to be so long
   that topology changes go unnoticed by resolvers for excessive periods
   of time.
3) better support needs to exist in the domain infrastructure for "best"
   IP address selection.

     This last point is important.  Presently, it is up to the local host
to decide upon a "best" IP address, based on quite incomplete information.
Many hosts (all Unix hosts?) simply pick the first IP address listed in
the NIC host table (or returned as A RR's from the domain system).  TOPS-20
selects in priority order: (1) first IP address from a directly connected
net that is "preferred" (e.g. a fast LAN), (2) first IP address from a
directly connected net that is "default" (e.g. a core net such as ARPANET),
(3) first IP address from any other directly connected net, (4) first IP
address.  "First IP address" means first from the address list from the
host table (or a set of A RR's from the domain system).  Note that there
is nothing whatsoever to do with "net 10".

     Almost 100% of the time, this makes the best possible choice of an
IP address.  It's only in those very few cases (which come up perhaps
2 or 3 times a YEAR!!!) where an otherwise highly desirable path breaks
for a long period of time that a problem comes up.  I consider it highly
objectionable to cycle through every other IP address (waiting a minute
or more for an IP retransmission timeout if the network is courteous
enough to tell me the other guy ain't there) every time I attempt to
connect to a dead host.

     JLarson's suggestion is less objectionable, but it involves one
piece of software (the mailer) telling a completely different piece of
software (host table or domain resolver) that the IP address given it
was sick.  Nobody wants to do the work to the host table software to
add such a feature.  It might be doable with the domain resolver (SRA
can comment on this); it certainly wouldn't be hard for the mailer to
pass on the word to the domain resolver.

     The problem is, what does "this IP address was sick" really mean?
How does "retransmission timeout" differ from "host dead" (a type 7
1822 message) differ from "host sent a reset" (refused the connection)
differ from any of the other ways a connection failed?  In which one(s)
of these do you say try another IP address, and in which one(s) do you
assume the host is really down, or really doesn't want to talk now?

     Again, what do you do about those cases when we really shouldn't
be using a particular IP address because of charging, or other
administrative issues?

     The domain system may be able to help; it was always my belief (I
remember suggesting this at the meeting when the domain system concept
was first invented) that nameservers should be allowed to tailor their
responses based on who was asking the question.  A domain query should
be something like: "I am on net 128.43 seeking an SMTP server for
FOO.BAR.COM, which is the best address for me to use?" and later on "I
am on net 128.43 seeking an SMTP server for FOO.BAR.COM and I already
tried 69.105.8.3, is there any other I should try?"

     The point is that a perfectly valid answer may be "if 69.105.8.3
ain't answering, he ain't up; try again later."

     This also gives the remote organization (which presumably knows
the status of their hosts) control over the IP address selection criteria,
based upon their knowledge instead of the local host's educated guesswork.

     Please, no flames.  If you're going to babble on and on about how I
should break my mailer to conform to your fantasy of how the world should
work, send it to *NUL: or /dev/null or whatever you call it.  Furthermore,
I'm not interested in any comments about a host table based means of IP
address selection.  The systems I support do not use host tables (and, for
the record, are currently the only TOPS-20's supporting MX mailing).  I
can't help but feel that if the problem of a sick "best" IP address happens
to a domain-based mailer, that the fault is that of the management of the
nameserver for that organization and not that of the mailer.

     If you have constructive observations, then let's talk.  Remember
that this is not about porting arguably "better" (or "worse") ideas from a
16-year-old operating system to a 19-year-old operating system.  This is
about what's going to be done in the next generation, that maybe will be
ported to the 16 and 19 year-olds.  I think we can do better than any of
the guesswork, and we should, if the threats of pay-per-packet come to
pass.

-- Mark --
-------

WANCHO@SIMTEL20.ARPA ("Frank J. Wancho") (02/15/88)

For those who may be unaware, Mark's pay-per-packet comments *will*
apply to all MILNET hosts for outgoing traffic and TAC access sometime
in FY 89, unless the plans have been rescinded by some miracle.  And,
from what I've seen of the algorithm DCA intends to impose on us, it
ain't cheap!  When that happens, those sites, such as this one, which
support large mailing lists and anonymous ftp service may be forced to
withdraw those services... (or move to ARPANET where flat fees will
continue - no :).

--Frank

AI.CLIVE@MCC.COM (Clive Dawson) (02/16/88)

Re:	Date: Sun, 14 Feb 88 18:18:09 PST
	From: Mark Crispin <MRC@PANDA.PANDA.COM>
	Subject: trying multiple addresses

Mark--

Your message implies that trying an alternate IP address after the
best one has failed will only win 1% of the time.  I disagree.  As
somebody who still has the job of dealing with the day-to-day hassles
of "keeping the mail flowing", I cannot afford to sit back and
consider what the "correct solution" will ultimately be once the whole
world arrives at some standard.  I will gladly accept any "kludgy
algorithms" that get the job done with no adverse effects.

Not counting the host-table-related problems (I know, you don't want
to hear this!) such as SU's Sierra, whose 10 address still appears
over a year after it died, or the more recent HI-MULTICS/CIM-VAX
address problem at Honeywell, where both addresses work, but SMTP
listens only on one, I have also had to deal with a multitude of other
problems in just the last week which were all solved by using
non-"best" IP addresses.  The following scenario comes up A LOT, at
least at this site:
	Q: Why isn't my mail getting through to host X?
	A: The host must be down.
	Q: Then why can I Telnet to it?
	A: Ah, your workstation is not directly connected to net 10,
	   and is so "dumb" that it doesn't realize that net 10 is the
	   only way out of this place anyway, so it chooses to use host
	   X's 128 address.  Meanwhile, our "smart" mail relay host (and
	   gateway) is directly connected to net 10 and therefore
	   insists on sticking to host X's 10 address regardless.

How am I supposed to tell users that "correct failure" is better
than "kludgy success"?  They could care less about these details,
and rightfully so.

	     If I understand JLarson.pa correctly, he's saying that Xerox.COM
	will use SUMEX-AIM's net 36 address just because they couldn't connect
	to the net 10 address the last time.  If this is common behavior it's
	no wonder those of us who must use the net 10/36 gateway find it so
	unusable.  Will I have to instruct the servers on multi-homed net
	10/36 hosts to refuse connections on net 36 from non-net 36 hosts to
	get them to stop?

Mark,  that's not how I understood JLarson at all.  He said that
this procedure was followed on RETRIES, and then ONLY at the
next retry interval.  This is quite different from adopting the
alternative IP address "permanently" thereafter for all deliveries.

It seems to me that this procedure is almost a complete win.  There is
no extra load on the sending mailer, and non-"best" IP addresses are
used only when they have to be, on a message-by-message basis.  It
might possibly be improved by a heuristic which says that if, on the
first retry, the second address is found to be down too, then 4 out
of every 5 (say) future retries should go back to using the "best" IP
address. This way, if SCORE is really down, the poor 10/36 gateway
won't get so much of a pounding with doomed connection attempts from
everybody.

Cheers,

Clive
-------

MRC@PANDA.PANDA.COM (Mark Crispin) (02/16/88)

Clive -

     The NIC host table is a disaster, and the only way to fix the many
problems associated with it is to abandon it for domains.  Supported
TOPS-20 domain and mail software was announced a while ago; perhaps you
should consider adopting it at your site.  [As a side note, the problem
with Sierra's net 10 address being in the host table will go away very
soon since the plug is being pulled on Sierra.]

     Trying every possible IP address for a host which fails to respond
on its "primary" IP address is a recipe for performance disaster for any
background mailer with a large workload.  We're talking about thousands
of messages/day, friends, not a piddly couple of hundred.

     Furthermore, we all can rattle off the cases where the primary IP
address was wrong.  We can do this because there have been so few of
them!  What's more, most of those few are host-table only problems.  If
you run obsolete software, you should expect to have to put in some
maintenance on your own.

-- Mark --
-------

AI.CLIVE@MCC.COM (Clive Dawson) (02/16/88)

	     Trying every possible IP address for a host which fails to
	respond on its "primary" IP address is a recipe for performance
	disaster for any background mailer with a large workload.  We're
	talking about thousands of messages/day, friends, not a piddly couple
	of hundred.

Yes, Mark, I know what large workloads are.  Considering that 

	a) this discussion deals with multi-homed hosts which
	   comprise only about 5% of the total host population, and
	b) the success rate for first delivery attempts exceeds 95%, and 
        c) once a message is in the retry queue the average number of
	   retries is greater than 1,

I would appreciate it if you would elaborate on just how this
"performance disaster" you predict will come about, especially for
sites using the algorithm we've discussed in the last few messages.

I'm glad to see you acknowledge that only "most" of these problems are
host-table related.  OK, let's eliminate these from consideration.
What do you propose that we do about the rest of the problems which
are not host-table related?  You seem to be resigned to the idea that
they should just be ignored because there are so few of them.  If
this is true, then a solution which only affects THESE FEW can hardly
cause a performance disaster.

Clive
-------

braden@VENERA.ISI.EDU (02/17/88)

Egad!  I thought this question was settled about 4 years ago!  And the
answer was... JLarson is entirely right, and Mark Crispin is confusing 
a settled issue.

Bob Braden

jordan@UCBARPA.BERKELEY.EDU (Jordan Hayes) (02/17/88)

Mark talks all about "host table this" and "obsolete software this"
but the problem that I had (and I think J Larson had) was that,
due to NIC paperwork (i.e., chan
ging *anything* about a host/gateway
that requires an NCD), there are things that will hose you for
sure.  I couldn't get the host table *or the domain database*
changed, so places like sushi.stanford.edu, mcc.com, a.isi.edu (a
*root domain server*!!!) bounced mail headed our way for 5 weeks.

I'm running all the correct software, i'm hip to the "now it's your
namespace, now you have control over it" thang, but I think it's a bit
naive to say that "Ah well, 1% of the time, you'll lose" ...

Summary:  Mark, i'd appreciate it if you'd try a bit harder not to
bounce my mail.  Thanks.

/jordan