MRC@PANDA.PANDA.COM (Mark Crispin) (02/15/88)
I am sure the advocates of trying multiple mail addresses would feel quite differently if they had to pay per-packet charges for network access. Historically, only a small percentage of network connection failures -- typically less than 1% -- have been due to a dysfunctional IP address. The remaining (= overwhelming majority of) failures have been due to dysfunctional networks, dysfunctional hosts, or dysfunctional servers. It is possible that trying a different IP address may help in the dysfunctional network case, although typically the "non-best" IP addresses all involve the dysfunctional network in some way (look at some network topology maps some time). This is a relatively rare case anyway. Many times, the "non-best" IP address is substantially inferior to the point where it should not be used under ANY circumstance. No site outside of Stanford should *ever* use SAIL's, Score's, or SUMEX-AIM's net 36 IP address; the gateway between net 10 and net 36 (as well as the net 36 subnet from that gateway) is seriously overloaded. If I understand JLarson.pa correctly, he's saying that Xerox.COM will use SUMEX-AIM's net 36 address just because they couldn't connect to the net 10 address the last time. If this is common behavior it's no wonder those of us who must use the net 10/36 gateway find it so unusable. Will I have to instruct the servers on multi-homed net 10/36 hosts to refuse connections on net 36 from non-net 36 hosts to get them to stop? What about those guys multi-homed on a "free" and a pay-per-packet X.25 net? Do they appreciate this behavior? The *correct* solution to this problem is NOT kludgy algorithms in the mailer. The correct solution is multi-part, and involves: 1) complete the migration from the host table to the domain system. The NIC simply cannot keep up with the changes in network topology (as the Xerox experience showed), and, frankly, it's unreasonable for us to expect them to. 2) domain database managers need to keep their name servers updated with changes to network topology. TTL's should not be allowed to be so long that topology changes go unnoticed by resolvers for excessive periods of time. 3) better support needs to exist in the domain infrastructure for "best" IP address selection. This last point is important. Presently, it is up to the local host to decide upon a "best" IP address, based on quite incomplete information. Many hosts (all Unix hosts?) simply pick the first IP address listed in the NIC host table (or returned as A RR's from the domain system). TOPS-20 selects in priority order: (1) first IP address from a directly connected net that is "preferred" (e.g. a fast LAN), (2) first IP address from a directly connected net that is "default" (e.g. a core net such as ARPANET), (3) first IP address from any other directly connected net, (4) first IP address. "First IP address" means first from the address list from the host table (or a set of A RR's from the domain system). Note that there is nothing whatsoever to do with "net 10". Almost 100% of the time, this makes the best possible choice of an IP address. It's only in those very few cases (which come up perhaps 2 or 3 times a YEAR!!!) where an otherwise highly desirable path breaks for a long period of time that a problem comes up. I consider it highly objectionable to cycle through every other IP address (waiting a minute or more for an IP retransmission timeout if the network is courteous enough to tell me the other guy ain't there) every time I attempt to connect to a dead host. JLarson's suggestion is less objectionable, but it involves one piece of software (the mailer) telling a completely different piece of software (host table or domain resolver) that the IP address given it was sick. Nobody wants to do the work to the host table software to add such a feature. It might be doable with the domain resolver (SRA can comment on this); it certainly wouldn't be hard for the mailer to pass on the word to the domain resolver. The problem is, what does "this IP address was sick" really mean? How does "retransmission timeout" differ from "host dead" (a type 7 1822 message) differ from "host sent a reset" (refused the connection) differ from any of the other ways a connection failed? In which one(s) of these do you say try another IP address, and in which one(s) do you assume the host is really down, or really doesn't want to talk now? Again, what do you do about those cases when we really shouldn't be using a particular IP address because of charging, or other administrative issues? The domain system may be able to help; it was always my belief (I remember suggesting this at the meeting when the domain system concept was first invented) that nameservers should be allowed to tailor their responses based on who was asking the question. A domain query should be something like: "I am on net 128.43 seeking an SMTP server for FOO.BAR.COM, which is the best address for me to use?" and later on "I am on net 128.43 seeking an SMTP server for FOO.BAR.COM and I already tried 69.105.8.3, is there any other I should try?" The point is that a perfectly valid answer may be "if 69.105.8.3 ain't answering, he ain't up; try again later." This also gives the remote organization (which presumably knows the status of their hosts) control over the IP address selection criteria, based upon their knowledge instead of the local host's educated guesswork. Please, no flames. If you're going to babble on and on about how I should break my mailer to conform to your fantasy of how the world should work, send it to *NUL: or /dev/null or whatever you call it. Furthermore, I'm not interested in any comments about a host table based means of IP address selection. The systems I support do not use host tables (and, for the record, are currently the only TOPS-20's supporting MX mailing). I can't help but feel that if the problem of a sick "best" IP address happens to a domain-based mailer, that the fault is that of the management of the nameserver for that organization and not that of the mailer. If you have constructive observations, then let's talk. Remember that this is not about porting arguably "better" (or "worse") ideas from a 16-year-old operating system to a 19-year-old operating system. This is about what's going to be done in the next generation, that maybe will be ported to the 16 and 19 year-olds. I think we can do better than any of the guesswork, and we should, if the threats of pay-per-packet come to pass. -- Mark -- -------
WANCHO@SIMTEL20.ARPA ("Frank J. Wancho") (02/15/88)
For those who may be unaware, Mark's pay-per-packet comments *will* apply to all MILNET hosts for outgoing traffic and TAC access sometime in FY 89, unless the plans have been rescinded by some miracle. And, from what I've seen of the algorithm DCA intends to impose on us, it ain't cheap! When that happens, those sites, such as this one, which support large mailing lists and anonymous ftp service may be forced to withdraw those services... (or move to ARPANET where flat fees will continue - no :). --Frank
AI.CLIVE@MCC.COM (Clive Dawson) (02/16/88)
Re: Date: Sun, 14 Feb 88 18:18:09 PST From: Mark Crispin <MRC@PANDA.PANDA.COM> Subject: trying multiple addresses Mark-- Your message implies that trying an alternate IP address after the best one has failed will only win 1% of the time. I disagree. As somebody who still has the job of dealing with the day-to-day hassles of "keeping the mail flowing", I cannot afford to sit back and consider what the "correct solution" will ultimately be once the whole world arrives at some standard. I will gladly accept any "kludgy algorithms" that get the job done with no adverse effects. Not counting the host-table-related problems (I know, you don't want to hear this!) such as SU's Sierra, whose 10 address still appears over a year after it died, or the more recent HI-MULTICS/CIM-VAX address problem at Honeywell, where both addresses work, but SMTP listens only on one, I have also had to deal with a multitude of other problems in just the last week which were all solved by using non-"best" IP addresses. The following scenario comes up A LOT, at least at this site: Q: Why isn't my mail getting through to host X? A: The host must be down. Q: Then why can I Telnet to it? A: Ah, your workstation is not directly connected to net 10, and is so "dumb" that it doesn't realize that net 10 is the only way out of this place anyway, so it chooses to use host X's 128 address. Meanwhile, our "smart" mail relay host (and gateway) is directly connected to net 10 and therefore insists on sticking to host X's 10 address regardless. How am I supposed to tell users that "correct failure" is better than "kludgy success"? They could care less about these details, and rightfully so. If I understand JLarson.pa correctly, he's saying that Xerox.COM will use SUMEX-AIM's net 36 address just because they couldn't connect to the net 10 address the last time. If this is common behavior it's no wonder those of us who must use the net 10/36 gateway find it so unusable. Will I have to instruct the servers on multi-homed net 10/36 hosts to refuse connections on net 36 from non-net 36 hosts to get them to stop? Mark, that's not how I understood JLarson at all. He said that this procedure was followed on RETRIES, and then ONLY at the next retry interval. This is quite different from adopting the alternative IP address "permanently" thereafter for all deliveries. It seems to me that this procedure is almost a complete win. There is no extra load on the sending mailer, and non-"best" IP addresses are used only when they have to be, on a message-by-message basis. It might possibly be improved by a heuristic which says that if, on the first retry, the second address is found to be down too, then 4 out of every 5 (say) future retries should go back to using the "best" IP address. This way, if SCORE is really down, the poor 10/36 gateway won't get so much of a pounding with doomed connection attempts from everybody. Cheers, Clive -------
MRC@PANDA.PANDA.COM (Mark Crispin) (02/16/88)
Clive - The NIC host table is a disaster, and the only way to fix the many problems associated with it is to abandon it for domains. Supported TOPS-20 domain and mail software was announced a while ago; perhaps you should consider adopting it at your site. [As a side note, the problem with Sierra's net 10 address being in the host table will go away very soon since the plug is being pulled on Sierra.] Trying every possible IP address for a host which fails to respond on its "primary" IP address is a recipe for performance disaster for any background mailer with a large workload. We're talking about thousands of messages/day, friends, not a piddly couple of hundred. Furthermore, we all can rattle off the cases where the primary IP address was wrong. We can do this because there have been so few of them! What's more, most of those few are host-table only problems. If you run obsolete software, you should expect to have to put in some maintenance on your own. -- Mark -- -------
AI.CLIVE@MCC.COM (Clive Dawson) (02/16/88)
Trying every possible IP address for a host which fails to respond on its "primary" IP address is a recipe for performance disaster for any background mailer with a large workload. We're talking about thousands of messages/day, friends, not a piddly couple of hundred. Yes, Mark, I know what large workloads are. Considering that a) this discussion deals with multi-homed hosts which comprise only about 5% of the total host population, and b) the success rate for first delivery attempts exceeds 95%, and c) once a message is in the retry queue the average number of retries is greater than 1, I would appreciate it if you would elaborate on just how this "performance disaster" you predict will come about, especially for sites using the algorithm we've discussed in the last few messages. I'm glad to see you acknowledge that only "most" of these problems are host-table related. OK, let's eliminate these from consideration. What do you propose that we do about the rest of the problems which are not host-table related? You seem to be resigned to the idea that they should just be ignored because there are so few of them. If this is true, then a solution which only affects THESE FEW can hardly cause a performance disaster. Clive -------
braden@VENERA.ISI.EDU (02/17/88)
Egad! I thought this question was settled about 4 years ago! And the answer was... JLarson is entirely right, and Mark Crispin is confusing a settled issue. Bob Braden
jordan@UCBARPA.BERKELEY.EDU (Jordan Hayes) (02/17/88)
Mark talks all about "host table this" and "obsolete software this" but the problem that I had (and I think J Larson had) was that, due to NIC paperwork (i.e., chan ging *anything* about a host/gateway that requires an NCD), there are things that will hose you for sure. I couldn't get the host table *or the domain database* changed, so places like sushi.stanford.edu, mcc.com, a.isi.edu (a *root domain server*!!!) bounced mail headed our way for 5 weeks. I'm running all the correct software, i'm hip to the "now it's your namespace, now you have control over it" thang, but I think it's a bit naive to say that "Ah well, 1% of the time, you'll lose" ... Summary: Mark, i'd appreciate it if you'd try a bit harder not to bounce my mail. Thanks. /jordan