craig@SH.CS.NET.UUCP (05/29/86)
I've noticed that a lot of mailers use timeouts on some or all SMTP commands, to make sure that a defective remote mailer doesn't cause the other mailer to hang permanently waiting for a reply. All the code I have seen uses a fixed timeout period. For example, you always have 30 seconds to reply to a HELO, or some such. It has occurred to me that it might make more sense for such mailers to adapt their timeouts to perceived performance at the remote end. For example, I've seen 10-15 second packet round trip times on parts of the Internet. In such a situation, a 30 second timeout is actually a 15 second timeout for the other mailer. One idea for adapting timeouts is to "ping" the other end of the SMTP link with a NOOP command every so often to find out the round trip time, and also get some vague sense of the remote machine's load (since it must actually recognize the NOOP and compose a reply). Another is to simply do one test at the start of the interaction, using the HELO command as the benchmark. What do other people think of this idea?. Anyone got other interesting ways to adjust the timeouts? We're serioiusly considering putting this feature into MMDF2b. Craig Partridge CSNET Technical Staff
dpk@mcvax.UUCP.UUCP (05/29/86)
Sounds cute, but is it really necessary. In all cases there should be a reasonable MAXIMUM hardcoded in. I thought the times I put in were reasonable. If not we have to ask should be really be talking to a host whose round trip time is 15 seconds. My suspicion is that the answer might be no. For example, no reply to a HELO after one minute is ridiculous. Sounds like a lot of work to get right and not necessarily foolproof. If you get a really fast HELO response, ar you going to reduce the time on RCPT replies too much (if it needs mucho expansion)? Its very hard to get right in all cases. It may be easier to be able to state difinitively what the values are for all cases. Consider them specs for performance. You need hardcoded minimums and maximums in any case. -Doug-
MILLS@USC-ISID.ARPA (05/29/86)
In response to the message sent Thu, 29 May 86 12:15:38 +0100 from mcvax!dpk@seismo.CSS.GOV Doug, I commonly see delays of up to a minute for replies to SMTP commands with MIT-MULTICS and up to two minutes for certain FORD-WDL hosts. This, however, bogs the question, since delays of this magnitude would be considered prima facie evidence of brain lesions by almost everybody except their maintainers. On the other hand, premature abandonment of an SMTP connection may be hazardous to the mental health of the recipients. In a recent case when our local 4.2bsd prematurely abandoned an SMTP connection and repeatedly tried it again hour after hour, the recipient got a (truncated) copy of the message every time. After three days he became quite violent. I think SMTP adaptive timeouts, while cute and possibly even useful in some cases, should not be used to "tune" connections, but to ensure system resources are returned to service when something hangs. They should be set quite long - in the order of several minutes. The present situation clearly indicates some implementations are broken and should be fixed. Having said that, it also is clear that the error-recovery characteristics of many mailers and femailers could be much improved. Dave -------
CERF@USC-ISI.ARPA (05/30/86)
Doug, In the complex internet environment, round-trip variations can be significant, depending on the paths taken, the satellite hops, congestion, various failures, and so on. Under jamming conditions, bandwidths drop to increase S/N and also increase delays (unfortunately). At least with respect to DoD objectives, the system has to work even if its parameters exceed the nominally desired limits. Consequently, many of us have been reluctant to rely on any fixed maxima if we can find ways to measure and adapt instead. I don't disagree that it would be easier to declare some maximum and in some instances (e.g. the 576 octet IP packet size which all subsystems must carry without fragmentation) we have done so. But with respect to dynamic parameters such as round-trip time, we have tried hard not to allow ourselves to be bequiled by the apparent simplicity of declared limits. There may well be some other dissenting opinions on this point out there in the diverse world which makes up this interest group, but I believe I am stating the principal view of those of us concerned for DoD requirements. Vint
dpk@mcvax.UUCP (05/31/86)
For systems that don't have large quantities of mail I have no complaint with large timeouts, but one systems like CSNET-RELAY and BRL which have to process hundreds or thousands of messages a day, waiting minutes can quite a delay if the number of "slow hosts" becomes more than a few. We already have to run 3-4 channels at BRL just to get them mail we have sent. Thinks a 'bout it. It's not an issue I feel strongly enough about to push further, but I want people to think about it. -Doug-
dpk@mcvax.UUCP (05/31/86)
I am aware of DOD operational requirements, as you know. I also have been responsible for real service at a large mail host that essentially talked to every Internet host at the time. My opinions come from this background. While arbitrary limits are to be avoided, there are certain cases when a decision which may help you in a few cases (say a 5 minute response time) can tie up significant resources at other times when a more reasonable timeout would be say 30 seconds. If you can classify hosts or users in one group or the other fine, you can have different classes of response times. But if preclassification is not possible, you have to be careful you don't create more problem that solution. We have had times at BRL when we could not get rid of the mail as fast as it was coming in until we ran seven outbound mail processes simultaneously (with what are now believed to be too short time intervals). Think about it. Novel solutions welcome, but waiting forever is not practical. -Doug- PS. We didn't start out with timeouts, they were added for a reason.
CERF@USC-ISI.ARPA (05/31/86)
Doug, these are good points. The line of reasoning which says that timeouts that are too long result in poor resource utilization is on good grounds - the problem I had with an arbitrary maximum is that under some extreme conditions, no service might be provided at all - the usual effect of timeouts that are too short. Vint
Murray.pa@XEROX.COM.UUCP (06/02/86)
An idea that we have found very helpful... Our mailer keeps outgoing mail sorted by host. Hosts are split into two categories: healthy and sick. While there is work to do on the healthy queue, the mailer ignores the sick hosts. Whenever the mailer empties the healthy queue, it tries the host on the front of the sick queue. (If that fails, it gets moved to the end of the sick queue.) The idea is to avoid having the mailer bang its head against hosts that are known to be causing trouble. Occasionally mail to a host that isn't really very sick takes much longer that we would like. This happens when the sick queue is very long and the mailer is busy so the sick queue doesn't turn over very fast. So far, this hasn't bothered us enough to do anything about it. Along the same lines, we also keep mail to a host sorted, but not quite chronologically. Whenever the mailer tries to send a message and fails, that message gets moved to the end of the queue. Occasionally, this lets the rest of the mail get through when one particular message is having/causing troubles.
DCP@SCRC-QUABBIN.ARPA.UUCP (06/03/86)
Date: Mon, 2 Jun 86 12:03:42 PDT From: Murray.pa@Xerox.COM An idea that we have found very helpful... Our mailer keeps outgoing mail sorted by host. Hosts are split into two categories: healthy and sick. While there is work to do on the healthy queue, the mailer ignores the sick hosts. Whenever the mailer empties the healthy queue, it tries the host on the front of the sick queue. (If that fails, it gets moved to the end of the sick queue.) The idea is to avoid having the mailer bang its head against hosts that are known to be causing trouble. We do something like this. We keep track of up and down hosts, and when processing mail skip the hosts that are believed down. Therefore, we don't concentrate on one particular host (which might drive that host crazy). I think we do it this way because one message is often destined for many hosts. Occasionally mail to a host that isn't really very sick takes much longer that we would like. This happens when the sick queue is very long and the mailer is busy so the sick queue doesn't turn over very fast. So far, this hasn't bothered us enough to do anything about it. Obvious solution: Periodically declare sick hosts up, or slightly more conservatively, declare the host suitable for a probe. If it really is sick, you'll know soon enough. You only have to do this for one message. If it isn't sick, you can requeue the tardy messages. Along the same lines, we also keep mail to a host sorted, but not quite chronologically. Whenever the mailer tries to send a message and fails, that message gets moved to the end of the queue. Occasionally, this lets the rest of the mail get through when one particular message is having/causing troubles. When a message is causing troubles, how long does it take a human to realize it and take corrective action. If it stayed at the head of the queue, I can imagine a human would notice sooner by either having no mail get through at all, or the queue for the troublesome host keeps growing instead of stays at some "respectable" number.