[mod.protocols.tcp-ip] Adaptive SMTP Timeouts

craig@SH.CS.NET.UUCP (05/29/86)

    I've noticed that a lot of mailers use timeouts on some or
all SMTP commands, to make sure that a defective remote mailer doesn't
cause the other mailer to hang permanently waiting for a reply.  All the
code I have seen uses a fixed timeout period.  For example, you always
have 30 seconds to reply to a HELO, or some such.

    It has occurred to me that it might make more sense for such mailers
to adapt their timeouts to perceived performance at the remote end.
For example, I've seen 10-15 second packet round trip times on parts
of the Internet.  In such a situation, a 30 second timeout is actually
a 15 second timeout for the other mailer.

    One idea for adapting timeouts is to "ping" the other end of the
SMTP link with a NOOP command every so often to find out the round
trip time, and also get some vague sense of the remote machine's
load (since it must actually recognize the NOOP and compose a reply).
Another is to simply do one test at the start of the interaction,
using the HELO command as the benchmark.

    What do other people think of this idea?.  Anyone got other
interesting ways to adjust the timeouts?  We're serioiusly considering
putting this feature into MMDF2b.

Craig Partridge
CSNET Technical Staff

dpk@mcvax.UUCP.UUCP (05/29/86)

Sounds cute, but is it really necessary.  In all cases there should be
a reasonable MAXIMUM hardcoded in.  I thought the times I put in were
reasonable.  If not we have to ask should be really be talking to
a host whose round trip time is 15 seconds.  My suspicion is that the
answer might be no.  For example, no reply to a HELO after one minute
is ridiculous.  Sounds like a lot of work to get right and not necessarily
foolproof.  If you get a really fast HELO response, ar you going to reduce
the time on RCPT replies too much (if it needs mucho expansion)?  Its very
hard to get right in all cases.  It may be easier to be able to state
difinitively what the values are for all cases.  Consider them specs for
performance.  You need hardcoded minimums and maximums in any case.

-Doug-

MILLS@USC-ISID.ARPA (05/29/86)

In response to the message sent  Thu, 29 May 86 12:15:38 +0100 from mcvax!dpk@seismo.CSS.GOV

Doug,

I commonly see delays of up to a minute for replies to SMTP commands
with MIT-MULTICS and up to two minutes for certain FORD-WDL hosts. This,
however, bogs the question, since delays of this magnitude would be considered
prima facie evidence of brain lesions by almost everybody except their
maintainers. On the other hand, premature abandonment of an SMTP connection
may be hazardous to the mental health of the recipients. In a recent case
when our local 4.2bsd prematurely abandoned an SMTP connection and repeatedly
tried it again hour after hour, the recipient got a (truncated) copy of the
message every time. After three days he became quite violent.

I think SMTP adaptive timeouts, while cute and possibly even useful in some
cases, should not be used to "tune" connections, but to ensure system
resources are returned to service when something hangs. They should be set
quite long - in the order of several minutes. The present situation clearly
indicates some implementations are broken and should be fixed. Having
said that, it also is clear that the error-recovery characteristics of
many mailers and femailers could be much improved.

Dave
-------

CERF@USC-ISI.ARPA (05/30/86)

Doug,

In the complex internet environment, round-trip variations can be significant,
depending on the paths taken, the satellite hops, congestion, various failures,
and so on. Under jamming conditions, bandwidths drop to increase S/N and also
increase delays (unfortunately). At least with respect to DoD objectives,
the system has to work even if its parameters exceed the nominally desired
limits. Consequently, many of us have been reluctant to rely on any fixed
maxima if we can find ways to measure and adapt instead.

I don't disagree that it would be easier to declare some maximum and in some
instances (e.g. the 576 octet IP packet size which all subsystems must carry
without fragmentation) we have done so. But with respect to dynamic parameters
such as round-trip time, we have tried hard not to allow ourselves to be
bequiled by the apparent simplicity of declared limits.

There may well be some other dissenting opinions on this point out there in
the diverse world which makes up this interest group, but I believe I am
stating the principal view of those of us concerned for DoD requirements.

Vint

dpk@mcvax.UUCP (05/31/86)

For systems that don't have large quantities of mail I have no complaint
with large timeouts, but one systems like CSNET-RELAY and BRL which have
to
process hundreds or thousands of messages a day, waiting minutes can
quite a delay if the number of "slow hosts" becomes more than a few.
We already have to run 3-4 channels at BRL just to get them mail we
have sent.  Thinks a 'bout it.  It's not an issue I feel strongly
enough about to push further, but I want people to think about it.

-Doug-

dpk@mcvax.UUCP (05/31/86)

I am aware of DOD operational requirements, as you know.

I also have been responsible for real service at a large mail
host that essentially talked to every Internet host at the time.

My opinions come from this background.  While arbitrary limits
are to be avoided, there are certain cases when a decision which
may help you in a few cases (say a 5 minute response time) can
tie up significant resources at other times when a more reasonable
timeout would be say 30 seconds.  If you can classify hosts or
users in one group or the other fine, you can have different
classes of response times.  But if preclassification is not
possible, you have to be careful you don't create more problem
that solution.  We have had times at BRL when we could not
get rid of the mail as fast as it was coming in until we ran
seven outbound mail processes simultaneously (with what are now
believed to be too short time intervals).  Think about it.  Novel
solutions welcome, but waiting forever is not practical.

-Doug-

PS.  We didn't start out with timeouts, they were added for a reason.

CERF@USC-ISI.ARPA (05/31/86)

Doug,

these are good points. The line of reasoning which says that timeouts that
are too long result in poor resource utilization is on good grounds - the
problem I had with an arbitrary maximum is that under some extreme conditions,
no service might be provided at all - the usual effect of timeouts that
are too short.

Vint

Murray.pa@XEROX.COM.UUCP (06/02/86)

An idea that we have found very helpful...

Our mailer keeps outgoing mail sorted by host. Hosts are split into two
categories: healthy and sick. While there is work to do on the healthy
queue, the mailer ignores the sick hosts. Whenever the mailer empties
the healthy queue, it tries the host on the front of the sick queue. (If
that fails, it gets moved to the end of the sick queue.) The idea is to
avoid having the mailer bang its head against hosts that are known to be
causing trouble.

Occasionally mail to a host that isn't really very sick takes much
longer that we would like. This happens when the sick queue is very long
and the mailer is busy so the sick queue doesn't turn over very fast. So
far, this hasn't bothered us enough to do anything about it.

Along the same lines, we also keep mail to a host sorted, but not quite
chronologically. Whenever the mailer tries to send a message and fails,
that message gets moved to the end of the queue. Occasionally, this lets
the rest of the mail get through when one particular message is
having/causing troubles.

DCP@SCRC-QUABBIN.ARPA.UUCP (06/03/86)

    Date: Mon, 2 Jun 86 12:03:42 PDT
    From: Murray.pa@Xerox.COM

    An idea that we have found very helpful...

    Our mailer keeps outgoing mail sorted by host. Hosts are split into two
    categories: healthy and sick. While there is work to do on the healthy
    queue, the mailer ignores the sick hosts. Whenever the mailer empties
    the healthy queue, it tries the host on the front of the sick queue. (If
    that fails, it gets moved to the end of the sick queue.) The idea is to
    avoid having the mailer bang its head against hosts that are known to be
    causing trouble.

We do something like this.  We keep track of up and down hosts, and when
processing mail skip the hosts that are believed down.  Therefore, we
don't concentrate on one particular host (which might drive that host
crazy).  I think we do it this way because one message is often destined
for many hosts.

    Occasionally mail to a host that isn't really very sick takes much
    longer that we would like. This happens when the sick queue is very long
    and the mailer is busy so the sick queue doesn't turn over very fast. So
    far, this hasn't bothered us enough to do anything about it.

Obvious solution: Periodically declare sick hosts up, or slightly more
conservatively, declare the host suitable for a probe.  If it really is
sick, you'll know soon enough.  You only have to do this for one
message.  If it isn't sick, you can requeue the tardy messages.

    Along the same lines, we also keep mail to a host sorted, but not quite
    chronologically. Whenever the mailer tries to send a message and fails,
    that message gets moved to the end of the queue. Occasionally, this lets
    the rest of the mail get through when one particular message is
    having/causing troubles.

When a message is causing troubles, how long does it take a human to
realize it and take corrective action.  If it stayed at the head of the
queue, I can imagine a human would notice sooner by either having no
mail get through at all, or the queue for the troublesome host keeps
growing instead of stays at some "respectable" number.