[comp.protocols.tcp-ip] Mail Delivery Problems

netcoor@NCS.DND.CA (DRENET Coordinator) (11/24/88)

Over the past several months I have noticed some problems sending mail to
some Internet sites. Initially it seemd to be specific between one host in
the DREnet (for which I am the Network Coordinator) and one other host in
the Internet. That problem was solved by routing mail through an intermediate
host which was able to deliver the message.

Recently, however, other users have brought some newer occurrences of this same
problem to my attention, each one affecting a different host. In each case,
mail from our networks cannot be sent to the affected host, but mail from
the host can reach us. 

The problem is not a routing or a reachability problem. When I monitor
attempts to send mail to the affected hosts, I see that a TCP connection
is successfully established, but little or no data transfer occurs. It
appears to me that the initial SMTP handshaking is occurring, but that
none of the data (ie the message following the DATA command) is getting
through (note that I can't be sure of this, it just makes sense given what
I have seen). I see the send queue (via netstat) grow to some size and then
stay at that size until the connection times out.

The syslog entries show "read: reply error" when the connection breaks, and
sometimes the host at the other end will follow up by sending a message saying
that the receipt of the message failed when a read timed out.

I can telnet to the SMTP port on the affected host and type the message
myself without any problems.

It may be related to packet size. I say this because all the TCP connection
handshake packets and the SMTP handshake are generally small, and the packets
following the DATA command would be significantly larger (given a reasonably
sized message). Also, I can FTP to any of the sites involved and can
retrieve files without difficulty (anonymous ftp). However, attempts to send
files, to the only known affected system that permits it, fail once the
actual transfer of the file starts. Directory listings, cwd's, and ftp
mode commands (ie bin and ascii commands) all work.

I am at a loss to explain this. I can't see why this would happen, given
that a TCP connection is established successfully and some packets can get
through. I don't think it is our systems here that are at fault as they are
able to mail to many other Internet sites without problems. Further, we have
a variety of systems here and all have the same problem with the affected
hosts.

If anyone can provide any clues or suggestions or answers to this problem,
I'd be glad to hear them. I admit that I am stumped. I have included below
a message I received from another DREnet user describing his view of the
problem.

Bob Bradford
DREnet Coordinator

============================================================================

From irwin@red.ipsa.dnd.ca Fri Nov 11 20:16:47 1988
Received: from red.ipsa.dnd.ca (red.ipsa.dnd.ca.ARPA) by ncs.dnd.ca; (4.12/4.7)
        id AA21100; Fri, 11 Nov 88 20:16:19 est
Received: by red.ipsa.dnd.ca; (5.54/4.7)
        id AA02802; Fri, 11 Nov 88 20:17:12 EST
Message-Id: <8811120117.AA02802@red.ipsa.dnd.ca>
To: drenet-problems@ncs.dnd.ca
Cc: dan@red.ipsa.dnd.ca, irwin@red.ipsa.dnd.ca
Subject: TCP packets lost
Date: Fri, 11 Nov 88 20:17:09 EST
From: irwin@red.ipsa.dnd.ca
Status: RO

We are experiencing a problem here in which packets sent from this host do not
reach the destination host. For some unknown reason, this only shows up in
sendmail to certain hosts.

When I try to send a mail message to one host (a BSD system), the following
sequence of events occurs:

1. There seems to be some difficulty in establishing the initial connection.
   Often, the initial SYN packet must be retransmitted a number of times, or
   the connection appears to be dropped and picked up again (I'm not sure about
   this).

2. Eventually, the connection is established, and this host sends 2-4 packets
   that are acknowledged by data and/or ACK packets from the remote host. (The
   normal state of affairs.)

3. This host sends a further packet, and no acknowledgement is received.

4. This host retransmits the packet the maximum number of times (10 in BSD
   4.2), receives no acknowledgement, and drops the connection.

I have turned on the kernel switches that enable the TCP protocol trace
printing code and watched this happen.  There is another switch in the kernel
which tells it to do a steeper exponential retransmit timeout, essentially
  numtries++;
  timeout = clip (timeout << numtries, MIN_TIMEOUT, MAX_TIMEOUT)
I tried this too, thinking that an acknowledgement might arrive if I gave it
more time. This did not work. (It's nice to have kernel source, though :-).)

It seems, however, that packets can still be received. When sending mail to
a Tops-20 system, the mailer connection is automatically logged
out because it is idle too long. The packet carrying the autologout message is
received here (sendmail prints it), but I haven't traced this, so I don't know
what acknowledgement the packet carries.

One other point: I tried to send a message by talking directly to one 
mailer with telnet from this host. That worked. I have no idea why this only
happens with sendmail.

As an aside, it seems to me that a steeper exponential retransmit timeout is
not a bad idea for a host like ours with an indirect connection to the
Internet. Any words of wisdom on the subject?

Irwin Meisels
irwin@red.ipsa.dnd.ca

craig@NNSC.NSF.NET (Craig Partridge) (11/28/88)

> It may be related to packet size. I say this because all the TCP connection
> handshake packets and the SMTP handshake are generally small, and the packets
> following the DATA command would be significantly larger (given a reasonably
> sized message). Also, I can FTP to any of the sites involved and can
> retrieve files without difficulty (anonymous ftp). However, attempts to send
> files, to the only known affected system that permits it, fail once the
> actual transfer of the file starts. Directory listings, cwd's, and ftp
> mode commands (ie bin and ascii commands) all work.

> I am at a loss to explain this. I can't see why this would happen, given
> that a TCP connection is established successfully and some packets can get
> through. I don't think it is our systems here that are at fault as they are
> able to mail to many other Internet sites without problems. Further, we have
> a variety of systems here and all have the same problem with the affected
> hosts.

Bob:

    I've encountered this problem a number of times.

    The problem is probably IP fragmentation.  You go from those small
datagrams before the DATA command to a large one that fragments and some
fragments never get through (see the 'always fragments' case in Mogul
and Kent's SIGCOMM '87 paper "Fragmentation Considered Harmful").  Do
the MTUs of the various networks you traverse differ?

    Another possibility is that you are running over a slow link which
is sensitive to packet size.  If your TCP has a poor RTT estimator, your
connection can, in fact, fail on the first large datagram (whose RTT is
a factor of 200 larger than the RTT for small datagrams).  Any slow links
(< 9.6Kbits/sec) in your path?  Do you have the Jacobson RTT code in your TCP?

    A final possibility I've seen is that various gateways at various
times in their software life cycles have had bugs that caused them to spindle
or discard datagrams over a certain size.  (Two typical problems -- refusing
to fragment -- corrupting a data buffer chain).  To my knowledge, all such
bugs are now gone, but if you can't find a fragmentation problem, look for
a busted gateway.

Craig

tsudik@JERICO.USC.EDU (Gene Tsudik) (11/29/88)

Another reason for packets not tgetting through may be that the DONT_FRAGMENT
flag is being erroneously set in the source host (or some intervening gw).

Gene Tsudik

Ata@RADC-MULTICS.ARPA ("John G. Ata") (11/30/88)

We have had problems sending to BSD4.2 and ULTRIX systems.  In our case
a packet gets sent but never acknowledged.  Subsequent retrasmissions
are fruitless and the connection eventually times out.  An analysis of
the packets that are affected show that only certain size packets have
difficulty.  Not counting the TCP or IP headers, a packet with a data
field of 15, 33, 51 ...  is affected (every 18 octets).  A look at a 4.2
BSD system showed that the IP layer was throwing the packets away
because the size field in the IP header wasn't the same size that the
device interface (X.25) thought it was.  You can see this with the
"netstat -s" command line.  Because we can talk with other systems
(including BSD 4.3) with no problem, I don't suspect our IP putting an
incorrect size field.
          Sending data via TELNET doesn't cause any problems because
most TELNETs are 1 octet/packet so the problem doesn't occur.  We can
duplicate this problem is line mode, however, where the packet size will
be the line size.
          Hope this helps...

                              John G. Ata

AI.CLIVE@MCC.COM (Clive Dawson) (11/30/88)

Bob-
  I don't know if you've solved your problem with hung mail
connections yet, but here's something else you might try.  
Check to see if your mail delivery program is sending out
carriage return/ line feed pairs at the end of each SMTP command,
as specified in the standard, rather than just line feeds.  

Certain SMTP servers, TOPS-20's MAISER for one, will ignore
lone line feeds.

Good luck,

Clive
-------