netcoor@NCS.DND.CA (DRENET Coordinator) (11/24/88)
Over the past several months I have noticed some problems sending mail to some Internet sites. Initially it seemd to be specific between one host in the DREnet (for which I am the Network Coordinator) and one other host in the Internet. That problem was solved by routing mail through an intermediate host which was able to deliver the message. Recently, however, other users have brought some newer occurrences of this same problem to my attention, each one affecting a different host. In each case, mail from our networks cannot be sent to the affected host, but mail from the host can reach us. The problem is not a routing or a reachability problem. When I monitor attempts to send mail to the affected hosts, I see that a TCP connection is successfully established, but little or no data transfer occurs. It appears to me that the initial SMTP handshaking is occurring, but that none of the data (ie the message following the DATA command) is getting through (note that I can't be sure of this, it just makes sense given what I have seen). I see the send queue (via netstat) grow to some size and then stay at that size until the connection times out. The syslog entries show "read: reply error" when the connection breaks, and sometimes the host at the other end will follow up by sending a message saying that the receipt of the message failed when a read timed out. I can telnet to the SMTP port on the affected host and type the message myself without any problems. It may be related to packet size. I say this because all the TCP connection handshake packets and the SMTP handshake are generally small, and the packets following the DATA command would be significantly larger (given a reasonably sized message). Also, I can FTP to any of the sites involved and can retrieve files without difficulty (anonymous ftp). However, attempts to send files, to the only known affected system that permits it, fail once the actual transfer of the file starts. Directory listings, cwd's, and ftp mode commands (ie bin and ascii commands) all work. I am at a loss to explain this. I can't see why this would happen, given that a TCP connection is established successfully and some packets can get through. I don't think it is our systems here that are at fault as they are able to mail to many other Internet sites without problems. Further, we have a variety of systems here and all have the same problem with the affected hosts. If anyone can provide any clues or suggestions or answers to this problem, I'd be glad to hear them. I admit that I am stumped. I have included below a message I received from another DREnet user describing his view of the problem. Bob Bradford DREnet Coordinator ============================================================================ From irwin@red.ipsa.dnd.ca Fri Nov 11 20:16:47 1988 Received: from red.ipsa.dnd.ca (red.ipsa.dnd.ca.ARPA) by ncs.dnd.ca; (4.12/4.7) id AA21100; Fri, 11 Nov 88 20:16:19 est Received: by red.ipsa.dnd.ca; (5.54/4.7) id AA02802; Fri, 11 Nov 88 20:17:12 EST Message-Id: <8811120117.AA02802@red.ipsa.dnd.ca> To: drenet-problems@ncs.dnd.ca Cc: dan@red.ipsa.dnd.ca, irwin@red.ipsa.dnd.ca Subject: TCP packets lost Date: Fri, 11 Nov 88 20:17:09 EST From: irwin@red.ipsa.dnd.ca Status: RO We are experiencing a problem here in which packets sent from this host do not reach the destination host. For some unknown reason, this only shows up in sendmail to certain hosts. When I try to send a mail message to one host (a BSD system), the following sequence of events occurs: 1. There seems to be some difficulty in establishing the initial connection. Often, the initial SYN packet must be retransmitted a number of times, or the connection appears to be dropped and picked up again (I'm not sure about this). 2. Eventually, the connection is established, and this host sends 2-4 packets that are acknowledged by data and/or ACK packets from the remote host. (The normal state of affairs.) 3. This host sends a further packet, and no acknowledgement is received. 4. This host retransmits the packet the maximum number of times (10 in BSD 4.2), receives no acknowledgement, and drops the connection. I have turned on the kernel switches that enable the TCP protocol trace printing code and watched this happen. There is another switch in the kernel which tells it to do a steeper exponential retransmit timeout, essentially numtries++; timeout = clip (timeout << numtries, MIN_TIMEOUT, MAX_TIMEOUT) I tried this too, thinking that an acknowledgement might arrive if I gave it more time. This did not work. (It's nice to have kernel source, though :-).) It seems, however, that packets can still be received. When sending mail to a Tops-20 system, the mailer connection is automatically logged out because it is idle too long. The packet carrying the autologout message is received here (sendmail prints it), but I haven't traced this, so I don't know what acknowledgement the packet carries. One other point: I tried to send a message by talking directly to one mailer with telnet from this host. That worked. I have no idea why this only happens with sendmail. As an aside, it seems to me that a steeper exponential retransmit timeout is not a bad idea for a host like ours with an indirect connection to the Internet. Any words of wisdom on the subject? Irwin Meisels irwin@red.ipsa.dnd.ca
craig@NNSC.NSF.NET (Craig Partridge) (11/28/88)
> It may be related to packet size. I say this because all the TCP connection > handshake packets and the SMTP handshake are generally small, and the packets > following the DATA command would be significantly larger (given a reasonably > sized message). Also, I can FTP to any of the sites involved and can > retrieve files without difficulty (anonymous ftp). However, attempts to send > files, to the only known affected system that permits it, fail once the > actual transfer of the file starts. Directory listings, cwd's, and ftp > mode commands (ie bin and ascii commands) all work. > I am at a loss to explain this. I can't see why this would happen, given > that a TCP connection is established successfully and some packets can get > through. I don't think it is our systems here that are at fault as they are > able to mail to many other Internet sites without problems. Further, we have > a variety of systems here and all have the same problem with the affected > hosts. Bob: I've encountered this problem a number of times. The problem is probably IP fragmentation. You go from those small datagrams before the DATA command to a large one that fragments and some fragments never get through (see the 'always fragments' case in Mogul and Kent's SIGCOMM '87 paper "Fragmentation Considered Harmful"). Do the MTUs of the various networks you traverse differ? Another possibility is that you are running over a slow link which is sensitive to packet size. If your TCP has a poor RTT estimator, your connection can, in fact, fail on the first large datagram (whose RTT is a factor of 200 larger than the RTT for small datagrams). Any slow links (< 9.6Kbits/sec) in your path? Do you have the Jacobson RTT code in your TCP? A final possibility I've seen is that various gateways at various times in their software life cycles have had bugs that caused them to spindle or discard datagrams over a certain size. (Two typical problems -- refusing to fragment -- corrupting a data buffer chain). To my knowledge, all such bugs are now gone, but if you can't find a fragmentation problem, look for a busted gateway. Craig
tsudik@JERICO.USC.EDU (Gene Tsudik) (11/29/88)
Another reason for packets not tgetting through may be that the DONT_FRAGMENT flag is being erroneously set in the source host (or some intervening gw). Gene Tsudik
Ata@RADC-MULTICS.ARPA ("John G. Ata") (11/30/88)
We have had problems sending to BSD4.2 and ULTRIX systems. In our case a packet gets sent but never acknowledged. Subsequent retrasmissions are fruitless and the connection eventually times out. An analysis of the packets that are affected show that only certain size packets have difficulty. Not counting the TCP or IP headers, a packet with a data field of 15, 33, 51 ... is affected (every 18 octets). A look at a 4.2 BSD system showed that the IP layer was throwing the packets away because the size field in the IP header wasn't the same size that the device interface (X.25) thought it was. You can see this with the "netstat -s" command line. Because we can talk with other systems (including BSD 4.3) with no problem, I don't suspect our IP putting an incorrect size field. Sending data via TELNET doesn't cause any problems because most TELNETs are 1 octet/packet so the problem doesn't occur. We can duplicate this problem is line mode, however, where the packet size will be the line size. Hope this helps... John G. Ata
AI.CLIVE@MCC.COM (Clive Dawson) (11/30/88)
Bob- I don't know if you've solved your problem with hung mail connections yet, but here's something else you might try. Check to see if your mail delivery program is sending out carriage return/ line feed pairs at the end of each SMTP command, as specified in the standard, rather than just line feeds. Certain SMTP servers, TOPS-20's MAISER for one, will ignore lone line feeds. Good luck, Clive -------