[comp.protocols.tcp-ip] IP fragmentation, and how to avoid it

dplatt@teknowledge-vaxc.ARPA (Dave Platt) (05/12/87)

I've run into some problems on a Sun 3/52 workstation running SunOS
3.2 that I've been told may involve IP packet fragmentation.  The
primary symptom is that SMTP mail deliveries "hang up" and abort with
a read timeout.

Background: my Sun is sitting on a 10 Mbit Ethernet with the default
ifconfig for the Ethernet board;  the MTU for the Ethernet interface
is 1500 bytes.  The system is configured so that packets destined for
IP addresses not on our net are sent to our Vax 8650 (Ultrix 1.2),
which ipforwards them to the Internet TIP.  The MTU for the Vax's
"imp0" interface is 1006 bytes.

Problem: if a process on the Sun establishes a TCP connection with a
peer running on a host somewhere on the Internet (e.g. an SMTP
server), and then sends a large burst of data, the Sun will typically
queue up about 4k of data in the TCP buffers at one time.  This
apparently results in the sending of an IP packet that approaches the
Sun's 1500-byte MTU; when the packet passes through the Vax on its way
to the IMP, it is apparently fragmented.  Some system or gateway seems
to drop the fragmented IP packet on the floor.  The Sun's TCP never
receives an acknowledgement for the TCP segment, retries the
transmission periodically, and eventually aborts the connection.

The problem typically occurs in the later stages of an SMTP session.
The Sun's SMTP mailer is able to connect with its peer on another
Internet host, go through the "MAIL FROM" and "RCPT TO" steps, and
receive permission to send the message body.  If the message is short
(< 1k bytes), everything works fine;  if it's too long, then the
timeout occurs.

This problem appears to occur only when the host I'm trying to connect
with lies on a local-area net... and not all LANs are affected.  I've
been told that certain gateways are incapable of reassembling
fragmented IP packets;  other gateways seem to work just fine.

Question for the gurus:  is there any way to reconfigure my Sun's le0
interface so that its MTU doesn't exceed that of the 8650?  If so, how
do I do it?  Or, is there a better solution to the problem?  Or,
finally, have I totally misunderstood the problem?

advTHANKSance,

                Dave Platt

Internet:  dplatt@teknowledge-vaxc.arpa
Usenet: {hplabs|sun|ucbvax|seismo|uw-beaver|decwrl}!teknowledge-vaxc.arpa!dplatt
Voice: (415) 424-0500

hedrick@topaz.RUTGERS.EDU (Charles Hedrick) (05/13/87)

Now and then we run into machines that can't reassemble.  Note that
the 1006 limit on imp0 isn't a problem with the VAX.  It is the
limit allowed by the Arpanet.  There are more elegant solutions,
but if you don't have source, here is a program that will let you
change the MTU on the fly.  We have used it on both Pyramid and
Sun, changing only the name of the kernel variable.  I.e.
the string "_il_softc", which is the name appropriate for il0 on
the Pyramid.  I just checked and it looks like _le_softc will work
for a Sun 3/50.  At least this will let you see whether your problem
is really a reassembly problem.  You should try "mtu 1006" or
maybe some slightly smaller number.  (We typically use 900 for
testing.)

#include <sys/types.h>
#include <sys/stat.h>
#include <a.out.h>
#include <stdio.h>

struct nlist nl[2];

short mtu;
int kmem;
struct stat statblock;
char *kernelfile;

main(argc,argv)
char *argv[];
{
	if (argc < 2) {
		fprintf(stderr,"usage: mtu <n> {<kernelfile>}\n");
		exit(2);
	}

	if ((kmem = open("/dev/kmem",2))<0) {
		perror("open /dev/kmem");
		exit(1);
	}
	if (argc > 2) {
		kernelfile = argv[2];
	} else {
		kernelfile = "/vmunix";
	}
	if (stat(kernelfile,&statblock)) {
		fprintf(stderr,"%s not found.\n",kernelfile);
		exit(1);
	}
	initnlistvars(atoi(argv[1]));
	exit(0);
}

initnlistvars(on)
register int on;
{
	nl[0].n_un.n_name = "_il_softc";
	nl[1].n_un.n_name = "";
	nlist(kernelfile,nl);
	if (nl[0].n_type == 0) {
		fprintf(stderr, "%s: No namelist\n", kernelfile);
		exit(4);
	}
	(void) lseek(kmem,(nl[0].n_value)+6,0);
	if (read(kmem,&mtu,2) != 2) {
		perror("read kmem");
		exit(5);
	}
	fprintf(stderr,"mtu was: %d is now: %d\n",mtu,on);
	(void) lseek(kmem,(nl[0].n_value)+6,0);
	mtu = on;
	if (write(kmem,&mtu,2) != 2) {
		perror("write kmem");
		exit(6);
	}
}

jonab@CAM.UNISYS.COM (Jonathan P. Biggar) (05/14/87)

Don't change the MTU on your network interface.  What you want to do
is change tcp to never send segments that are larger than the mtu of
the Arpanet.  If you change the MTU on your interface, you will mess up
any ND or NFS access you may have.

Jon Biggar
jonab@cam.unisys.com

jas@MONK.PROTEON.COM (John A. Shriver) (05/14/87)

The SunOS TCP will choose to put 1024 bytes of data in each packet
unless the socket receive high water mark is lower (so_rcv.sb_hiwat).
This is straight out of the 4.2BSD VAX code, without any change.  (At
least as of SunOS 3.2.)  Indeed, this will result in the IP packets
being fragmented on the ARPANET, which is a lose.  IP fragment
reassembly is far less robust than TCP reassembly.

This code is fixed in 4.3BSD, where it sends large packets only to
hosts on the same net (LAN), and otherwise limits istelf to 576 byte
packets.  The same code also allows the data to open up beyond 1024
bytes if you have a LAN with large MTU.  This can dramatically
increase local TCP performance.

Bother your Sun technical support contact to encourage them to fix
this.  It involves adding one subroutine (tcp_mss()), and tweaking
tcp_output().

As for tweaking the MTU, I don't think that it will hurt NFS, as it is
already sending 8192 byte UDP packets that are being fragmented by the
IP layer.  I have no idea what effect it will have on ND, since ND is
proprietary.  However, better to fix the problem (TCP) than to have to
crock around it (MTU).

rick@SEISMO.CSS.GOV (Rick Adams) (05/15/87)

I can provide you with the source the the 4.3BSD tcp as hacked to run with
the Sun 4.2 IP. It makes a tremendous difference in performance.
It often is the difference between making a connection or not being able to 
connect at all.

Based on the following, I am assuming that you don't even need a source
license. (Right Mike?)

---rick

	From: karels@monet.berkeley.edu (Mike Karels)
	Message-Id: <8605142343.AA09396@monet.Berkeley.EDU>
	To: CERF@usc-isi.arpa
	Cc: tcp-ip@sri-nic.arpa
	Subject: Re: C implementations of TCP/IP 
	In-Reply-To: Your message of 13 May 86 22:13:00 EDT.
	Date: Wed, 14 May 86 16:43:02 PDT

	The Berkeley 4.2/4.3BSD TCP/IP code is written in C.  It's not quite
	public domain (it is copyright by the university), but the only
	restriction on its use is that the University of California be
	credited.

			Mike

brady@MACOM4.ARPA (Sean Brady) (05/15/87)

>I can provide you with the source the the 4.3BSD tcp as hacked to run with
>the Sun 4.2 IP. It makes a tremendous difference in performance.
>It often is the difference between making a connection or not being able to 
>connect at all.

If you do have the source, would you be so kind as to allow me to use it? 
I am currently in need of doing some tcp work on a 4.2 Sun, and I am having
the usual difficulties. A copy of an improved tcp would be most appreciated.

					Sean

swb@DEVVAX.TN.CORNELL.EDU (Scott Brim) (05/17/87)

There's one other thing to check, which is rather simple.  What you
describe sounds exactly like the symptoms we used to get with hosts
trying to send IP trailers through gateways.  Be sure you have
"-trailers" in your ifconfig.
							Scott

dplatt@teknowledge-vaxc.ARPA (Dave Platt) (05/29/87)

About three weeks ago I posted a query concerning an IP-fragmentation
problem that I had encountered on my Sun workstation.  I've received a
really astounding amount of assistance from folks on the net, and have
been able to zap the problem.  Several people have asked me to
summarize my findings and the answers I received from informed
netfolks... so, here goes.

- The original symptom of the problem was that SMTP connections would
  hang, and then abort with a network-read timeout, while sending
  large messages to a few hosts on the Internet.  Other hosts
  (including those of the same type as the affected systems) were not
  affected.

- Several people suggested that I check to ensure that my Ethernet
  interface was configured with the -trailers option (it is).

- The problem was triggered by the fact that the MTU of my Sun's
  Ethernet interface (1500 bytes) was less than the MTU of our ARPANET
  gateway's IMP interface (1006 bytes).  This situation caused the
  TCP/IP packets sent by my Sun to be fragmented as they passed
  through the gateway.

- The fragmented packets would occasionally fail to be reassembled
  upon reception.  Some hosts apparently don't implement IP-packet
  reassembly (or don't do it reliably).  Also, I'm told that there is
  a bug in BSD 4.2 UNIX (and possibly in 4.3 as well) that prevents
  BSD systems from successfully fragmenting an already-fragmented IP
  packet.  Thus, if a 1006-byte fragment from our net's gateway had to
  be refragmented to fit within the MTU of the destination host's
  network, the new fragments would be malformed and could not be
  successfully reassembled.

- One method for working around the problem is to reduce the Sun's
  Ethernet MTU to <= 1006 bytes, so that our gateway won't have to
  fragment the packets.  I was able to locate the constant 1500 in the
  "ether_attach()" function in /vmunix, and patch it down to 1000
  bytes with adb; booting with the patched /vmunix resolved the
  problem.  Charles Hedrick posted the source for a small program that
  can change the MTU of the interface "on the fly", and it also works
  like a charm;  it's the method I'm now using.

  Reducing the Ethernet MTU increases the number of packets needed to
  complete NFS RPCs, and thus increases the overhead;  NFS continues
  to work just fine.  I've been warned that decreasing the MTU will
  probably break ND, but as I don't use it I don't really care.

- Another method for fixing the problem is persuading TCP to use a
  smaller segment size, so that the packets that it sends will not
  exceed the 1006-byte limit.  I tried patching the 1024-byte MSS in
  tcp_output() to a smaller size (512 bytes), but this did not appear
  to work.  I'm not sure why, as I have no sources for the SunOS 3.2
  version of BSD 4.2 TCP.

  Many people have pointed out that BSD 4.3 TCP makes a better choice
  of MSS, based on the MTU of the interface and on whether the packets
  will be routed through a gateway (a 512-byte MSS is used if the
  packets are sent to any non-local destination).  The BSD 4.3
  enhancements have been incorporated into SunOS 3.4, which is due to
  be shipped Real Soon Now according to our Sun sales-rep.  I FTP'ed
  the BSD 4.3 source for TCP from seismo (thanks, rick!) and can see
  the additional logic;  I haven't tried to retrofit the new TCP into
  SunOS 3.2 or patch in equivalent code due to lack of time and lack
  of urgency.

So... I've got a good workaround for the problem (reducing the MTU),
and the problem will go away once I install SunOS 3.4 with the BSD 4.3
enhancements to TCP.  Happy ending.

MANY thanks to all of the people on the net who have sent suggestions,
hints, and reports of similar problems elsewhere!