[comp.unix.wizards] IP fragmentation, and how to avoid it

dplatt@teknowledge-vaxc.ARPA (Dave Platt) (05/12/87)

I've run into some problems on a Sun 3/52 workstation running SunOS
3.2 that I've been told may involve IP packet fragmentation.  The
primary symptom is that SMTP mail deliveries "hang up" and abort with
a read timeout.

Background: my Sun is sitting on a 10 Mbit Ethernet with the default
ifconfig for the Ethernet board;  the MTU for the Ethernet interface
is 1500 bytes.  The system is configured so that packets destined for
IP addresses not on our net are sent to our Vax 8650 (Ultrix 1.2),
which ipforwards them to the Internet TIP.  The MTU for the Vax's
"imp0" interface is 1006 bytes.

Problem: if a process on the Sun establishes a TCP connection with a
peer running on a host somewhere on the Internet (e.g. an SMTP
server), and then sends a large burst of data, the Sun will typically
queue up about 4k of data in the TCP buffers at one time.  This
apparently results in the sending of an IP packet that approaches the
Sun's 1500-byte MTU; when the packet passes through the Vax on its way
to the IMP, it is apparently fragmented.  Some system or gateway seems
to drop the fragmented IP packet on the floor.  The Sun's TCP never
receives an acknowledgement for the TCP segment, retries the
transmission periodically, and eventually aborts the connection.

The problem typically occurs in the later stages of an SMTP session.
The Sun's SMTP mailer is able to connect with its peer on another
Internet host, go through the "MAIL FROM" and "RCPT TO" steps, and
receive permission to send the message body.  If the message is short
(< 1k bytes), everything works fine;  if it's too long, then the
timeout occurs.

This problem appears to occur only when the host I'm trying to connect
with lies on a local-area net... and not all LANs are affected.  I've
been told that certain gateways are incapable of reassembling
fragmented IP packets;  other gateways seem to work just fine.

Question for the gurus:  is there any way to reconfigure my Sun's le0
interface so that its MTU doesn't exceed that of the 8650?  If so, how
do I do it?  Or, is there a better solution to the problem?  Or,
finally, have I totally misunderstood the problem?

advTHANKSance,

                Dave Platt

Internet:  dplatt@teknowledge-vaxc.arpa
Usenet: {hplabs|sun|ucbvax|seismo|uw-beaver|decwrl}!teknowledge-vaxc.arpa!dplatt
Voice: (415) 424-0500

hedrick@topaz.RUTGERS.EDU (Charles Hedrick) (05/13/87)

Now and then we run into machines that can't reassemble.  Note that
the 1006 limit on imp0 isn't a problem with the VAX.  It is the
limit allowed by the Arpanet.  There are more elegant solutions,
but if you don't have source, here is a program that will let you
change the MTU on the fly.  We have used it on both Pyramid and
Sun, changing only the name of the kernel variable.  I.e.
the string "_il_softc", which is the name appropriate for il0 on
the Pyramid.  I just checked and it looks like _le_softc will work
for a Sun 3/50.  At least this will let you see whether your problem
is really a reassembly problem.  You should try "mtu 1006" or
maybe some slightly smaller number.  (We typically use 900 for
testing.)

#include <sys/types.h>
#include <sys/stat.h>
#include <a.out.h>
#include <stdio.h>

struct nlist nl[2];

short mtu;
int kmem;
struct stat statblock;
char *kernelfile;

main(argc,argv)
char *argv[];
{
	if (argc < 2) {
		fprintf(stderr,"usage: mtu <n> {<kernelfile>}\n");
		exit(2);
	}

	if ((kmem = open("/dev/kmem",2))<0) {
		perror("open /dev/kmem");
		exit(1);
	}
	if (argc > 2) {
		kernelfile = argv[2];
	} else {
		kernelfile = "/vmunix";
	}
	if (stat(kernelfile,&statblock)) {
		fprintf(stderr,"%s not found.\n",kernelfile);
		exit(1);
	}
	initnlistvars(atoi(argv[1]));
	exit(0);
}

initnlistvars(on)
register int on;
{
	nl[0].n_un.n_name = "_il_softc";
	nl[1].n_un.n_name = "";
	nlist(kernelfile,nl);
	if (nl[0].n_type == 0) {
		fprintf(stderr, "%s: No namelist\n", kernelfile);
		exit(4);
	}
	(void) lseek(kmem,(nl[0].n_value)+6,0);
	if (read(kmem,&mtu,2) != 2) {
		perror("read kmem");
		exit(5);
	}
	fprintf(stderr,"mtu was: %d is now: %d\n",mtu,on);
	(void) lseek(kmem,(nl[0].n_value)+6,0);
	mtu = on;
	if (write(kmem,&mtu,2) != 2) {
		perror("write kmem");
		exit(6);
	}
}

mike@BRL.ARPA (Mike Muuss) (05/18/87)

The problem is fixed by a change in the TCP Max Seg Size used on the
connection.  The algorithm for computing this in 4.2 BSD (and thus
the SUN OS's) is rather simplistic, resulting in exactly the sort of
difficulties you reported.

A long time ago, I posted a few lines of code that fix this problem.
Mike Karels improved them some more, and the correct behavior is now
standard in 4.3 BSD UNIX.  I'm certain it is a mater of time until
SUN "integrates" this code into their product.
	-Mike

dplatt@teknowledge-vaxc.ARPA (Dave Platt) (05/29/87)

About three weeks ago I posted a query concerning an IP-fragmentation
problem that I had encountered on my Sun workstation.  I've received a
really astounding amount of assistance from folks on the net, and have
been able to zap the problem.  Several people have asked me to
summarize my findings and the answers I received from informed
netfolks... so, here goes.

- The original symptom of the problem was that SMTP connections would
  hang, and then abort with a network-read timeout, while sending
  large messages to a few hosts on the Internet.  Other hosts
  (including those of the same type as the affected systems) were not
  affected.

- Several people suggested that I check to ensure that my Ethernet
  interface was configured with the -trailers option (it is).

- The problem was triggered by the fact that the MTU of my Sun's
  Ethernet interface (1500 bytes) was less than the MTU of our ARPANET
  gateway's IMP interface (1006 bytes).  This situation caused the
  TCP/IP packets sent by my Sun to be fragmented as they passed
  through the gateway.

- The fragmented packets would occasionally fail to be reassembled
  upon reception.  Some hosts apparently don't implement IP-packet
  reassembly (or don't do it reliably).  Also, I'm told that there is
  a bug in BSD 4.2 UNIX (and possibly in 4.3 as well) that prevents
  BSD systems from successfully fragmenting an already-fragmented IP
  packet.  Thus, if a 1006-byte fragment from our net's gateway had to
  be refragmented to fit within the MTU of the destination host's
  network, the new fragments would be malformed and could not be
  successfully reassembled.

- One method for working around the problem is to reduce the Sun's
  Ethernet MTU to <= 1006 bytes, so that our gateway won't have to
  fragment the packets.  I was able to locate the constant 1500 in the
  "ether_attach()" function in /vmunix, and patch it down to 1000
  bytes with adb; booting with the patched /vmunix resolved the
  problem.  Charles Hedrick posted the source for a small program that
  can change the MTU of the interface "on the fly", and it also works
  like a charm;  it's the method I'm now using.

  Reducing the Ethernet MTU increases the number of packets needed to
  complete NFS RPCs, and thus increases the overhead;  NFS continues
  to work just fine.  I've been warned that decreasing the MTU will
  probably break ND, but as I don't use it I don't really care.

- Another method for fixing the problem is persuading TCP to use a
  smaller segment size, so that the packets that it sends will not
  exceed the 1006-byte limit.  I tried patching the 1024-byte MSS in
  tcp_output() to a smaller size (512 bytes), but this did not appear
  to work.  I'm not sure why, as I have no sources for the SunOS 3.2
  version of BSD 4.2 TCP.

  Many people have pointed out that BSD 4.3 TCP makes a better choice
  of MSS, based on the MTU of the interface and on whether the packets
  will be routed through a gateway (a 512-byte MSS is used if the
  packets are sent to any non-local destination).  The BSD 4.3
  enhancements have been incorporated into SunOS 3.4, which is due to
  be shipped Real Soon Now according to our Sun sales-rep.  I FTP'ed
  the BSD 4.3 source for TCP from seismo (thanks, rick!) and can see
  the additional logic;  I haven't tried to retrofit the new TCP into
  SunOS 3.2 or patch in equivalent code due to lack of time and lack
  of urgency.

So... I've got a good workaround for the problem (reducing the MTU),
and the problem will go away once I install SunOS 3.4 with the BSD 4.3
enhancements to TCP.  Happy ending.

MANY thanks to all of the people on the net who have sent suggestions,
hints, and reports of similar problems elsewhere!