[mod.protocols.tcp-ip] ip fragmentation follies

tcp-ip@ucbvax.UUCP (12/29/85)

I've been playing with IP fragmentation/reassembly and have discovered
a major crock in the Berkeley way of doing things.  This may have been
noticed by someone before, but I hadn't really thought about it.

What caused me to notice this was claims by some people (namely Sun)
that using very large IP packets and using IP-level fragmentation
makes protocols like NFS run faster.  This makes some sense (less
context-switching, etc), so we decided to try it.  We quickly noticed
a problem, though: if a fragmented packet has to be retransmitted (eg
because one of the fragments was dropped somewhere) the fragments of
the retransmitted packet are not and can not be merged with those of
the original packet!  Why?  Because the Berkeley code has no notion of
IP-level retransmission, and hence assigns a new IP-level packet
identifier to each and every IP packet it transmits!  And since the
IP-level identifier is the only way the receiver can tell whether two
fragments belong to the same packet, this means that the fragments of
a retransmitted packet can never be combined with those of the
original.

What all this means in practice is this: for a fragmented IP packet to
get through to its receiver, all the fragments resulting from a single
transmission of that packet must get through.  If a single fragment is
lost, all the other fragments resulting from that transmission of the
packet are useless and will never be recombined with fragments from
past or future transmissions of the same packet.

This all explains (or at least provides a partial explanation) for why
people running 4.2 TCP connections across the Arpanet using 1024-byte
packets were losing so badly.  If the probability of fragment lossage
is even moderately high, it will often take three or more tries to get
a fragmented packet across the net.  Meanwhile, of course, the useless
fragments from previous transmissions are sitting on reassembly queues
in the receiver (for 15 seconds, I think?), tying up buffering
resources and increasing the chances that fragments will be dropped in
the future!

In the current Berkeley code, it's possible to imagine workarounds for
this problem for TCP: because TCP is in the kernel, it could have a
side hook into the IP layer to tell it "this packet is a
retransmission, don't give it a new IP identifier".  For protocols
like UDP, however, the acknowledgment and retransmission functions are
done outside of the kernel, and the only kernel interface that's
available is Berkeley's socket calls (sendto, recvfrom, etc).
Needless to say, the socket interface gives you 1) no way to find out
what IP identifier a packet was sent with; 2) No way to specify the IP
identifier to use on an outgoing packet.

I don't really have any idea what to do about this problem.  And, it's
not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same
thing...  In any case, until there's a fix I don't think using IP
fragmentation/reassembly when talking to 4.2bsd systems is a very good
idea.
                                                        -Larry

-------

Greenwald@SCRC-STONY-BROOK.ARPA (Michael Greenwald) (12/29/85)

Yeah, it's been noticed.  I thought Dave had even commented on it in his
"implementation notes", but I can't find my copies, so I didn't check it
up.  I noticed this in multics (it hadn't actually happened, but if you
remember I was trying to decide when you'd rather have *large* ip
packets, and when you'd rather restrict ip packets to the max network
size.  IP reassembly was cheaper than TCP reassembly. (fewer packet
headers to process.)  I thought the only drawback was that in case of a
lost fragment, you had to retransmit the entire packet, but when I
thought about it, I made the same realization that you did.)  My
(minimal) solution is mentioned below.

My multics code had foo$retransmit_packet, foo$forward_packet, and
foo$send_packet for each datagram protocol named "foo".  (IP, UDP, GGP,
and ICMP, as far as I can remember.)  Not only did it keep the same ID,
but it eliminated a certain category of error checking and checksum
generation, where possible.