tcp-ip@ucbvax.UUCP (12/29/85)
I've been playing with IP fragmentation/reassembly and have discovered a major crock in the Berkeley way of doing things. This may have been noticed by someone before, but I hadn't really thought about it. What caused me to notice this was claims by some people (namely Sun) that using very large IP packets and using IP-level fragmentation makes protocols like NFS run faster. This makes some sense (less context-switching, etc), so we decided to try it. We quickly noticed a problem, though: if a fragmented packet has to be retransmitted (eg because one of the fragments was dropped somewhere) the fragments of the retransmitted packet are not and can not be merged with those of the original packet! Why? Because the Berkeley code has no notion of IP-level retransmission, and hence assigns a new IP-level packet identifier to each and every IP packet it transmits! And since the IP-level identifier is the only way the receiver can tell whether two fragments belong to the same packet, this means that the fragments of a retransmitted packet can never be combined with those of the original. What all this means in practice is this: for a fragmented IP packet to get through to its receiver, all the fragments resulting from a single transmission of that packet must get through. If a single fragment is lost, all the other fragments resulting from that transmission of the packet are useless and will never be recombined with fragments from past or future transmissions of the same packet. This all explains (or at least provides a partial explanation) for why people running 4.2 TCP connections across the Arpanet using 1024-byte packets were losing so badly. If the probability of fragment lossage is even moderately high, it will often take three or more tries to get a fragmented packet across the net. Meanwhile, of course, the useless fragments from previous transmissions are sitting on reassembly queues in the receiver (for 15 seconds, I think?), tying up buffering resources and increasing the chances that fragments will be dropped in the future! In the current Berkeley code, it's possible to imagine workarounds for this problem for TCP: because TCP is in the kernel, it could have a side hook into the IP layer to tell it "this packet is a retransmission, don't give it a new IP identifier". For protocols like UDP, however, the acknowledgment and retransmission functions are done outside of the kernel, and the only kernel interface that's available is Berkeley's socket calls (sendto, recvfrom, etc). Needless to say, the socket interface gives you 1) no way to find out what IP identifier a packet was sent with; 2) No way to specify the IP identifier to use on an outgoing packet. I don't really have any idea what to do about this problem. And, it's not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same thing... In any case, until there's a fix I don't think using IP fragmentation/reassembly when talking to 4.2bsd systems is a very good idea. -Larry -------
Greenwald@SCRC-STONY-BROOK.ARPA (Michael Greenwald) (12/29/85)
Yeah, it's been noticed. I thought Dave had even commented on it in his "implementation notes", but I can't find my copies, so I didn't check it up. I noticed this in multics (it hadn't actually happened, but if you remember I was trying to decide when you'd rather have *large* ip packets, and when you'd rather restrict ip packets to the max network size. IP reassembly was cheaper than TCP reassembly. (fewer packet headers to process.) I thought the only drawback was that in case of a lost fragment, you had to retransmit the entire packet, but when I thought about it, I made the same realization that you did.) My (minimal) solution is mentioned below. My multics code had foo$retransmit_packet, foo$forward_packet, and foo$send_packet for each datagram protocol named "foo". (IP, UDP, GGP, and ICMP, as far as I can remember.) Not only did it keep the same ID, but it eliminated a certain category of error checking and checksum generation, where possible.