art@ACC.ARPA (05/23/87)
I don't recall ever seeing a suggestion such as follows, so I thought that I'd throw the idea out for comment. BACKGROUND: In the internet environment, two end hosts generally don't know what the maximum IP datagram size that can traverse the network is. The current solutions seem to be: 1) negotiate down to the minimum of the datagram limits of the directly connected networks. 2) if using a gateway, use guaranteed limit of 576. 3) use a fixed TCP segment size and don't think about it. Solution 1 can cause lots of unneeded fragmentation (i.e. host-ethernet-arpanet-ethernet-host). Solution 2 may be unnecessarily suboptimal for the path. Solution 3 may perform as either 1 or 2. PROPOSAL: Add an entry to the IP routing table which gives maximum datagram size for sending to this destination network. This entry would be initialized based on the directly attached network used to send to that destination. Add a new ICMP message type which a gateway sends back to an originating host when it fragments an IP datagram. The ICMP message would identify the destination network and specify what size it had to fragment the datagram to. The originating host would update the limits for that network in its IP routing table. The originating host should adjust its segment size (either immediately or on new TCP connection) to optimize IP datagram size. If current implementations ignore the new ICMP message, then they would continue to operate as always. Any Comments? Art Berggreen art@acc.arpa ------
JNC@XX.LCS.MIT.EDU ("J. Noel Chiappa") (05/23/87)
This or a close variant of it sounds like a good idea to me. It's been clear for a time that the TCP MaxSegSize negotiation only gives you part of what you want. I'd suggest two minor changes. First, the message gives 'the maximum datagram sise for sending to the destination *host + TOS*'. (It has to work if the destination net is subnetted, we don't need to messages, blah, blah, blah standard JNC flame.) It's also not clear whether you'd make it an ICMP message that was returned every time a message was fragmented. (In any case, you can simulate that using the existing Don't Fragment flag.) Such a message makes using fragmentation for real almost impossible; the extra network load every time a packet was fragmented would be significant (like hosts that ignore Redirects). I think you'd want a special mechanism which the user has to invoke, sort of like record route, where it goes along the path; it is initialized to the MTU of the outbound link from the host, and each node in the path resets the value to the min of that and of the MTU on the next hop link. I don't think you want a a special ICMP type, since then all switches would have to examine all packets going though to see if they were an ICMP packet of that type; extra overhead. I think the right thing is an IP option, 'record minimum MTU'. In general, I think this is a good idea though. Noel -------
CLYNN@G.BBN.COM (05/23/87)
Art, I don't think that we NEED more message types (although I have suggested some). We have a Very Strong Hint already in place - all we need to do is to have the IP reassembly code notice the size of the First fragment of a fragmented datagram and pass it up to the higher layers. TCP could then send the appropriate max seg size option to the other end; the routing table could record it for use in subsequent connections (by the time a packet is fragmented, it MAY be too late to help the current connection, depending on the packetization algorithm being used). This assumes that the IP fragmentation algorithms split a datagram so that the size of the first fragment is determined by the MTU (and not, for example, into n equal pieces). Are there any implementations which do not make the first fragment as large as possible?? Note that this is one of the things that a system may do without the need for cooperation from other systems. Note also, that since route going and coming may not be the same, the size a system finds may not be the best one for datagrams it sends. Charlie
jas@MONK.PROTEON.COM (John A. Shriver) (05/24/87)
Two problems with looking at incoming fragments. 1. It tells you the other guy is sending packets that are too large. 2. The TCP MSS option is only valid in SYN packets, which almost always have no data. You will find out too late. Another interesting problem to think about is that the fragmentation issue could shift dynamically as routes change. I'd guess that the first tool would be an ICMP record MSS type, or some IP option. Of course, not many routers handle source routing yet...
CERF@A.ISI.EDU (05/24/87)
Charlie, Variations in paths and the possibility of multiple fragmentation on the first datagram fragment suggests that you "strong Hint" may also be very misleading. Vint
geof@apolling.UUCP (Geof Cooper) (05/24/87)
I like the idea of an IP-level solution to the fragmentation problem since it has application to UDP protocols (I know that none that exist today could use it, but that's no excuse for ignoring UDP). Isn't there a destination unreachable message with the reason being "can't fragment and had to" (sorry my ICMP spec is at the office)? If not, we could certainly add one. In that case, the idea is to always send TCP packets with the "don't fragment" bit set. Use the scheme suggested that keeps track of MTU's in the routing cache. Update the cache based on DU's received (decrease the MTU a bit and try again) -- time out the entry on a long timer to be able to detect new routes. The obvious improvement is to have the ICMP message also include the MTU restriction that is appropriate -- that requires changing ICMP, of course, but it would probably be a good idea.
hwb@MCR.UMICH.EDU (Hans-Werner Braun) (05/24/87)
I don't understand what this whole fuzz about messages to help negotiating the MSS is all about. First of all all the assumptions that the paths are symmetric in both directions are not valid any more, in particular with the NSFNET which is now running since almost a year and the upcoming networks from other agencies (like NASA) and may be even the frequently quoted Interagency Research Internet. The previous tree structure of the Internet is certainly overtaken by events by now, or at least not guaranteed any more. All new schemes we come up with have to survive in a real meshed net of networks. Most if not all I have heard here so far assumes that there are symmetric paths. Second, and as someone else has pointed out before, we only have influence on the MSS in the first packet exchange, i.e., as seen from the host. Any extension to negotiating the MSS otherwise is therefore non-trivial and needs to be well architectured, including that all the host implementations will need to be changed. Third, the Berkeley folks have changed their MSS attitude considerably with the version 4.3bsd. The assumption now is to use local network sizes if you can be reasonably sure that the packets stay on the local physical network, and to use the only at least somewhat guaranteed maximum size of 576 bytes otherwise. This strikes me as an excellent idea. What are we really talking about? Most of what we are discussing implies the difference from 576 bytes to 1500 bytes, i.e., the maximum record size on an Ethernet. But 1500 bytes is less then three times the 576 bytes. In the longer run, i.e., a very few years, what we REALLY need are much larger packets then 1500 bytes. This will become imperative with the expected appearance of very high speed networks. I cannot help myself thinking that a reasonable thing to do for today, supposing you want to reach other then your local net, is to rather stick with the 576 byte limit (a limit that is spelled out all over the place) and rather design future networks which allow at least 20K or 40K packets on very high speed networks which might run at multiple hundreds of megabits per second or higher. Even if the local speeds are much lower then this, there could be a higher speed piece in the middle. These short packets are a in fact real problem already at much lower speeds, and they are killing the gateways because of the overhead they impose. -- Hans-Wermeon
geof@imagen.UUCP (Geoffrey Cooper) (05/25/87)
> I don't understand what this whole fuzz about messages to help negotiating > the MSS is all about. > will need to be changed. > ... > Third, the Berkeley folks have changed their MSS attitude considerably with > the version 4.3bsd. The assumption now is to use local network sizes if > you can be reasonably sure that the packets stay on the local physical > network, and to use the only at least somewhat guaranteed maximum size > of 576 bytes otherwise. This strikes me as an excellent idea. What about subnets? If I have a cluster of subnets, each of which has an MTU of 1500 bytes, I really want the extra speed. And todays gateways, which generally don't keep up with the full LAN bandwidth, make an excellent case for using large packets when sending to hosts that are off the current network. - Geof --- {decwrl,sun,saber}!imagen!geof
hwb@MCR.UMICH.EDU.UUCP (05/26/87)
Put your gateways into promiscuous mode and pretend you have a flat space in the hosts. That assumes of course that the ARP caches time out properly. How about the 8K requests Sun uses and appear as UDP fragments for NFS? -- Hans-Werner
louden@GATEWAY.MITRE.ORG.UUCP (05/26/87)
Note that 576 is not the guaranteed limit for the networks but for the reassembly buffers in the receiving host. The networks can fragment at smaller sizes and some do to get through sat-links and such.
yamo@AMES-NAS.ARPA (Michael J. Yamasaki) (05/26/87)
Greetings. > From: Hans-Werner Braun <hwb@MCR.UMICH.EDU> > > What are we really talking about? Most of what we are discussing implies > the difference from 576 bytes to 1500 bytes, i.e., the maximum record > size on an Ethernet. But 1500 bytes is less then three times the 576 bytes. > In the longer run, i.e., a very few years, what we REALLY need are much > larger packets then 1500 bytes. This will become imperative with the > expected appearance of very high speed networks. I cannot help myself > thinking that a reasonable thing to do for today, supposing you want to > reach other then your local net, is to rather stick with the 576 byte > limit (a limit that is spelled out all over the place) and rather design > future networks which allow at least 20K or 40K packets on very high speed > networks which might run at multiple hundreds of megabits per second or > higher. Even if the local speeds are much lower then this, there could be i > [...] Uh, gee, I was really appreciative that this issue (IP MSS negotiation) was brought up because at this very moment I've been grappling with the problem of high speed file transfer over a NSC HYPERchannel network. In the short term I developed a simple ACK-NAK protocol so that I could transfer in 56K blocks (Why 56K is a long story. Why "blocks" instead of "packets" is that in the HYPERchannel world "packets" is not a useful term.). It just seems too boggling to tackle the issues associated with the MSS problem that we face here at NASA/Ames all at once, which would be required to use our normal vehicle for file transfer namely TCP/IP. Our MSS for the HYPERchannel network is 4K data + the HYPERchannel header. This stresses the buffer management schemes of our 4.2, 4.3 and TWG/SV versions quite well (can you say crash and burn when two many rcp's happen at once ;-). We have drivers on some of our machines (Cray 2, Amdahl 5840, SGI IRIS) which can handle greater than 4K data (up to 64K). Consequently, our local net could conceivably have quite a range of MSSs. In addition, all of our local hosts with the exception of the Cray 2 have Ethernet connections and plans are in the near term to experiment with a token ring net and FDDI as soon as... Add Vitalinks, ARPAnet, gateways... Anyway, I just wanted to say that this is not a solution in search of a problem. Selecting 576 for a gateway between ethernet and HYPERchannel is a losing proposition. Add the additional wrinkle that the host rather than the network chooses the maximum size it can accept (HYPERchannel has no theoretical upper bound on segment size although mileage may differ...). An end to end protocol for MSS negotiation seems very appropriate. -Yamo- Thanks, Art, for bringing up an important issue.
braden@ISI.EDU (Bob Braden) (05/26/87)
HWB: Your comments are generally right on the mark, especially about the need to dramatically increase packet sizes for high-speed nets in the future. However, I think that in the short term one cannot always ignore the performance difference between 576 and 1006 byte MTU's over typical WAN's. [By the way, this question never occurred to me before... what is the MTU of the NSFnet backbone? How about the regional networks?] You suggest sticking to 576 byte packets. A better strategy may be to adopt a larger MTU (say, 1500) and let fragmentation fall where it will. Suppose you use 1500 across a path which has the ARPANET in the middle... then each FTP/SMTP packet will be split into 1000 and 500 byte pieces, for an average of 750 bytes per packet. That is 75% efficiency, good enough is many cases. If a particular host has a high percentage of its traffic across a WAN with a 1006-byte MTU, the host administrator can adjust the effective MTU parameter of the interface down to 1006 to get that last 25%. A host needs intrumentation on its IP layer to detect and report a situation of particularly bad statistics. Also, someone should remind us which will beat up ARPANET/MILNET more... 576 packets, or (1000,500) byte pairs. Why fuss about fragmentation into small packets, in a community that practices single-character echoplex terminal interactions?? So, how do we go towards 20K packets? What LAN technology will we need to get there? What will this imply for host interfaces? How can we take care of hosts that have not been converted to big buffers? It seems that when some parts of the Internet take very big packets while other parts still take miserable little ones, it will be absolutely necessary for a host to be able to learn the properties of a path it is using. Yes, Vint, paths do change dynamically, but as a practical matter they don't change that fast, and we are probably willing to take a performance hit if the new path makes our MTU choice suboptimal. Bob Braden
dpk@BRL.ARPA (Doug Kingston) (05/26/87)
Bob, The main argument against fragmenting is that when the loss rate goes up, which is has lately under conditions of heavy congestion, the total throughput drops dramatically since the loss of any one fragment can kill a set of packets. -Doug-
Mills@UDEL.EDU (05/26/87)
Bob, The NSFNET fuzzbugs presently use an MTU of 576. With the latest fuzzware the thugs can be set higher, while still using the buffer pool efficiently. I chose 576 partly in defense of the buffer pool (tinygrams still sat in a full packet buffer, but not any more) and partly to keep delays small. In a rash moment, I've even thought about dynamic fragmentation with preemption - a 20K monstergram getting sliced when a high-priority tinygram arrives. For two dollars, I bet you don't remember where and when that suggestion first came up (hint: it was at an overseas meeting). The fuzz now have the hooks for schedule-to-deadline service. Thoughts of stream-type service are prancing through my disheveled mind, but I need to sweep out some other junk first. Dave
geof@apolling.UUCP.UUCP (05/26/87)
> Ideally, one would look for a "type of service" routing capability which > could avoid fragmentation - rather than having to construct a path by > trial and error .... True, although you'd need more than one bit to describe the service you want. The desired negotiation (repeated constantly, since routes may change) is: tcp: I want to use N bytes per packet on this connection network: I can send M bytes, M<N, without fragmenting tcp: OK, then I'll use M bytes per packet. If there were an IP "MTU" option that is filled in by BOTH tcp's for EVERY packet, and modified to the min(myMTU, packetMTU) by each gateway, the problem would be solved, since the local network layer could cache M on a per-host basis. Hmm... I guess that you could have each TCP generate the option only when it saw packets that were fragmented (although you wouldn't ever find out that the MTU has increased). I wonder how that unbends the gateways? - Geof
NJG@CORNELLA.BITNET.UUCP (05/27/87)
It would seem that it would be interesting to know how common fragmentation is. Is there some size (greater than 576 or not) that will TYPICALLY not be fragmented? Has anyone measured this? Are there known common gateways, IMPs, etc that have limits smaller than the typical 1500 or so byte ethernet limit?
karels%okeeffe@UCBVAX.BERKELEY.EDU.UUCP (05/27/87)
There are a few other points in this discussion that haven't made. One it that looking at IP fragmentation in received packets doesn't work for one-way connections (FTP) or connections with asymmetrical data flow (Telnet, most everything else). As Doug pointed out, using large packets and depending on IP fragmentation looses badly when the network becomes lossy. I've always resisted paying for 2 different fragmentation/reassembly mechanisms at once, and only the TCP level allows acknowledgement (and progress) when only part of the data gets through. Also, under various circumstances, IP datagrams may be fragmented more than once, resulting in lots of packets of varying sizes, lots of them tiny. When this happens, hardware or software limitations are likely to cause some of these small packets just after larger ones to be lost. (One such bug in the ACC LH/DH and the 4.2BSD driver caused 1024-byte packets to be lost with high probability because of fragmentation.) There were a few comments about use of larger sizes on the local network(s). 4.3BSD's algorithm is to use a large size under the MTU of the outgoing network interface if the destination is on the same logical network (another subnet of the same network is considered local). This assumes that the MTU on any segment of the network is not unreasonably bad for other parts of the network. If packet size for a path is determined and propagated by the routing protocols, that wouldn't help hosts that don't listen to the routing protocols. I would very much like to see options for determining the min MTU, min throughput and the hopcount of a path. In order to return the information to the sender, this would have to be done as an ICMP message that is reflected and returned by the destination, or hosts/ gateways would have to use the convention of preserving such IP options when echoing ICMP messages. If this was an ICMP message, it could be defined so that each gateway replaced the IP destination address with the IP address of the next-hop gateway so that gateways need not examine each datagram forwarded. The original IP address would be stored in the ICMP part of the packet or in an IP option. Mike
narten@PURDUE.EDU (05/27/87)
It has already been pointed out that MSS negotiation can only be done at connection set up time if one goes by the spec. One still can dynamically adjust the MSS though, provided that the negotiated values are high at the outset. Just because the negotiated MSS is large, doesn't mean that segments of that size have to be sent. In other words, negotiate a value that is too large rather than too small, and use as large a value as gets through without fragmenting (without exceeding the negotiated value of course). Such a scheme is also compatible with what is currently implemented. Old TCP implementations will negotiate small MSS's. Thomas
hwb@MCR.UMICH.EDU (Hans-Werner Braun) (05/27/87)
Gosh. It really looks as if you are interpreting me wrong. I am very much in favor of longer packets. What I tried to describe, but may be I wasn't clear enough, was that a solution will be intrusive for the hosts. The simple scheme originally discussed just won't work. Nobody keeps you from sending fragments for now and especially nobody is keeping you from doing within your local environment whatever you please. If your Cray-2 sends a graphics object stream to an IRIS you really don't want to put current generation gateways into the middle anyway. In particular not if you think along the lines of the speed SGI is considering for the fairly near future. You even make a case for what I was saying, namely that we need to develop link/physical level devices which do much better then the current 1500 bytes in use on Ethernets. -- Hans-Werner
steve@BRL.ARPA.UUCP (05/27/87)
The NSFNET Management solicitation specifies MTU=1500 for the backbone. -s
alan@mn-at1.UUCP (05/27/87)
In article <8705261555.AA14106@ames-nas.arpa> yamo@AMES-NAS.ARPA (Michael J. Yamasaki) writes: > >Selecting 576 for a gateway between ethernet and HYPERchannel is >a losing proposition. Yes. In 1985, we did some benchmarks to determine the impact of running screen editors on the Cray-2. The worry was that CPU overhead for context switching would be prohibitively expensive. It turned out that while the CPU impact was acceptable (8% usage of 1 CPU for 32 users running heavy simulated vi), the Hyperchannel was com- pletely saturated. It turns out that a Hyperchannel is limited to ap- proximately 300 blocks/sec. This is independent of block size. By plotting block size to bandwidth, you get a curve that flattens out to around 8 megabits/sec for a block size of 64-128 kilobits, but it nudges 11 megabits/sec for block sizes larger than 512 kilobits. (Not coincidentally, this is the block size we chose for the Extended File System.) These are special conditions (dedicated LAN, no gateway, low-lossage), obviously, but it would be nice if future protocols could be designed not to limit packet sizes to 64K just because of a 16 bit field width. (Shades of Intel :->) -- Alan Klietz Minnesota Supercomputer Center (*) 1200 Washington Avenue South Minneapolis, MN 55415 UUCP: ..rutgers!meccts!quest!mn-at1!alan Ph: +1 612 626 1836 ..ihnp4!dicome!mn-at1!alan ARPA: alan@uc.msc.umn.edu (was umn-rei-uc.arpa) (*) An affiliate of the University of Minnesota
jqj@GVAX.CS.CORNELL.EDU.UUCP (05/28/87)
One issue I'd like to see getting more attention in the current discussion of transaction-parameter negotiation is the latency/thruput tradeoff. It's clear that for an ftp-like application all the user really cares about is thruput, but for the rapidly growing rpc-based style of interaction latency becomes critical. My client programs care about the total time between the dispatch of an RPC request and the response; if I'm going to transfer a lot of data I'm willing to negotiate a separate channel for that transfer. No, this is not simply a UDP vs. TCP issue. Under different circumstances I might want to use either protocol to carry my RPC traffic, or for that matter might want a special-purpose protocol tuned for my particular style of RPC. There are still some general optimization issues that the IP community needs to address. Interested readers should see the discussion of special- vs. general- purpose networking protocols in the latest Transactions on Computer Systems (TOCS).
geof@apolling.UUCP (Geof Cooper) (05/28/87)
> In other words, negotiate a value that is too large rather than too > small, and use as large a value as gets through without fragmenting > (without exceeding the negotiated value of course). Such a scheme is > also compatible with what is currently implemented. Old TCP > implementations will negotiate small MSS's. Ummm, the sending TCP doesn't know that the packets are being fragmented. The receiving TCP does. As John Wobus states, you have to treat the different directions differently. If a TCP-level solution is really the way of choice (I don't believe it is) then just allow the MSS option to exist on ANY packet. Beyond the first packet the interpretation is that it is an advisory value, relating to fragmentation. I think this is the smallest change to the TCP spec to make things work. It also should work fine with all existing implementations, since the MSS option should just be ignored past the first packet (there is probably some implementation out there who sends mail to the system maintainer of the other system, flaming at him to fix his TCP....). - Geof
jon@CS.UCL.AC.UK (Jon Crowcroft) (06/01/87)
Of course, if you use X.25 you don't have this problem :-). The switches can use level 2 to support fragment sizes properly over each hop, and level 3 to get end to end packet sizes efficient, and a well known transport protocol over that... Then all you need is to work out how to get CCITT to accept giant window size and packet size options to make it work well over a big network. The strain of hop by hop versus end to end arguments starts to show when you try and use an underengineered mil-spec net for heavy duty service. What Vint says sounds good to me - figure out how to get hop by hop information back to the end to end protocols so they can be taught how to behave properly. Real problem here is for short lived protocol entities - eg query systems running in PCs - who dont get a chance to used previously cached info... jon
hinden@CCV.BBN.COM (Robert Hinden) (06/01/87)
Nick, In fact most of the networks that make up the Internet have MTU's smaller than 1500 bytes. Here is a summary: Wideband 2048 Pronet Ring 2040 Ethernet 1500 Arpanet 1007 NSFNET 576 PDN 512 Suran 320 Satnet 256 Packet Radio 226 Hope this helps. Bob
Mills@UDEL.EDU.UUCP (06/02/87)
Jon, THe X.25 model requires X.75 gateways to handle M-bit repacketizing between two networks with different packet sizes. This requires route-binding between the gateways, which runs completely contrary to the basic Internet architecure. If we are to accept your suggestion we would have to drop this fundamental cornerstone in the model. Is the increment in performance possibly achieved by route-binding justify the loss in robustness and flexibility inherent in the present model? I'm not looking for a debate, just refining the perspective. Dave
jon@CS.UCL.AC.UK.UUCP (06/02/87)
Dave, I agree X.75 cgoes contrary to internet philosophy. My suggestion is that something between the current route binding in X.25/X.75, and the fully dynamic routing of IP is required plus some. I think You need some kind of weak resource reservation scheme, combined with some means for end to end protocols to find out what the resources out there are, since pure feedback from the end to end protocol will give you information too late to adjust mtu's, rtx times etc, and will always give you unstable algorithms. Jon