[comp.protocols.tcp-ip] IP datagram sizes

art@ACC.ARPA (05/23/87)

I don't recall ever seeing a suggestion such as follows, so I thought
that I'd throw the idea out for comment.

BACKGROUND:

In the internet environment, two end hosts generally don't know what
the maximum IP datagram size that can traverse the network is.  The
current solutions seem to be:

	1) negotiate down to the minimum of the datagram limits of
	   the directly connected networks.

	2) if using a gateway, use guaranteed limit of 576.

	3) use a fixed TCP segment size and don't think about it.

Solution 1 can cause lots of unneeded fragmentation
(i.e. host-ethernet-arpanet-ethernet-host).
Solution 2 may be unnecessarily suboptimal for the path.
Solution 3 may perform as either 1 or 2.

PROPOSAL:

Add an entry to the IP routing table which gives maximum datagram size
for sending to this destination network.  This entry would be initialized
based on the directly attached network used to send to that destination.
Add a new ICMP message type which a gateway sends back to an originating
host when it fragments an IP datagram.  The ICMP message would identify
the destination network and specify what size it had to fragment the
datagram to.  The originating host would update the limits for that network
in its IP routing table.  The originating host should adjust its segment
size (either immediately or on new TCP connection) to optimize IP datagram
size.  If current implementations ignore the new ICMP message, then they
would continue to operate as always.

Any Comments?
						Art Berggreen
						art@acc.arpa

------

JNC@XX.LCS.MIT.EDU ("J. Noel Chiappa") (05/23/87)

	This or a close variant of it sounds like a good idea to me.
It's been clear for a time that the TCP MaxSegSize negotiation only
gives you part of what you want.

	I'd suggest two minor changes. First, the message gives 'the
maximum datagram sise for sending to the destination *host + TOS*'.
(It has to work if the destination net is subnetted, we don't need to
messages, blah, blah, blah standard JNC flame.)
	It's also not clear whether you'd make it an ICMP message that
was returned every time a message was fragmented. (In any case, you
can simulate that using the existing Don't Fragment flag.) Such a
message makes using fragmentation for real almost impossible; the
extra network load every time a packet was fragmented would be
significant (like hosts that ignore Redirects). I think you'd want a
special mechanism which the user has to invoke, sort of like record
route, where it goes along the path; it is initialized to the MTU of
the outbound link from the host, and each node in the path resets the
value to the min of that and of the MTU on the next hop link. I don't
think you want a a special ICMP type, since then all switches would
have to examine all packets going though to see if they were an ICMP
packet of that type; extra overhead. I think the right thing is an IP
option, 'record minimum MTU'.
	In general, I think this is a good idea though.

	Noel
-------

CLYNN@G.BBN.COM (05/23/87)

Art,
	I don't think that we NEED more message types (although I have
suggested some).  We have a Very Strong Hint already in place - all we
need to do is to have the IP reassembly code notice the size of the
First fragment of a fragmented datagram and pass it up to the higher
layers.  TCP could then send the appropriate max seg size option to the
other end; the routing table could record it for use in subsequent
connections (by the time a packet is fragmented, it MAY be too late
to help the current connection, depending on the packetization
algorithm being used).
	This assumes that the IP fragmentation algorithms split a
datagram so that the size of the first fragment is determined by the
MTU (and not, for example, into n equal pieces).  Are there any
implementations which do not make the first fragment as large as possible??
	Note that this is one of the things that a system may do without
the need for cooperation from other systems.  Note also, that since
route going and coming may not be the same, the size a system finds may
not be the best one for datagrams it sends.

Charlie

jas@MONK.PROTEON.COM (John A. Shriver) (05/24/87)

Two problems with looking at incoming fragments.

1. It tells you the other guy is sending packets that are too large.

2. The TCP MSS option is only valid in SYN packets, which almost
always have no data.  You will find out too late.

Another interesting problem to think about is that the fragmentation
issue could shift dynamically as routes change.

I'd guess that the first tool would be an ICMP record MSS type, or
some IP option.  Of course, not many routers handle source routing
yet...

CERF@A.ISI.EDU (05/24/87)

Charlie,

Variations in paths and the possibility of multiple fragmentation on
the first datagram fragment suggests that you "strong Hint" may also
be very misleading.

Vint

geof@apolling.UUCP (Geof Cooper) (05/24/87)

I like the idea of an IP-level solution to the fragmentation problem
since it has application to UDP protocols (I know that none that exist
today could use it, but that's no excuse for ignoring UDP).

Isn't there a destination unreachable message with the reason being
"can't fragment and had to" (sorry my ICMP spec is at the office)?  If
not, we could certainly add one.

In that case, the idea is to always send TCP packets with the "don't
fragment" bit set.  Use the scheme suggested that keeps track of MTU's
in the routing cache.  Update the cache based on DU's received
(decrease the MTU a bit and try again) -- time out the entry on a long
timer to be able to detect new routes.

The obvious improvement is to have the ICMP message also include the
MTU restriction that is appropriate -- that requires changing ICMP, of
course, but it would probably be a good idea.

hwb@MCR.UMICH.EDU (Hans-Werner Braun) (05/24/87)

I don't understand what this whole fuzz about messages to help negotiating
the MSS is all about.

First of all all the assumptions that the paths are symmetric in both
directions are not valid any more, in particular with the NSFNET which is
now running since almost a year and the upcoming networks from other
agencies (like NASA) and may be even the frequently quoted Interagency
Research Internet. The previous tree structure of the Internet is
certainly overtaken by events by now, or at least not guaranteed any
more. All new schemes we come up with have to survive in a real meshed
net of networks. Most if not all I have heard here so far assumes that
there are symmetric paths.

Second, and as someone else has pointed out before, we only have influence
on the MSS in the first packet exchange, i.e., as seen from the host. Any
extension to negotiating the MSS otherwise is therefore non-trivial and
needs to be well architectured, including that all the host implementations
will need to be changed.

Third, the Berkeley folks have changed their MSS attitude considerably with
the version 4.3bsd. The assumption now is to use local network sizes if
you can be reasonably sure that the packets stay on the local physical
network, and to use the only at least somewhat guaranteed maximum size
of 576 bytes otherwise. This strikes me as an excellent idea.

What are we really talking about? Most of what we are discussing implies
the difference from 576 bytes to 1500 bytes, i.e., the maximum record
size on an Ethernet. But 1500 bytes is less then three times the 576 bytes.
In the longer run, i.e., a very few years, what we REALLY need are much
larger packets then 1500 bytes. This will become imperative with the
expected appearance of very high speed networks. I cannot help myself
thinking that a reasonable thing to do for today, supposing you want to
reach other then your local net, is to rather stick with the 576 byte
limit (a limit that is spelled out all over the place) and rather design
future networks which allow at least 20K or 40K packets on very high speed
networks which might run at multiple hundreds of megabits per second or
higher. Even if the local speeds are much lower then this, there could be
a higher speed piece in the middle. These short packets are a in fact real 
problem already at much lower speeds, and they are killing the gateways
because of the overhead they impose.

	-- Hans-Wermeon

geof@imagen.UUCP (Geoffrey Cooper) (05/25/87)

> I don't understand what this whole fuzz about messages to help negotiating
> the MSS is all about.
> will need to be changed.
> ...
> Third, the Berkeley folks have changed their MSS attitude considerably with
> the version 4.3bsd. The assumption now is to use local network sizes if
> you can be reasonably sure that the packets stay on the local physical
> network, and to use the only at least somewhat guaranteed maximum size
> of 576 bytes otherwise. This strikes me as an excellent idea.

What about subnets?  If I have a cluster of subnets, each of which has
an MTU of 1500 bytes, I really want the extra speed.  And todays
gateways, which generally don't keep up with the full LAN bandwidth,
make an excellent case for using large packets when sending to hosts
that are off the current network.

- Geof
---
{decwrl,sun,saber}!imagen!geof

hwb@MCR.UMICH.EDU.UUCP (05/26/87)

Put your gateways into promiscuous mode and pretend you have a flat space
in the hosts. That assumes of course that the ARP caches time out properly.
How about the 8K requests Sun uses and appear as UDP fragments for NFS?

	-- Hans-Werner

louden@GATEWAY.MITRE.ORG.UUCP (05/26/87)

Note that 576 is not the guaranteed limit for the networks but for the
reassembly buffers in the receiving host.  The networks can fragment
at smaller sizes and some do to get through sat-links and such.

yamo@AMES-NAS.ARPA (Michael J. Yamasaki) (05/26/87)

Greetings.

> From: Hans-Werner Braun <hwb@MCR.UMICH.EDU>
> 
> What are we really talking about? Most of what we are discussing implies
> the difference from 576 bytes to 1500 bytes, i.e., the maximum record
> size on an Ethernet. But 1500 bytes is less then three times the 576 bytes.
> In the longer run, i.e., a very few years, what we REALLY need are much
> larger packets then 1500 bytes. This will become imperative with the
> expected appearance of very high speed networks. I cannot help myself
> thinking that a reasonable thing to do for today, supposing you want to
> reach other then your local net, is to rather stick with the 576 byte
> limit (a limit that is spelled out all over the place) and rather design
> future networks which allow at least 20K or 40K packets on very high speed
> networks which might run at multiple hundreds of megabits per second or
> higher. Even if the local speeds are much lower then this, there could be i
> [...]

Uh, gee, I was really appreciative that this issue (IP MSS negotiation) was 
brought up because at this very moment I've been grappling with the problem 
of high speed file transfer over a NSC HYPERchannel network.  In the short 
term I developed a simple ACK-NAK protocol so that I could transfer in 56K 
blocks (Why 56K is a long story. Why "blocks" instead of "packets" is that in 
the HYPERchannel world "packets" is not a useful term.).

It just seems too boggling to tackle the issues associated with the MSS
problem that we face here at NASA/Ames all at once, which would be required
to use our normal vehicle for file transfer namely TCP/IP.  Our MSS for 
the HYPERchannel network is 4K data + the HYPERchannel header.  This stresses
the buffer management schemes of our 4.2, 4.3 and TWG/SV versions quite
well (can you say crash and burn when two many rcp's happen at once ;-).
We have drivers on some of our machines (Cray 2, Amdahl 5840, SGI IRIS)
which can handle greater than 4K data (up to 64K).  Consequently, our local
net could conceivably have quite a range of MSSs.  In addition, all of our
local hosts with the exception of the Cray 2 have Ethernet connections and
plans are in the near term to experiment with a token ring net and FDDI
as soon as... Add Vitalinks, ARPAnet, gateways... 

Anyway, I just wanted to say that this is not a solution in search of a 
problem.  Selecting 576 for a gateway between ethernet and HYPERchannel is
a losing proposition.  Add the additional wrinkle that the host rather than
the network chooses the maximum size it can accept (HYPERchannel has no
theoretical upper bound on segment size although mileage may differ...).
An end to end protocol for MSS negotiation seems very appropriate.

                                                   -Yamo-

Thanks, Art, for bringing up an important issue.

braden@ISI.EDU (Bob Braden) (05/26/87)

HWB:

Your comments are generally right on the mark, especially about the need
to dramatically increase packet sizes for high-speed nets in the
future.  However, I think that in the short term one cannot
always ignore the performance difference between 576 and 1006 byte MTU's
over typical WAN's.  [By the way, this question never occurred to me
before... what is the MTU of the NSFnet backbone? How about the regional
networks?] 

You suggest sticking to 576 byte packets. A better strategy may be to
adopt a larger MTU (say, 1500) and let fragmentation fall where it will.
Suppose you use 1500 across a path which has the ARPANET in the middle...
then each FTP/SMTP packet will be split into 1000 and 500 byte pieces,
for an average of 750 bytes per packet.  That is 75% efficiency, good
enough is many cases.  If a particular host has a high percentage of its
traffic across a WAN with a 1006-byte MTU, the host administrator can
adjust the effective MTU parameter of the interface down to 1006 to get
that last 25%.  A host needs intrumentation on its IP layer to detect and
report a situation of particularly bad statistics. Also, someone should
remind us which will beat up ARPANET/MILNET more... 576 packets, or
(1000,500) byte pairs. 

   Why fuss about fragmentation into small packets, in a community
   that practices single-character echoplex terminal interactions??

So, how do we go towards 20K packets?  What LAN technology will we need
to get there?  What will this imply for host interfaces?  How can we take
care of hosts that have not been converted to big buffers?  It seems that
when some parts of the Internet take very big packets while other parts
still take miserable little ones, it will be absolutely necessary for a
host to be able to learn the properties of a path it is using.  Yes,
Vint, paths do change dynamically, but as a practical matter they don't
change that fast, and we are probably willing to take a performance hit
if the new path makes our MTU choice suboptimal. 

Bob Braden

dpk@BRL.ARPA (Doug Kingston) (05/26/87)

Bob,
	The main argument against fragmenting is that when
the loss rate goes up, which is has lately under conditions
of heavy congestion, the total throughput drops dramatically
since the loss of any one fragment can kill a set of packets.

-Doug-

Mills@UDEL.EDU (05/26/87)

Bob,

The NSFNET fuzzbugs presently use an MTU of 576. With the latest fuzzware
the thugs can be set higher, while still using the buffer pool efficiently.
I chose 576 partly in defense of the buffer pool (tinygrams still sat
in a full packet buffer, but not any more) and partly to keep delays
small. In a rash moment, I've even thought about dynamic fragmentation
with preemption - a 20K monstergram getting sliced when a high-priority
tinygram arrives. For two dollars, I bet you don't remember where and
when that suggestion first came up (hint: it was at an overseas meeting).
The fuzz now have the hooks for schedule-to-deadline service. Thoughts
of stream-type service are prancing through my disheveled mind, but I
need to sweep out some other junk first.

Dave

geof@apolling.UUCP.UUCP (05/26/87)

 >      Ideally, one would look for a "type of service" routing capability which
 >      could avoid fragmentation - rather than having to construct a path by
 >      trial and error .... 

True, although you'd need more than one bit to describe the service you want.
The desired negotiation (repeated constantly, since routes may change) is:

        tcp: I want to use N bytes per packet on this connection
        network: I can send M bytes, M<N, without fragmenting
        tcp: OK, then I'll use M bytes per packet.

If there were an IP "MTU" option that is filled in by BOTH tcp's for EVERY
packet, and modified to the min(myMTU, packetMTU) by each gateway, the problem
would be solved, since the local network layer could cache M on a per-host basis.
Hmm... I guess that you could have each TCP generate the option only when it
saw packets that were fragmented (although you wouldn't ever find out that the
MTU has increased).

I wonder how that unbends the gateways?

- Geof

NJG@CORNELLA.BITNET.UUCP (05/27/87)

It would seem that it would be interesting to know how common
fragmentation is. Is there some size (greater than 576 or not) that
will TYPICALLY not be fragmented? Has anyone measured this? Are there
known common gateways, IMPs, etc that have limits smaller than the
typical 1500 or so byte ethernet limit?

karels%okeeffe@UCBVAX.BERKELEY.EDU.UUCP (05/27/87)

There are a few other points in this discussion that haven't made.

One it that looking at IP fragmentation in received packets
doesn't work for one-way connections (FTP) or connections with
asymmetrical data flow (Telnet, most everything else).

As Doug pointed out, using large packets and depending on IP fragmentation
looses badly when the network becomes lossy.  I've always resisted
paying for 2 different fragmentation/reassembly mechanisms at once,
and only the TCP level allows acknowledgement (and progress) when only
part of the data gets through.  Also, under various circumstances,
IP datagrams may be fragmented more than once, resulting in lots
of packets of varying sizes, lots of them tiny.  When this happens,
hardware or software limitations are likely to cause some of these
small packets just after larger ones to be lost.  (One such bug
in the ACC LH/DH and the 4.2BSD driver caused 1024-byte packets
to be lost with high probability because of fragmentation.)

There were a few comments about use of larger sizes on the local
network(s).  4.3BSD's algorithm is to use a large size under the MTU
of the outgoing network interface if the destination is on the same
logical network (another subnet of the same network is considered local).
This assumes that the MTU on any segment of the network is not unreasonably
bad for other parts of the network.

If packet size for a path is determined and propagated by the routing
protocols, that wouldn't help hosts that don't listen to the routing
protocols.

I would very much like to see options for determining the min MTU,
min throughput and the hopcount of a path.  In order to return the
information to the sender, this would have to be done as an ICMP
message that is reflected and returned by the destination, or hosts/
gateways would have to use the convention of preserving such IP options
when echoing ICMP messages.  If this was an ICMP message, it could
be defined so that each gateway replaced the IP destination address
with the IP address of the next-hop gateway so that gateways need not
examine each datagram forwarded.  The original IP address would be
stored in the ICMP part of the packet or in an IP option.

		Mike

narten@PURDUE.EDU (05/27/87)

It has already been pointed out that MSS negotiation can only be done
at connection set up time if one goes by the spec. One still can
dynamically adjust the MSS though, provided that the negotiated values
are high at the outset. Just because the negotiated MSS is large,
doesn't mean that segments of that size have to be sent.

In other words, negotiate a value that is too large rather than too
small, and use as large a value as gets through without fragmenting
(without exceeding the negotiated value of course). Such a scheme is
also compatible with what is currently implemented. Old TCP
implementations will negotiate small MSS's. 

Thomas

hwb@MCR.UMICH.EDU (Hans-Werner Braun) (05/27/87)

Gosh. It really looks as if you are interpreting me wrong. I am very much
in favor of longer packets. What I tried to describe, but may be I wasn't
clear enough, was that a solution will be intrusive for the hosts. The
simple scheme originally discussed just won't work. Nobody keeps you from
sending fragments for now and especially nobody is keeping you from doing
within your local environment whatever you please. If your Cray-2 sends
a graphics object stream to an IRIS you really don't want to put current
generation gateways into the middle anyway. In particular not if you 
think along the lines of the speed SGI is considering for the fairly 
near future. You even make a case for what I was saying, namely that we
need to develop link/physical level devices which do much better then 
the current 1500 bytes in use on Ethernets.

	-- Hans-Werner

steve@BRL.ARPA.UUCP (05/27/87)

The NSFNET Management solicitation specifies MTU=1500 for the backbone.  -s

alan@mn-at1.UUCP (05/27/87)

In article <8705261555.AA14106@ames-nas.arpa> yamo@AMES-NAS.ARPA (Michael J. Yamasaki) writes:
>
>Selecting 576 for a gateway between ethernet and HYPERchannel is
>a losing proposition. 

Yes.  In 1985, we did some benchmarks to determine the impact of running
screen editors on the Cray-2.   The worry was that CPU overhead for context
switching would be prohibitively expensive. 

It turned out that while the CPU impact was acceptable (8% usage of 1
CPU for 32 users running heavy simulated vi), the Hyperchannel was com-
pletely saturated.   It turns out that a Hyperchannel is limited to ap-
proximately 300 blocks/sec.  This is independent of block size.

By plotting block size to bandwidth, you get a curve that flattens out to
around 8 megabits/sec for a block size of 64-128 kilobits, but it nudges 11
megabits/sec for block sizes larger than 512 kilobits.  (Not coincidentally,
this is the block size we chose for the Extended File System.)

These are special conditions (dedicated LAN, no gateway, low-lossage),
obviously, but it would be nice if future protocols could be designed
not to limit packet sizes to 64K just because of a 16 bit field width.

(Shades of Intel :->)


--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  ..rutgers!meccts!quest!mn-at1!alan
Ph: +1 612 626 1836              ..ihnp4!dicome!mn-at1!alan
                          ARPA:  alan@uc.msc.umn.edu  (was umn-rei-uc.arpa)

(*) An affiliate of the University of Minnesota

jqj@GVAX.CS.CORNELL.EDU.UUCP (05/28/87)

One issue I'd like to see getting more attention in the current discussion
of transaction-parameter negotiation is the latency/thruput tradeoff.  It's
clear that for an ftp-like application all the user really cares about is
thruput, but for the rapidly growing rpc-based style of interaction latency
becomes critical.  My client programs care about the total time between the
dispatch of an RPC request and the response; if I'm going to transfer a
lot of data I'm willing to negotiate a separate channel for that transfer.

No, this is not simply a UDP vs. TCP issue.  Under different circumstances
I might want to use either protocol to carry my RPC traffic, or for that
matter might want a special-purpose protocol tuned for my particular style
of RPC.  There are still some general optimization issues that the IP
community needs to address.

Interested readers should see the discussion of special- vs. general-
purpose networking protocols in the latest Transactions on Computer
Systems (TOCS).

geof@apolling.UUCP (Geof Cooper) (05/28/87)

 >      In other words, negotiate a value that is too large rather than too
 >      small, and use as large a value as gets through without fragmenting
 >      (without exceeding the negotiated value of course). Such a scheme is
 >      also compatible with what is currently implemented. Old TCP
 >      implementations will negotiate small MSS's. 

Ummm, the sending TCP doesn't know that the packets are being fragmented.
The receiving TCP does.  As John Wobus states, you have to treat the
different directions differently.

If a TCP-level solution is really the way of choice (I don't believe it
is) then just allow the MSS option to exist on ANY packet.  Beyond the
first packet the interpretation is that it is an advisory value, relating
to fragmentation.  I think this is the smallest change to the TCP spec
to make things work.  It also should work fine with all existing
implementations, since the MSS option should just be ignored past the
first packet (there is probably some implementation out there who sends
mail to the system maintainer of the other system, flaming at him to
fix his TCP....).

- Geof

jon@CS.UCL.AC.UK (Jon Crowcroft) (06/01/87)

Of course, if you use X.25 you don't have this problem :-). The
switches can use level 2 to support fragment sizes properly
over each hop, and level 3 to get end to end packet sizes
efficient, and a well known transport protocol over that...

Then all you need is to work out how to get CCITT to accept
giant window size and packet size options to make it work well
over a big network.

The strain of hop by hop versus end to end arguments starts to
show when you try and use an underengineered mil-spec net for
heavy duty service.

What Vint says sounds good to me - figure out how to get hop by
hop information back to the end to end protocols so they can be
taught how to behave properly. Real problem here is for short
lived protocol entities - eg query systems running in PCs - who
dont get a chance to used previously cached info...

jon

hinden@CCV.BBN.COM (Robert Hinden) (06/01/87)

Nick,

In fact most of the networks that make up the Internet have MTU's smaller
than 1500 bytes.  Here is a summary:
     
     Wideband     2048
     Pronet Ring  2040
     Ethernet     1500
     Arpanet      1007
     NSFNET        576
     PDN           512
     Suran         320
     Satnet        256
     Packet Radio  226

Hope this helps.

Bob

Mills@UDEL.EDU.UUCP (06/02/87)

Jon,

THe X.25 model requires X.75 gateways to handle M-bit repacketizing between
two networks with different packet sizes. This requires route-binding between
the gateways, which runs completely contrary to the basic Internet architecure.
If we are to accept your suggestion we would have to drop this fundamental
cornerstone in the model. Is the increment in performance possibly achieved
by route-binding justify the loss in robustness and flexibility inherent in
the present model? I'm not looking for a debate, just refining the perspective.

Dave

jon@CS.UCL.AC.UK.UUCP (06/02/87)

Dave,

 I agree X.75 cgoes contrary to internet philosophy. My
suggestion is that something between the current route binding
in X.25/X.75, and the fully dynamic routing of IP is required
plus some.

I think You need some kind of weak resource reservation scheme,
combined with some means for end to end protocols to find out what the
resources out there are, since pure feedback from the end to
end protocol will give you information too late to adjust mtu's,
rtx times etc, and will always give you unstable algorithms.

Jon