[comp.protocols.tcp-ip] maximum Ethernet throughput

mogul@DECWRL.DEC.COM (Jeffrey Mogul) (03/02/88)

Someone recently asked:
    If anyone has figures on maximum through-put of 10 Mbps Ethernet, I could
    make good use of them.

The snide answer is "10 Mbit/sec".  Actually, the theoretical maximum is
about 9.9 Mbit/sec, because of the inter-packet gap, and perhaps closer
to 9.8 or 9.7 Mbit/sec if you count the addresses and CRC as wasted bits.

That's probably not quite what you wanted to know, but the question as
put does not have a single answer.  For example, do you want the
aggregate maximum or the maximum for a single pair of hosts?

For the latter, I believe Bill Nowicki of Sun has obtained 5 Mbit/sec
using TCP between Suns, and the Sprite people at Berkeley have
reached somewhere near 5.6 Mbit/sec using their kernel-to-kernel
RPC.  These numbers are both from memory; perhaps someone can confirm
or improve them.

The best numbers that I am aware of, for communications between a pair
of hosts, comes from Dave Boggs.  Using a pair of 15 MIPs processors
with an interface and software of his own design, and without much
of an operating system or any protocol (aside from the Ethernet data
link headers), he can get about 7.5 Mbits/sec obeying the Ethernet
spec, or at least 8.6 Mbits/sec if he "cheats" by sending 4Kbyte
packets.  (He once got >9 Mbits/sec with even larger packets, but
his current code doesn't support that.)

The limiting factor in his measurements seems to be that his interface
can only send one packet at a time; i.e., he must process one interrupt
per transmitted packet, which takes a few hundred microseconds.  The
interface can receive up to 16 packets using only one interrupt.  With
a more elaborate interface design, the theoretical limit should be
attainable.

budden@tetra.NOSC.MIL (Ray A. Buddenberg) (03/07/88)

definition to give a straight answer.

Unless you know what reference model layer, you can't properly
reply.

Similarly, the number of nodes on the net makes an awful lot of difference.

OK, to the ethernet problem.  10 Mbit/sec at the physical layer turns
into about 4 Mbits at the network layer (neglecting trivial cases of
only two nodes on the network).  This decrease is due to the carrier
sense multiple access characteristics of ethernets.

At the transport layer, this tends to decrease further to around 1 Mbit
due to transport overhead.  The figure is plastic depending on whose
TCP you are using.  Unless you've got a lot of overhead in the
higher layers, this 1Mbit figure is pretty much the application-to
application value.

So what does this all mean?  At the network layer, a '4 Mbit' 802.5
token ring gets just as much data thru as a '10 Mbit' ethernet.  Caveat
emptor when the salesman comes to call.  As higher speed networks become
common -- FDDI starts out at 100 Mbit at the physical and network layers,
the Transport layer becomes an even more obvious bottleneck.  Indeed,
it appears that the bottleneck may be shifting from the LAN media toward
the DMA hardware.

Rex Buddenberg

hedrick@athos.rutgers.edu (Charles Hedrick) (03/09/88)

I've heard a second-hand report that Van Jacobson recently claimed an
8.2Mbit/sec TCP transfer between two Sun 3/50's using a very
aggressively tuned 4.3 BSD TCP.  If I got that right, it's quite
impressive, considering that a 3/50 is by current standards not that
fast a machine.  This suggests

 - that 10 > 4  (i.e. that you can get an Ethernet transfer that
	is faster than the maximum transfer on a 4Mbit token ring)

 - that you don't necessarily have to abandon robust, long-haul
	protocols such as TCP in order to get good performance

I wonder if we could nominate Van for a Nobel prize?  It seems to me
that he deserves something for all the work he has been doing for
TCP/IP.

alan@mn-at1.UUCP (Alan Klietz) (03/09/88)

In article <8803012359.AA00957@acetes.dec.com> mogul@DECWRL.DEC.COM (Jeffrey Mogul) writes:
<The best numbers that I am aware of, for communications between a pair
<of hosts, comes from Dave Boggs.  Using a pair of 15 MIPs processors
<with software of his own design he can get about 7.5 Mbits/sec obeying
<the Ethernet spec, or at least 8.6 Mbits/sec if he "cheats" by sending
<4Kbyte packets. 
<
<The limiting factor in his measurements seems to be that his interface
<can only send one packet at a time; i.e., he must process one interrupt
<per transmitted packet, which takes a few hundred microseconds.  The
<interface can receive up to 16 packets using only one interrupt.  With
<a more elaborate interface design, the theoretical limit should be
<attainable.

In general the maximum channel utilization possible by a single host,

assuming no errors, no contention, an unlimited output queue, and

no copy is,

                       t
                        f
              D = ------------
		   t   +   t
		    f       p

where t  is the time required to wiggle over the wire the bits of an
       f
average length data frame over the wire, and t  is the processing
                                              p
overhead time to interrupt the host CPU, switch context, reset the

interface hardware, allocate buffers, and do other (fixed) overhead.

To push the utilization curve to near 1, you have to either make

your data frames big (increase t , or equivalently process more frames
                                f
per interrupt) or make your processing overhead small (decrease t  ).
                                                                 p

Dave Boggs did both things and got good results.  His processing overhead
is 50us, or a 20,000 packets per second.  Impressive.  (The results for
the 4K case show a 66us overhead, from which we can infer that one of
our original assumptions, such as no data copy, were probably wrong,
but the results are still valid.)

But herein lies the problem.  As machines get faster and bigger, with
more pipelining and vectorization, and as the host network software
becomes bigger and more complicated, the per-message processing overhead
gets more expensive.  And yet the data-links are becoming faster and
faster.   FDDI is 100+ mbit/s.  The GaAs version will be 1000+ mbit/s.
The ANSI X3T9 Working Group is developing a spec for a HSC channel rated
at 1600 mbit/s per second.   The importance of absolutely minimizing
the host overhead is something that I think is critical to get any sort
of decent usage of these links (e.g. buffering multiple messages per host
interrupt). 

The problem is that many vendors still think the "critical path" for
getting better performance is the wire.  (Make the wire go faster and
you get better results), when in actuality the critical path is reverting
back the processor.  Chop off the zeros and its not unlike the olden
days of writing a 9600 baud serial driver for a Intel 8080 processor :-)

--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN  55415    UUCP:  alan@mn-at1.k.mn.org
Ph: +1 612 626 1836       ARPA:  alan@uc.msc.umn.edu  (was umn-rei-uc.arpa)

(*) An affiliate of the University of Minnesota

oconnor@SCCGATE.SCC.COM (Michael J. O'Connor) (03/09/88)

I also have heard "unofficial" reports of TCP transfers at greater than 8Mbps
between a couple of SOS boxes by Van Jacobson.  Stories of crashing Vaxen
accompany the reports.  At least one reporter claimed that this helped to
demonstrate that Sun violates the Ethernet spec, allowing packets to be put
on the wire too close together, effectively denying access to other hosts.

Any chance of an "official" confirmation or denial of these rumors?  Possibly
an IDEA or RFC?  Which leads to another question, what ever happened to IENs?

				Mike

hedrick@ARAMIS.RUTGERS.EDU (Charles Hedrick) (03/10/88)

We know of no evidence that Suns violate the minimum spacing
requirement.  Indeed if they did, things that work for us would fail,
so I don't believe it.  However 8Mb transfers would certainly open us
to a whole range of things that have never been tested before.  While
Ethernet should in theory tolerate multiple simultaenous high-speed
users, as far as I know it isn't being used that way now.  Proper
functioning would depend upon everybody's random backoff working
right.  Given the past history in networking, I am not prepared to bet
that untested features work.  One could imagine failures everywhere
from Ethernet controller microcode to device drivers to protocols
implementation.  It would also let us test whether protocol designs
implicitly assume that an Ethernet can never be congested.  Certainly
DECnet deals very differently with Ethernet than with point to point
lines, and to a certainly extent IP does as well.  When designers were
thinking of Ethernet, I rather suspect they might not have considered
the possibility that one host could actually use all 10Mb of
bandwidth.  It is possible that protocols such as ARP and DECnet hello
would have to be rethought in this context.  (For the benefit of the
person asking the original question, let me note that there's no reason
to think that a token ring would help.  Indeed it is slower, so one
is in danger of reaching this unknown realm sooner.)

pogran@CCQ.BBN.COM (Ken Pogran) (03/10/88)

Charles,

Getting "maximum throughput" from a LAN has been a challenge
since the inception of Local Area Network technology back in the
'70s.  Often there's a disparity between what the LAN specs
permit (at the "MAC" layer) and what practical controller
implementations are capable of delivering.  The challenge is to
implement a controller that can deliver something approaching the
maximum throughput without being prohibitively expensive (i.e.,
from a vendor's perspective: "Well, we could build one that would
go that fast, but it would price us out of the market.").  So one
sees compromises in what a SINGLE flow can obtain from a LAN.

In ETHERNET implementations, the compromise usually centers on
how fast one can "turn around" a controller following one
transaction (packet sent or received) to begin another one.  On
transmit, this limits how quickly you can send on an unloaded
net.  There are (or at least there were five years ago) some
controller products that did a lot of "housekeeping" between
outgoing packets, lowering throughput (I used one; I was
disappointed.).  On receive, the effect is is much more
pernicious: If your controller takes "a long time" to get ready
to receive the next incoming packet, and the controller sending
to you has a faster turn-around than you do, your controller
might be "blind" to the net and miss a packet, requiring
retransmission by a higher-level protocol.

Applications can't bank on getting a significant fraction of a
LAN; after all, what would happen when two (or ten) of 'em happen
to run simultaneously?  On the other hand, software implemented
in an environment that only contains "slow" controllers might
break in embarrassing ways when employed on systems with quick
turn-around controllers and a lightly loaded net!  Which brings
up a good point for developers of software that run right above
the MAC layer: make sure it'll run on hardware "quicker" than
yours.

>    When designers were thinking of Ethernet, I rather suspect
>    they might not have considered the possibility that one host
>    could actually use all 10Mb of bandwidth.

The mechanisms of the ETHERNET care not a whit whether the next
packet on the line is from the same guy as the last, or from
someone else.  I think it's clear that there were (are?) a number
of CONTROLLER designs that assumed that no individual host would
want to TRY to use all 10 Mb/s.  I haven't looked at the insides
of any recent controllers (or the timing specs of recent controller
chips) but I'd be willing to bet that the choke point is more
likely to be either the controller or its interface to the host,
rather than the specs of the ETHERNET itself.

This all applies to token-ring LANs, too, of course.  And the
faster the aggregate data rate on the LAN, the more these
arguments apply.  (So I might, for example, be able to get a
greater PERCENTAGE of a 4 Mb/s Token Ring than I can of a 10 Mb/s
ETHERNET).

Any comments from designers of LAN controllers?

Ken Pogran

van@LBL-CSAM.ARPA (Van Jacobson) (03/10/88)

I don't know what would constitute an "official" confirmation
but maybe I can put some rumors to bed.  We have done a TCP that
gets 8Mbps between Sun 3/50s (the lowest we've seen is 7Mbps,
the highest 9Mbps -- when using 100% of the wire bandwidth, the
ethernet exponential backoff makes throughput very sensitive to the
competing traffic distribution.)  The throughput limit seemed to
be the Lance chip on the Sun -- the CPU was showing 10-15% idle
time.  I don't believe the idle time number (I want to really
measure the idle time with a uprocessor analyzer) but the
interactive response on the machines was pretty good even while
they were shoving 1MB/s at each other so I know there was some
CPU left over.

Yes, I did crash most of our VMS vaxen while running throughput
tests and no, this has nothing to do with Sun violating
protocols -- the problem was that the DECNET designers failed to
use common sense.  I fired off a 1GB transfer to see if it would
really finish in 20 minutes (it took 18 minutes) and halfway
through I noticed that our VMS 780 was rebooting.  When I later
looked at the crash dump I found that it had run out of non-paged
pool because the DEUNA queue was full of packets.  It seems that
whoever did the protocols used a *linear* backoff on the
retransmit timer.  With 20 DECNET routers trying to babble the
state of the universe every couple of minutes, and my Suns
keeping the wire warm in the interim, any attempt to access the
ether was going to put a host into serious exponential backoff.
Under these circumstances, a linear transport timer just doesn't
cut it.  So I found 25 retransmissions in the outbound queue for
every active DECNET connection.  I know as little about VMS as
possible so I didn't investigate why the machine had crashed
rather than terminating the connections gracefully.  I should
also note that NFS on our other Sun workstations wasn't all that
happy about waiting for the wire:  As I walked around the building,
every Sun screen was filled with "server not responding" messages.
(But no Sun crashed -- I later shut most of them down to keep ND
traffic off the wire while I was looking for the upper bound on
xfer rate.)

I did run two simultaneous 100MB transfers between 4 3/50s and
verified that they were gracious about sharing the wire.  The
total throughput was 7Mbps split roughly 60/40.  The tcpdump
trace of the two conversations has some holes in it (tcpdump
can't quite hack a packet/millisecond, steady state) but the
trace doesn't show anything weird happening.

Quite a bit of the speedup comes from an algorithm that we (`we'
refers to collaborator Mike Karels and myself) are calling
"header prediction".  The idea is that if you're in the middle
of a bulk data transfer and have just seen a packet, you know
what the next packet is going to look like:  It will look just
like the current packet with either the sequence number or ack
number updated (depending on whether you're the sender or
receiver).  Combining this with the "Use hints" epigram from
Butler Lampson's classic "Epigrams for System Designers", you
start to think of the tcp state (rcv.nxt, snd.una, etc.) as
"hints" about what the next packet should look like.

If you arrange those "hints" so they match the layout of a tcp
packet header, it takes a single 14-byte compare to see if your
prediction is correct (3 longword compares to pick up the send &
ack sequence numbers, header length, flags and window, plus a
short compare on the length).  If the prediction is correct,
there's a single test on the length to see if you're the sender
or receiver followed by the appropriate processing.  E.g., if
the length is non-zero (you're the receiver), checksum and
append the data to the socket buffer then wake any process
that's sleeping on the buffer.  Update rcv.nxt by the length of
this packet (this updates your "prediction" of the next packet).
Check if you can handle another packet the same size as the
current one.  If not, set one of the unused flag bits in your
header prediction to guarantee that the prediction will fail on
the next packet and force you to go through full protocol
processing.  Otherwise, you're done with this packet.  So, the
*total* tcp protocol processing, exclusive of checksumming, is
on the order of 6 compares and an add.  The checksumming goes
at whatever the memory bandwidth is so, as long as the effective
memory bandwidth at least 4 times the ethernet bandwidth, the
cpu isn't a bottleneck.  (Let me make that clearer: we got 8Mbps
with checksumming on).

You can apply the same idea to outgoing tcp packets and most
everywhere else in the protocol stack.  I.e., if you're going
fast, it's real likely this packet comes from the same place
the last packet came from so 1-behind caches of pcb's and arp
entries are a big win if you're right and a negligible loss if
you're wrong.

In addition to the header prediction, I put some horrible kluges
in the mbuf handling to get the allocation/deallocations down to
1 per packet.  Mike Karels has been working in the same area and
his clean code is faster than my kluges.  As soon as this
semester is over, I plan to merge Mike's and my versions then
the two of us will probably make a pass at knocking off the
biggest of the rough edges.  Sometime in late spring or early
summer we should be passing this code out to hardy souls for
beta-testing (but ask yourself: do you really want workstations
that routinely use 100% of the ethernet bandwidth?  I'm pretty
sure we don't and we're not running this tcp on any of our
workstations.)

Some of the impetus for this work came from Greg Chesson's
statement at the Phoenix USENIX that `the way TCP and IP are
designed, it's impossible to make them go fast'.  On hearing
this, I lept to my feet to protest but decided that saying "I
think you're wrong" wasn't going to change anybody's mind about
anything.  Now I can say that I'm pretty sure he was wrong about
TCP (but that's *not* a comment on the excellent work he's doing
and has done on the XTP and the Protocol Engine).

The header prediction algorithm evolved during attempts to make
my 2400-baud SLIP dial-up send 4 bytes when I typed a character
rather than 44.  After staring at packet streams for a while, it
was pretty obvious that the receiver could predict everything
about the next packet on a TCP data stream except for the data
bytes.  Thus all the sender had to ship in the usual case was
one bit that said "yes, your prediction is right" plus the data.
I mention this because funding agents looking for high speed,
next-generation networks may forget that research to make slow
things go fast sometimes makes fast things go faster.

 - Van

hedrick@ARAMIS.RUTGERS.EDU (Charles Hedrick) (03/10/88)

I thought it was obvious that I was talking about systems designers,
not the people who made up the Ethernet specs.  As I said, when we get
into untests realms, such as 8Mb single transfers, we run the danger
of finding bugs in the controller, its microcode, device drivers, and
possibly even (though we will hope not) the protocols defining the
Ethernet encapsulation for IP, DECnet, or whatever.  We see very
clearly the fact that many controllers can't handle bursts of packets
with minimum spacing.  Normally however they simply drop packets, not
crash the machine they are running on.  I have no idea why DECnet
machines would crash during bursts of TCP traffic (or even whether
that rumor is true), but I would start looking at the design of the
Ethernet controller and the device driver that is dealing with it.  It
could be something as simple as a bug in the code that handles
situations where you are unable to get the Ethernet for an extended
period of time, or something as complex as implicit assumptions in
some piece of the DECnet protocol design.  Of course it could also be
a bug in the Sun that causes it to fail to defer to someone else when
it is supposed to do so, but I sort of doubt that, since that should
be handled by the Lance chip.

bzs@BU-CS.BU.EDU (Barry Shein) (03/11/88)

Breathtaking, in a word, who made that comment about Nobel Prizes
for networking?

At any rate, there is another interesting issue here:

>(but ask yourself: do you really want workstations
>that routinely use 100% of the ethernet bandwidth?  I'm pretty
>sure we don't and we're not running this tcp on any of our
>workstations.)

My temptation is to ask the converse: Do I wish to believe that merely
mediocre algorithms protect me from this as a problem?

It seems that given what we might call near-capacity algorithms (for
ethernets, of course new wire technologies such as direct fiber
hookups will be interesting again) we need to think about rational
ways to administer such networks.

In the trivial case we could isolate many of these workstations, as we
already do here, to their own ethernets, barely shared, so it is less
of a problem. Perhaps this would spur vendors to provide hardware to
make that even easier and more economical. This would certainly be
useful in the case of networked file systems using a client/server
model (eg. diskless or disk-poor clients.)

Beyond that I have often thought of the idea of a network "throttle",
a settable parameter that might control maximum throughput (packets
output per second, for example) that a machine might limit itself to.

Obviously that requires voluntary compliance (although it could be an
extension of window advertising, that is, making the window behavior
tunable by the system administrator rather than calculated for maximum
throughput always based upon blind assumptions about resources.)

Interesting, at any rate...

	-Barry Shein, Boston University

melohn@SUN.COM (Bill Melohn) (03/11/88)

In article <8803091444.AA24112@sccgate.scc.com> oconnor@SCCGATE.SCC.COM (Michael J. O'Connor) writes:
>...  At least one reporter claimed that this helped to
>demonstrate that Sun violates the Ethernet spec, allowing packets to be put
>on the wire too close together, effectively denying access to other hosts.

We hear these stories at Sun all the time, usually from customers who
have been told by other computer vendors that the Sun somehow "cheats"
on the Ethernet spec by "putting too many packets on the wire", or
"sending too many packets too fast for Ethernet". These are complete
falsehoods. Sun uses standard chips supplied by both Intel and AMD for
our ethernet interfaces, and while they may be the fastest
implementations available on the market, they are completely within
Ethernet specifications. Next time you hear one of these stories from
a sales person for another computer vendor, ask them to back their
claim with facts, and seriously consider the source.

rajaei@ttds.UUCP (Hassan Rajaei) (03/12/88)

In article <412@mn-at1.UUCP> alan@mn-at1.UUCP (0000-Alan Klietz) writes:
>But herein lies the problem.  As machines get faster and bigger, with
>more pipelining and vectorization, and as the host network software
>becomes bigger and more complicated, the per-message processing overhead
>gets more expensive.  And yet the data-links are becoming faster and
>faster.   FDDI is 100+ mbit/s.  The GaAs version will be 1000+ mbit/s.
>The ANSI X3T9 Working Group is developing a spec for a HSC channel rated
>at 1600 mbit/s per second.   The importance of absolutely minimizing
>the host overhead is something that I think is critical to get any sort
>of decent usage of these links (e.g. buffering multiple messages per host
>interrupt). 
>
You are right. We are reaching a point that a host machine no longer can
cope with the communication problems alone. We have to offload the host as
much as we can by introducing dedicated communications system rather than
some simple board. The protocol engines are something to think about.

Hassan Rajaei
rajaei@ttds.tds.kth.se

jas@MONK.PROTEON.COM (John A. Shriver) (03/12/88)

One way to deal with those violent controller speed mismatches that
are starting to show up is to use a LAN with some form of hardware
ACK.  This is part of all token ring networks (ProNET, Apollo, 802.5,
FDDI), as well as Hyperchannel and DEC's CI.  While these bits cannot
be used to guaruntee end-to-end reliability, they do let you know when
you are sending faster than the other guy can receive.

The simplest thing to do is just retransmit until the other guy copies
it, but this is a waste of network bandwith.  The more sophisticated
thing to do is put the packet back on the front of the output queue,
and search up the queue for a packet to a different hardware
destination.  This way the packet retransmissions are interleaved with
potentially useful transmissions.  You don't, however, want to shuufle
the order of packets to a particular node, this can be pathological to
some protocols.  (The NSP in DECnet-VAX did not accept out-of-order
packets until Version 4.7, some TCP's have the same problem.)

Good programming interfaces are indeed important.  While our earlier
boards were only single-buffered (one each way), their simple and fast
programming interface (start a transmit in 4 C statements) made up for
it.  (The ACK bit also helps on the single-buffer problem.)

It is very hard to try and get below the interrupt-per-packet
threshold and keep a simple programming interface.  However, even if
you make it, you run into programming interfaces, such as the 4.3bsd
network device interface, that make it very difficult to get a
pipeline going.  By comparison, in VAX/VMS, where pipelining is quite
feasible, when you pipeline two packets deep, you nearly double the
throughput.

As an example of how programming interfaces and interface design
affect performance, note that Van's benchmarks were run on a Sun-3/50,
which uses the AMD LANCE Ethernet chipset.  I doubt he could match
those numbers on a Sun-3/160, which uses a Intel 82586 Ethernet chip.
The '586 has an inferior programming interface, and some nasty
internal "housekeeping" delays that make it a good bit slower.

thomson@oldhub.toronto.edu (Brian Thomson) (03/17/88)

Van Jacobson's results are certainly startling, but I can't help
believing that a significant part of that speedup must be in
changes to the mbuf handling, the socket code, and the LANCE
driver.  My evidence is a test I ran on a 3/50: I defined a
'protocol' whose PRU_SEND action was to checksum each mbuf then
hand it directly to the driver, with a dummy AF_UNSPEC destination so
there would be no ARPing going on.  This exercises vanilla
SUN mbuf, socket, and interface driver code, while replacing all
of TCP/IP by simple checksumming - so no protocols at all.
The data goes nowhere, and there are no acks to deal with.
Even so, this configuration could not source data to the wire faster
than about 3.6Mb/sec.  I could hit 8Mb/sec if I threw the data
away right after checksumming, without passing it to the driver at all.

-- 
		    Brian Thomson,	    CSRI Univ. of Toronto
		    utcsri!uthub!thomson, thomson@hub.toronto.edu