[comp.protocols.tcp-ip] 4BSD TCP Ethernet Throughput

van@HELIOS.EE.LBL.GOV (Van Jacobson) (10/25/88)

Many people have asked for the Ethernet throughput data I
showed at Interop so it's probably easier to post it:

These are some throughput results for an experimental version of
the 4BSD (Berkeley Unix) network code running on a couple of
different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD
LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel
82586 Ethernet chip) [note again the tests were done with Sun
hardware but not Sun software -- I'm running 4.?BSD, not Sun
OS].  There are lots and lots of interesting things in the data
but the one thing that seems to have attracted people's
attention is the big difference in performance between the two
Ethernet chips.

The test measured task-to-task data throughput over a TCP
connection from a source (e.g., chargen) to a sink (e.g.,
discard).  The tests were done between 2am and 6am on a fairly
quiet Ethernet (~100Kb/s average background traffic).  The
packets were all maximum size (1538 bytes on the wire or 1460
bytes of user data per packet).  The free parameters for the
tests were the sender and receiver socket buffer sizes (which
control the amount of 'pipelining' possible between the sender,
wire and receiver).  Each buffer size was independently varied
from 1 to 17 packets in 1 packet steps.  Four tests were done at
each of the 289 combinations.  Each test transferred 8MB of data
then recorded the total time for the transfer and the send and
receive socket buffer sizes (8MB was chosen so that the worst
case error due to the system clock resolution was ~.1% -- 10ms
in 10sec).  The 1,156 tests per machine pair were done in random
order to prevent any effects from fixed patterns of resource
allocation.

In general, the maximum throughput was observed when the sender
buffer equaled the receiver buffer (the reason why is complicated
but has to do with collisions).  The following table gives the
task-to-task data throughput (in KBytes/sec) and throughput on
the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and
(b) a 3/280 sending to a 3/60.

	_________________________________________________
	|              3/60 to 3/60   |  3/280 to 3/60   |
	|            (LANCE to LANCE) | (Intel to LANCE) |
	| socket                      |                  |
	| buffer     task to          | task to          |
	|  size       task      wire  |  task      wire  |
	|(packets)   (KB/s)    (Mb/s) | (KB/s)    (Mb/s) |
	|    1         384      3.4   |   337      3.0   |
	|    2         606      5.4   |   575      5.1   |
	|    3         690      6.1   |   595      5.3   |
	|    4         784      6.9   |   709      6.3   |
	|    5         866      7.7   |   712      6.3   |
	|    6         904      8.0   |   708      6.3   |
	|    7         946      8.4   |   710      6.3   |
	|    8         954      8.4   |   718      6.4   |
	|    9         974      8.6   |   715      6.3   |
	|   10         983      8.7   |   712      6.3   |
	|   11         995      8.8   |   714      6.3   |
	|   12        1001      8.9   |   715      6.3   |
	|_____________________________|__________________|

The theoretical maximum data throughput, after you take into
account all the protocol overheads, is 1,104 KB/s (this
task-to-task data rate would put 10Mb/s on the wire).  You can
see that the 3/60s get 91% of the the theoretical max.  The
3/280, although a much faster processor (the CPU performance is
really dominated by the speed of the memory system, not the
processor clock rate, and the memory system in the 3/280 is
almost twice the speed of the 3/60), gets only 65% of
theoretical max.

The low throughput of the 3/280 seems to be entirely due to the
Intel Ethernet chip: at around 6Mb/s, it saturates.  (I put the
board on an extender and watched the bus handshake lines on the
82586 to see if the chip or the Sun interface logic was pooping
out.  It was the chip -- it just stopped asking for data.  (The
CPU was loafing along with at least 35% idle time during all
these tests so it wasn't the limit).

[Just so you don't get confused:  Stuff above was measurements.
 Stuff below includes opinions and interpretation and should
 be viewed with appropriate suspicion.]

If you graph the above, you'll see a large notch in the Intel
data at 3 packets.  This is probably a clue to why it's dying:
TCP delivers one ack for every two data packets.  At a buffer
size of three packets, the collision rate increases dramatically
since the sender's third packet will collide with the receiver's
ack for the previous two packets (for buffer sizes of 1 and 2,
there are effectively no collisions).  My suspicion is that the
Intel is taking a long time to recover from collisions (remember
that you're 64 bytes into the packet when you find out you've
collided so the chip bus logic has to back up 64 bytes -- Intel
spent their silicon making the chip "programmable", I doubt they
invested as much as AMD in the bus interface).  This may or may
not be what's going on:  life is too short to spend debugging
Intel parts so I really don't care to investigate further.

The one annoyance in all this is that Sun puts the fast Ethernet
chip (the AMD LANCE) in their slow machines (3/50s and 3/60s)
and the slow Ethernet chip (Intel 82586) in their fast machines
(3/180s, 3/280s and Sun-4s, i.e., all their file servers).
[I've had to put delay loops in the Ethernet driver on the 3/50s
and 3/60s to slow them down enough for the 3/280 server to keep
up.]  Sun's not to blame for anything here:  It costs a lot
to design a new Ethernet interface; they had a design for the
3/180 board set (which was the basis of all the other VME
machines--the [34]/280 and [34]/110); and no market pressure to
change it.  If they hadn't ventured out in a new direction with
the 3/[56]0 -- the LANCE -- I probably would have thought
700KB/s was great Ethernet throughput (at least until I saw
Dave Boggs' DEC-Titan/Seeq-chip throughput data).

But I think Sun is overdue in offering a high-performance VME
Ethernet interface.  That may change though -- VME controllers
like the Interphase 4207 Eagle are starting to appear which
should either put pressure on Sun and/or offer a high
performance 3rd party alternative (I haven't actually tried an
Eagle yet but from the documentation it looks like they did a
lot of things right).  I'd sure like to take the delay loops out
of my LANCE driver...

 - Van

ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as
    the Intel-to-LANCE I listed above.  Using an Intel chip on the
    receiver, the results are MUCH worse -- 420KB/s max.  I chose
    the data that put the 82586 in its very best light.

    I also have scope pictures taken at the transceivers during all
    these tests.  I'm sure there'll be a chorus of "so-and-so violates
    the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR
    SYSTEMS VIOLATED THE ETHERNET SPEC IN ANY WAY, SHAPE OR FORM.
    I looked very carefully for violations and have the pictures to
    prove there were none.

    Finally, all of the above is Copyright (c) 1988 by Van Jacobson.
    If you want to reproduce any part of it in print, you damn well
    better ask me first -- I'm getting tired of being misquoted in
    trade rags.

retrac@RICE.EDU (John Carter) (10/27/88)

Van,

    I've made similar measurements on the similar machines, and come to
roughty the same conclusions.  My measurements are in the context of
the V operating systems interkernel protocols, but for raw hardware speed
comparisons, this shouldn't matter.

    Between two SUN-3/50 or SUN-3/60's (both with the LANCE interface),
I can sustain about 8.2 Mbps user-to-user performance (not source/chargen,
sink/discard).  From a SUN-3/180 (somewhat slower than the 2/180, but
with the same Intel ethernet interface) to a SUN-3/60, I can sustain
slightly more than 6.0 Mbps user-to-user performance.  All these
measurements are for 1024 byte data packets, with 80 byte V interkernel
headers (ouch!!!) and 14 byte ethernet headers.  Factoring the headers
in, the SUN-3/50 -> SUN-3/50 throughput is 9.0 Mbps and SUN-3/180 ->
SUN-3/50 throughput is 6.5 Mbps.  Roughly the same raw numbers...

    The SUN-2/50's (which use the Intel interface, but are significantly
slower) can maintain around 4.7-5.1 Mbps in or out.  These are very rough,
since I haven't fully debugged the implementation on the 2's.

    [ The following is opinion and shouldn't be construed as gospel. ]

    I also have only put a little bit of effort into determining the exact
cause of the disparity.  I had made the same decision you had regarding
the 82586's DMA ability, namely, it isn't very good (and can only sustain
60-70% of the network performance).  You conjectured that the interface
takes a long time to recover from interruptions.  I hadn't seen too many
collisions, so I hadn't thought of that, but it seems to fit with
some other observations I've made concerning the interface.  The Intel
interface tends to drop packets reasonably frequently when receiving large
packet bursts (blasts), presumably because of its inability to DMA in to
memory fast enough.  Another problem I have had, which seems to be caused
by the interface, is that it takes a relatively long time for the
interface to interrupt the processor when an event occurs (either a packet
reception completing, or a transmission completing).  An annoying
"feature" of the interface is the fact that you can't have receive buffers
start on odd boundaries (I suppose that they wanted to simplify the DMA
design).

    Finally, despite the great effort put in to designing a "programmable"
interface, I don't really think it was much easier to get to do what we
needed than the LANCE was.  True, it has a few less annoying
programmability problems (e.g., a less obscure method of selectively
accepting multicast packets, and the decision to not append the useless
harldware CRC to the end of each packet, thus requiring its handling
[which is particularly painful for optimistic blast, because I don't want
it to get redirected in to user memory, sigh]), but the overall decline in
raw performance overshadows these issues.  Heck, you only have to program
the thing once, performance lives forever (and is what counts)!

John

[And for anyone out there reading the cc'd copy of this, I'd like to add
 my voice to the call for a better ethernet interface design.  The
 existing ones are quite lacking in many ways, and put far too much
 load on the processor.  If you're going to design one, I have some ideas
 for what would be useful.  There were some interesting designs presented
 at Sigcomm '88 which address some of the problems I have with current
 designs, though not all of them.]

    Finally, since I cc'd this to tcp-ip, all of the above is
    Copyright (c) 1988 by John Carter.  This way, I don't get
    misquoted and, more importantly, Van doesn't get misquoted
    by my references to his work.

LYNCH@A.ISI.EDU (Dan Lynch) (11/01/88)

Regarding copyrighting your missives to avoid being misquoted by
the "press" (or anyone else), I think you are misguided.  In fact,
I think you are sadly mistaken (and "taken") by the copyright protection
you are seeking.  Just taking a sentence or two out of your "article"
is permissible.  And that is waht I think you are trying to avoid.

Heck,  if you speak out, others will quote you.  

Sigh,
Dan
-------

Mills@UDEL.EDU (11/03/88)

Dan,

If one writes well and has the patience
SOmeone will come from among the runners
And read what one has written quickly
And write out as much as he can remmeber
In the language of the highway

- Yeats

Dave

retrac@RICE.EDU (John Carter) (11/07/88)

    This is a short followup to a posting I made a while back in which
I responded to Van Jacobson's comments on the poor performance of the
Intel ethernet interface in the SUN-3/280 and similar workstations.
At the time I commented, I had not fully completed my re-implementation
of the V bulk data IPC protocols for the Intel interface.  I recently
completed a a version, and want to make a few corrections to my previous
posting.

    I seem to have been a little hard on the Intel interface.  I am able
to get *peak* process-to-process throughputs (measured as I laid out in the
previous posting) of a little over 8 Mbps (up to 8.2).  This corresponds
to a 1 Mbyte transfer taking about a second (8.4 million bits in 1.02 secs
is about 8.2 Mbps).  Unfortunately, it isn't very stable - it seems to
fluctuate between 6.5 mbps and 8.2 mbps.  It appears that packets are
getting dropped quite often, causing timeouts (argh).  I'm not sure if
it's the interface or a flaky network (outwork net isn't a prime example
of a well laid out and administered Ethernet...).

    Oh yeah, the above numbers are SUN-3/50 -> SUN-3/180.  The best that
I can get the Intel interface to transmit at is 6.3-6.5 mbps.  I attain that
by chaining together 32 packet descriptors for transmission at a time,
then waiting until I get an ACK before I chain the next 32 (no shadow
descriptors, i.e. setting them up while awaiting the ACK).

    This dichotomy (transmit vs receive) seems quite strange, and I don't
have an explanation for it.

    The Intel implementation is quite a bit smaller, thanks to it not
having as many annoying "features" as the LANCE, particularly not having
a hardware CRC tailer appended on every packet (which really made the
optimistic blast implementation gross for handling certain error cases).

John Carter
Rice University

P.S. Several people asked for my opinions on interfaces and such not.
     I've been quite busy lately, I'll try to repond soon.