[comp.dcom.lans] LANCE vs. Intel

smb@ulysses.homer.nj.att.com (Steven M. Bellovin) (01/06/89)

Here, reposted with permission, is Van Jacobson's article describing his
measurements.  He added, in his note to me, that he's done other tests with
raw Ethernet, udp, and NFS read/write, and he saw *no* cases where the
Intel chip could outperform the LANCE.


------
Path: ulysses!att!rutgers!mailrus!ames!pasteur!agate!ucbvax!HELIOS.EE.LBL.GOV!van
From: van@HELIOS.EE.LBL.GOV (Van Jacobson)
Newsgroups: comp.protocols.tcp-ip
Subject: 4BSD TCP Ethernet Throughput
Message-ID: <8810242033.AA29183@helios.ee.lbl.gov>
Date: 24 Oct 88 20:33:13 GMT
Sender: daemon@ucbvax.BERKELEY.EDU
Organization: The Internet
Lines: 139

Many people have asked for the Ethernet throughput data I
showed at Interop so it's probably easier to post it:

These are some throughput results for an experimental version of
the 4BSD (Berkeley Unix) network code running on a couple of
different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD
LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel
82586 Ethernet chip) [note again the tests were done with Sun
hardware but not Sun software -- I'm running 4.?BSD, not Sun
OS].  There are lots and lots of interesting things in the data
but the one thing that seems to have attracted people's
attention is the big difference in performance between the two
Ethernet chips.

The test measured task-to-task data throughput over a TCP
connection from a source (e.g., chargen) to a sink (e.g.,
discard).  The tests were done between 2am and 6am on a fairly
quiet Ethernet (~100Kb/s average background traffic).  The
packets were all maximum size (1538 bytes on the wire or 1460
bytes of user data per packet).  The free parameters for the
tests were the sender and receiver socket buffer sizes (which
control the amount of 'pipelining' possible between the sender,
wire and receiver).  Each buffer size was independently varied
from 1 to 17 packets in 1 packet steps.  Four tests were done at
each of the 289 combinations.  Each test transferred 8MB of data
then recorded the total time for the transfer and the send and
receive socket buffer sizes (8MB was chosen so that the worst
case error due to the system clock resolution was ~.1% -- 10ms
in 10sec).  The 1,156 tests per machine pair were done in random
order to prevent any effects from fixed patterns of resource
allocation.

In general, the maximum throughput was observed when the sender
buffer equaled the receiver buffer (the reason why is complicated
but has to do with collisions).  The following table gives the
task-to-task data throughput (in KBytes/sec) and throughput on
the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and
(b) a 3/280 sending to a 3/60.

	_________________________________________________
	|              3/60 to 3/60   |  3/280 to 3/60   |
	|            (LANCE to LANCE) | (Intel to LANCE) |
	| socket                      |                  |
	| buffer     task to          | task to          |
	|  size       task      wire  |  task      wire  |
	|(packets)   (KB/s)    (Mb/s) | (KB/s)    (Mb/s) |
	|    1         384      3.4   |   337      3.0   |
	|    2         606      5.4   |   575      5.1   |
	|    3         690      6.1   |   595      5.3   |
	|    4         784      6.9   |   709      6.3   |
	|    5         866      7.7   |   712      6.3   |
	|    6         904      8.0   |   708      6.3   |
	|    7         946      8.4   |   710      6.3   |
	|    8         954      8.4   |   718      6.4   |
	|    9         974      8.6   |   715      6.3   |
	|   10         983      8.7   |   712      6.3   |
	|   11         995      8.8   |   714      6.3   |
	|   12        1001      8.9   |   715      6.3   |
	|_____________________________|__________________|

The theoretical maximum data throughput, after you take into
account all the protocol overheads, is 1,104 KB/s (this
task-to-task data rate would put 10Mb/s on the wire).  You can
see that the 3/60s get 91% of the the theoretical max.  The
3/280, although a much faster processor (the CPU performance is
really dominated by the speed of the memory system, not the
processor clock rate, and the memory system in the 3/280 is
almost twice the speed of the 3/60), gets only 65% of
theoretical max.

The low throughput of the 3/280 seems to be entirely due to the
Intel Ethernet chip: at around 6Mb/s, it saturates.  (I put the
board on an extender and watched the bus handshake lines on the
82586 to see if the chip or the Sun interface logic was pooping
out.  It was the chip -- it just stopped asking for data.  (The
CPU was loafing along with at least 35% idle time during all
these tests so it wasn't the limit).

[Just so you don't get confused:  Stuff above was measurements.
 Stuff below includes opinions and interpretation and should
 be viewed with appropriate suspicion.]

If you graph the above, you'll see a large notch in the Intel
data at 3 packets.  This is probably a clue to why it's dying:
TCP delivers one ack for every two data packets.  At a buffer
size of three packets, the collision rate increases dramatically
since the sender's third packet will collide with the receiver's
ack for the previous two packets (for buffer sizes of 1 and 2,
there are effectively no collisions).  My suspicion is that the
Intel is taking a long time to recover from collisions (remember
that you're 64 bytes into the packet when you find out you've
collided so the chip bus logic has to back up 64 bytes -- Intel
spent their silicon making the chip "programmable", I doubt they
invested as much as AMD in the bus interface).  This may or may
not be what's going on:  life is too short to spend debugging
Intel parts so I really don't care to investigate further.

The one annoyance in all this is that Sun puts the fast Ethernet
chip (the AMD LANCE) in their slow machines (3/50s and 3/60s)
and the slow Ethernet chip (Intel 82586) in their fast machines
(3/180s, 3/280s and Sun-4s, i.e., all their file servers).
[I've had to put delay loops in the Ethernet driver on the 3/50s
and 3/60s to slow them down enough for the 3/280 server to keep
up.]  Sun's not to blame for anything here:  It costs a lot
to design a new Ethernet interface; they had a design for the
3/180 board set (which was the basis of all the other VME
machines--the [34]/280 and [34]/110); and no market pressure to
change it.  If they hadn't ventured out in a new direction with
the 3/[56]0 -- the LANCE -- I probably would have thought
700KB/s was great Ethernet throughput (at least until I saw
Dave Boggs' DEC-Titan/Seeq-chip throughput data).

But I think Sun is overdue in offering a high-performance VME
Ethernet interface.  That may change though -- VME controllers
like the Interphase 4207 Eagle are starting to appear which
should either put pressure on Sun and/or offer a high
performance 3rd party alternative (I haven't actually tried an
Eagle yet but from the documentation it looks like they did a
lot of things right).  I'd sure like to take the delay loops out
of my LANCE driver...

 - Van

ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as
    the Intel-to-LANCE I listed above.  Using an Intel chip on the
    receiver, the results are MUCH worse -- 420KB/s max.  I chose
    the data that put the 82586 in its very best light.

    I also have scope pictures taken at the transceivers during all
    these tests.  I'm sure there'll be a chorus of "so-and-so violates
    the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR
    SYSTEMS VIOLATED THE ETHERNET SPEC IN ANY WAY, SHAPE OR FORM.
    I looked very carefully for violations and have the pictures to
    prove there were none.

djh@feathers.ATT.COM (5450) (01/06/89)

The Intel 82586 allows the programmer to scatter a frame (for
transmission or reception) in non-contiguous buffers.  In addition
the number of bytes in each buffer is chosen by the programmer.
This flexibility is why the 82586 is THE most powerful LAN controller
on the market (sez me).  Unfortunately this flexibility can cause
poor performance if the programmer is not careful with his choices.

From my reading of Van Jacobson's data, this is probably the case in
his experiment.  The fact that the host CPU is not 100% active is not
significant.  The host CPU was probably idle during most of the wire
time (1.2ms for a 1518 byte frame).  The poor performance was probably
caused by excessive processing prior to transmission and after reception
of frames (the wire would be idle during this time).  Remember the amount
of chip related processing is largely dependent on how the programmer
lays out the structures for frames.

I have programmed a variety of other LAN controllers (SEEQ, 8390, 82588)
so I am familiar with different vendor's approaches.  The 82586 can
closely match or surpass the performance of any chip I have seen.  I
have written drivers that transmit and receive data at 9.5Mb/s using the
82586 in 10MHz 80286 based PCs.


Dan Hansen

wsmith@umn-cs.CS.UMN.EDU (Warren Smith [Randy]) (01/11/89)

In article <84@feathers.ATT.COM> djh@feathers.ATT.COM (Dan Hansen--615-5450) writes:
>
>The Intel 82586 allows the programmer to scatter a frame (for
>transmission or reception) in non-contiguous buffers.  In addition
>the number of bytes in each buffer is chosen by the programmer.
>This flexibility is why the 82586 is THE most powerful LAN controller
>on the market (sez me).  [...]

The Lance also has this capability (and debatably (!) the scatter/gather is
better because each buffer has an individual size).

It should be pointed out that Van's tests all use the Sun Intel controller.
I am sure the problems with bandwidth all relate to that controller, since
a large+fast dual-ported buffer should let either of these two chips
crank data at near 10Mbps.

Randy
wsmith@umn-cs.cs.umn.edu

-- 
Randy Smith
wsmith@umn-cs.cs.umn.edu
...!rutgers!umn-cs!wsmith