van@HELIOS.EE.LBL.GOV (Van Jacobson) (10/25/88)
Many people have asked for the Ethernet throughput data I showed at Interop so it's probably easier to post it: These are some throughput results for an experimental version of the 4BSD (Berkeley Unix) network code running on a couple of different MC68020-based systems: Sun 3/60s (20MHz 68020 with AMD LANCE Ethernet chip) and Sun 3/280s (25MHz 68020 with Intel 82586 Ethernet chip) [note again the tests were done with Sun hardware but not Sun software -- I'm running 4.?BSD, not Sun OS]. There are lots and lots of interesting things in the data but the one thing that seems to have attracted people's attention is the big difference in performance between the two Ethernet chips. The test measured task-to-task data throughput over a TCP connection from a source (e.g., chargen) to a sink (e.g., discard). The tests were done between 2am and 6am on a fairly quiet Ethernet (~100Kb/s average background traffic). The packets were all maximum size (1538 bytes on the wire or 1460 bytes of user data per packet). The free parameters for the tests were the sender and receiver socket buffer sizes (which control the amount of 'pipelining' possible between the sender, wire and receiver). Each buffer size was independently varied from 1 to 17 packets in 1 packet steps. Four tests were done at each of the 289 combinations. Each test transferred 8MB of data then recorded the total time for the transfer and the send and receive socket buffer sizes (8MB was chosen so that the worst case error due to the system clock resolution was ~.1% -- 10ms in 10sec). The 1,156 tests per machine pair were done in random order to prevent any effects from fixed patterns of resource allocation. In general, the maximum throughput was observed when the sender buffer equaled the receiver buffer (the reason why is complicated but has to do with collisions). The following table gives the task-to-task data throughput (in KBytes/sec) and throughput on the wire (in MBits/sec) for (a) a 3/60 sending to a 3/60 and (b) a 3/280 sending to a 3/60. _________________________________________________ | 3/60 to 3/60 | 3/280 to 3/60 | | (LANCE to LANCE) | (Intel to LANCE) | | socket | | | buffer task to | task to | | size task wire | task wire | |(packets) (KB/s) (Mb/s) | (KB/s) (Mb/s) | | 1 384 3.4 | 337 3.0 | | 2 606 5.4 | 575 5.1 | | 3 690 6.1 | 595 5.3 | | 4 784 6.9 | 709 6.3 | | 5 866 7.7 | 712 6.3 | | 6 904 8.0 | 708 6.3 | | 7 946 8.4 | 710 6.3 | | 8 954 8.4 | 718 6.4 | | 9 974 8.6 | 715 6.3 | | 10 983 8.7 | 712 6.3 | | 11 995 8.8 | 714 6.3 | | 12 1001 8.9 | 715 6.3 | |_____________________________|__________________| The theoretical maximum data throughput, after you take into account all the protocol overheads, is 1,104 KB/s (this task-to-task data rate would put 10Mb/s on the wire). You can see that the 3/60s get 91% of the the theoretical max. The 3/280, although a much faster processor (the CPU performance is really dominated by the speed of the memory system, not the processor clock rate, and the memory system in the 3/280 is almost twice the speed of the 3/60), gets only 65% of theoretical max. The low throughput of the 3/280 seems to be entirely due to the Intel Ethernet chip: at around 6Mb/s, it saturates. (I put the board on an extender and watched the bus handshake lines on the 82586 to see if the chip or the Sun interface logic was pooping out. It was the chip -- it just stopped asking for data. (The CPU was loafing along with at least 35% idle time during all these tests so it wasn't the limit). [Just so you don't get confused: Stuff above was measurements. Stuff below includes opinions and interpretation and should be viewed with appropriate suspicion.] If you graph the above, you'll see a large notch in the Intel data at 3 packets. This is probably a clue to why it's dying: TCP delivers one ack for every two data packets. At a buffer size of three packets, the collision rate increases dramatically since the sender's third packet will collide with the receiver's ack for the previous two packets (for buffer sizes of 1 and 2, there are effectively no collisions). My suspicion is that the Intel is taking a long time to recover from collisions (remember that you're 64 bytes into the packet when you find out you've collided so the chip bus logic has to back up 64 bytes -- Intel spent their silicon making the chip "programmable", I doubt they invested as much as AMD in the bus interface). This may or may not be what's going on: life is too short to spend debugging Intel parts so I really don't care to investigate further. The one annoyance in all this is that Sun puts the fast Ethernet chip (the AMD LANCE) in their slow machines (3/50s and 3/60s) and the slow Ethernet chip (Intel 82586) in their fast machines (3/180s, 3/280s and Sun-4s, i.e., all their file servers). [I've had to put delay loops in the Ethernet driver on the 3/50s and 3/60s to slow them down enough for the 3/280 server to keep up.] Sun's not to blame for anything here: It costs a lot to design a new Ethernet interface; they had a design for the 3/180 board set (which was the basis of all the other VME machines--the [34]/280 and [34]/110); and no market pressure to change it. If they hadn't ventured out in a new direction with the 3/[56]0 -- the LANCE -- I probably would have thought 700KB/s was great Ethernet throughput (at least until I saw Dave Boggs' DEC-Titan/Seeq-chip throughput data). But I think Sun is overdue in offering a high-performance VME Ethernet interface. That may change though -- VME controllers like the Interphase 4207 Eagle are starting to appear which should either put pressure on Sun and/or offer a high performance 3rd party alternative (I haven't actually tried an Eagle yet but from the documentation it looks like they did a lot of things right). I'd sure like to take the delay loops out of my LANCE driver... - Van ps: I have data for Intel-to-Intel and LANCE-to-Intel as well as the Intel-to-LANCE I listed above. Using an Intel chip on the receiver, the results are MUCH worse -- 420KB/s max. I chose the data that put the 82586 in its very best light. I also have scope pictures taken at the transceivers during all these tests. I'm sure there'll be a chorus of "so-and-so violates the Ethernet spec" but that's a lie -- NONE OF THESE CHIPS OR SYSTEMS VIOLATED THE ETHERNET SPEC IN ANY WAY, SHAPE OR FORM. I looked very carefully for violations and have the pictures to prove there were none. Finally, all of the above is Copyright (c) 1988 by Van Jacobson. If you want to reproduce any part of it in print, you damn well better ask me first -- I'm getting tired of being misquoted in trade rags.
retrac@RICE.EDU (John Carter) (10/27/88)
Van, I've made similar measurements on the similar machines, and come to roughty the same conclusions. My measurements are in the context of the V operating systems interkernel protocols, but for raw hardware speed comparisons, this shouldn't matter. Between two SUN-3/50 or SUN-3/60's (both with the LANCE interface), I can sustain about 8.2 Mbps user-to-user performance (not source/chargen, sink/discard). From a SUN-3/180 (somewhat slower than the 2/180, but with the same Intel ethernet interface) to a SUN-3/60, I can sustain slightly more than 6.0 Mbps user-to-user performance. All these measurements are for 1024 byte data packets, with 80 byte V interkernel headers (ouch!!!) and 14 byte ethernet headers. Factoring the headers in, the SUN-3/50 -> SUN-3/50 throughput is 9.0 Mbps and SUN-3/180 -> SUN-3/50 throughput is 6.5 Mbps. Roughly the same raw numbers... The SUN-2/50's (which use the Intel interface, but are significantly slower) can maintain around 4.7-5.1 Mbps in or out. These are very rough, since I haven't fully debugged the implementation on the 2's. [ The following is opinion and shouldn't be construed as gospel. ] I also have only put a little bit of effort into determining the exact cause of the disparity. I had made the same decision you had regarding the 82586's DMA ability, namely, it isn't very good (and can only sustain 60-70% of the network performance). You conjectured that the interface takes a long time to recover from interruptions. I hadn't seen too many collisions, so I hadn't thought of that, but it seems to fit with some other observations I've made concerning the interface. The Intel interface tends to drop packets reasonably frequently when receiving large packet bursts (blasts), presumably because of its inability to DMA in to memory fast enough. Another problem I have had, which seems to be caused by the interface, is that it takes a relatively long time for the interface to interrupt the processor when an event occurs (either a packet reception completing, or a transmission completing). An annoying "feature" of the interface is the fact that you can't have receive buffers start on odd boundaries (I suppose that they wanted to simplify the DMA design). Finally, despite the great effort put in to designing a "programmable" interface, I don't really think it was much easier to get to do what we needed than the LANCE was. True, it has a few less annoying programmability problems (e.g., a less obscure method of selectively accepting multicast packets, and the decision to not append the useless harldware CRC to the end of each packet, thus requiring its handling [which is particularly painful for optimistic blast, because I don't want it to get redirected in to user memory, sigh]), but the overall decline in raw performance overshadows these issues. Heck, you only have to program the thing once, performance lives forever (and is what counts)! John [And for anyone out there reading the cc'd copy of this, I'd like to add my voice to the call for a better ethernet interface design. The existing ones are quite lacking in many ways, and put far too much load on the processor. If you're going to design one, I have some ideas for what would be useful. There were some interesting designs presented at Sigcomm '88 which address some of the problems I have with current designs, though not all of them.] Finally, since I cc'd this to tcp-ip, all of the above is Copyright (c) 1988 by John Carter. This way, I don't get misquoted and, more importantly, Van doesn't get misquoted by my references to his work.
LYNCH@A.ISI.EDU (Dan Lynch) (11/01/88)
Regarding copyrighting your missives to avoid being misquoted by the "press" (or anyone else), I think you are misguided. In fact, I think you are sadly mistaken (and "taken") by the copyright protection you are seeking. Just taking a sentence or two out of your "article" is permissible. And that is waht I think you are trying to avoid. Heck, if you speak out, others will quote you. Sigh, Dan -------
Mills@UDEL.EDU (11/03/88)
Dan, If one writes well and has the patience SOmeone will come from among the runners And read what one has written quickly And write out as much as he can remmeber In the language of the highway - Yeats Dave
retrac@RICE.EDU (John Carter) (11/07/88)
This is a short followup to a posting I made a while back in which I responded to Van Jacobson's comments on the poor performance of the Intel ethernet interface in the SUN-3/280 and similar workstations. At the time I commented, I had not fully completed my re-implementation of the V bulk data IPC protocols for the Intel interface. I recently completed a a version, and want to make a few corrections to my previous posting. I seem to have been a little hard on the Intel interface. I am able to get *peak* process-to-process throughputs (measured as I laid out in the previous posting) of a little over 8 Mbps (up to 8.2). This corresponds to a 1 Mbyte transfer taking about a second (8.4 million bits in 1.02 secs is about 8.2 Mbps). Unfortunately, it isn't very stable - it seems to fluctuate between 6.5 mbps and 8.2 mbps. It appears that packets are getting dropped quite often, causing timeouts (argh). I'm not sure if it's the interface or a flaky network (outwork net isn't a prime example of a well laid out and administered Ethernet...). Oh yeah, the above numbers are SUN-3/50 -> SUN-3/180. The best that I can get the Intel interface to transmit at is 6.3-6.5 mbps. I attain that by chaining together 32 packet descriptors for transmission at a time, then waiting until I get an ACK before I chain the next 32 (no shadow descriptors, i.e. setting them up while awaiting the ACK). This dichotomy (transmit vs receive) seems quite strange, and I don't have an explanation for it. The Intel implementation is quite a bit smaller, thanks to it not having as many annoying "features" as the LANCE, particularly not having a hardware CRC tailer appended on every packet (which really made the optimistic blast implementation gross for handling certain error cases). John Carter Rice University P.S. Several people asked for my opinions on interfaces and such not. I've been quite busy lately, I'll try to repond soon.