mogul@DECWRL.DEC.COM (Jeffrey Mogul) (03/02/88)
Someone recently asked: If anyone has figures on maximum through-put of 10 Mbps Ethernet, I could make good use of them. The snide answer is "10 Mbit/sec". Actually, the theoretical maximum is about 9.9 Mbit/sec, because of the inter-packet gap, and perhaps closer to 9.8 or 9.7 Mbit/sec if you count the addresses and CRC as wasted bits. That's probably not quite what you wanted to know, but the question as put does not have a single answer. For example, do you want the aggregate maximum or the maximum for a single pair of hosts? For the latter, I believe Bill Nowicki of Sun has obtained 5 Mbit/sec using TCP between Suns, and the Sprite people at Berkeley have reached somewhere near 5.6 Mbit/sec using their kernel-to-kernel RPC. These numbers are both from memory; perhaps someone can confirm or improve them. The best numbers that I am aware of, for communications between a pair of hosts, comes from Dave Boggs. Using a pair of 15 MIPs processors with an interface and software of his own design, and without much of an operating system or any protocol (aside from the Ethernet data link headers), he can get about 7.5 Mbits/sec obeying the Ethernet spec, or at least 8.6 Mbits/sec if he "cheats" by sending 4Kbyte packets. (He once got >9 Mbits/sec with even larger packets, but his current code doesn't support that.) The limiting factor in his measurements seems to be that his interface can only send one packet at a time; i.e., he must process one interrupt per transmitted packet, which takes a few hundred microseconds. The interface can receive up to 16 packets using only one interrupt. With a more elaborate interface design, the theoretical limit should be attainable.
budden@tetra.NOSC.MIL (Ray A. Buddenberg) (03/07/88)
definition to give a straight answer. Unless you know what reference model layer, you can't properly reply. Similarly, the number of nodes on the net makes an awful lot of difference. OK, to the ethernet problem. 10 Mbit/sec at the physical layer turns into about 4 Mbits at the network layer (neglecting trivial cases of only two nodes on the network). This decrease is due to the carrier sense multiple access characteristics of ethernets. At the transport layer, this tends to decrease further to around 1 Mbit due to transport overhead. The figure is plastic depending on whose TCP you are using. Unless you've got a lot of overhead in the higher layers, this 1Mbit figure is pretty much the application-to application value. So what does this all mean? At the network layer, a '4 Mbit' 802.5 token ring gets just as much data thru as a '10 Mbit' ethernet. Caveat emptor when the salesman comes to call. As higher speed networks become common -- FDDI starts out at 100 Mbit at the physical and network layers, the Transport layer becomes an even more obvious bottleneck. Indeed, it appears that the bottleneck may be shifting from the LAN media toward the DMA hardware. Rex Buddenberg
hedrick@athos.rutgers.edu (Charles Hedrick) (03/09/88)
I've heard a second-hand report that Van Jacobson recently claimed an 8.2Mbit/sec TCP transfer between two Sun 3/50's using a very aggressively tuned 4.3 BSD TCP. If I got that right, it's quite impressive, considering that a 3/50 is by current standards not that fast a machine. This suggests - that 10 > 4 (i.e. that you can get an Ethernet transfer that is faster than the maximum transfer on a 4Mbit token ring) - that you don't necessarily have to abandon robust, long-haul protocols such as TCP in order to get good performance I wonder if we could nominate Van for a Nobel prize? It seems to me that he deserves something for all the work he has been doing for TCP/IP.
alan@mn-at1.UUCP (Alan Klietz) (03/09/88)
In article <8803012359.AA00957@acetes.dec.com> mogul@DECWRL.DEC.COM (Jeffrey Mogul) writes:
<The best numbers that I am aware of, for communications between a pair
<of hosts, comes from Dave Boggs. Using a pair of 15 MIPs processors
<with software of his own design he can get about 7.5 Mbits/sec obeying
<the Ethernet spec, or at least 8.6 Mbits/sec if he "cheats" by sending
<4Kbyte packets.
<
<The limiting factor in his measurements seems to be that his interface
<can only send one packet at a time; i.e., he must process one interrupt
<per transmitted packet, which takes a few hundred microseconds. The
<interface can receive up to 16 packets using only one interrupt. With
<a more elaborate interface design, the theoretical limit should be
<attainable.
In general the maximum channel utilization possible by a single host,
assuming no errors, no contention, an unlimited output queue, and
no copy is,
t
f
D = ------------
t + t
f p
where t is the time required to wiggle over the wire the bits of an
f
average length data frame over the wire, and t is the processing
p
overhead time to interrupt the host CPU, switch context, reset the
interface hardware, allocate buffers, and do other (fixed) overhead.
To push the utilization curve to near 1, you have to either make
your data frames big (increase t , or equivalently process more frames
f
per interrupt) or make your processing overhead small (decrease t ).
p
Dave Boggs did both things and got good results. His processing overhead
is 50us, or a 20,000 packets per second. Impressive. (The results for
the 4K case show a 66us overhead, from which we can infer that one of
our original assumptions, such as no data copy, were probably wrong,
but the results are still valid.)
But herein lies the problem. As machines get faster and bigger, with
more pipelining and vectorization, and as the host network software
becomes bigger and more complicated, the per-message processing overhead
gets more expensive. And yet the data-links are becoming faster and
faster. FDDI is 100+ mbit/s. The GaAs version will be 1000+ mbit/s.
The ANSI X3T9 Working Group is developing a spec for a HSC channel rated
at 1600 mbit/s per second. The importance of absolutely minimizing
the host overhead is something that I think is critical to get any sort
of decent usage of these links (e.g. buffering multiple messages per host
interrupt).
The problem is that many vendors still think the "critical path" for
getting better performance is the wire. (Make the wire go faster and
you get better results), when in actuality the critical path is reverting
back the processor. Chop off the zeros and its not unlike the olden
days of writing a 9600 baud serial driver for a Intel 8080 processor :-)
--
Alan Klietz
Minnesota Supercomputer Center (*)
1200 Washington Avenue South
Minneapolis, MN 55415 UUCP: alan@mn-at1.k.mn.org
Ph: +1 612 626 1836 ARPA: alan@uc.msc.umn.edu (was umn-rei-uc.arpa)
(*) An affiliate of the University of Minnesota
oconnor@SCCGATE.SCC.COM (Michael J. O'Connor) (03/09/88)
I also have heard "unofficial" reports of TCP transfers at greater than 8Mbps between a couple of SOS boxes by Van Jacobson. Stories of crashing Vaxen accompany the reports. At least one reporter claimed that this helped to demonstrate that Sun violates the Ethernet spec, allowing packets to be put on the wire too close together, effectively denying access to other hosts. Any chance of an "official" confirmation or denial of these rumors? Possibly an IDEA or RFC? Which leads to another question, what ever happened to IENs? Mike
hedrick@ARAMIS.RUTGERS.EDU (Charles Hedrick) (03/10/88)
We know of no evidence that Suns violate the minimum spacing requirement. Indeed if they did, things that work for us would fail, so I don't believe it. However 8Mb transfers would certainly open us to a whole range of things that have never been tested before. While Ethernet should in theory tolerate multiple simultaenous high-speed users, as far as I know it isn't being used that way now. Proper functioning would depend upon everybody's random backoff working right. Given the past history in networking, I am not prepared to bet that untested features work. One could imagine failures everywhere from Ethernet controller microcode to device drivers to protocols implementation. It would also let us test whether protocol designs implicitly assume that an Ethernet can never be congested. Certainly DECnet deals very differently with Ethernet than with point to point lines, and to a certainly extent IP does as well. When designers were thinking of Ethernet, I rather suspect they might not have considered the possibility that one host could actually use all 10Mb of bandwidth. It is possible that protocols such as ARP and DECnet hello would have to be rethought in this context. (For the benefit of the person asking the original question, let me note that there's no reason to think that a token ring would help. Indeed it is slower, so one is in danger of reaching this unknown realm sooner.)
pogran@CCQ.BBN.COM (Ken Pogran) (03/10/88)
Charles, Getting "maximum throughput" from a LAN has been a challenge since the inception of Local Area Network technology back in the '70s. Often there's a disparity between what the LAN specs permit (at the "MAC" layer) and what practical controller implementations are capable of delivering. The challenge is to implement a controller that can deliver something approaching the maximum throughput without being prohibitively expensive (i.e., from a vendor's perspective: "Well, we could build one that would go that fast, but it would price us out of the market."). So one sees compromises in what a SINGLE flow can obtain from a LAN. In ETHERNET implementations, the compromise usually centers on how fast one can "turn around" a controller following one transaction (packet sent or received) to begin another one. On transmit, this limits how quickly you can send on an unloaded net. There are (or at least there were five years ago) some controller products that did a lot of "housekeeping" between outgoing packets, lowering throughput (I used one; I was disappointed.). On receive, the effect is is much more pernicious: If your controller takes "a long time" to get ready to receive the next incoming packet, and the controller sending to you has a faster turn-around than you do, your controller might be "blind" to the net and miss a packet, requiring retransmission by a higher-level protocol. Applications can't bank on getting a significant fraction of a LAN; after all, what would happen when two (or ten) of 'em happen to run simultaneously? On the other hand, software implemented in an environment that only contains "slow" controllers might break in embarrassing ways when employed on systems with quick turn-around controllers and a lightly loaded net! Which brings up a good point for developers of software that run right above the MAC layer: make sure it'll run on hardware "quicker" than yours. > When designers were thinking of Ethernet, I rather suspect > they might not have considered the possibility that one host > could actually use all 10Mb of bandwidth. The mechanisms of the ETHERNET care not a whit whether the next packet on the line is from the same guy as the last, or from someone else. I think it's clear that there were (are?) a number of CONTROLLER designs that assumed that no individual host would want to TRY to use all 10 Mb/s. I haven't looked at the insides of any recent controllers (or the timing specs of recent controller chips) but I'd be willing to bet that the choke point is more likely to be either the controller or its interface to the host, rather than the specs of the ETHERNET itself. This all applies to token-ring LANs, too, of course. And the faster the aggregate data rate on the LAN, the more these arguments apply. (So I might, for example, be able to get a greater PERCENTAGE of a 4 Mb/s Token Ring than I can of a 10 Mb/s ETHERNET). Any comments from designers of LAN controllers? Ken Pogran
van@LBL-CSAM.ARPA (Van Jacobson) (03/10/88)
I don't know what would constitute an "official" confirmation but maybe I can put some rumors to bed. We have done a TCP that gets 8Mbps between Sun 3/50s (the lowest we've seen is 7Mbps, the highest 9Mbps -- when using 100% of the wire bandwidth, the ethernet exponential backoff makes throughput very sensitive to the competing traffic distribution.) The throughput limit seemed to be the Lance chip on the Sun -- the CPU was showing 10-15% idle time. I don't believe the idle time number (I want to really measure the idle time with a uprocessor analyzer) but the interactive response on the machines was pretty good even while they were shoving 1MB/s at each other so I know there was some CPU left over. Yes, I did crash most of our VMS vaxen while running throughput tests and no, this has nothing to do with Sun violating protocols -- the problem was that the DECNET designers failed to use common sense. I fired off a 1GB transfer to see if it would really finish in 20 minutes (it took 18 minutes) and halfway through I noticed that our VMS 780 was rebooting. When I later looked at the crash dump I found that it had run out of non-paged pool because the DEUNA queue was full of packets. It seems that whoever did the protocols used a *linear* backoff on the retransmit timer. With 20 DECNET routers trying to babble the state of the universe every couple of minutes, and my Suns keeping the wire warm in the interim, any attempt to access the ether was going to put a host into serious exponential backoff. Under these circumstances, a linear transport timer just doesn't cut it. So I found 25 retransmissions in the outbound queue for every active DECNET connection. I know as little about VMS as possible so I didn't investigate why the machine had crashed rather than terminating the connections gracefully. I should also note that NFS on our other Sun workstations wasn't all that happy about waiting for the wire: As I walked around the building, every Sun screen was filled with "server not responding" messages. (But no Sun crashed -- I later shut most of them down to keep ND traffic off the wire while I was looking for the upper bound on xfer rate.) I did run two simultaneous 100MB transfers between 4 3/50s and verified that they were gracious about sharing the wire. The total throughput was 7Mbps split roughly 60/40. The tcpdump trace of the two conversations has some holes in it (tcpdump can't quite hack a packet/millisecond, steady state) but the trace doesn't show anything weird happening. Quite a bit of the speedup comes from an algorithm that we (`we' refers to collaborator Mike Karels and myself) are calling "header prediction". The idea is that if you're in the middle of a bulk data transfer and have just seen a packet, you know what the next packet is going to look like: It will look just like the current packet with either the sequence number or ack number updated (depending on whether you're the sender or receiver). Combining this with the "Use hints" epigram from Butler Lampson's classic "Epigrams for System Designers", you start to think of the tcp state (rcv.nxt, snd.una, etc.) as "hints" about what the next packet should look like. If you arrange those "hints" so they match the layout of a tcp packet header, it takes a single 14-byte compare to see if your prediction is correct (3 longword compares to pick up the send & ack sequence numbers, header length, flags and window, plus a short compare on the length). If the prediction is correct, there's a single test on the length to see if you're the sender or receiver followed by the appropriate processing. E.g., if the length is non-zero (you're the receiver), checksum and append the data to the socket buffer then wake any process that's sleeping on the buffer. Update rcv.nxt by the length of this packet (this updates your "prediction" of the next packet). Check if you can handle another packet the same size as the current one. If not, set one of the unused flag bits in your header prediction to guarantee that the prediction will fail on the next packet and force you to go through full protocol processing. Otherwise, you're done with this packet. So, the *total* tcp protocol processing, exclusive of checksumming, is on the order of 6 compares and an add. The checksumming goes at whatever the memory bandwidth is so, as long as the effective memory bandwidth at least 4 times the ethernet bandwidth, the cpu isn't a bottleneck. (Let me make that clearer: we got 8Mbps with checksumming on). You can apply the same idea to outgoing tcp packets and most everywhere else in the protocol stack. I.e., if you're going fast, it's real likely this packet comes from the same place the last packet came from so 1-behind caches of pcb's and arp entries are a big win if you're right and a negligible loss if you're wrong. In addition to the header prediction, I put some horrible kluges in the mbuf handling to get the allocation/deallocations down to 1 per packet. Mike Karels has been working in the same area and his clean code is faster than my kluges. As soon as this semester is over, I plan to merge Mike's and my versions then the two of us will probably make a pass at knocking off the biggest of the rough edges. Sometime in late spring or early summer we should be passing this code out to hardy souls for beta-testing (but ask yourself: do you really want workstations that routinely use 100% of the ethernet bandwidth? I'm pretty sure we don't and we're not running this tcp on any of our workstations.) Some of the impetus for this work came from Greg Chesson's statement at the Phoenix USENIX that `the way TCP and IP are designed, it's impossible to make them go fast'. On hearing this, I lept to my feet to protest but decided that saying "I think you're wrong" wasn't going to change anybody's mind about anything. Now I can say that I'm pretty sure he was wrong about TCP (but that's *not* a comment on the excellent work he's doing and has done on the XTP and the Protocol Engine). The header prediction algorithm evolved during attempts to make my 2400-baud SLIP dial-up send 4 bytes when I typed a character rather than 44. After staring at packet streams for a while, it was pretty obvious that the receiver could predict everything about the next packet on a TCP data stream except for the data bytes. Thus all the sender had to ship in the usual case was one bit that said "yes, your prediction is right" plus the data. I mention this because funding agents looking for high speed, next-generation networks may forget that research to make slow things go fast sometimes makes fast things go faster. - Van
hedrick@ARAMIS.RUTGERS.EDU (Charles Hedrick) (03/10/88)
I thought it was obvious that I was talking about systems designers, not the people who made up the Ethernet specs. As I said, when we get into untests realms, such as 8Mb single transfers, we run the danger of finding bugs in the controller, its microcode, device drivers, and possibly even (though we will hope not) the protocols defining the Ethernet encapsulation for IP, DECnet, or whatever. We see very clearly the fact that many controllers can't handle bursts of packets with minimum spacing. Normally however they simply drop packets, not crash the machine they are running on. I have no idea why DECnet machines would crash during bursts of TCP traffic (or even whether that rumor is true), but I would start looking at the design of the Ethernet controller and the device driver that is dealing with it. It could be something as simple as a bug in the code that handles situations where you are unable to get the Ethernet for an extended period of time, or something as complex as implicit assumptions in some piece of the DECnet protocol design. Of course it could also be a bug in the Sun that causes it to fail to defer to someone else when it is supposed to do so, but I sort of doubt that, since that should be handled by the Lance chip.
bzs@BU-CS.BU.EDU (Barry Shein) (03/11/88)
Breathtaking, in a word, who made that comment about Nobel Prizes for networking? At any rate, there is another interesting issue here: >(but ask yourself: do you really want workstations >that routinely use 100% of the ethernet bandwidth? I'm pretty >sure we don't and we're not running this tcp on any of our >workstations.) My temptation is to ask the converse: Do I wish to believe that merely mediocre algorithms protect me from this as a problem? It seems that given what we might call near-capacity algorithms (for ethernets, of course new wire technologies such as direct fiber hookups will be interesting again) we need to think about rational ways to administer such networks. In the trivial case we could isolate many of these workstations, as we already do here, to their own ethernets, barely shared, so it is less of a problem. Perhaps this would spur vendors to provide hardware to make that even easier and more economical. This would certainly be useful in the case of networked file systems using a client/server model (eg. diskless or disk-poor clients.) Beyond that I have often thought of the idea of a network "throttle", a settable parameter that might control maximum throughput (packets output per second, for example) that a machine might limit itself to. Obviously that requires voluntary compliance (although it could be an extension of window advertising, that is, making the window behavior tunable by the system administrator rather than calculated for maximum throughput always based upon blind assumptions about resources.) Interesting, at any rate... -Barry Shein, Boston University
melohn@SUN.COM (Bill Melohn) (03/11/88)
In article <8803091444.AA24112@sccgate.scc.com> oconnor@SCCGATE.SCC.COM (Michael J. O'Connor) writes: >... At least one reporter claimed that this helped to >demonstrate that Sun violates the Ethernet spec, allowing packets to be put >on the wire too close together, effectively denying access to other hosts. We hear these stories at Sun all the time, usually from customers who have been told by other computer vendors that the Sun somehow "cheats" on the Ethernet spec by "putting too many packets on the wire", or "sending too many packets too fast for Ethernet". These are complete falsehoods. Sun uses standard chips supplied by both Intel and AMD for our ethernet interfaces, and while they may be the fastest implementations available on the market, they are completely within Ethernet specifications. Next time you hear one of these stories from a sales person for another computer vendor, ask them to back their claim with facts, and seriously consider the source.
rajaei@ttds.UUCP (Hassan Rajaei) (03/12/88)
In article <412@mn-at1.UUCP> alan@mn-at1.UUCP (0000-Alan Klietz) writes: >But herein lies the problem. As machines get faster and bigger, with >more pipelining and vectorization, and as the host network software >becomes bigger and more complicated, the per-message processing overhead >gets more expensive. And yet the data-links are becoming faster and >faster. FDDI is 100+ mbit/s. The GaAs version will be 1000+ mbit/s. >The ANSI X3T9 Working Group is developing a spec for a HSC channel rated >at 1600 mbit/s per second. The importance of absolutely minimizing >the host overhead is something that I think is critical to get any sort >of decent usage of these links (e.g. buffering multiple messages per host >interrupt). > You are right. We are reaching a point that a host machine no longer can cope with the communication problems alone. We have to offload the host as much as we can by introducing dedicated communications system rather than some simple board. The protocol engines are something to think about. Hassan Rajaei rajaei@ttds.tds.kth.se
jas@MONK.PROTEON.COM (John A. Shriver) (03/12/88)
One way to deal with those violent controller speed mismatches that are starting to show up is to use a LAN with some form of hardware ACK. This is part of all token ring networks (ProNET, Apollo, 802.5, FDDI), as well as Hyperchannel and DEC's CI. While these bits cannot be used to guaruntee end-to-end reliability, they do let you know when you are sending faster than the other guy can receive. The simplest thing to do is just retransmit until the other guy copies it, but this is a waste of network bandwith. The more sophisticated thing to do is put the packet back on the front of the output queue, and search up the queue for a packet to a different hardware destination. This way the packet retransmissions are interleaved with potentially useful transmissions. You don't, however, want to shuufle the order of packets to a particular node, this can be pathological to some protocols. (The NSP in DECnet-VAX did not accept out-of-order packets until Version 4.7, some TCP's have the same problem.) Good programming interfaces are indeed important. While our earlier boards were only single-buffered (one each way), their simple and fast programming interface (start a transmit in 4 C statements) made up for it. (The ACK bit also helps on the single-buffer problem.) It is very hard to try and get below the interrupt-per-packet threshold and keep a simple programming interface. However, even if you make it, you run into programming interfaces, such as the 4.3bsd network device interface, that make it very difficult to get a pipeline going. By comparison, in VAX/VMS, where pipelining is quite feasible, when you pipeline two packets deep, you nearly double the throughput. As an example of how programming interfaces and interface design affect performance, note that Van's benchmarks were run on a Sun-3/50, which uses the AMD LANCE Ethernet chipset. I doubt he could match those numbers on a Sun-3/160, which uses a Intel 82586 Ethernet chip. The '586 has an inferior programming interface, and some nasty internal "housekeeping" delays that make it a good bit slower.
thomson@oldhub.toronto.edu (Brian Thomson) (03/17/88)
Van Jacobson's results are certainly startling, but I can't help believing that a significant part of that speedup must be in changes to the mbuf handling, the socket code, and the LANCE driver. My evidence is a test I ran on a 3/50: I defined a 'protocol' whose PRU_SEND action was to checksum each mbuf then hand it directly to the driver, with a dummy AF_UNSPEC destination so there would be no ARPing going on. This exercises vanilla SUN mbuf, socket, and interface driver code, while replacing all of TCP/IP by simple checksumming - so no protocols at all. The data goes nowhere, and there are no acks to deal with. Even so, this configuration could not source data to the wire faster than about 3.6Mb/sec. I could hit 8Mb/sec if I threw the data away right after checksumming, without passing it to the driver at all. -- Brian Thomson, CSRI Univ. of Toronto utcsri!uthub!thomson, thomson@hub.toronto.edu