van@EE.LBL.GOV (Van Jacobson) (06/05/91)
A minor milestone in high-speed, long-haul networking happened at the recent Educom National Net '91 conference in Washington, DC: TCP transfers from a Cray YMP at Pittsburgh Supercomputer Center to a Sun Sparcstation-2 on the show floor ran at a sustained rate of 10Mb/s (by `sustained' I mean sustained over hundreds of megabytes of data transferred -- In one of the first tests, we shipped 256MB from the PSC to DC in 3.9 minutes). The rough topology was: FDDI T3 backbone Ethernet 100Mb/s 45Mb/s 10Mb/s PSC YMP -------> PSC ENSS --....--> DC ENSS ---------> SS-2 The limiting bandwidth was the show ethernet & we ran at that bandwidth. If we could have gotten in & out of the ENSS at >= 45Mb/s, we would have run at the 45Mb/s NSFNet backbone bandwidth. (Dave Borman's Cray TCP has been measured at over 300Mb/s. My TCP on the SS-2 runs 48Mb/s through the loopback interface (i.e., copying all the data twice, doing both sides of the protocol & context switching on every packet) and should easily sink or source >100Mb/s as soon as Sun comes up with a decent high-speed network interface.) The real milestone is that two completely independent implementations of the RFC1072/RFC1185 "fat pipe" extensions, one by Dave Borman of Cray Research for Unicos on the YMP & one by me for an experimental TCP/IP running on the SS-2, interoperated with no problems. [This gave me no small measure of joy: Sun demonstrated their normal inability to supply system source to academic sites & I didn't get a copy of 4.1.1 (the only version of Sun OS that would run on a SS-2) until Friday of the week before the show. I spent a very long weekend throwing out Sun's network & IPC code & replacing it with my experimental 4.4BSD stuff and got a working kernel 45 minutes before the movers came to crate up our lone Sparcstation-2 & ship it from California to DC. So the very first chance to test my RFC1072 implementation against Dave Borman's was when we got the machine uncrated & connected the SS-2 to the show net in DC. I was all resigned to another 36 hours of sleepless debugging when, to my utter astonishment, the Cray & Sparcstation flawlessly negotiated a 512KB window & timestamps, then happily started exchanging data at a very high rate. (Several other groups were busy setting up their demos. I started up my usual throughput test programs on the Cray & Sun then looked down at the `recv' indicator light on the Cabletron transceiver & noticed it was on solid -- I stared at it for at least 30 seconds & it didn't even blink. Right after that I heard one of the Cornell people say ``What happened to the network? Our connection seems to have stopped.'' and I did my best to look innocent while the test finished.)] The big disappointment was that I was worried about what would happen when the TCP congestion avoidance / recovery algorithms started interacting with half a megabyte of stored energy (in-transit data) in the pipe. So I'd spend a bunch of time tuning the Fast Retransmit and Fast Recovery algorithms & had a bunch of instrumentation set up to watch how they'd perform. But the damn T3 NSFNet refused to drop packets -- we shipped slightly more than 30GB (roughly 22 million data packets) during the three days of the show & the kernel network stats said that 6 were dropped. Naturally, I wasn't watching for any of these 6 & a 0.00003% loss rate wasn't enough test any of the spiffy new algorithms. Oh well, maybe next time I'll bring my wire cutters and chop halfway through the cable to make things a bit more interesting. [The one interesting piece of behavior had to do with the ethernet controller on the Sun: The LANCE is a half-duplex device so if it's receiving back-to-back ethernet packets it will never attempt to transmit, even if it has data available to send. TCP will attempt to ack every other packet but, since new data packets were arriving back-to-back from the ENSS, the ack packets just got queued in the LANCE on the SS-2. Since the acks can't get out, eventually (after half a megabyte is sent) the window on the Cray should close, it should stop sending packets & the SS-2 should get to dump 180 back-to-back ack packets, re-opening the window & restarting the data flow. This would have resulted in essentially the stop-and-wait performance of a BLAST protocol but, fortunately, there seems to be a 128KB buffer somewhere in the T3 NSFNet so after 90 data packets there was a 30us pause, allowing the SS-2 LANCE to grab the ethernet, dump 45 back-to-back acks, then start collecting the next 90 data packets. So the window on the Cray never shut & the pipe stayed full -- due to essentially an engineering mistake in the network (a transcontinental T3 run at T3 needs at least half a meg of buffer everywhere a queue can form). This, incidentally, is one reason why we used a half megabyte window instead of the 80KB window required to fill a 36ms RTT into 10Mb bandwidth-delay product.] Anyway, I'm writing this mostly to document that it happened & to thank all the people at LBL, PSC, Merit, Cray & Sun that made it happen. I'm particularly grateful to Dave Borman for some great Cray TCP software and to Geoff Baehr at Sun for battling with the lawyers and getting us the system pieces we needed (it's nice to know that at least part of Sun still ranks engineering over bean counting). And I am eternally grateful to Wendy Huntoon of PSC & Elise Gerich of Merit who put in long, long hours to get the new and almost untested PSC & NSF T3 connections up & running, then kept everything running smoothly and essentially trouble free for the entire show. - Van Jacobson