[comp.protocols.tcp-ip] saturating an ethernet from 1000 miles away

van@EE.LBL.GOV (Van Jacobson) (06/05/91)
A minor milestone in high-speed, long-haul networking happened
at the recent Educom National Net '91 conference in Washington,
DC:  TCP transfers from a Cray YMP at Pittsburgh Supercomputer
Center to a Sun Sparcstation-2 on the show floor ran at a
sustained rate of 10Mb/s (by `sustained' I mean sustained over
hundreds of megabytes of data transferred -- In one of the first
tests, we shipped 256MB from the PSC to DC in 3.9 minutes).  The
rough topology was:

	   FDDI            T3 backbone        Ethernet
	  100Mb/s            45Mb/s            10Mb/s
  PSC YMP -------> PSC ENSS --....--> DC ENSS ---------> SS-2

The limiting bandwidth was the show ethernet & we ran at that
bandwidth.  If we could have gotten in & out of the ENSS at >=
45Mb/s, we would have run at the 45Mb/s NSFNet backbone
bandwidth.  (Dave Borman's Cray TCP has been measured at over
300Mb/s.  My TCP on the SS-2 runs 48Mb/s through the loopback
interface (i.e., copying all the data twice, doing both sides of
the protocol & context switching on every packet) and should
easily sink or source >100Mb/s as soon as Sun comes up with a
decent high-speed network interface.)

The real milestone is that two completely independent
implementations of the RFC1072/RFC1185 "fat pipe" extensions,
one by Dave Borman of Cray Research for Unicos on the YMP & one
by me for an experimental TCP/IP running on the SS-2,
interoperated with no problems.  [This gave me no small measure
of joy:  Sun demonstrated their normal inability to supply system
source to academic sites & I didn't get a copy of 4.1.1 (the
only version of Sun OS that would run on a SS-2) until Friday of
the week before the show.  I spent a very long weekend
throwing out Sun's network & IPC code & replacing it with my
experimental 4.4BSD stuff and got a working kernel 45 minutes
before the movers came to crate up our lone Sparcstation-2 &
ship it from California to DC.  So the very first chance to
test my RFC1072 implementation against Dave Borman's was when we
got the machine uncrated & connected the SS-2 to the show net in
DC.  I was all resigned to another 36 hours of sleepless
debugging when, to my utter astonishment, the Cray &
Sparcstation flawlessly negotiated a 512KB window & timestamps,
then happily started exchanging data at a very high rate.
(Several other groups were busy setting up their demos.  I
started up my usual throughput test programs on the Cray & Sun
then looked down at the `recv' indicator light on the Cabletron
transceiver & noticed it was on solid -- I stared at it for at
least 30 seconds & it didn't even blink.  Right after that I
heard one of the Cornell people say ``What happened to the
network?  Our connection seems to have stopped.'' and I did my
best to look innocent while the test finished.)]

The big disappointment was that I was worried about what would
happen when the TCP congestion avoidance / recovery algorithms
started interacting with half a megabyte of stored energy
(in-transit data) in the pipe.  So I'd spend a bunch of time
tuning the Fast Retransmit and Fast Recovery algorithms & had a
bunch of instrumentation set up to watch how they'd perform.
But the damn T3 NSFNet refused to drop packets -- we shipped
slightly more than 30GB (roughly 22 million data packets) during
the three days of the show & the kernel network stats said that
6 were dropped.  Naturally, I wasn't watching for any of
these 6 & a 0.00003% loss rate wasn't enough test any of the
spiffy new algorithms.  Oh well, maybe next time I'll bring my
wire cutters and chop halfway through the cable to make things a
bit more interesting.

[The one interesting piece of behavior had to do with the
ethernet controller on the Sun:  The LANCE is a half-duplex
device so if it's receiving back-to-back ethernet packets it will
never attempt to transmit, even if it has data available to
send.  TCP will attempt to ack every other packet but, since new
data packets were arriving back-to-back from the ENSS, the ack
packets just got queued in the LANCE on the SS-2.  Since the
acks can't get out, eventually (after half a megabyte is sent)
the window on the Cray should close, it should stop sending
packets & the SS-2 should get to dump 180 back-to-back ack
packets, re-opening the window & restarting the data flow.  This
would have resulted in essentially the stop-and-wait performance
of a BLAST protocol but, fortunately, there seems to be a 128KB
buffer somewhere in the T3 NSFNet so after 90 data packets there
was a 30us pause, allowing the SS-2 LANCE to grab the ethernet,
dump 45 back-to-back acks, then start collecting the next 90
data packets.  So the window on the Cray never shut & the pipe
stayed full -- due to essentially an engineering mistake in the
network (a transcontinental T3 run at T3 needs at least half a
meg of buffer everywhere a queue can form).  This, incidentally,
is one reason why we used a half megabyte window instead of the
80KB window required to fill a 36ms RTT into 10Mb bandwidth-delay
product.]

Anyway, I'm writing this mostly to document that it happened &
to thank all the people at LBL, PSC, Merit, Cray & Sun that made
it happen.  I'm particularly grateful to Dave Borman for some
great Cray TCP software and to Geoff Baehr at Sun for battling
with the lawyers and getting us the system pieces we needed
(it's nice to know that at least part of Sun still ranks
engineering over bean counting).  And I am eternally grateful to
Wendy Huntoon of PSC & Elise Gerich of Merit who put in long,
long hours to get the new and almost untested PSC & NSF T3
connections up & running, then kept everything running smoothly
and essentially trouble free for the entire show.

 - Van Jacobson