[comp.protocols.tcp-ip] some interim notes on the bsd network speedups

van@HELIOS.EE.LBL.GOV (Van Jacobson) (07/20/88)

I told the potential beta-tests for our new 4bsd network code
that I hoped to have a version of the code out by the end of
July.  (BTW, we've got all the beta testers we can handle --
please don't apply.)  It looks like that's going to turn into
the end of August, in part because of SIGCOMM and in part
because Sun puts sending source to academic customers on the
bottom of its priority list.  I thought I'd flame about the
latter and give a bit of a status report on the new code.

I've been trying to put together a beta distribution for the
"header prediction" bsd network code.  This effort has been
stymied because it's impossible to get current source from Sun.
The code involves major changes to the 4.3bsd kernel.  The only
local machines I can use to develop new kernels are Suns --
everything else is either multi-user or has pathetic ethernet
hardware.  But the only Sun kernel source I've got is the doubly
obsolete Sun OS 3.5/4.2 BSD.  It would be a massive waste of
time to upgrade this system to 4.3 BSD just so I can develop
the next BSD -- Bill Nowicki did the upgrade a year ago and
binaries of the new system (totally worthless to me) are
shipping as Sun OS 4.0.  [I'm not the only one suffering this
problem -- I understand Craig Partridge's multicast work is
suffering because he can't get 4.3-compatible Sun source.  I
think he gave up & decided to do all his development on 4.3bsd
Vaxen.  And I think I heard Chuck Hedrick say 4.0 has all the
rlogin, URG and nameserver bugs that we fondly remember fixing
in 3.x.  And he has to get source before the academic year
starts or they won't be able to switch until a semester break.
And Mike Karels is saying "I told you so" and suggesting I buy
some CCIs.  Pity that Sun can't figure out that it's in their
best interest to do TIMELY source distribution to the academic
and research community -- their software development base gets
expanded a few hundred-fold for the cost of making tapes.]

Anyway, now that I've vented my spleen, there are some interim
results to talk about.  While waiting for either useful source
or a better hardware platform to show up, I've been cleaning up
my original mods and backing out changes one and two at a time
to gauge their individual effect.  Because network performance
seems to rest on getting a lot of things happening in parallel,
this leave-one-out testing doesn't give simple good/bad answers
(I had one test case that went like

	Basic system:	600 KB/s
	add feature A:	520 KB/s
	drop A, add B:	530 KB/s
	add both A & B:	700 KB/s

Obviously, any statement of the form "feature A/B is good/bad"
is bogus.)  But, in spite of the ambiguity, some of the network
design folklore I've heard seems to be clearly wrong.

In the process of cleaning things up, they slowed down.  Task-
to-task data throughput using TCP between two Sun 3/60s dropped
from 1 MB/s (about 8.6 Mb/s on the wire) to 890 KB/s (about 7.6
Mb/s on the wire).  I know where the 11% was lost (an
interaction between "sosend" and the fact that an AMD LANCE chip
requires at least 100 bytes in the first chunk of data if you
ask it to chain -- massive braindamage on AMD's part) and how to
get it back (re-do the way user data gets handed to the
transport protocol) but need to talk with Mike Karels about the
"right" way to do this.

Still, 890 KB/s represents a non-trivial improvement over the
stock Sun/4bsd system:  Under identical test conditions (same
socket & user buffer sizes, same MSS and MTU, same machines),
the best tcp throughput I could get with an out-the-box Sun OS
3.5 was 380 KB/s.  I wish I could say "make these two simple
changes and your throughput will double" but I can't.  There
were lots and lots of fiddley little changes, none made a huge
difference and they all seemed to be necessary.

The biggest single effect was a change to sosend (the routine
between the user "write" syscall and tcp_output).  Its loop
looked something like:

    while there is user data & space in the socket buffer
	copy from user space to socket
    call the protocol "send" routine

After hooking a scope to our ethernet cable & looking at the
packet spacings, I changed this to

    while there is user data & space in the socket buffer
	copy up to 1K (one cluster's worth) from user space to socket
	call the protocol "send" routine

and the throughput jumped from 380 to 456 KB/s (+20%).  There's
one school of thought that says the first loop was better
because it minimized the "boundary crossings", the fixed costs
of routine calls and context changes.  This same school is
always lobbying for "bigger":  bigger packets, bigger windows,
bigger buffers, for essentially the same reason: the bigger
chunks are, the fewer boundary crossings you pay for.  The
correct school, mine :-), says there's always a fixed cost and a
variable cost (e.g., the cost of maintaining tcp state and
tacking a tcp packet header on the front of some data is
independent of the amount of data; the cost of filling in the
checksum field in that header scales linearly with the amount of
data).  If the size is large enough to make the fixed cost small
compared to the variable cost, making things bigger LOWERS
throughput because you throw away opportunities for parallelism.
Looking at the ethernet, I saw a burst of packets, a long dead
time, another burst of packets, ... .  It was clear that while
we were copying 4 KB from the user, the processor in the LANCE
chip and tcp_input on the destination machine were mostly
sitting idle.

To get good network performance, where there are guaranteed to
be many processors that could be doing things in parallel, you
want the "units of work" (loop sizes, packet sizes, etc.) to be
the SMALLEST values that amortise the fixed cost.  In Berkeley
Unix, the fixed costs of protocol processing are pretty low and
sizes of 1 - 2 KB on a 68020 are as large as you'd want to get.
(This is easy to determine.  Just do a throughput vs. size test
and look for the knee in the graph.  Best performance is just to
the right of the knee.)  And, obviously, on a faster machine
you'd probably want to do things in even smaller units (if the
overhead stays the same -- Crays are fast but hardware
strangeness drives the fixed costs way up.  Just remember that
if it takes large packets and large buffers to get good
performance on a fast machine, that's because it's broken, not
because it's fast.)

Another big effect (difficult to quantify because several
changes had to be made at once) was to remove lots of
more-or-less hidden data copies from the protocol processing.
It's a truism that you never copy data in network code.  Just
lay the data down in a buffer & pass around pointers to
appropriate places in that buffer.  But there are lots of places
where you need to get from a pointer into the buffer back to a
pointer to the buffer itself (e.g., you've got a pointer to the
packet's IP header, there's a header checksum error, and you
want to free the buffer holding the packet).  The routine "dtom"
converts a data pointer back to a buffer pointer but it only
works for small things that fit in a single mbuf (the basic
storage allocation unit in the bsd network code).  Incoming
packets are never in an mbuf; they're in a "cluster" which the
mbuf points to.  There's no way to go from a pointer into a
cluster to a pointer to the mbuf.  So, anywhere you might need
to do a dtom (anywhere there's a detectable error), there had to
be a call to "m_pullup" to make sure the dtom would work.
(M_pullup works by allocating a fresh mbuf, copying a bunch of
data from the cluster to this mbuf, then chaining the original
cluster behind the new mbuf.)

So, we were doing a bunch of copying to expedite error handling.
But errors usually don't happen (if you're worried about
performance, first you make sure there are very, very few
errors), so we were doing a bunch of copying for nothing.  But,
if you're sufficiently insomniac, in about a week you can track
down all the pullup's associated with all the dtom's and change
things to avoid both.  This requires massive recoding of both
the TCP and IP re-assembly code.  But it was worth it:  TCP
throughput climbed from around 600 KB/s to 750 KB/s and IP
forwarding just screamed:  A 3/60 forwarding packets at the 9
Mb/s effective ethernet bandwidth used less than 50% of the CPU.

[BTW, in general I didn't find anything wrong with the BSD
mbuf/cluster model.  In fact, I tried some other models (e.g.,
do everything in packet sized chunks) and they were slower.
There are many cases where knowing that you can grab an mbuf and
chain it onto a chunk of data really simplifies the protocol
code (simplicity == speed).  And the level of indirection and
fast, reference counted allocation of clusters can really be a
win on data transfers (a la kudp_fastsend in Sun OS).  The
biggest problem I saw, other than the m_pullup's, was that
clusters are too small:  They need to be at least big enough for
an ethernet packet (2K) and making them page sized (8K on a Sun)
doesn't hurt and would let you do some quick page swap tricks in
the user-system data copies (I didn't do any of these tricks in
the fast TCP.  I did use 2KB clusters to optimize things for the
ethernet drivers).]

An interesting result of the m_pullup removals was the death of
another piece of folklore.  I'd always heard that the limiting
CPU was the receiver.  Wrong.  After the pullup changes, the
sender would be maxed out at 100% CPU utilization while the
receiver loafed along at 65-70% utilization (utilizations
measured with a microprocessor analyzer; I don't trust the
system's stats).  In hindsight, this seems reasonable.  At the
receiver, a packet comes in, wanders up to the tcp layer, gets
stuck in the socket buffer and an ack is generated (i.e., the
processing terminates with tcp_input at the socket buffer).  At
the sender, the ack comes in, wanders up to the tcp layer, frees
some space, then the higher level socket process has to be woken
up to fill that space (i.e., the processing terminates with
sosend, at the user socket layer).  The way Unix works, this
forces a boundary crossing between the bottom half (interrupt
service) and top half (process context) of the kernel.  On a
Sun, and most of the other Unix boxes I know of, this is an
expensive crossing.  [Of course, the user process on the
receiver side has to eventually wake up and empty the socket
buffer but it gets to do that asynchronously and the dynamics
tend to arrange themselves so it processes several packets on
each wakeup, minimizing the boundary crossings.]

Talking about the bottom half of the kernel reminds me of
another major effect.  There seemed to be a "sound barrier" at
550 KB/s.  No matter what I did, the performance stuck at 550
KB/s.  Finally, I noticed that Sun's LANCE ethernet driver,
if_le.c, would only queue one packet to the LANCE at a time.
Picture what this means:  (1) driver hands packet to LANCE, (2)
LANCE puts packet on wire, (3) end of packet, LANCE interrupts
processor, (4) interrupt dispatched to driver, (5) go back to
(1).  The time involved in (4) is non-trivial, more than 150us,
and represents a lot of idle time for the LANCE and the wire.
So, I rewrote the driver to queue an arbitrary number of packets
to the LANCE, the sound barrier disappeared, and other changes
started making the throughput climb (it's not clear whether this
change had any effect on throughput or just allowed other
changes to have an effect).

[Shortly after making the if_le change, I learned why Sun might
have written the driver the silly way they did:  Before the
change, the 6 back-to-back IP fragments of an NFS write would
each be separated by the 150us interrupt service time.  After
the change, they were really back-to-back, separated by only the
9.6us minimum ethernet spacing (and, no, Sun does NOT violate
the ethernet spec in any way, shape or form.  After my last
message on this stuff, Apollo & DEC people kept telling me Sun
was out of spec.  I've been watching packets on our ethernet for
so long, I'm starting to learn the middle name of every bit.
Sun bits look like DEC bits and Sun, or rather, the LANCE in the
3/50 & 3/60, completely complys with the timings in the blue
book.)  Anyway, the brain-dead Intel 82586 ethernet chip Sun
puts in all its 3/180, 3/280 and Sun-4 file servers can't hack
back-to-back, minimum spacing packets.  Every now and again it
drops some of the frags and wanders off to never-never land
("iebark reset").  Diskless workstations don't work well when
they can't write to their file server and, until I hit on the
idea of inserting "DELAY" macros in kudp_fastsend, it looked
like I could have a fast TCP or a functional workstation but not
both.]

Probably 30% of the performance improvements came from fixing
things in the Sun kernel.  I mean like, really, guys:  If the
current task doesn't change, and it doesn't 80% of the time
swtch is called, there's no need to do a full context save and
restore.  Adding the two lines

	cmpl    _masterprocp,a0
	jeq     6f              | restore of current proc is easy

just before the call to "resume" in sun3/vax.s:swtch got me a
quick 70 KB/s performance increase but felt more like a bug fix
than progress.  And a kernel hacker that does 16-bit "movw"s and
"addw"s on a 68020, or writes 2 instruction dbra loops designed
to put a 68010 in loop mode, should be shot.  The alu takes the
same 3 clocks for a 2 byte add or a 4 byte add so things will
finish a lot quicker if you give it 4 bytes at a time.  And
every branch stalls the pipe, so unrolling that loop to cut down
on branches is a BIG win.

Anyway, I recoded the checksum routine, ocsum.s (now oc_cksum.s
because I found the old calling sequence wasn't the best way to
do things) and its caller, in_cksum.c, and got the checksumming
time down from 490us/KB to 130us/KB.  Unrolling the move loop in
copyin/copyout (the routines that move data user to kernel and
kernel to user), got them down from 200us/KB to 140us/KB.  (BTW,
if you combine the move with the checksum, which I did in the
kludged up, fast code that ran 1 MB/s on a 15MHz 3/50, it costs
200us/KB, not the 300us/KB you'd expect from adding the move
and sum times.  Pipelined processors are strange and wonderful
beasts.)

From these times, you can work out most of the budgets and
processing details:  I was using 1408 data byte packets (don't
ask why 1408).  It takes 193us to copy a packet's worth of data
and 184us to checksum the packet and its 40 byte header.  From
the logic analyzer, the LANCE uses 300us of bus and memory
bandwidth reading or writing the packet (I don't understand why,
it should only take half this).   So, the variable costs are
around 700us per packet.  When you add the 18 byte ethernet
header and 12 byte interpacket gap, to run at 10 Mb/s I'd have
to supply a new packet every 1200us.  I.e., there's a budget of
500us for all the fixed costs (protocol processing, interrupt
dispatch, device setup, etc.).  This is do-able (I've done it,
but not very cleanly) but what I'm getting today is a packet
every 1500us.  I.e., 800us per packet fixed costs.  When I look
with our analyzer, 30% of this is TCP, IP, ARP and ethernet
protocol processing (this was 45% before the "header prediction"
tcp mods), 15% is stalls (idle time that I don't currently
understand but should be able to eventually get rid of) and 55%
is device setup, interrupt service and task handling.  I.e.,
protocol processing is negligible (240us per packet on this
processor and this isn't a fast processor in today's terms).  To
make the network go faster, it seems we just need to fix the
operating system parts we've always needed to fix:  I/O service,
interrupts, task switching and scheduling.  Gee, what a surprise.

 - Van

hedrick@athos.rutgers.edu (Charles Hedrick) (07/21/88)

>....  And I think I heard Chuck Hedrick say [Sun] 4.0 has all the
>rlogin, URG and nameserver bugs that we fondly remember fixing
>in 3.x.  And he has to get source before the academic year
>starts or they won't be able to switch until a semester break.

It's not enough to get the source before Sept 1.  I need to get it
soon enough that I can fix all the problems and get through an
internal release cycle before Sept. 1.  That time has now been passed.
We are committed to 3.2 on our student machines until January 89.  We
are backing out of 4.0 on some of our student test machines, and
praying that we get source soon enough that we don't have to back out
of the rest.  We have 3/60's that require software more recent that
our standard 3.2, but without source we can't bring up 4.0.  So we are
now having to put together a release based on 3.2 with pieces of 3.5
added, all because we can't get the source at the same time as the
binary.

nittmann@rusvx1.rus.uni-stuttgart.dbp.de ("Michael F.H. Nittmann ") (08/09/88)

Bravo! This is what I would call REAL speedups;
 
as I understand the result, there is now a Sun 3/60 under SunOS4
with 750kB/s transfer rate and 1.25MB forwarding capacity.
                                               
I do in no way discredit this work by furnishing some remarks to
the mail article and I do not want anybody to do tricky quoting
or implying negative contents. A good artist always merits some
critics, not only thumb handclapping with the crowd.

One remark to the 50% cpu load on forwarding: I guess that is the
fact that the ethernet is at it's capacity limit and so the interface
imposes cpu idle cycles on the transfer activities.

One remark on loop unrolling: a sun is a pipelined processor machine
BUT the compiler seems not to provide for memory prefetches nor for 
optimized loop structures. Only in this context an unrolling of scalar
loops on a single processor machine can have results.

One remark on "overhead" : if you have a slow scalar performing machine
with some simple compiler that does not know about advantageous
memory fetch instruction placing, the loop overhead seems to be
small with respect to the instruction completion times within the
loop. But this is not small loop overhead!

A remark to the broken Cray's: it is absolutely true that on a fast
machine with sophisticated instruction buffering, with compilers
that take full advantage of memory "oddities" and with multiple 
parallel working functional units per processor the overhead of
the return to the first instruction of a loop becomes prohibitive 
if the next instruction lies out of memory sequences and out of 
instruction buffers and the like. But the overhead is only big relatively
to the instruction time spent within the loop which is already smaller
than the sum of the executed instructions which seems not to be true
in the case of the Sun. ( I intentionally left vectorization apart since
this does not exist on the Sun)  

So Your school is perfectly right - for Your choosen combination of
software (compiler, virtual memory, processor architecture ... ) and 
hardware. 
And that's it!

The fact that just 1kB segments seem to be the optimum copy size 
seems to me more a Sun specific sort of resonance effect between 
memory transfers, virtual memory managing and interface readyness
ON THE LOCAL MACHINE. 

As a physicist I learned to look very critically at experiments and their
setups and I like to point out that Your performance test between two
I guess equally optimized Suns certainly shows Your great competence
on the field and the very true betterments Your code changes brought to
the Sun OS4. But the experiment strictly only holds for Your configuration
of two machines that operate at their - equal !!! - optimum configuration
of packet munching parameters. This is sort of a formula I race. 

In a network You have to deal with what You call "broken" partners.
They also work at least close to their respective optimum parameter
configuration - resp. they would like to. Isn't it sort of a racism
to declare that broken?

And in a network You have to deal with the network medium itself.
And no doubt: the bigger the chunks are, the better the network throughput
is. Yes, I know that this is true for the monochromatic case of all equal
packet lengths, but packet size distribution primarily causes disadvantages for
the individual transfer ( small packet waiting for big one to pass).
Now, if You operate a machine that's interface can deliver 100MB but the
fastest channel You have will only gulp 10MB or 50MB (Ether, NSC Hyper),
from the point of writing networking code on such a fast machine You have
an interest to flood the net with a big packet and then continue with
the honest work of letting users use the CPU until the 4MB Working station
eventually may accept another packet. And even under these  circumstances
the networking software has to be as fast as possible to waste as little
resources and cpu cycles as possible since Your CPU does not earn money
by servicing 100MB interfaces with a some GB badnwidth CPU nor by pushing 
around small memory segments but by giving CPU cycles to users' number	 
crunching codes.
(this is a hit on the people that think " oh, the link is only xxxkB/s,
why bother writing a code that could serve yyMB/s").
So on my opinion there are more viewpoints to observe than the 
record breaking perspective in clinical and artificial environment. 






disclaimers as usual (sometimes I have bad conscience because somebody pays
NW costs for my private ognions)
                                                      




Michael.