[comp.sys.sun] ethernet hardware "etiquette"

bernard@pleiades.oz.au (Bernard Wilson) (02/16/90)

We have both Suns and a microVAX which are not as yet on the same Ethernet
network. I have cross posted the article below in the hope of gaining
useful information.

					Bernard Wilson

In article <17590@megaron.cs.arizona.edu>, leonard@cs.arizona.edu (Aaron Leonard) writes:
---------------------------- START OF TEXT ------------------------------------
I have no wish to start a holy war here.

BUT it is my simple observation that Suns are poor neighbors to have on an
Ethernet.  (Another way of putting this might be: VAXclusters are too
delicate to coexist happily with lusty, rambunctious Suns.)

I have never seen proof of the oft-repeated rumor that Suns "cheat" on the
9.6 ms Ethernet retransmit time.  I do know, however, that if you are
running a Local Area VAXcluster that shares a large, variegated Ethernet
with Suns, you should be wary of the following aspects of reality:

1. A few Suns running NFS can swamp the Ethernet very quickly.  I don't
want to get into the technical details here, but due to some rather
fast-and-loose design decisions in NFS, you will find that colliding NFS
traffic will produce a degenerate situation in a hurry.

2. Sloppy TCP/IP code residing on Suns can cause frequent network
meltdowns.  Specifically, Suns are notorious for behaving poorly when
confronted by IP broadcast addresses different from what they expect.
More precisely, just a handful of Suns which use the 0's broadcast and
which run RIP (which alas seems to be the default), will generate feedback
loops quite capable of destroying the network, when confronted by address
queries for the 1's broadcast.  (Note: Suns are not the only machines
which exhibit this problem, merely the most virulent.)

On my campus, we have experienced frequent network meltdowns, typically
associated with Suns and their allies caught in a feedback loop of some
sort.  Such events were frequently highlighted by members of Local Area
VAXclusters undergoing "voluntary" CLUEXIT bugchecks when they found that
they couldn't maintain SCS contact with eachother. 

We have lessened the frequency and severity of these tragic occasions via
the following techniques:

1. Put Suns behind IP routers whenever possible.  In other words, get them
OFF of the same Ethernet from your VAXen (and from each other.)

2. Increase the RECNXINTERVAL SYSGEN param on your NI VAXcluster nodes to
something like 120 seconds.   This makes them better able to weather
broadcast storms without committing hara kiri.

I do not intend to raise the issue of the morality of the Suns' behavior,
but only to stress the need for proper prophylaxis when your "share
needles with them" as it were. 

Aaron
------------------------------ END OF TEXT ------------------------------------

steve@umiacs.umd.edu (02/23/90)

Let me repeat once again:  to the absolute best of my knowledge, Suns do
*not* violate the 9.6ms retransmit time restriction.  If I remember
correctly, during Van Jacobson's experiments (y'all know, the ones where
he got 8-9 megabits per second (!)  over TCP between two Sun-3/60s), he
actually put a scope on the wire and watched to be sure that the 9.6ms
interpacket gap was respected.  It was.  Given that SunOS can't ordinarily
deal with pushing bits that fast, I'd say that in any less stressful
situation it's quite unlikely that the interpacket gap restrictions will
be violated.

Another way to come to this conclusion (if you don't have a scope handy)
is to read the sources for the LANCE driver, and to know something about
the LANCE chip.  If the LANCE won't let you crank down the interpacket
spacing, then the Sun driver can shout at it until it's blue in the face,
and the packets will still come out properly spaced.  Even if the LANCE
will, and the Sun driver doesn't set it up to do so, there's no problem.
One should change LANCE to Intel here and check once more, just to be
sure, I suppose.  (We'll all assume that the ancient ec boards can't go
fast enough to be dangerous, unless a whole lot of them gang up on
someone.)

It seems that having lots of packets flying with minimum interrecord gaps
can be a Bad Thing for some machines.  See Van's message (which I've
fished out of the Sun-Nets archive and included below) for more
information.  It seems that some number of Sun ie0 boards went off the
deep end during his tests.  I don't know whether or not the problem
persists.  I have heard things about the Ethernet hardware of other
vendors, but I won't say them, so as not to feed the "Vendor X doesn't
follow the spec" rumor mill.

About the broadcast problems: if you don't set the broadcast address to be
the same thing on *all* your machines (DEC, Sun, Apollo, Joe's Garage),
you *will* have problems.  I'd like to see the kernel on lots of machines
fixed so that it won't forward things flagged as link-level broadcasts.
No doubt something (probably boot-related, too) depends on forwarding
broadcasts, in spite of it being silly to do so, so this may never fly...
(Once we get our SunOS 4.1 sources -- or if I'm feeling energetic
sometime, and feel like hacking up a 4.0.3 kernel -- I might sit down and
test this out.  I'm pretty sure I know how to make the right hacks.)

Also, 'a few Suns running NFS' won't always swamp an Ethernet.  A few
(ten-ish, depending on what you use Suns for) diskless Suns running
diskless NFS (or, probably worse, ND) can jack the load up fairly high.
For example, a network I have with seven diskless Sun-3s can heat the
cable up to 40% utilization or so, if I push it just so.  ('Just so'
involved some atypical behavior:  running 'find / -name XXXXX' on each
workstation.  This sort of pounding doesn't usually happen.)  On the other
hand, in the CS Department here, there's about a hundred workstations
sharing the same cable.  Those workstations are all diskful, and with a
few exceptions, the only NFS they do is to get /usr off the file servers.
The CSD cable is busy, but it's less loaded than the diskless-machine
cable I have...

As usual, it's all a matter of configuration and careful control of one's
setup.  If one has everything cleaned up (no broadcast wars or other
obvious bogus behavior, and if necessary you've hit the cable with a TDR
just to be really sure), and problems persist, sure, separate your
network.  It may well have been too busy, anyway.

Spoken: Steve Miller   Domain: steve@umiacs.umd.edu    UUCP: uunet!mimsy!steve
Phone: +1-301-454-1808 USPS: UMIACS, Univ. of Maryland, College Park, MD 20742

*** Included Message ***

X-From: steve@umiacs.UMD.EDU (Steven D. Miller)
X-To: sun-nets@brillig.umd.edu
X-Subject: Interesting new TCP/IP work

This message came across the TCP-IP mailing list a few weeks ago.  I think
it's fascinating, and I think many of you might also be interested.  (My
apologies to those on the tcp-ip list, who will be seeing this for the
second time.)  Van Jacobsen and Mike Karels have been doing things to TCP
that people didn't think possible...

On a more general note, how do people feel about my occasionally
forwarding interesting messages from tcp-ip onto this list?  I'd generally
only forward things that I think are of great and general interest, so the
duplicate traffic would hopefully not be too obnoxious.  Anyway, let me
know.

Spoken: Steve Miller   Domain: steve@mimsy.umd.edu    UUCP: uunet!mimsy!steve
Phone: +1-301-454-1808 USPS: UMIACS, Univ. of Maryland, College Park, MD 20742

X-To: tcp-ip@sri-nic.arpa
X-Subject: some interim notes on the bsd network speedups
X-From: Van Jacobson <van@helios.ee.lbl.gov>

I told the potential beta-tests for our new 4bsd network code that I hoped
to have a version of the code out by the end of July.  (BTW, we've got all
the beta testers we can handle -- please don't apply.)  It looks like
that's going to turn into the end of August, in part because of SIGCOMM
and in part because Sun puts sending source to academic customers on the
bottom of its priority list.  I thought I'd flame about the latter and
give a bit of a status report on the new code.

I've been trying to put together a beta distribution for the "header
prediction" bsd network code.  This effort has been stymied because it's
impossible to get current source from Sun.  The code involves major
changes to the 4.3bsd kernel.  The only local machines I can use to
develop new kernels are Suns -- everything else is either multi-user or
has pathetic ethernet hardware.  But the only Sun kernel source I've got
is the doubly obsolete Sun OS 3.5/4.2 BSD.  It would be a massive waste of
time to upgrade this system to 4.3 BSD just so I can develop the next BSD
-- Bill Nowicki did the upgrade a year ago and binaries of the new system
(totally worthless to me) are shipping as Sun OS 4.0.  [I'm not the only
one suffering this problem -- I understand Craig Partridge's multicast
work is suffering because he can't get 4.3-compatible Sun source.  I think
he gave up & decided to do all his development on 4.3bsd Vaxen.  And I
think I heard Chuck Hedrick say 4.0 has all the rlogin, URG and nameserver
bugs that we fondly remember fixing in 3.x.  And he has to get source
before the academic year starts or they won't be able to switch until a
semester break.  And Mike Karels is saying "I told you so" and suggesting
I buy some CCIs.  Pity that Sun can't figure out that it's in their best
interest to do TIMELY source distribution to the academic and research
community -- their software development base gets expanded a few
hundred-fold for the cost of making tapes.]

Anyway, now that I've vented my spleen, there are some interim results to
talk about.  While waiting for either useful source or a better hardware
platform to show up, I've been cleaning up my original mods and backing
out changes one and two at a time to gauge their individual effect.
Because network performance seems to rest on getting a lot of things
happening in parallel, this leave-one-out testing doesn't give simple
good/bad answers (I had one test case that went like

	Basic system:	600 KB/s
	add feature A:	520 KB/s
	drop A, add B:	530 KB/s
	add both A & B:	700 KB/s

Obviously, any statement of the form "feature A/B is good/bad" is bogus.)
But, in spite of the ambiguity, some of the network design folklore I've
heard seems to be clearly wrong.

In the process of cleaning things up, they slowed down.  Task- to-task
data throughput using TCP between two Sun 3/60s dropped from 1 MB/s (about
8.6 Mb/s on the wire) to 890 KB/s (about 7.6 Mb/s on the wire).  I know
where the 11% was lost (an interaction between "sosend" and the fact that
an AMD LANCE chip requires at least 100 bytes in the first chunk of data
if you ask it to chain -- massive braindamage on AMD's part) and how to
get it back (re-do the way user data gets handed to the transport
protocol) but need to talk with Mike Karels about the "right" way to do
this.

Still, 890 KB/s represents a non-trivial improvement over the stock
Sun/4bsd system:  Under identical test conditions (same socket & user
buffer sizes, same MSS and MTU, same machines), the best tcp throughput I
could get with an out-the-box Sun OS 3.5 was 380 KB/s.  I wish I could say
"make these two simple changes and your throughput will double" but I
can't.  There were lots and lots of fiddley little changes, none made a
huge difference and they all seemed to be necessary.

The biggest single effect was a change to sosend (the routine between the
user "write" syscall and tcp_output).  Its loop looked something like:

    while there is user data & space in the socket buffer
	copy from user space to socket
    call the protocol "send" routine

After hooking a scope to our ethernet cable & looking at the packet
spacings, I changed this to

    while there is user data & space in the socket buffer
	copy up to 1K (one cluster's worth) from user space to socket
	call the protocol "send" routine

and the throughput jumped from 380 to 456 KB/s (+20%).  There's one school
of thought that says the first loop was better because it minimized the
"boundary crossings", the fixed costs of routine calls and context
changes.  This same school is always lobbying for "bigger":  bigger
packets, bigger windows, bigger buffers, for essentially the same reason:
the bigger chunks are, the fewer boundary crossings you pay for.  The
correct school, mine :-), says there's always a fixed cost and a variable
cost (e.g., the cost of maintaining tcp state and tacking a tcp packet
header on the front of some data is independent of the amount of data; the
cost of filling in the checksum field in that header scales linearly with
the amount of data).  If the size is large enough to make the fixed cost
small compared to the variable cost, making things bigger LOWERS
throughput because you throw away opportunities for parallelism.  Looking
at the ethernet, I saw a burst of packets, a long dead time, another burst
of packets, ... .  It was clear that while we were copying 4 KB from the
user, the processor in the LANCE chip and tcp_input on the destination
machine were mostly sitting idle.

To get good network performance, where there are guaranteed to be many
processors that could be doing things in parallel, you want the "units of
work" (loop sizes, packet sizes, etc.) to be the SMALLEST values that
amortise the fixed cost.  In Berkeley Unix, the fixed costs of protocol
processing are pretty low and sizes of 1 - 2 KB on a 68020 are as large as
you'd want to get.  (This is easy to determine.  Just do a throughput vs.
size test and look for the knee in the graph.  Best performance is just to
the right of the knee.)  And, obviously, on a faster machine you'd
probably want to do things in even smaller units (if the overhead stays
the same -- Crays are fast but hardware strangeness drives the fixed costs
way up.  Just remember that if it takes large packets and large buffers to
get good performance on a fast machine, that's because it's broken, not
because it's fast.)

Another big effect (difficult to quantify because several changes had to
be made at once) was to remove lots of more-or-less hidden data copies
from the protocol processing.  It's a truism that you never copy data in
network code.  Just lay the data down in a buffer & pass around pointers
to appropriate places in that buffer.  But there are lots of places where
you need to get from a pointer into the buffer back to a pointer to the
buffer itself (e.g., you've got a pointer to the packet's IP header,
there's a header checksum error, and you want to free the buffer holding
the packet).  The routine "dtom" converts a data pointer back to a buffer
pointer but it only works for small things that fit in a single mbuf (the
basic storage allocation unit in the bsd network code).  Incoming packets
are never in an mbuf; they're in a "cluster" which the mbuf points to.
There's no way to go from a pointer into a cluster to a pointer to the
mbuf.  So, anywhere you might need to do a dtom (anywhere there's a
detectable error), there had to be a call to "m_pullup" to make sure the
dtom would work.  (M_pullup works by allocating a fresh mbuf, copying a
bunch of data from the cluster to this mbuf, then chaining the original
cluster behind the new mbuf.)

So, we were doing a bunch of copying to expedite error handling.  But
errors usually don't happen (if you're worried about performance, first
you make sure there are very, very few errors), so we were doing a bunch
of copying for nothing.  But, if you're sufficiently insomniac, in about a
week you can track down all the pullup's associated with all the dtom's
and change things to avoid both.  This requires massive recoding of both
the TCP and IP re-assembly code.  But it was worth it:  TCP throughput
climbed from around 600 KB/s to 750 KB/s and IP forwarding just screamed:
A 3/60 forwarding packets at the 9 Mb/s effective ethernet bandwidth used
less than 50% of the CPU.

[BTW, in general I didn't find anything wrong with the BSD mbuf/cluster
model.  In fact, I tried some other models (e.g., do everything in packet
sized chunks) and they were slower.  There are many cases where knowing
that you can grab an mbuf and chain it onto a chunk of data really
simplifies the protocol code (simplicity == speed).  And the level of
indirection and fast, reference counted allocation of clusters can really
be a win on data transfers (a la kudp_fastsend in Sun OS).  The biggest
problem I saw, other than the m_pullup's, was that clusters are too small:
They need to be at least big enough for an ethernet packet (2K) and making
them page sized (8K on a Sun) doesn't hurt and would let you do some quick
page swap tricks in the user-system data copies (I didn't do any of these
tricks in the fast TCP.  I did use 2KB clusters to optimize things for the
ethernet drivers).]

An interesting result of the m_pullup removals was the death of another
piece of folklore.  I'd always heard that the limiting CPU was the
receiver.  Wrong.  After the pullup changes, the sender would be maxed out
at 100% CPU utilization while the receiver loafed along at 65-70%
utilization (utilizations measured with a microprocessor analyzer; I don't
trust the system's stats).  In hindsight, this seems reasonable.  At the
receiver, a packet comes in, wanders up to the tcp layer, gets stuck in
the socket buffer and an ack is generated (i.e., the processing terminates
with tcp_input at the socket buffer).  At the sender, the ack comes in,
wanders up to the tcp layer, frees some space, then the higher level
socket process has to be woken up to fill that space (i.e., the processing
terminates with sosend, at the user socket layer).  The way Unix works,
this forces a boundary crossing between the bottom half (interrupt
service) and top half (process context) of the kernel.  On a Sun, and most
of the other Unix boxes I know of, this is an expensive crossing.  [Of
course, the user process on the receiver side has to eventually wake up
and empty the socket buffer but it gets to do that asynchronously and the
dynamics tend to arrange themselves so it processes several packets on
each wakeup, minimizing the boundary crossings.]

Talking about the bottom half of the kernel reminds me of another major
effect.  There seemed to be a "sound barrier" at 550 KB/s.  No matter what
I did, the performance stuck at 550 KB/s.  Finally, I noticed that Sun's
LANCE ethernet driver, if_le.c, would only queue one packet to the LANCE
at a time.  Picture what this means:  (1) driver hands packet to LANCE,
(2) LANCE puts packet on wire, (3) end of packet, LANCE interrupts
processor, (4) interrupt dispatched to driver, (5) go back to (1).  The
time involved in (4) is non-trivial, more than 150us, and represents a lot
of idle time for the LANCE and the wire.  So, I rewrote the driver to
queue an arbitrary number of packets to the LANCE, the sound barrier
disappeared, and other changes started making the throughput climb (it's
not clear whether this change had any effect on throughput or just allowed
other changes to have an effect).

[Shortly after making the if_le change, I learned why Sun might have
written the driver the silly way they did:  Before the change, the 6
back-to-back IP fragments of an NFS write would each be separated by the
150us interrupt service time.  After the change, they were really
back-to-back, separated by only the 9.6us minimum ethernet spacing (and,
no, Sun does NOT violate the ethernet spec in any way, shape or form.
After my last message on this stuff, Apollo & DEC people kept telling me
Sun was out of spec.  I've been watching packets on our ethernet for so
long, I'm starting to learn the middle name of every bit.  Sun bits look
like DEC bits and Sun, or rather, the LANCE in the 3/50 & 3/60, completely
complys with the timings in the blue book.)  Anyway, the brain-dead Intel
82586 ethernet chip Sun puts in all its 3/180, 3/280 and Sun-4 file
servers can't hack back-to-back, minimum spacing packets.  Every now and
again it drops some of the frags and wanders off to never-never land
("iebark reset").  Diskless workstations don't work well when they can't
write to their file server and, until I hit on the idea of inserting
"DELAY" macros in kudp_fastsend, it looked like I could have a fast TCP or
a functional workstation but not both.]

Probably 30% of the performance improvements came from fixing things in
the Sun kernel.  I mean like, really, guys:  If the current task doesn't
change, and it doesn't 80% of the time swtch is called, there's no need to
do a full context save and restore.  Adding the two lines

	cmpl    _masterprocp,a0
	jeq     6f              | restore of current proc is easy

just before the call to "resume" in sun3/vax.s:swtch got me a quick 70
KB/s performance increase but felt more like a bug fix than progress.  And
a kernel hacker that does 16-bit "movw"s and "addw"s on a 68020, or writes
2 instruction dbra loops designed to put a 68010 in loop mode, should be
shot.  The alu takes the same 3 clocks for a 2 byte add or a 4 byte add so
things will finish a lot quicker if you give it 4 bytes at a time.  And
every branch stalls the pipe, so unrolling that loop to cut down on
branches is a BIG win.

Anyway, I recoded the checksum routine, ocsum.s (now oc_cksum.s because I
found the old calling sequence wasn't the best way to do things) and its
caller, in_cksum.c, and got the checksumming time down from 490us/KB to
130us/KB.  Unrolling the move loop in copyin/copyout (the routines that
move data user to kernel and kernel to user), got them down from 200us/KB
to 140us/KB.  (BTW, if you combine the move with the checksum, which I did
in the kludged up, fast code that ran 1 MB/s on a 15MHz 3/50, it costs
200us/KB, not the 300us/KB you'd expect from adding the move and sum
times.  Pipelined processors are strange and wonderful beasts.)

>From these times, you can work out most of the budgets and processing
details:  I was using 1408 data byte packets (don't ask why 1408).  It
takes 193us to copy a packet's worth of data and 184us to checksum the
packet and its 40 byte header.  From the logic analyzer, the LANCE uses
300us of bus and memory bandwidth reading or writing the packet (I don't
understand why, it should only take half this).   So, the variable costs
are around 700us per packet.  When you add the 18 byte ethernet header and
12 byte interpacket gap, to run at 10 Mb/s I'd have to supply a new packet
every 1200us.  I.e., there's a budget of 500us for all the fixed costs
(protocol processing, interrupt dispatch, device setup, etc.).  This is
do-able (I've done it, but not very cleanly) but what I'm getting today is
a packet every 1500us.  I.e., 800us per packet fixed costs.  When I look
with our analyzer, 30% of this is TCP, IP, ARP and ethernet protocol
processing (this was 45% before the "header prediction" tcp mods), 15% is
stalls (idle time that I don't currently understand but should be able to
eventually get rid of) and 55% is device setup, interrupt service and task
handling.  I.e., protocol processing is negligible (240us per packet on
this processor and this isn't a fast processor in today's terms).  To make
the network go faster, it seems we just need to fix the operating system
parts we've always needed to fix:  I/O service, interrupts, task switching
and scheduling.  Gee, what a surprise.

 - Van