[comp.unix.internals] NFS vs communications meduim

torek@elf.ee.lbl.gov (Chris Torek) (03/18/91)

In article <thurlow.669179279@convex.convex.com> thurlow@convex.com
(Robert Thurlow) writes:
>... The only difference is the performance bottleneck due to the network.
>If you crippled your I/O subsystem, you'd see similar things.  Until we
>get new networks that are two orders of magnitude faster, this may be
>the case.

(Rob T is at convex, so he may actually have disks with real bandwidth;
then the picture changes.)

The bandwidth of your standard, boring old Ethernet is 10 Mb/s or 1.2
MB/s.  The bandwidth of your standard, boring old SCSI disk without
synchronous mode is around 1.5 MB/s.  The latency on your Ethernet is
typically much *lower* than that on your standard boring SCSI
controller (which probably contains a 4 MHz 8085 running ill-planned
and poorly-written code, whereas your Ethernet chip has a higher clock
rate and shorter microcode paths.)

In other words, they are fairly closely matched.  So why does NFS
performance fall so far below local SCSI performance?

There are many different answers to this question, but one of the most
important is one of the easiest to cure.

A typical NFS implementation uses UDP to ship packets from one machine
to another.  Its UDP interface typically squeezes about 500 KB/s out of
the Ethernet (i.e., around 42% of the available bandwidth).  Since UDP
is an `unreliable protocol' (in the sense that UDP is allowed to drop
and reorder packets), the NFS implementation has to duplicate most of
the TCP mechanism to make things reliable.

A good TCP implementation, on the other hand, squeezes about 1.1 MB/s
out of the Ethernet even when talking to user code (talking to user code
is inherently at least slightly more expensive than talking to kernel
code, because you must double-check everything so that users cannot
crash the machine).  This is 92% of the available bandwidth.

Thus, one easy way to improve NFS performance (by a factor of less than
2, unfortunately: even though you may halve the time spent talking,
there is plenty of other unimproved time in there) is to replace the
poor TCP implementations with good ones, and then simply call the TCP
transport code.  (To talk to existing NFS implementations, you must
also preserve a UDP interface, so you might as well fix that too.)  The
reason this is easy is that much of the work has already been done for
you---it appears in the current BSD systems.  As a nice side bonus, TCP
NFS works over long-haul and low-speed networks (including 9600 baud
serial links).  A typical UDP NFS does not, because its retransmit
algorithms are wired for local Ethernet speeds.

Indeed, even if you do go from Ethernet to FDDI, you will find that your
NFS performance is largely unchanged unless you fix the UCP and TCP code.
(When you fix TCP, you will discover that you also need window scaling,
since the amount of data `in flight' over gigabit networks is much more
than an unscaled TCP window can describe.)

Opening up this bottleneck reveals the next one to be NFS's write-through
cache policy, and now I will stop talking.  (You may infer anything you
like from this phrasing :-) .)
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

david@bacchus.esa.oz.au (David Burren [Athos]) (03/18/91)

In <11030@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:

>The bandwidth of your standard, boring old Ethernet is 10 Mb/s or 1.2
>MB/s.

Say what?  If you can get over 1 Mb/s out of an Ethernet I'd like to hear
about it.

As a simple test, on a barely-loaded Ethernet (5 Sony workstations, with two
people running vi) I ftp'ed a >400k file from one machine to another.
Local SCSI disk to local RAM disk.  No NFS involved.  The transfer rate I
got was was 94 kbytes/s. (strange, considering the 270 kb/s NFS throughput
shown below)
I know this is a poor test, but it indicates a ballpark figure MUCH less than
1 Mb/s.

On a busy Ethernet I'd expect IP performance to fall _far_ short of 1 Mb/s,
as collisions take their toll.

>The bandwidth of your standard, boring old SCSI disk without
>synchronous mode is around 1.5 MB/s.

Using the bonnie filesystem-benchmark on our local SCSI disks shows writes
ranging from 200 kb/s (for char-by-char) to >600 kb/s (block I/O) and reads
from 150 kb/s (character) to >600 kb/s (block).
This is with Wren-IV's and M9380S's using asynchronous SCSI.  Note that
bonnie measures *through-the-filesystem* performance.

I ran bonnie again, over NFS/Ethernet (onto a workstation with 8 Mb RAM).  By
this stage there were about 5 users on the net, running a mix of vi, nn,
xmahjongg, etc.
Block writes came in at about 50 kb/s (not surprising really) while reads
showed ~150 kb/s (character) and ~270 kb/s (block).  This was for a 40 Mb
file, and no I didn't do the test more than once.

>The latency on your Ethernet is
>typically much *lower* than that on your standard boring SCSI
>controller (which probably contains a 4 MHz 8085 running ill-planned
>and poorly-written code, whereas your Ethernet chip has a higher clock
>rate and shorter microcode paths.)

Which distinguishes older SCSI/ST506 implementations from the newer
embedded-SCSI disks.  I wonder which is more prevalent in today's machines?
Also, see my comment above re Ethernet collisions.

>In other words, they are fairly closely matched.

I beg to differ.  Of course, the hardware here may be atypical.
That aside, I agree that NFS performance is probably less than optimal.

>There are many different answers to this question, but one of the most
>important is one of the easiest to cure.

>A good TCP implementation, on the other hand, squeezes about 1.1 MB/s
>out of the Ethernet even when talking to user code (talking to user code
>is inherently at least slightly more expensive than talking to kernel
>code, because you must double-check everything so that users cannot
>crash the machine).  This is 92% of the available bandwidth.

Could you please refer me to such a TCP implementation?
The figures I've quoted above were on Sony NEWS-1750 workstations, running
NEWS-OS 3.3a (basically 4.3BSD-Tahoe, I believe).
_____________________________________________________________________________
David Burren [Athos]                          Email: david@bacchus.esa.oz.au
Software Development Engineer                 Phone: +61 3 819 4554
Expert Solutions Australia, Hawthorn, VIC     Fax:   +61 3 819 5580

[Above opinions and comments are mine, not ESA's.]

goudreau@larrybud.rtp.dg.com (Bob Goudreau) (03/19/91)

In article <2028@bacchus.esa.oz.au>, david@bacchus.esa.oz.au (David Burren [Athos]) writes:
> In <11030@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
> 
> >The bandwidth of your standard, boring old Ethernet is 10 Mb/s or 1.2
> >MB/s.
> 
> Say what?  If you can get over 1 Mb/s out of an Ethernet I'd like to hear
> about it.

Note that Chris said "bandwidth", not "effective throughput".

Also, pay attention to "b" ("bit") vs. "B" (byte).

----------------------------------------------------------------------
Bob Goudreau				+1 919 248 6231
Data General Corporation		goudreau@dg-rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!goudreau
Research Triangle Park, NC  27709, USA

torek@elf.ee.lbl.gov (Chris Torek) (03/19/91)

>In <11030@dog.ee.lbl.gov> I wrote:
>>The bandwidth of your standard, boring old Ethernet is 10 Mb/s or 1.2
>>MB/s.

In article <2028@bacchus.esa.oz.au> david@bacchus.esa.oz.au
(David Burren [Athos]) writes:
>Say what?  If you can get over 1 Mb/s out of an Ethernet I'd like to hear
>about it.

You just did. :-)

Van Jacobson regulary gets around 1 MB/s (8 Mb/s) on Sun-3 (68020) boxes.
4.3BSD-reno (a much less carefully tuned system than Van's) running on a VAX
8250 with a DEUNA, talking to an Encore Multimax running UMax 4.3, receives
data inside FTP at 130 kb/s or just a bit over 1 Mb/s.

(I used `get /vmunix /dev/null' to get this number.  Note that this depends
on the rate at which the remote machine can generate data for you.)

>As a simple test, on a barely-loaded Ethernet (5 Sony workstations, with two
>people running vi) I ftp'ed a >400k file from one machine to another.
>Local SCSI disk to local RAM disk.  No NFS involved.  The transfer rate I
>got was was 94 kbytes/s.

(You may have forgotten to use binary mode.)

>>The bandwidth of your standard, boring old SCSI disk without
>>synchronous mode is around 1.5 MB/s.

>Using the bonnie filesystem-benchmark on our local SCSI disks shows writes
>ranging from 200 kb/s (for char-by-char) to >600 kb/s (block I/O) and reads
>from 150 kb/s (character) to >600 kb/s (block).
>This is with Wren-IV's and M9380S's using asynchronous SCSI.  Note that
>bonnie measures *through-the-filesystem* performance.

Yes, these numbers are fairly typical (you lose half the bus performance
in the file system code: something else that needs tuning: see Larry McVoy's
paper from the last Usenix for one approach).

>>A good TCP implementation, on the other hand, squeezes about 1.1 MB/s
>>out of the Ethernet even when talking to user code ...

>Could you please refer me to such a TCP implementation?
>The figures I've quoted above were on Sony NEWS-1750 workstations, running
>NEWS-OS 3.3a (basically 4.3BSD-Tahoe, I believe).

4.3-tahoe lacks the `header prediction' code that appears in 4.3-reno.
4.3-reno lacks Van's latest changes (though said changes are likely to
be in 4.4BSD, if/when 4.4BSD exists).

Only those who work on NEWS-OS could say for certain which performance
fixes are in it.  Also, much depends on the bus design and the code for
the Ethernet driver.  It is important to avoid data copies; many
existing implementations copy a packet just so they can insert headers,
even though it is easy to arrange for space for those headers `in
advance'.  It is also important to avoid long code paths for typical
cases (e.g., the `header prediction' stuff that went into 4.3-reno,
and the route cacheing stuff; I think the latter has been around longer).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

david@bacchus.esa.oz.au (David Burren [Athos]) (03/19/91)

In <2028@bacchus.esa.oz.au> I wrote:

>In <11030@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:

>>The bandwidth of your standard, boring old Ethernet is 10 Mb/s or 1.2
>>MB/s.

>Say what?  If you can get over 1 Mb/s out of an Ethernet I'd like to hear
>about it.

Bruce Barnett @ GE kindly sent me a copy of a posting to comp.protocols.tcp-ip
by Van Jacobson in October 1988.

In it he described tests using Sun-3s with two types of Ethernet controller:
a Lance and an i82586.  The LANCE came out best, with throughputs up to
1000 kbytes/sec, while the Intel part peaked out at 720 kbytes/sec.

I stand corrected in what Ethernet can do :-)  Mind you, unfortunately I
suspect that this optimised code is still absent in many shipped systems.
I do not know if the Sonys here incorporate the Van Jacobson TCP.

So, Ethernet being capable (depending on controller and software) of
sustaining throughputs similar to modern asynch SCSI-1 setups, we're back
to the distinct performance difference between local disks and NFS.
Eg. in my previous posting: fs performance (block reads) on:
	SCSI	600 kb/s
	NFS	270 kb/s

Not that I've added all that much to the discussion :-(  Back to the experts...

- David B.

milburn@me10.lbl.gov (John Milburn) (03/19/91)

In the referenced article torek@elf.ee.lbl.gov (Chris Torek) writes:

>Van Jacobson regulary gets around 1 MB/s (8 Mb/s) on Sun-3 (68020) boxes.
>4.3BSD-reno (a much less carefully tuned system than Van's) running on a VAX
>8250 with a DEUNA, talking to an Encore Multimax running UMax 4.3, receives
>data inside FTP at 130 kb/s or just a bit over 1 Mb/s.

>(I used `get /vmunix /dev/null' to get this number.  Note that this depends
>on the rate at which the remote machine can generate data for you.)

There are commercial implementations using Van's alogrithms.  Using an
hp9000s400 (HP/UX 7.03) connected to a locally connected sun4 (SunOS
4.1), and using the same method, "get /vmunix /dev/null", I get a
binary transfer rate of 501 Kbyte/sec or .5 MByte/s.  The hp is using
header prediction, dynamic window sizing, and Phil Karn's clamped
retransmission algorithm.

If I go to another sun4 on the other side of a cisco router, a Dec LanBridge,
and two FDDI <-> ether bridges, the rate drops to 240 Kbyte/s.

-jem
--
John Milburn             milburn@me10.lbl.gov     (415) 486-6969
"Inconceivable!"
"You use that word a lot.  I don't think it means what you think it does."

lm@slovax.Eng.Sun.COM (Larry McVoy) (03/19/91)

In article <11074@dog.ee.lbl.gov> JEMilburn@lbl.gov (John Milburn) writes:
>In the referenced article torek@elf.ee.lbl.gov (Chris Torek) writes:
>
>>Van Jacobson regulary gets around 1 MB/s (8 Mb/s) on Sun-3 (68020) boxes.
>>4.3BSD-reno (a much less carefully tuned system than Van's) running on a VAX
>>8250 with a DEUNA, talking to an Encore Multimax running UMax 4.3, receives
>>data inside FTP at 130 kb/s or just a bit over 1 Mb/s.
>
>>(I used `get /vmunix /dev/null' to get this number.  Note that this depends
>>on the rate at which the remote machine can generate data for you.)
>
>There are commercial implementations using Van's alogrithms.  Using an
>hp9000s400 (HP/UX 7.03) connected to a locally connected sun4 (SunOS
>4.1), and using the same method, "get /vmunix /dev/null", I get a
>binary transfer rate of 501 Kbyte/sec or .5 MByte/s.  The hp is using
>header prediction, dynamic window sizing, and Phil Karn's clamped
>retransmission algorithm.

The clustering changes give you a bit better performance (both
ends are sun 4/60's on a local net, the end w/ /h/XXX has
clustering changes.  The reason it doesn't get faster the second
time is that snafu has only 8MB of memory, so much of the file
is reread from disk.)  The interesting thing to note is that the
disk bandwidth (~1.2MB/sec) and the ethernet are closely
matched.  What happens when we consider FDDI and ISDN, the fast
and slow futures of networking?

220 snafu FTP server (SunOS 4.1.1) ready.
ftp> bin
200 Type set to I.
ftp> get /h/XXX /dev/null
200 PORT command successful.
150 Binary data connection for /h/XXX (129.144.50.10,1494) (8388608 bytes).
8388608 bytes received in 11 seconds (7.4e+02 Kbytes/s)
ftp> get /h/XXX /dev/null
8388608 bytes received in 11 seconds (7.6e+02 Kbytes/s)
ftp> quit
script done on Mon Mar 18 19:53:19 1991
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

raj@hpindwa.cup.hp.com (Rick Jones) (03/22/91)

In addition to paying close attention to b's and B's, I have also
decided to take what ftp says as merely a best guess for the transfer
rate. If you transfer a small enough file on a fast enough system, one
can see ftp report transfer rates of 3-4 MB/s (closely watched b's and
B's ;-) on an *ethernet*  ;-) ;-) ;-)

Just about any new WS worth it's silicon should be able to go memory
to memory at full ethernet speeds using TCP...

rick jones




___   _  ___
|__) /_\  |    Richard Anders Jones   | HP-UX   Networking   Performance
| \_/   \_/    Hewlett-Packard  Co.   | "It's so fast, that _______" ;-)
------------------------------------------------------------------------
Being an employee of a Standards Company, all Standard Disclaimers Apply