[comp.unix.wizards] NFS performance: a question

brown@noao.arizona.edu (Mike Brown) (02/01/88)

Why is the transfer rate when a process writes to a remote NFS file 3-4 times
smaller than the transfer rate when reading a remote NFS file?

	- Is this asymmetry a characteristic of NFS?
	- Do Ultrix(2.0) and 4.3BSD/NFS from Mt. Xinu both have brain damaged
		NFS implementations?
	- Is my test brain damages?


I am surprized at the difference in the tranfer rate for writing a file
compared to reading a file.  I understand that creating/writing a file in
Unix is slower because of extra overhead involved during file allocation.
The difference I see is far greater than the difference in performance for
reads/writes on a local file system in Unix.  The disks I use have effective
transfer rates of about 280-300 Kilobytes/sec.

The test I ran was:

    writing:
	time /bin/cp /local/2_megabytes /remote/2_megabytes
	time /bin/cp /local/2_megabytes /remote/2_megabytes
	time /bin/cp /local/2_megabytes /remote/2_megabytes

    reading:
	time /bin/cp /remote/2_megabytes /local/2_megabytes
	time /bin/cp /remote/2_megabytes /local/2_megabytes
	time /bin/cp /remote/2_megabytes /local/2_megabytes

The test was run between microvaxes running 4.3BSD from Mt. Xinu
and it was also run between microvaxes running Ultrix 2.0.  The systems
were running multiuser with no other user processes active.

The transfer rates were:	reading (110-90 Kbytes/sec)
				writing (20-25 Kbytes/sec)


	Regards,
	Mike Brown		Biomedical Computer Lab.
				Washington University
				700 S. Euclid Ave.
				St. Louis, MO  63110
				(314) 362-2135


uucp:		uunet!wucs1!brown	or	{...}!noao!brown
internet:	brown@noao.arizona.edu


( Please excuse my posting this from arizona. )
( News out of Wash. Univ. is broken. )

hedrick@athos.rutgers.edu (Charles Hedrick) (02/01/88)

You note that reading is faster than writing via NFS.  We see this on
Suns also, though the difference we see is not quite as drastic as
yours.  Some tests I just did showed writing a big file to be about a
factor of 3 slower than reading.  I believe this is because reading
can do readahead, but it is hard to predict what the next byte a user
is going to write is going to be.  (Note that the machine doing the
reading has a Ciprico controller, so the controller is doing a lot of
readahead also.  Reading across the network was actually twice as fast
as reading locally, because the local machine had a normal Xylogics
controller.)  There are some other things that can affect NFS
performance though.  try nfsstat before and after your test.  See if
you are getting retransmissions.  We have run into systems that can
send data faster than they can receive them (Sun 4's).  In that case,
it turned out that not running biod actually improved throughput by a
factor of 3.  For more normal cases, biod does help.  You can
sometimes increase performancy by running more copies, e.g. /etc/biod
8 instead of /etc/biod 4.

sxm@philabs.Philips.Com (Sandeep Mehta) (02/01/88)

In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes:
>
>Why is the transfer rate when a process writes to a remote NFS file 3-4 times
>smaller than the transfer rate when reading a remote NFS file?
>
> ......

File reading is probably faster because of the read ahead, but when 
you write back blocks to a remote NFS, the server probably waits for
acknowlegement of a successful write, i.e. the server waits until
it receives the ack, before it trashes the cached block.

I would like to know if there is detailed answer to this.

sandeep

-- 
Sandeep Mehta					TEL : (914)-945-6478
Robotics & Flexible Automation			UUCP: uunet!philabs!sxm	
Philips Laboratories				ARPA: sxm@philabs.philips.com

sas@pyrps5 (Scott Schoenthal) (02/02/88)

In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes:
>
>Why is the transfer rate when a process writes to a remote NFS file 3-4 times
>smaller than the transfer rate when reading a remote NFS file?

Due to the stateless nature of NFS, all write requests received by a
server must be written synchronously to the disk before they can be
acknowledged to the client.  If this was not true, the following could happen:

Client A sends a write request (block 'n') to Server B.  Server B
acknowledges the write but does not write the block.  Server B crashes.
Server B then comes back up.  Client A will then send block 'n+1'
thinking that 'n' has been written out.

In a UNIX NFS server implmenetation, the inode and block maps must
also be updated during each synchronous block write.

sas
----
Scott Schoenthal   sas@pyrps5.pyramid.com	Pyramid Technology Corp.

sxn%ingersoll@Sun.COM (Stephen X. Nahm) (02/02/88)

In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes:
>Why is the transfer rate when a process writes to a remote NFS file 3-4 times
>smaller than the transfer rate when reading a remote NFS file?
>
>	- Is this asymmetry a characteristic of NFS?

Yes.  Here's a quote from ``The Sun Network Filesystem: Design,
Implementation and Experience'' by Russel Sandberg:

``Much of the development time of NFS has been spent in improving
performance.  Our goal was to make NFS comparable in speed to a small
local disk.  The speed we were interested in is not raw throughput,
but how long it takes to do normal work.''

There is then a discussion of performance tweaks that were made,
concluding with:

``With these improvements, a diskless Sun-3 (68020 at 16.67 MHz.)
using a Sun-3 server with a Fujitsu Eagle disk, runs the benchmarks
faster than the same Sun-3 with a local Fujitsu 2243AS 84 Megabyte
disk on a SCSI interface.''

Then:

``The 'write' [NFS operation] is slow because it is synchronous on
the server.  Fortunately, the number of write calls in normal use is
very small (about 5% of all calls to the server) so it is not
noticeable unless the client writes a large remote file.''

Finally:

``Since many people base performance estimates on raw transfer speed
we also measured those.  The current numbers on raw transfer speed
are: 250 kilobytes/second for read and 60 kilobytes/second for write
on a Sun-3 with a Sun-3 server.''

The factors you should be aware of are: NFS implementation, CPU
bandwidth (both server and client), and type of disk involved (both
server and client).

>The transfer rates were:	reading (110-90 Kbytes/sec)
>				writing (20-25 Kbytes/sec)

Your ratios match well with what Rusty reported in his paper (his was
about 4.1 read-to-write, and yours is about 4.4).  Perhaps the
CPU bandwidth of your client and server is the limiting factor.

>	- Do Ultrix(2.0) and 4.3BSD/NFS from Mt. Xinu both have brain damaged
>		NFS implementations?

I believe both are based on Sun's reference port.  We port SunOS NFS
to vanilla 4.2 and 4.3BSD.  The licensees use that as a reference in
building their own product.

>	- Is my test brain damages?

Not if this represents the typical kind of work you do.  However a
fairer benchmark would be real work, like doing compiles, nroffs,
whatever.

Steve Nahm                              sxn@sun.COM or sun!sxn
Portable ONC/NFS

ps@diab.UUCP (Per Erik Sundberg) (02/02/88)

In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes:
>
>Why is the transfer rate when a process writes to a remote NFS file 3-4 times
>smaller than the transfer rate when reading a remote NFS file?
>
>	- Is this asymmetry a characteristic of NFS?

Yes it is. Due to the statelessnes of the NFS protocol, writes are SYNCRONOUS.
This is to make sure that the data written is on stable storage when the
client call returns. Local file writes are normally asyncronous, which will
make it possible to optimize the writes to disk using delayed writes.
BTW, early Lachman System V NFS implementations have a bug in them, that
will use async writes on the server side. Will improve performance a bit
though :-)

-- 
Per-Erik Sundberg,  Diab Data AB
SNAIL: Box 2029, S-183 02 Taby, Sweden
ANALOG: +46 8-7680660
UUCP: mcvax!enea!diab!ps

grandi@noao.arizona.edu (Steve Grandi) (02/02/88)

In article <14163@pyramid.pyramid.com> sas@pyrps5.pyramid.com (Scott Schoenthal) writes:
>Due to the stateless nature of NFS, all write requests received by a
>server must be written synchronously to the disk before they can be
>acknowledged to the client.  If this was not true, the following could happen:
>
>In a UNIX NFS server implmenetation, the inode and block maps must
>also be updated during each synchronous block write.

At the Sun Users Group meeting I heard about one of "hacks" added to Sun OS
and NFS to make NFS performance "good enough" to replace the unloved ND
protocol for booting, paging and swapping.  Consider what happens when
an NFS server makes a write on behalf of a client.  Even the simplest case of
a "replacement"  (a write that does not cause the file to be extended) requires
TWO synchronous writes: one to write the data block and one to update time
fields in the inode.  Thus one of the hacks added to Sun OS 4.0 is to
eliminate updating the inode atime and mtime fields for a "non-extending"
write.  Since client swap space will be contained in pre-allocated files on
the server, you have just eliminated half of this source of overhead for
client paging and swapping in NFS.
-- 
Steve Grandi, National Optical Astronomy Observatories, Tucson AZ, 602-325-9228
UUCP: {arizona,decvax,hao,ihnp4}!noao!grandi  or  uunet!noao.arizona.edu!grandi 
Internet: grandi@noao.arizona.edu    SPAN/HEPNET: 5356::GRANDI or DRACO::GRANDI

dougs@sequent.UUCP (Doug Schwartz) (02/03/88)

In article <663@noao.UUCP>, brown@noao.arizona.edu (Mike Brown) writes:
> 
> Why is the transfer rate when a process writes to a remote NFS file 3-4 times
> smaller than the transfer rate when reading a remote NFS file?

I may not be able to explain why you are getting what you are getting,
but this seems to be in line with Sun's published statistics.  From the
Summer 1985 USENIX prodeedings, page 129, "The current numbers on raw
transfer speed are: 120 kilobytes/second for read (cp bigfile /dev/null)
and 40 kilobytes/second for write."
-- 
Doug Schwartz
Sequent Computer
Beaverton, Oregon
tektronix!ogcvax!sequent!dougs

berliner@convex.UUCP (02/04/88)

At CONVEX, we have a special variable which can be patched on a running
system (or built in during system configuration) which enables the
CONVEX NFS server to operate asynchronously.  That is, to NOT do a
synchronous write, but rather simply queue the write request in the
buffer cache to be written to permanent disk storage at some later
time (marked delayed-write).

I decided to try your "cp 2 MegaBytes from client to server" timings
to see where we stand, and to see how the asynchronous server affects
the timings.  This is between two C-1's.  The results are as follows:

		| Sync.	| Async	|
	+-------+-------+-------+
	| Write	|  60kb	| 186kb	|
	| Read	| 204kb | 204kb	|
	+-------+-------+-------+

Note:  these numbers are completely unofficial and informal.  I'm just
reporting my quick timings.

I know that the "synchronous" nature of NFS relies on the fact that
once the server has responded to a write request, the client can rest
assured that the data is safely written to permanent storage.  That is
why our default behaviour is the "synchronous" one.  We document the
tradeoffs involved with turning on the asynch server option and leave
it to our customers to decide how they wish to use NFS.

I can tell you that internally, we run things with asynch NFS turned
on, due to the much improved performance we see on our writes.  I can
also tell you that you can only get burned if your server crashes --
we have not seen running async NFS server in-house to be a problem.

I might also note that the next revision of the NFS spec. (version 3)
seems to have some hooks to do just what we've done, but by adding it
into the spec.  There Sun may (it hasn't been released yet) add an NFS
request, WRITECACHE, which basically allows the client to control
which data is written to the server's buffer cache and whether or not
it should be "sync"ed to the disk upon completion.

Regards,

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Brian <nogger> Berliner				What is a nogger, anyway?
Convex Computer Corp.
UUCP: {ihnp4, uiucdcs, sun, rice, allegra}!convex!berliner
ARPA: convex!berliner@rice.arpa

weiser.pa@Xerox.COM (02/06/88)

One reason writing through NFS is slower than reading is that writes commit to
the disk while reads can be buffered.  This is a feature of NFS, that when a
'write' returns you know the bits are really magnetically rotating (or equiv)
somewhere, but slows down writes. 

-mark

peralta@encore.UUCP (Rick Peralta) (02/09/88)

Why not allow the user to specify SYNC ot ASYNC writes and
allow the server to do take advantage of caching?

uppal@hpindda.HP.COM (Sanjay Uppal) (02/11/88)

As has been mentioned before, the asynchronous option would
go against the stateless nature of NFS if one wanted to 
recover from crashes correctly. However, I believe that
for the latest revision of the protocol SUN will allow
such a call to be made.

In UNIX, for files greater than the number of directly accessed
blocks, the indirect block too has to be written to the disk
synchronously. That makes it 3 blocks to be written synchronously
for every data block. Not only does this decrease the overlapped
processing with the cpu, but also increases the average time
to write a block as the blocks are no longer physically sequential.


Sanjay Uppal (NN9T, VU2SMT)	phone:	(408) 447-3864
Hewlett-Packard (IND)		uucp: ...!hplabs!hpda!uppal
NS/800			 	arpa: uppal%hpda@hplabs.hp.com

mangler@cit-vax.Caltech.Edu (Don Speck) (02/14/88)

In article <63500011@convex>, berliner@convex.UUCP writes:
> At CONVEX, we have a special variable which can be patched on a running
> system (or built in during system configuration) which enables the
> CONVEX NFS server to operate asynchronously.	That is, to NOT do a
> synchronous write, but rather simply queue the write request in the
> buffer cache to be written to permanent disk storage at some later
> time (marked delayed-write).

This is unnecessarily risky; it could probably be handled better in
the client.  In conventional Unix filesystems, bufs for writing file
blocks are marked B_ASYNC, and nobody is waiting for them to complete;
so if the acknowledges are slow in coming, that shouldn't inhibit the
client from sending some more to the server.  The client should blast
data at the server until one of server or client cannot keep track of
more outstanding requests.  Unfortunately, the default limit on the
amount of unacknowledged outstanding requests is a mere 6K!  It may
help to raise this, if the server's ethernet board has enough buffering,
but what is really needed is two kinds of acknowledges, one that says
    "Your request is completed",
and a new one that says
    "I can accept another N kilobytes of requests"
(much like TCP's notion of a "window").  Having the latter would
obviate the need to specify rsize and wsize when mounting an
NFS filesystem.  Anyone who owns 3com's would appreciate that.

Don Speck   speck@vlsi.caltech.edu  {amdahl,ames!elroy}!cit-vax!speck

guy@gorodish.Sun.COM (Guy Harris) (02/15/88)

> In conventional Unix filesystems, bufs for writing file blocks are marked
> B_ASYNC, and nobody is waiting for them to complete; so if the acknowledges
> are slow in coming, that shouldn't inhibit the client from sending some more
> to the server.

It doesn't, at least in Sun's implementation; writes are synchronous on the
server, but not on the client.  The "biod" processes handle asynchronous NFS
writes.  Note that this has the disadvantage that write errors are not
synchronously reported back to the program doing the "write"s; when doing NFS
writes, "out of space" errors are not synchronously reported.  A program should
probably do an "fsync" after writing out a lot of data, and check the return
code from "fsync", so that it can find out about such errors and report them to
the user.

wesommer@athena.mit.edu (William Sommerfeld) (02/15/88)

In article <41894@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>It doesn't, at least in Sun's implementation; writes are synchronous on the
>server, but not on the client.  The "biod" processes handle asynchronous NFS
>writes.  Note that this has the disadvantage that write errors are not
>synchronously reported back to the program doing the "write"s; when doing NFS
>writes, "out of space" errors are not synchronously reported.  A program
>should probably do an "fsync" after writing out a lot of data, and check
>the return code from "fsync", so that it can find out about such errors
>and report them to the user.

In at least one version of NFS based on the Sun source (the Wisconsin
4.3+NFS for vaxen which we're using here at Athena), 99% of the code
needed to report write errors back to the client process was there.
However, the part which stored the error code from the write RPC call
into the rnode structure (the client side per-file state) was missing,
and thus fsync() and close() of NFS files could never return error
codes.

By the way, the biod's seem to be a very expensive way to implement a
`window'; that is, having one process per outstanding write packet.
If the Sun RPC client and server code could handle more than one
outstanding packet at a time, this could be fixed.

					- Bill

ed@mtxinu.UUCP (Ed Gould) (02/16/88)

In article <41894@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>... when doing NFS writes, "out of space" errors are not synchronously
>reported.  A program should probably do an "fsync" after writing out a
>lot of data, and check the return code from "fsync", so that it can
>find out about such errors and report them to the user.

The return value from close() will also report the error code, so the
fsync() is not necessary in all cases.  It's necessary only if the program
wishes to check for errors but not close the file.

Note that there are errors that can occur on local filesystems that may
not be reported synchronously either.  Usually, these are hardware
errors (not software-only things like "out of space").  In fact, they
may not be reported to the user program at all.

Moral:  Check for and report errors from *all* system calls.

-- 
Ed Gould                    mt Xinu, 2560 Ninth St., Berkeley, CA  94710  USA
{ucbvax,uunet}!mtxinu!ed    +1 415 644 0146

"`She's smart, for a woman, wonder how she got that way'..."