brown@noao.arizona.edu (Mike Brown) (02/01/88)
Why is the transfer rate when a process writes to a remote NFS file 3-4 times smaller than the transfer rate when reading a remote NFS file? - Is this asymmetry a characteristic of NFS? - Do Ultrix(2.0) and 4.3BSD/NFS from Mt. Xinu both have brain damaged NFS implementations? - Is my test brain damages? I am surprized at the difference in the tranfer rate for writing a file compared to reading a file. I understand that creating/writing a file in Unix is slower because of extra overhead involved during file allocation. The difference I see is far greater than the difference in performance for reads/writes on a local file system in Unix. The disks I use have effective transfer rates of about 280-300 Kilobytes/sec. The test I ran was: writing: time /bin/cp /local/2_megabytes /remote/2_megabytes time /bin/cp /local/2_megabytes /remote/2_megabytes time /bin/cp /local/2_megabytes /remote/2_megabytes reading: time /bin/cp /remote/2_megabytes /local/2_megabytes time /bin/cp /remote/2_megabytes /local/2_megabytes time /bin/cp /remote/2_megabytes /local/2_megabytes The test was run between microvaxes running 4.3BSD from Mt. Xinu and it was also run between microvaxes running Ultrix 2.0. The systems were running multiuser with no other user processes active. The transfer rates were: reading (110-90 Kbytes/sec) writing (20-25 Kbytes/sec) Regards, Mike Brown Biomedical Computer Lab. Washington University 700 S. Euclid Ave. St. Louis, MO 63110 (314) 362-2135 uucp: uunet!wucs1!brown or {...}!noao!brown internet: brown@noao.arizona.edu ( Please excuse my posting this from arizona. ) ( News out of Wash. Univ. is broken. )
hedrick@athos.rutgers.edu (Charles Hedrick) (02/01/88)
You note that reading is faster than writing via NFS. We see this on Suns also, though the difference we see is not quite as drastic as yours. Some tests I just did showed writing a big file to be about a factor of 3 slower than reading. I believe this is because reading can do readahead, but it is hard to predict what the next byte a user is going to write is going to be. (Note that the machine doing the reading has a Ciprico controller, so the controller is doing a lot of readahead also. Reading across the network was actually twice as fast as reading locally, because the local machine had a normal Xylogics controller.) There are some other things that can affect NFS performance though. try nfsstat before and after your test. See if you are getting retransmissions. We have run into systems that can send data faster than they can receive them (Sun 4's). In that case, it turned out that not running biod actually improved throughput by a factor of 3. For more normal cases, biod does help. You can sometimes increase performancy by running more copies, e.g. /etc/biod 8 instead of /etc/biod 4.
sxm@philabs.Philips.Com (Sandeep Mehta) (02/01/88)
In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes: > >Why is the transfer rate when a process writes to a remote NFS file 3-4 times >smaller than the transfer rate when reading a remote NFS file? > > ...... File reading is probably faster because of the read ahead, but when you write back blocks to a remote NFS, the server probably waits for acknowlegement of a successful write, i.e. the server waits until it receives the ack, before it trashes the cached block. I would like to know if there is detailed answer to this. sandeep -- Sandeep Mehta TEL : (914)-945-6478 Robotics & Flexible Automation UUCP: uunet!philabs!sxm Philips Laboratories ARPA: sxm@philabs.philips.com
sas@pyrps5 (Scott Schoenthal) (02/02/88)
In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes: > >Why is the transfer rate when a process writes to a remote NFS file 3-4 times >smaller than the transfer rate when reading a remote NFS file? Due to the stateless nature of NFS, all write requests received by a server must be written synchronously to the disk before they can be acknowledged to the client. If this was not true, the following could happen: Client A sends a write request (block 'n') to Server B. Server B acknowledges the write but does not write the block. Server B crashes. Server B then comes back up. Client A will then send block 'n+1' thinking that 'n' has been written out. In a UNIX NFS server implmenetation, the inode and block maps must also be updated during each synchronous block write. sas ---- Scott Schoenthal sas@pyrps5.pyramid.com Pyramid Technology Corp.
sxn%ingersoll@Sun.COM (Stephen X. Nahm) (02/02/88)
In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes: >Why is the transfer rate when a process writes to a remote NFS file 3-4 times >smaller than the transfer rate when reading a remote NFS file? > > - Is this asymmetry a characteristic of NFS? Yes. Here's a quote from ``The Sun Network Filesystem: Design, Implementation and Experience'' by Russel Sandberg: ``Much of the development time of NFS has been spent in improving performance. Our goal was to make NFS comparable in speed to a small local disk. The speed we were interested in is not raw throughput, but how long it takes to do normal work.'' There is then a discussion of performance tweaks that were made, concluding with: ``With these improvements, a diskless Sun-3 (68020 at 16.67 MHz.) using a Sun-3 server with a Fujitsu Eagle disk, runs the benchmarks faster than the same Sun-3 with a local Fujitsu 2243AS 84 Megabyte disk on a SCSI interface.'' Then: ``The 'write' [NFS operation] is slow because it is synchronous on the server. Fortunately, the number of write calls in normal use is very small (about 5% of all calls to the server) so it is not noticeable unless the client writes a large remote file.'' Finally: ``Since many people base performance estimates on raw transfer speed we also measured those. The current numbers on raw transfer speed are: 250 kilobytes/second for read and 60 kilobytes/second for write on a Sun-3 with a Sun-3 server.'' The factors you should be aware of are: NFS implementation, CPU bandwidth (both server and client), and type of disk involved (both server and client). >The transfer rates were: reading (110-90 Kbytes/sec) > writing (20-25 Kbytes/sec) Your ratios match well with what Rusty reported in his paper (his was about 4.1 read-to-write, and yours is about 4.4). Perhaps the CPU bandwidth of your client and server is the limiting factor. > - Do Ultrix(2.0) and 4.3BSD/NFS from Mt. Xinu both have brain damaged > NFS implementations? I believe both are based on Sun's reference port. We port SunOS NFS to vanilla 4.2 and 4.3BSD. The licensees use that as a reference in building their own product. > - Is my test brain damages? Not if this represents the typical kind of work you do. However a fairer benchmark would be real work, like doing compiles, nroffs, whatever. Steve Nahm sxn@sun.COM or sun!sxn Portable ONC/NFS
ps@diab.UUCP (Per Erik Sundberg) (02/02/88)
In article <663@noao.UUCP> brown@noao.arizona.edu (Mike Brown) writes: > >Why is the transfer rate when a process writes to a remote NFS file 3-4 times >smaller than the transfer rate when reading a remote NFS file? > > - Is this asymmetry a characteristic of NFS? Yes it is. Due to the statelessnes of the NFS protocol, writes are SYNCRONOUS. This is to make sure that the data written is on stable storage when the client call returns. Local file writes are normally asyncronous, which will make it possible to optimize the writes to disk using delayed writes. BTW, early Lachman System V NFS implementations have a bug in them, that will use async writes on the server side. Will improve performance a bit though :-) -- Per-Erik Sundberg, Diab Data AB SNAIL: Box 2029, S-183 02 Taby, Sweden ANALOG: +46 8-7680660 UUCP: mcvax!enea!diab!ps
grandi@noao.arizona.edu (Steve Grandi) (02/02/88)
In article <14163@pyramid.pyramid.com> sas@pyrps5.pyramid.com (Scott Schoenthal) writes: >Due to the stateless nature of NFS, all write requests received by a >server must be written synchronously to the disk before they can be >acknowledged to the client. If this was not true, the following could happen: > >In a UNIX NFS server implmenetation, the inode and block maps must >also be updated during each synchronous block write. At the Sun Users Group meeting I heard about one of "hacks" added to Sun OS and NFS to make NFS performance "good enough" to replace the unloved ND protocol for booting, paging and swapping. Consider what happens when an NFS server makes a write on behalf of a client. Even the simplest case of a "replacement" (a write that does not cause the file to be extended) requires TWO synchronous writes: one to write the data block and one to update time fields in the inode. Thus one of the hacks added to Sun OS 4.0 is to eliminate updating the inode atime and mtime fields for a "non-extending" write. Since client swap space will be contained in pre-allocated files on the server, you have just eliminated half of this source of overhead for client paging and swapping in NFS. -- Steve Grandi, National Optical Astronomy Observatories, Tucson AZ, 602-325-9228 UUCP: {arizona,decvax,hao,ihnp4}!noao!grandi or uunet!noao.arizona.edu!grandi Internet: grandi@noao.arizona.edu SPAN/HEPNET: 5356::GRANDI or DRACO::GRANDI
dougs@sequent.UUCP (Doug Schwartz) (02/03/88)
In article <663@noao.UUCP>, brown@noao.arizona.edu (Mike Brown) writes: > > Why is the transfer rate when a process writes to a remote NFS file 3-4 times > smaller than the transfer rate when reading a remote NFS file? I may not be able to explain why you are getting what you are getting, but this seems to be in line with Sun's published statistics. From the Summer 1985 USENIX prodeedings, page 129, "The current numbers on raw transfer speed are: 120 kilobytes/second for read (cp bigfile /dev/null) and 40 kilobytes/second for write." -- Doug Schwartz Sequent Computer Beaverton, Oregon tektronix!ogcvax!sequent!dougs
berliner@convex.UUCP (02/04/88)
At CONVEX, we have a special variable which can be patched on a running system (or built in during system configuration) which enables the CONVEX NFS server to operate asynchronously. That is, to NOT do a synchronous write, but rather simply queue the write request in the buffer cache to be written to permanent disk storage at some later time (marked delayed-write). I decided to try your "cp 2 MegaBytes from client to server" timings to see where we stand, and to see how the asynchronous server affects the timings. This is between two C-1's. The results are as follows: | Sync. | Async | +-------+-------+-------+ | Write | 60kb | 186kb | | Read | 204kb | 204kb | +-------+-------+-------+ Note: these numbers are completely unofficial and informal. I'm just reporting my quick timings. I know that the "synchronous" nature of NFS relies on the fact that once the server has responded to a write request, the client can rest assured that the data is safely written to permanent storage. That is why our default behaviour is the "synchronous" one. We document the tradeoffs involved with turning on the asynch server option and leave it to our customers to decide how they wish to use NFS. I can tell you that internally, we run things with asynch NFS turned on, due to the much improved performance we see on our writes. I can also tell you that you can only get burned if your server crashes -- we have not seen running async NFS server in-house to be a problem. I might also note that the next revision of the NFS spec. (version 3) seems to have some hooks to do just what we've done, but by adding it into the spec. There Sun may (it hasn't been released yet) add an NFS request, WRITECACHE, which basically allows the client to control which data is written to the server's buffer cache and whether or not it should be "sync"ed to the disk upon completion. Regards, -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Brian <nogger> Berliner What is a nogger, anyway? Convex Computer Corp. UUCP: {ihnp4, uiucdcs, sun, rice, allegra}!convex!berliner ARPA: convex!berliner@rice.arpa
weiser.pa@Xerox.COM (02/06/88)
One reason writing through NFS is slower than reading is that writes commit to the disk while reads can be buffered. This is a feature of NFS, that when a 'write' returns you know the bits are really magnetically rotating (or equiv) somewhere, but slows down writes. -mark
peralta@encore.UUCP (Rick Peralta) (02/09/88)
Why not allow the user to specify SYNC ot ASYNC writes and allow the server to do take advantage of caching?
uppal@hpindda.HP.COM (Sanjay Uppal) (02/11/88)
As has been mentioned before, the asynchronous option would go against the stateless nature of NFS if one wanted to recover from crashes correctly. However, I believe that for the latest revision of the protocol SUN will allow such a call to be made. In UNIX, for files greater than the number of directly accessed blocks, the indirect block too has to be written to the disk synchronously. That makes it 3 blocks to be written synchronously for every data block. Not only does this decrease the overlapped processing with the cpu, but also increases the average time to write a block as the blocks are no longer physically sequential. Sanjay Uppal (NN9T, VU2SMT) phone: (408) 447-3864 Hewlett-Packard (IND) uucp: ...!hplabs!hpda!uppal NS/800 arpa: uppal%hpda@hplabs.hp.com
mangler@cit-vax.Caltech.Edu (Don Speck) (02/14/88)
In article <63500011@convex>, berliner@convex.UUCP writes: > At CONVEX, we have a special variable which can be patched on a running > system (or built in during system configuration) which enables the > CONVEX NFS server to operate asynchronously. That is, to NOT do a > synchronous write, but rather simply queue the write request in the > buffer cache to be written to permanent disk storage at some later > time (marked delayed-write). This is unnecessarily risky; it could probably be handled better in the client. In conventional Unix filesystems, bufs for writing file blocks are marked B_ASYNC, and nobody is waiting for them to complete; so if the acknowledges are slow in coming, that shouldn't inhibit the client from sending some more to the server. The client should blast data at the server until one of server or client cannot keep track of more outstanding requests. Unfortunately, the default limit on the amount of unacknowledged outstanding requests is a mere 6K! It may help to raise this, if the server's ethernet board has enough buffering, but what is really needed is two kinds of acknowledges, one that says "Your request is completed", and a new one that says "I can accept another N kilobytes of requests" (much like TCP's notion of a "window"). Having the latter would obviate the need to specify rsize and wsize when mounting an NFS filesystem. Anyone who owns 3com's would appreciate that. Don Speck speck@vlsi.caltech.edu {amdahl,ames!elroy}!cit-vax!speck
guy@gorodish.Sun.COM (Guy Harris) (02/15/88)
> In conventional Unix filesystems, bufs for writing file blocks are marked > B_ASYNC, and nobody is waiting for them to complete; so if the acknowledges > are slow in coming, that shouldn't inhibit the client from sending some more > to the server. It doesn't, at least in Sun's implementation; writes are synchronous on the server, but not on the client. The "biod" processes handle asynchronous NFS writes. Note that this has the disadvantage that write errors are not synchronously reported back to the program doing the "write"s; when doing NFS writes, "out of space" errors are not synchronously reported. A program should probably do an "fsync" after writing out a lot of data, and check the return code from "fsync", so that it can find out about such errors and report them to the user.
wesommer@athena.mit.edu (William Sommerfeld) (02/15/88)
In article <41894@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >It doesn't, at least in Sun's implementation; writes are synchronous on the >server, but not on the client. The "biod" processes handle asynchronous NFS >writes. Note that this has the disadvantage that write errors are not >synchronously reported back to the program doing the "write"s; when doing NFS >writes, "out of space" errors are not synchronously reported. A program >should probably do an "fsync" after writing out a lot of data, and check >the return code from "fsync", so that it can find out about such errors >and report them to the user. In at least one version of NFS based on the Sun source (the Wisconsin 4.3+NFS for vaxen which we're using here at Athena), 99% of the code needed to report write errors back to the client process was there. However, the part which stored the error code from the write RPC call into the rnode structure (the client side per-file state) was missing, and thus fsync() and close() of NFS files could never return error codes. By the way, the biod's seem to be a very expensive way to implement a `window'; that is, having one process per outstanding write packet. If the Sun RPC client and server code could handle more than one outstanding packet at a time, this could be fixed. - Bill
ed@mtxinu.UUCP (Ed Gould) (02/16/88)
In article <41894@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >... when doing NFS writes, "out of space" errors are not synchronously >reported. A program should probably do an "fsync" after writing out a >lot of data, and check the return code from "fsync", so that it can >find out about such errors and report them to the user. The return value from close() will also report the error code, so the fsync() is not necessary in all cases. It's necessary only if the program wishes to check for errors but not close the file. Note that there are errors that can occur on local filesystems that may not be reported synchronously either. Usually, these are hardware errors (not software-only things like "out of space"). In fact, they may not be reported to the user program at all. Moral: Check for and report errors from *all* system calls. -- Ed Gould mt Xinu, 2560 Ninth St., Berkeley, CA 94710 USA {ucbvax,uunet}!mtxinu!ed +1 415 644 0146 "`She's smart, for a woman, wonder how she got that way'..."