monty@delphi.bsd.uchicago.edu (Monty Mullig) (06/14/89)
The following is a summary of responses that I received in response to my posting about rsize and wsize for NFS mounted partions. The first entry is a summary of my original posting. Thanks to those who responded. --monty Summary of results: wc and a cp run on a 9.5MB file, all activity on this file on the test partition. trial 1: read/write sizes using default (8k) fstab entry for /u1 partition: delphi:/u1 /u1 nfs rw 0 0 average cp: 1:33.6 (93.6s) average wc: 1:45.6 (105.6s) trial 2: read/write sizes of 2048, timeo=100 fstab entry for /u1 partition: delphi:/u1 /u1 nfs rw,rsize=2048,wsize=2048,timeo=100 0 0 average cp: 4:50.3 (290.3s) +210.2% over defaults ave average wc: 2:09.3 (129.3s) + 22.4% over defaults ave trial 3: read/write sizes of 1024, timeo=100 fstab entry for /u1 partition: delphi:/u1 /u1 nfs rw,rsize=1024,wsize=1024,timeo=100 0 0 average cp time: 1:48.6 (108.6s) 16.0% over defaults average wc time: 1:45.0 (105.0s) -0.6% over defaults >----------------------------------------------< Date: Thu, 25 May 89 22:28:46 EDT From: dan@flash.bellcore.com (Daniel Strick) The default rsize/wsize is 8k. The recommendation that these parameters be reduced to 2k or 1k was originally suggested to preserve the functionality of old ethernet interfaces with only 2k of buffer space (beyond which packets must be dropped). It turns out that in addition to the limitation in the old interfaces, there are kernel buffer resources that can be exceeded (usually happens when a fast machine blasts away at a slower one) and therefore the rsize/wsize reduction recommendation is periodically repeated even though the old ethernet interface is history. If the destination of the nfs data is not overrun, the default 8k rsize/wsize should be marginally most efficient. This is reflected by your 8k and 1k tests. I don't know what happened during your 2k tests (you win a cigar). Perhaps the maximum ethernet packet size of roughly 1500 bytes is relevant. >-----------------------------------------< Date: Thu, 25 May 89 22:28:40 EDT From: hedrick@geneva.rutgers.edu There's no reason to descrease rsize and wsize between Sun 3's and 4's on the same Ethernet. Rsize and wsize are a hack, for use only with Ethernet controllers that don't have enough buffering to receive 6 back to back packets. The 3Com Ethernet cards used on most Sun 2's have this problem. So you want to reduce wsize on a 3 that has mounted a 2 or rsize on a 2 that has mounted a 3. Some gateways or bridges have trouble with large numbers of back to back packets also. This may be load-dependent. The newest cisco hardware works fine with default settings, as they are now using controller cards with lots of on-board buffering. Older cisco gateways (particularly those using 3Com controller cards, but sometimes the Interlan cards have trouble also) need reduced rsize and wsize. I assume the same may be true of other vendors. Finally, if you have a link that tends to lose packets (e.g. a noisy serial line), reducing the sizes could help too. If you lose one packet you have to resend the whole bunch, so reducing the size of the bunch could help. But you'd need very high error rates before you'd see this. If you don't have one of these special situations where you need a smaller size, then the defaults do better, since they decrease the RPC processing overhead needed to handle a given amount of data. Your test had a server that was faster than the client. If you had the reverse, e.g. a Sun 4 client and a Sun 3 server, when the client writes data to the server, the server may get overrun. Generally we suggest reducing the number of biod's rather than using rsize and wsize, but if you needed to throttle just one particular mount, wsize might be the way to do it. We've never seen trouble due to the server being faster than the client. >-------------------------------------< Date: Fri, 26 May 89 11:05:19 EDT From: jas@proteon.com (John A. Shriver) The default rsize and wsize are 8192. The problem is that they send one giant UDP packet of wsize, and let IP fragmentation make it small enough to go across the Ethernet. For 8192, that six packets. These packets are sent as a *very* fast burst. If any of the fragments get lost, all are useless because of the IP unique ID. This message explains: Date: Sat, 28 Dec 85 19:00:04 est From: Larry Allen <apollo!lwa@uw-beaver.arpa> Subject: ip fragmentation follies I've been playing with IP fragmentation/reassembly and have discovered a major crock in the Berkeley way of doing things. This may have been noticed by someone before, but I hadn't really thought about it. What caused me to notice this was claims by some people (namely Sun) that using very large IP packets and using IP-level fragmentation makes protocols like NFS run faster. This makes some sense (less context-switching, etc), so we decided to try it. We quickly noticed a problem, though: if a fragmented packet has to be retransmitted (eg because one of the fragments was dropped somewhere) the fragments of the retransmitted packet are not and can not be merged with those of the original packet! Why? Because the Berkeley code has no notion of IP-level retransmission, and hence assigns a new IP-level packet identifier to each and every IP packet it transmits! And since the IP-level identifier is the only way the receiver can tell whether two fragments belong to the same packet, this means that the fragments of a retransmitted packet can never be combined with those of the original. What all this means in practice is this: for a fragmented IP packet to get through to its receiver, all the fragments resulting from a single transmission of that packet must get through. If a single fragment is lost, all the other fragments resulting from that transmission of the packet are useless and will never be recombined with fragments from past or future transmissions of the same packet. This all explains (or at least provides a partial explanation) for why people running 4.2 TCP connections across the Arpanet using 1024-byte packets were losing so badly. If the probability of fragment lossage is even moderately high, it will often take three or more tries to get a fragmented packet across the net. Meanwhile, of course, the useless fragments from previous transmissions are sitting on reassembly queues in the receiver (for 15 seconds, I think?), tying up buffering resources and increasing the chances that fragments will be dropped in the future! In the current Berkeley code, it's possible to imagine workarounds for this problem for TCP: because TCP is in the kernel, it could have a side hook into the IP layer to tell it "this packet is a retransmission, don't give it a new IP identifier". For protocols like UDP, however, the acknowledgment and retransmission functions are done outside of the kernel, and the only kernel interface that's available is Berkeley's socket calls (sendto, recvfrom, etc). Needless to say, the socket interface gives you 1) no way to find out what IP identifier a packet was sent with; 2) No way to specify the IP identifier to use on an outgoing packet. I don't really have any idea what to do about this problem. And, it's not entirely Berkeley's fault; the BBN TCP/IP for 4.1bsd did the same thing... In any case, until there's a fix I don't think using IP fragmentation/reassembly when talking to 4.2bsd systems is a very good idea. -Larry Well, the important thing is that this only matters when packets are being lost. The least likely time for that to happen is on an idle network at midnight. The net has to be busy. Also, the problem is for receiving data on the slow host (3/50 in your case). Try reading two large files, from two different file servers, at the same time, with your 3/50. That will start causing it to lose packets. For files the 3/50 mounts, you may only need to set the rsize lower, the wsize may be fine. The case is not that smaller rsize/wsize improves performance. The case is that if you are losing enough packets to blow performance to hell, lowering rsize/wsize will save your ass. Much of this should be greatly improved with SunOS 4.1 comes out, with adaptive retransmission in NFS. >----------------------------------------------< Date: Fri, 26 May 1989 11:25-EDT From: David.Maynard@K.GP.CS.CMU.EDU About 2 years ago I did some fairly extensive benchmarks on the rsize, wsize, and timeo options. In addition to having machines of different speeds (Sun-2/120 vs. Sun-3/160), I had to deal with LANbridges and IP routers on a heavily loaded network. About 6 months ago I did some more limited tests using a Sun-3/50 instead of the Sun-2 on a similarly convoluted network. These tests were done under 3.X so things could be very different under 4.0. In addition, the client machines had local disks so I was not affected by page/swap traffic that might change your results. First to answer your question, the default maximum transfer size is 8192 unless the server is a Sun-2 with the 3Com ethernet board. This corresponds to the page size on most of the newer Suns so you only need one transfer to get a whole page. In most cases, the default rsize and wsize settings should work well. Problems generally arise if your combination of hardware and loading prevent one of the machines from handling a fairly steady stream of large packets. Two possible sources of such problems are, 1) speed differences between the client and the server, and 2) limitations in the network itself. If the server machine is much faster than the client, then what the server considers a steady stream of packets may be an unmanageable flood to the client. With Sun-2's this could be a real problem. I've also heard of people having similar problems between Sun-3's and Sun-4's. In this case, the load on the client plays a major role in how bad the problem is. The second source of problems is limitations in the network itself. In Sun-2's with the 3com controller, the network interface doesn't deal well with packets longer than 4K. If your network has IP routers or bridges, these network links can greatly limit your ability to transfer streams of large packets. Some IP routers are especially notorious for dropping things under heavy loads. The key to minimizing these problems for NFS is limiting the overhead of having large numbers of small packets while reducing the number of retransmissions due to dropped or late packets. To get a feel for how your network behaves, try using 'spray' with various packet sizes. It isn't as accurate as NFS tests, but is easier to do while others are working. Be sure to spray both from client to server and from server to client. By comparing the percentage of packets dropped in the two directions you can get an idea of how CPU speed differences might affect NFS (although only roughly since spray represents the extreme case of streaming packets). Then, look at the bandwidth numbers for the different sizes. Bandwidth should increase as packet size increases (reduced overhead). This is why you want rsize and wsize to be as large as possible. However, the number of dropped packets also tends to increase with size. Unlike spray, NFS has to retransmit dropped packets, so dropped packets can greatly reduce NFS performance. If your network has routers, you will also probably notice a drop-off point where performance degrades rapidly for larger packets. Once you have an idea of how the network behaves, start doing NFS tests with 'cp,' 'wc,' or your favorite command. Adjust rsize and wsize from 8192 down to 1024. Also, adjust the timeo option from 7 (the default) up to 20 or so. For each test look at the elapsed time for the commands AND the statistics reported by 'nfsstat' on the client. (Remember to zero the nfsstat statistics between tests.) The 'Client rpc' data reported by nfsstat will tell you how many (if any) of the calls timed out (i.e., were dropped or were too late). You want to keep the number of retransmissions low to get the best performance. One way of reducing retransmissions is to increase the timeo option. However, increasing the timeout introduces a delay before dropped packets are retried. With timeo=100, it will be 10 seconds before a dropped packet is retried! This delay can really really hurt NFS performance. Even on a bad network I have found that limiting the timeo to 10 or less gives me the best overall performance. On the other hand, that extra 3/10's of a second greatly reduces the number of timeouts for our particular network. To comment on your specific results, I would suspect that either you don't have a problem and you should just use the defaults, or that your tests were skewed by the large timeo values. One quick way to tell is to look at the nfsstat results on a client that has been running for awhile under normal load. If the percentage of client rpc calls that has timed out is greater than 1/2% of the total then you should probably do some more rsize and wsize tests. Because of our heavily loaded network and routers, I get the best performance when around 1% of the packets time out. I hope you don't mind the long explanation. I guess it might be more appropriate for Sun-Spots where it might help someone who isn't familiar with the background and hasn't already done a lot of tests. Anyway, I hope it helps.