anselmo-ed@CS.YALE.EDU (Ed Anselmo) (02/22/91)
Is there a magic formula for determining how man nfsd's to run? "System Performance Tuning" by Loukides says "...4 is appropriate for most situations. The effects of changing this number are hard to predict and are often counterintuitive. For various reasons that are beyone the scope of this book, increasing the number of daemons can often degrade network performance." The SunOS 4.1.1 man page suggests 8 nfsd's. What happens when you run too many nfsd's? Too few? In our case, we're running Sun 4/390 servers, typically with 2 1 GB IPI disks on one controller. Each server serves around 20 "sorta-standalone" Sun 4/60 clients (/, swap, and /usr are on the client, /home is on the server). -- Ed Anselmo anselmo-ed@cs.yale.edu {harvard,cmcl2}!yale!anselmo-ed
thurlow@convex.com (Robert Thurlow) (02/22/91)
In <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes: >Is there a magic formula for determining how man nfsd's to run? Surely a question for the FAQ, except that there's no definitive answer :-) Someone I was talking to got a rule-of-thumb from a Sun tech support person that said, "Run one nfsd for each local disk controller plus one." If your nfsd's can keep all the local disk arms going, that's pretty good, and then you add another one. Less nfsd's mean you lose potential bandwidth, while many more mean you just have a lot of nfsd's waiting on disk. Rob T -- Rob Thurlow, thurlow@convex.com An employee and not a spokesman for Convex Computer Corp., Dallas, TX
gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) (02/22/91)
In article <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes: >Is there a magic formula for determining how man nfsd's to run? I was once told that the mystical figure was "one per network interface and one per disk with exported filesystems." >What happens when you run too many nfsd's? Too few? If you have too many processes competing for the limited slots in the hardware context cache, your machine will roll over and die. You can look up this number in you hardware manuals somewhere. For low-end sun4's the number is 8. I run 4 nfsd's on such machines. The same problem can bite you with too many biods. Another good thing to check is NFS timeouts... nfsstat(8C) will tell you if the server is responding sufficiently quickly. I raised my timeout until I saw zero timeouts. Disclaimer: I'm not an expert, but I once asked for help on this very topic here and nobody answered ;-)
richard@aiai.ed.ac.uk (Richard Tobin) (02/23/91)
In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes: >If you have too many processes competing for the limited slots in the >hardware context cache, your machine will roll over and die. You can >look up this number in you hardware manuals somewhere. For low-end >sun4's the number is 8. I run 4 nfsd's on such machines. The same >problem can bite you with too many biods. Given that nfsd runs in kernel mode inside nfssvc(), is this statement about contexts correct? If so, why is the default number of nfsds for Sun 3s 8? -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
john@iastate.edu (Hascall John Paul) (02/24/91)
In article <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes: }Is there a magic formula for determining how man nfsd's to run? So far, we've seen four answers: -- 4 -- 8 -- (1 * n_disk_interfaces) + 1 -- (1 * n_disks_exported) + (1 * n_network_interfaces) Now, consider our situation (DEC 5000, 2 SCSI x 4 x 1GB, 1 net), that gives: -- 4 -- 8 -- 3 = (1 * 2) + 1 -- 10 = (1 * 8) + (1 * 1) We are running 20, which is almost assuredly too many. Does anyone have a solid answer for this question? Also does the number of clients to be served have any impact? Perhaps, a true answer can only be obtained empirically? Is there such a thing as a test-client which makes a series of nfs rpc calls to determine response time stats? I am sure many could benefit from such information [FAQ?]. Thanks, John Hascall <john@iastate.edu> -- John Hascall An ill-chosen word is the fool's messenger. Project Vincent Iowa State University Computation Center john@iastate.edu Ames, IA 50011 (515) 294-9551
thurlow@convex.com (Robert Thurlow) (02/24/91)
In <1991Feb24.025821.11354@news.iastate.edu> john@iastate.edu (Hascall John Paul) writes: >Does anyone have a solid answer for this question? I think that part of the problem is that nobody has done a really good job at finding an answer, and that part is that there just isn't _one_ answer; that there's one for each job mix, machine pair, and user community. I was reminded of a reason to keep the numbers down that I had forgotten - SunOS will wake up all of the processes sleeping on the NFS input request socket, and the first one will get in. That could be a lot of process jostling. Since the nfsd's on a Convex sleep on a queued semaphore, this isn't an issue on our machines. Rob T -- Rob Thurlow, thurlow@convex.com An employee and not a spokesman for Convex Computer Corp., Dallas, TX
chucka@cup.portal.com (Charles - Anderson) (02/24/91)
> Is there a magic formula for determining how man nfsd's to run? > > "System Performance Tuning" by Loukides says "...4 is appropriate for > most situations. The effects of changing this number are hard to > predict and are often counterintuitive. For various reasons that are > beyone the scope of this book, increasing the number of daemons can > often degrade network performance." > > The SunOS 4.1.1 man page suggests 8 nfsd's. > > What happens when you run too many nfsd's? Too few? > > In our case, we're running Sun 4/390 servers, typically with 2 1 GB > IPI disks on one controller. Each server serves around 20 > "sorta-standalone" Sun 4/60 clients (/, swap, and /usr are on the > client, /home is on the server). > -- > Ed Anselmo anselmo-ed@cs.yale.edu {harvard,cmcl2}!yale!anselmo-ed We are running 50-60 clients using DOS PCNFS and 4-6 mounts each. We picked 16. Seems to be working pretty well. We have 16 Meg memory and 1 G Disk. I think we could get by with 12. But, no one is complaining about performance, which can go down due to swapping. Chuck Anderson
davecb@yunexus.YorkU.CA (David Collier-Brown) (02/25/91)
In <1991Feb24.025821.11354@news.iastate.edu> john@iastate.edu (Hascall John Paul) writes: | Does anyone have a solid answer for this question? thurlow@convex.com (Robert Thurlow) writes: | I think that part of the problem is that nobody has done a really | good job at finding an answer, and that part is that there just | isn't _one_ answer; that there's one for each job mix, machine | pair, and user community. This can be determined empirically with fair ease on a machine which is swapping/paging via nfs: log in and start X windows and some applications. we did it with three people (on three workstations) in a couple of hours without even needing a quiescent net. --dave -- David Collier-Brown, | davecb@Nexus.YorkU.CA | lethe!dave 72 Abitibi Ave., | Willowdale, Ontario, | Even cannibals don't usually eat their CANADA. 416-223-8968 | friends.
G.Eustace@massey.ac.nz (Glen Eustace) (02/25/91)
The appropriate number of nfsds is definitely environment and for that matter application (client) dependent. We followed the recommendations of our SE when our Pyramid 9815 was installed, 3 disks and 2 ethernets. Whatever formula was applied ( a guess is as godd as any :-) ), the number arrived at was 8. After running this way for over a year and having continual problems with slow network response and insufficient CPU available on the server, a performance analysis was conducted. Now for the counter-intuitive part !! We have reduced the number of nfsds to 4 and most of our problems have gone away. The network performance is better ( but could still be improved ) and we have sufficient CPU left over to do some other useful activities like servicing mount requests and pcnfs authentication and print requests etc :-;. From this experience we have stuck to 4 nfsds and where appropriate 4 biods on all other machines, DEC3100s and a DEC5000. We have found this to be satisfactory. The moral of the story: in calculating the nfsds the other processor activity needs to be considered. -- -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=- Glen Eustace, Software Manager, Computer Centre, Massey University, Palmerston North, New Zealand. EMail: G.Eustace@massey.ac.nz Phone: +64 63 69099 x7440, Fax: +64 63 505 607, Timezone: GMT-12
liam@cs.qmw.ac.uk (William Roberts;) (02/26/91)
In <4218@skye.ed.ac.uk> richard@aiai.ed.ac.uk (Richard Tobin) writes: >In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes: >>If you have too many processes competing for the limited slots in the >>hardware context cache, your machine will roll over and die. >Given that nfsd runs in kernel mode inside nfssvc(), is this statement >about contexts correct? If so, why is the default number of nfsds for >Sun 3s 8? 1) Hardware contexts are a feature of the MMU/Instruction cache, and so Greg's comment is specific to Sun4 machines. 2) The nfssvc() system call never returns, but the process slot of the caller is used as a cheap way to implement multi-threading in the kernel. When an nfsd runs out of work, it does a sleep() waiting for requests to come in on the socket bound to UDP port 2049. When it gets handles a request involving disk I/O, it sleeps waiting for the result. All nice straightforward kernel stuff provided that you have a process table entry to work with, but impossible if you don't. Using a kernel like Mach or Chorus it would be done with true multi-threading of the code, so lightweight process switches would be the only overhead; under most UNIX systems it takes the full context switch and hence can thrash the hardware contexts (if applicable). Historical note: prior to SunOS 3.2, there was no wakeupone() routine and so ALL sleeping nfsd processes would be made runnable when a request came in. One would get to handle the request, and the others would needlessly run, find nothing to do and go back to sleep. Under this scheme it was possible to find that the file server got slower at night, and needless to say it was impossible to choose a "good number" since there were penalties for having too many for the current load. I once tried 30 nfsds on a WhiteChapel MG1 server: it stopped. Nowadays only one nfsd will wake up for an incoming request, hence the change in "good number" from "4" to "4 or more". One plausible way to determine how many nfsds you need is to ask for lots, then see how many use up CPU time (NB. this depends on sleep queues being managed unfairly, so it might not work these days). -- William Roberts ARPA: liam@cs.qmw.ac.uk Queen Mary & Westfield College UUCP: liam@qmw-cs.UUCP Mile End Road AppleLink: UK0087 LONDON, E1 4NS, UK Tel: 071-975 5250 (Fax: 081-980 6533)
jim@cs.strath.ac.uk (Jim Reid) (02/27/91)
In article <thurlow.667370019@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:
......................... I was reminded of a reason to keep the
numbers down that I had forgotten - SunOS will wake up all of the
processes sleeping on the NFS input request socket, and the first
one will get in. That could be a lot of process jostling.
This was certainly true for the early NFS implementations: say around
the time of SunOS2.0. Later implementations included a new kernel
function - wakeup_one() - which will only wake up one process waiting
on an event instead of every process waiting on that event. This is
intended to save the overheads of scheduling N nfsd processes only to
see N-1 of them immediately go back to sleep. The routine is still in
SunOS 4.[01].
Jim
ckollars@deitrick.East.Sun.COM (Chuck Kollars - Sun Technical Marketing - Boston) (02/27/91)
Both Mike Loukides' book "System Performance Tuning" and Hal Stern's presentation at SUG have raised interest in the question of how many nfsd's/biod's. Here's my summary of public and private answers: In article <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes: >Is there a magic formula for determining how man nfsd's to run? There's a magic procedure for adjusting the number of nfsd's, and several magic formulas for picking a starting value. Q: So what's the magic procedure? Run 'netstat -s' a few times, and look at the number of "socket overflows" in the "udp:" category. If it's growing, run more nfsd's. If it's zero, you're okay. (If it's large but not changing, probably somebody ran 'spray' on your network yesterday.) Q: So what are the magic formulas for a starting value? Variation 1-- #(disk controllers) + 1 Variation 2-- #(disk controllers) + #(ethernet interfaces) Variation 3-- #(disk controllers) * min(4, #(spindles/controller)) Variation 4-- 4 for a small server, 8 for a midsize server, 16 for a large server Q: Does the number of nfsd's have to be a multiple of 2? No, there can be any number of nfsd's. >"System Performance Tuning" by Loukides says "...4 is appropriate for >most situations. The effects of changing this number are hard to >predict and are often counterintuitive. For various reasons that are >beyone the scope of this book, increasing the number of daemons can >often degrade network performance." > >The SunOS 4.1.1 man page suggests 8 nfsd's. > >What happens when you run too many nfsd's? Server performance may suffer. The amount of degradation depends on your hardware, your software, and just how many "extra" nfsd's you have. In most cases the penalty for erring on the high side is small. On machines with hardware support of multiple contexts (ex: Sun4/SPARC), running more nfsd's than there are hardware contexts will force context switching into software. On machines without software support for signaling semaphores, an NFS request will wake up _all_ the sleeping nfsd's, which will stumble all over each other resulting in a lot of unnecessary context switches and degraded server performance. >Too few? The RPC queue of NFS requests will overflow, resulting in timeouts, retransmissions, backoffs, and hence noticeably degraded client performance. This serious effect can make a large expensive server act like a boat anchor. Q: How important is it to get the number of nfsd's exactly right? Not very. To quote Hal Stern, "This fine-tuning may yield minor improvements, but major performance problems can usually be traced to other bottlenecks or problems." Q: Why do some people run very large numbers of nfsd's, 48 or even 64? 'nhfsstone' benchmark results can often be improved by running a very large number of nfsd's on the server under test. Doing this proves that it's possible. Doing this does not prove that it's helpful on a production server with a mixed or bursty workload. Q: What if the NFS requests are very bursty? Running enough nfsd's to handle the peaks seems to imply running way too many nfsd's most of the time. NFS requests are seldom so bursty that this is a problem. If it is, you could increase "udp_recvspace" in sys/netinet/in_proto.c and rebuild your kernel. Q: Does the number of nfsd's being run affect other things? Yes, be sure MAXUSERS in your kernel config allocates enough kernel resources to run all your nfsd's. Despite what its name seems to imply, MAXUSERS is a "large knob" that controls the size of statically allocated tables in the kernel. On large servers with lots of RAM, turn the knob "way up", otherwise those nfsd's may overflow some kernel table. Q: Is there a similar magic formula or procedure for biod's? Yes, always run the default 4 biod's per client. Changing the number of biod's is usually counterproductive. Reducing the number could cripple client performance. Increasing the number could cripple network or server performance. Changing the number of biod's on the clients either way will require adjusting the number of nfsd's on the server. --- chuck kollars <ckollars@East.Sun.COM> Sun Technical Marketing, located in Sun's Boston Development Center
jstampfl@iliad.West.Sun.COM (John Stampfl - Sun Silicon Valley SE) (03/01/91)
>Q: So what are the magic formulas for a starting value? > >Variation 1-- #(disk controllers) + 1 >Variation 2-- #(disk controllers) + #(ethernet interfaces) >Variation 3-- #(disk controllers) * min(4, #(spindles/controller)) >Variation 4-- 4 for a small server, 8 for a midsize server, 16 for a large server Chuck, I have found serveral problems when there are not enough nfsd's in particular, "server not responding", slow copies from a fast client to a slow server. Also, running with presto or presto and OMNI usually requires less nfsd's or OMNI_nfsd's. I like 24 nfsd's as a starting value on all pure nfs servers, if you are doing compute serving also, or comms or other cpu intensive job, you may want a smaller number. >>What happens when you run too many nfsd's? >Server performance may suffer. I haven't seen nfs server performance suffer, but have heard of mixed cpu/nfs servers suffering. >On machines with hardware support of multiple contexts (ex: >Sun4/SPARC), running more nfsd's than there are hardware contexts will >force context switching into software. I believe that nfsd's run in kernel space and thus donot do hardware context switches. > >Q: Why do some people run very large numbers of nfsd's, 48 or even 64? > >'nhfsstone' benchmark results can often be improved by running a very >large number of nfsd's on the server under test. Doing this proves >that it's possible. Doing this does not prove that it's helpful on a >production server with a mixed or bursty workload. I have not seen much improvement in running over 24 nfsd's. The curve has flattened out. With omni and presto 16 may be enough. One thing, If you have a system with nfs performance problems and running less than 24 nfsd, try 8 or so more, you may also need to increase maxusers In may experience going to 24 nfsd's or so doesn't hurt much, except for small memory cpu/nfs servers. If more nfsd's don't help, then the lack of nfsd's wasn't the problem and you can reduce the number. Keep up the good work, maybe someday I will understand what is going on.
cornell@csl.dl.nec.com (Cornell Kinderknecht) (03/01/91)
>> Is there a magic formula for determining how man nfsd's to run?
I was talking to a Sun software support person yesterday about a different
subject but she made the comment that based on Sun's tests, after 22 nfsd's
running, there was no increase in performance. So she had recommended a
maximum of 22.
I suppose there is no problem with running more besides that they aren't
needed. Running too few I would guess, might cause problems when a lot
of nfs requests come in at the same time.
--- Cornell Kinderknecht
pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)
I have crossposted to comp.arch, becasue this is really a system/network
architecture question. NFS is almost incidental :-).
On 22 Feb 91 16:14:12 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:
richard> In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU>
richard> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:
gl8f> If you have too many processes competing for the limited slots in the
gl8f> hardware context cache, your machine will roll over and die. You can
gl8f> look up this number in you hardware manuals somewhere. For low-end
gl8f> sun4's the number is 8. I run 4 nfsd's on such machines. The same
gl8f> problem can bite you with too many biods.
richard> Given that nfsd runs in kernel mode inside nfssvc(), is this
richard> statement about contexts correct?
Yes and no, depending on who is your vendor, and which OS revision and
machine model you have. For Sun there is some history that may be worth
mentioning. Under SunOS 3 the nfsds were in effect kernel processes, so
that they could access the buffer cache, held in the kernel address
space, without copies. Since all nfsds run in the kernel page table
there was no problem.
Under SunOS 4 the buffer cache went away, so each nfsd was given its own
address space (memory mapped IO), while still being technically a kernel
process. This meant that MMU slot thrashing was virtually guaranteed, as
the nfs daemons are activated more or less FIFO and the MMU has a LIFO
replacement policy. As soon as the number of nfsds is greater or equal
to the number of MMU slots problems happen.
I have seen the same server running the same load under SunOS 3 the day
before with 10-20% system time and 100-200 context switches per second,
and with SunOS 4 the day after with 80-90% system time and 800-900
context switches per second. An MMU slot swap on a Sun 3 will take about
a millisecond, which fits.
Under SunOS 4.1.1 things may well be different, as Sun may have
corrected the problem (by making all the nfsds share a single adddress
space and giving each of them a section of it in which to map the
relevant files, for example, or by better tuning the MMU cache
replacement policy to the nfsd activation patterns, for another
example). On larger Sun 4s there are many more MMU slots, say 64, so the
problem effectively does not happen for any sensible number of nfsds.
richard> If so, why is the default number of nfsds for Sun 3s 8?
Sun bogosity :-).
As to the general problem of how many NFS daemons, I have already posted
long treatises on the subject. However briefly the argument is:
Each nfsd is synchronous, that is it may carry out only one operation at
a time, in a cycle: read request packet, find out what it means, go to
the IO subsystem to read/write the relevant block, write the result
packet, loop.
Clearly on a server that has X network interfaces, Y CPUs, and Z disks
(if your controller supports overlapping transfers, otherwise it is the
number of controllers) there cannot be more than X+Y+Z nfsds active, as
at most X nfsds can be reading or writing a packet from a network
interface, at most Y nfsds can be running kernel code, and at most Z
nfsds can be waiting for a a read or a write from a disk.
The optimum number may be lower that X+Y+Z, because it is damn unlikely
that the maximum multiprogramming level will actually be as high as
that, and there may other be processes that compete with nfsds for the
newtork interfaces, or the CPUs, or the disks.
It may also be higher, because this would allow multiple IO requests to
be queued waiting for a disk, thus giving the arm movement optimizer a
chance to work (if there is only ever one outstanding request per disk,
tis implies a de facto FCFS arm movement policy).
The latter argument is somewhat doubtful as there is contradictory
evidence about the relative merits of FCFS and of elevator style sorting
as used by the Unix kernel.
All in all I think that X+Y+Z is a reasonable estimate, or maybe a a
slightly larger number than that if you are persuaded that giving a
chance to the disk request sorter is worthwhile (which may not be true
for a remote file server, as opposite to a timesharing system where it
is almost always worthwhile).
Naturally this is only the "benefit" side of the equation. As to the
"cost" side, it used to be that nfsds had a very low cost (a proc table
slot each and little more), so slightly overallocating them was not a
big problem. But on some OS/machine combinations the cost becomes very
large over a certain threshold, and this may mean that reducing the
number below the theoretical maximum pays off.
Finally there is question of the Ethernet bandwidth. In the best of
cases an Ethernet interface can process read about 1000 packets/s, and
write 800KB/s (we assume that requests are small, so the number of
packets/s matters, while results are large, so the number of KB/s
matters; stat(2) and read/exec(2) are far more common than write(2)).
Divide that by the number of clients that may be actively requesting
data (usually about a tenth of the total number of machines on a wire
are actively doing remote IO), and you get pretty depressing numbers.
It may be pointless to have say 4 2MB/s server disks capable of doing
each 50 transactions per second each involving say 8-16KB and so have
enough nfsds to take advantage of this parallelism and bandwidth, if the
Ethernet wire and interface are the bottleneck.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
lm@slovax.Berkeley.EDU (Larry McVoy) (03/02/91)
In article <PCG.91Feb28195347@odin.cs.aber.ac.uk>, pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: |> richard> Given that nfsd runs in kernel mode inside nfssvc(), is this |> richard> statement about contexts correct? |> |> Yes and no, depending on who is your vendor, and which OS revision and |> machine model you have. For Sun there is some history that may be worth |> mentioning. Under SunOS 3 the nfsds were in effect kernel processes, so |> that they could access the buffer cache, held in the kernel address |> space, without copies. Since all nfsds run in the kernel page table |> there was no problem. |> |> Under SunOS 4 the buffer cache went away, so each nfsd was given its own |> address space (memory mapped IO), while still being technically a kernel |> process. This meant that MMU slot thrashing was virtually guaranteed, as |> the nfs daemons are activated more or less FIFO and the MMU has a LIFO |> replacement policy. As soon as the number of nfsds is greater or equal |> to the number of MMU slots problems happen. More misinformation spoken in an authoritative manner. Ah, well, I've done the same I guess. From sys/nfs/nfs_server.c: nfs_svc() .... /* Now, release client memory; we never return back to user */ relvm(u.u_procp); From the SCCS history (note the date): D 2.83 87/12/15 18:34:42 kepecs 88 87 remove virtual memory in async_daemon and nfs_svc as it's not needed. Remove pre-2.0 code to set rdir in nfs_svc. make sure these guys exit if error, since no vm to return to. In other words, this problem went away 3 years ago, never to return. |> I have seen the same server running the same load under SunOS 3 the day |> before with 10-20% system time and 100-200 context switches per second, |> and with SunOS 4 the day after with 80-90% system time and 800-900 |> context switches per second. An MMU slot swap on a Sun 3 will take about |> a millisecond, which fits. You may well have seen this. Jumping to the conclusing that it is caused by NFS is false, at least the reasons that you list are not true. |> richard> If so, why is the default number of nfsds for Sun 3s 8? |> |> Sun bogosity :-). Piercarlo ignorance :-) |> The latter argument is somewhat doubtful as there is contradictory |> evidence about the relative merits of FCFS and of elevator style sorting |> as used by the Unix kernel. Ahem. References, please? I've looked at this one in detail, this should be interesting. --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/05/91)
[ this article may have already appeared; I repost it because probably it did not get out of the local machine; apologies if you see it more than once ] [ ... on SUN NFS/MMU sys time bogosity ... ] pcg> I have seen the same server running the same load under SunOS 3 the pcg> day before with 10-20% system time and 100-200 context switches per pcg> second, and with SunOS 4 the day after with 80-90% system time and pcg> 800-900 context switches per second. An MMU slot swap on a Sun 3 pcg> will take about a millisecond, which fits. On 1 Mar 91 21:37:30 GMT, Larry McVoy commented: lm> You may well have seen this. Jumping to the conclusing that it is lm> caused by NFS is false, at least the reasons that you list are not lm> true. This you say after recognizing above that the problem existed and claiming that in recent SunOS releases it has been obviated. Now you seem to hint that it is NFS related, buit not because of MMU context switching. As to me, my educated guesses are: this bogosity appears to be strictly correlated with the number of NFS transactions processed per second, and the overhead per transaction seems to be about 1ms and that 1ms seems to be the cost of a MMU swap, and the number of context switches per second reported by vmstat(8) seems to be correlated strongly to the number of active nfsd processed, and the system time accumulated by nfsd processes becomes very large when there are many context switches per second, but not otherwise. Anybody with this problem (it helps to have servers running both SunOS 3.5 and SunOS 4.0.x) can have a look at the evidence, thanks to the wonders of nfsstat(8), vmstat(1), ps(1) and pstat(8). In particular 'vmstat 1' (the 'r' 'b' 'cs' 'sy' columns), 'ps axv' (the 'TIM" and 'PAGEIN' columns) will be revealing; 'nfstat -ns' and 'pstat -u <nfsd pid>' will give extra details (sample outputs for both SunOS 3 and 4 available on request). The inferences that can be drawn are obvious, even if maybe wrong. After all I don't spend too much time second guessing the *whys* of Sun bogosities, contrary to appearances. I am already overwhelmed by those in AT&T Sv386 at home... :-). Pray, tell us why the above observed behaviour is not a bogosity, or at least what was/is the cause, and how/if it has been obviated three years ago. My explanation is a best guess, as should be pretty obvious; you need not guess, and I am sure that enquiring minds want to know. As to the details: lm> nfs_svc() lm> /* Now, release client memory; we never return back to user */ lm> relvm(u.u_procp); lm> From the SCCS history (note the date): The date is when the file was edited on a machine at Sun R&D... This is slightly cheating. When did the majority of customers see this? lm> D 2.83 87/12/15 18:34:42 kepecs 88 87 lm> remove virtual memory in async_daemon and nfs_svc as lm> it's not needed. Remove pre-2.0 code to set rdir in nfs_svc. lm> make sure these guys exit if error, since no vm to return to. Note that this was well known to me, and I did write that the nfsds are *kernel* based processes, and used to run in th kernel's context. My suspicion is that they now (SunOS 4), either via 'bread()' or directly, do VM mapped IO to satisfy remote requests and thus require page table swaps, which causes problems on machines with few contexts. For sure they are reported having a lot of page-ins in SunOS 4, both by 'pstat -u' and 'ps axv', while they were reported to have a lot of IO transactions in SunOS 3. It's curious that processes that do not have an address space are doing page ins. Maybe some kind of address space they do have... :-) lm> In other words, this problem went away 3 years ago, never to return. Much software here is three years old... Same for a lot of people out there. Also, if Sun R&D corrected the mistake three years ago on their internal systems, it may take well over three years before it percolates to some machines in the field. There are quite a few people still running SunOS 3 out there (because SunOS 4.0, for this and other reasons, performed so poorly that they have preferred to stay with an older release, and are too scared to go on to SunOS 4.1.1 even if admittedly it is vastly improved). One amusing note though: one of the servers here has been recently put on 4.1, which is still not the latest and greatest, and it still shows appallingly high system time overheads directly proportional to NFS load, but with an important difference: the number of context switches per second reported by vmstat(1) is no longer appallingly high, even if it counts the nfs daemons in the runnable and blocked categories. What's going on? I have the suspicion that the number of context switches per second now simply excludes those for the nfsd processes. Final note: as usual, I want to remind everybody that I am essentially just a guest for News and Mail access at this site, and therefore none of my postings should reflect on the reputation of the research performed by the Coleg Prifysgol Cymru, in any way. I mention my observations of their systems solely because they are those at hand. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk
dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (03/07/91)
Here are some practical results. Each machine listed below had 8 nfsds running originally. Results of increasing the numbers are shown. Sun-3/180: Increased to 16. Result: system nearly dead of high load on CPU; NFS response much worse and frequent "NFS server ... not responding" messages on clients. Sun-3/280: Increased to 16. Result: system much more sluggish; CPU response very poor (but not dead); NFS response not noticeably changed. Sun-4/280: Increased to 32. Result: system very sluggish; CPU response much poorer (nearly dead but not quite); NFS response not noticeably changed. Sun-4/490: Increased to 32. Result: system more sluggish; CPU response worse but not as bad as with above machines; NFS response not noticeably changed, but might be slightly better. One of the adverse effects of a large number of nfsds under SunOS 4.0.3 is that if an executing binary is ever deleted, and then the server reboots, or if a mounted filesystem is unmounted and replaced by another while a client was executing from it, the resulting NFS traffic becomes VERY high and adversely affects both server and client. The more nfsds there are the close the server is to being completely dead at this point. -- Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com> UUCP: oliveb!cirrusl!dhesi
guy@auspex.auspex.com (Guy Harris) (03/09/91)
>>>If you have too many processes competing for the limited slots in the >>>hardware context cache, your machine will roll over and die. > >>Given that nfsd runs in kernel mode inside nfssvc(), is this statement >>about contexts correct? If so, why is the default number of nfsds for >>Sun 3s 8? > >1) Hardware contexts are a feature of the MMU/Instruction cache, and so Greg's >comment is specific to Sun4 machines. Eh? Hardware contexts aren't specific to Sun-4 machines; the *number* of hardware contexts in various Sun machines that have them (not all do; the Sun386i and 68030-based Sun-3's don't) is specific to the machine - Sun-3s that have them have 8 contexts, some SPARC-based Suns have 16, others have 8, others have 64. (The "instruction cache" hasn't anything to do with it; the only Suns I know of with instruction caches are the Sun-3s, with the on-chip I caches and *maybe* the 4/110, due to the way the software allocates physical addresses for pages so as not to make the "static column RAM" "cache" not thrash. The cache in the Suns I know about is a unified I/D cache.) >2) The nfssvc() system call never returns, but the process slot of the caller >is used as a cheap way to implement multi-threading in the kernel. When an >nfsd runs out of work, it does a sleep() waiting for requests to come in on >the socket bound to UDP port 2049. When it gets handles a request involving >disk I/O, it sleeps waiting for the result. All nice straightforward kernel >stuff provided that you have a process table entry to work with, but >impossible if you don't. Using a kernel like Mach or Chorus it would be done >with true multi-threading of the code, so lightweight process switches would >be the only overhead; under most UNIX systems it takes the full context switch >and hence can thrash the hardware contexts (if applicable). It may not be applicable, *even on machines with hardware contexts*. As Larry McVoy has already noted, as of SunOS 4.0, the NFS server processes release their address space; this means that, when a context switch is done to such a process, the system doesn't bother switching the hardware context on SPARC-based Suns, or switches to context 0 (the kernel context) on Sun-3s with 68020s. I.e., it won't thrash the hardware contexts....