[comp.protocols.nfs] how many nfsd's should I run?

anselmo-ed@CS.YALE.EDU (Ed Anselmo) (02/22/91)

Is there a magic formula for determining how man nfsd's to run?

"System Performance Tuning" by Loukides says "...4 is appropriate for
most situations.  The effects of changing this number are hard to
predict and are often counterintuitive.  For various reasons that are
beyone the scope of this book, increasing the number of daemons can
often degrade network performance."

The SunOS 4.1.1 man page suggests 8 nfsd's.

What happens when you run too many nfsd's?  Too few?

In our case, we're running Sun 4/390 servers, typically with 2 1 GB
IPI disks on one controller.  Each server serves around 20
"sorta-standalone" Sun 4/60 clients (/, swap, and /usr are on the
client, /home is on the server).
-- 
Ed Anselmo   anselmo-ed@cs.yale.edu   {harvard,cmcl2}!yale!anselmo-ed

thurlow@convex.com (Robert Thurlow) (02/22/91)

In <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes:

>Is there a magic formula for determining how man nfsd's to run?

Surely a question for the FAQ, except that there's no definitive
answer :-) Someone I was talking to got a rule-of-thumb from a Sun
tech support person that said, "Run one nfsd for each local disk
controller plus one."  If your nfsd's can keep all the local disk
arms going, that's pretty good, and then you add another one.
Less nfsd's mean you lose potential bandwidth, while many more
mean you just have a lot of nfsd's waiting on disk.

Rob T
--
Rob Thurlow, thurlow@convex.com
An employee and not a spokesman for Convex Computer Corp., Dallas, TX

gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) (02/22/91)

In article <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes:
>Is there a magic formula for determining how man nfsd's to run?

I was once told that the mystical figure was "one per network
interface and one per disk with exported filesystems."

>What happens when you run too many nfsd's?  Too few?

If you have too many processes competing for the limited slots in the
hardware context cache, your machine will roll over and die. You can
look up this number in you hardware manuals somewhere. For low-end
sun4's the number is 8. I run 4 nfsd's on such machines. The same
problem can bite you with too many biods.

Another good thing to check is NFS timeouts... nfsstat(8C) will tell
you if the server is responding sufficiently quickly. I raised my
timeout until I saw zero timeouts.

Disclaimer: I'm not an expert, but I once asked for help on this very
topic here and nobody answered ;-)

richard@aiai.ed.ac.uk (Richard Tobin) (02/23/91)

In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:
>If you have too many processes competing for the limited slots in the
>hardware context cache, your machine will roll over and die. You can
>look up this number in you hardware manuals somewhere. For low-end
>sun4's the number is 8. I run 4 nfsd's on such machines. The same
>problem can bite you with too many biods.

Given that nfsd runs in kernel mode inside nfssvc(), is this statement
about contexts correct?  If so, why is the default number of nfsds for
Sun 3s 8?

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

john@iastate.edu (Hascall John Paul) (02/24/91)

In article <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes:
}Is there a magic formula for determining how man nfsd's to run?

    So far, we've seen four answers:

        -- 4
        -- 8
        -- (1 * n_disk_interfaces) + 1
        -- (1 * n_disks_exported) + (1 * n_network_interfaces)

   Now, consider our situation (DEC 5000, 2 SCSI x 4 x 1GB, 1 net),
that gives:

        --  4
        --  8
        --  3 = (1 * 2) + 1
        -- 10 = (1 * 8) + (1 * 1)

   We are running 20, which is almost assuredly too many.  Does anyone
have a solid answer for this question?  Also does the number of clients
to be served have any impact?  Perhaps, a true answer can only be
obtained empirically?  Is there such a thing as a test-client which
makes a series of nfs rpc calls to determine response time stats?

   I am sure many could benefit from such information [FAQ?].

Thanks,
John Hascall <john@iastate.edu>
--
John Hascall                        An ill-chosen word is the fool's messenger.
Project Vincent
Iowa State University Computation Center                       john@iastate.edu
Ames, IA  50011                                                  (515) 294-9551

thurlow@convex.com (Robert Thurlow) (02/24/91)

In <1991Feb24.025821.11354@news.iastate.edu> john@iastate.edu (Hascall John Paul) writes:

>Does anyone have a solid answer for this question?

I think that part of the problem is that nobody has done a really
good job at finding an answer, and that part is that there just
isn't _one_ answer; that there's one for each job mix, machine
pair, and user community.  I was reminded of a reason to keep the
numbers down that I had forgotten - SunOS will wake up all of the
processes sleeping on the NFS input request socket, and the first
one will get in.  That could be a lot of process jostling.  Since
the nfsd's on a Convex sleep on a queued semaphore, this isn't an
issue on our machines.

Rob T
--
Rob Thurlow, thurlow@convex.com
An employee and not a spokesman for Convex Computer Corp., Dallas, TX

chucka@cup.portal.com (Charles - Anderson) (02/24/91)

> Is there a magic formula for determining how man nfsd's to run?
> 
> "System Performance Tuning" by Loukides says "...4 is appropriate for
> most situations.  The effects of changing this number are hard to
> predict and are often counterintuitive.  For various reasons that are
> beyone the scope of this book, increasing the number of daemons can
> often degrade network performance."
> 
> The SunOS 4.1.1 man page suggests 8 nfsd's.
> 
> What happens when you run too many nfsd's?  Too few?
> 
> In our case, we're running Sun 4/390 servers, typically with 2 1 GB
> IPI disks on one controller.  Each server serves around 20
> "sorta-standalone" Sun 4/60 clients (/, swap, and /usr are on the
> client, /home is on the server).
> -- 
> Ed Anselmo   anselmo-ed@cs.yale.edu   {harvard,cmcl2}!yale!anselmo-ed

We are running 50-60 clients using DOS PCNFS and 4-6 mounts each.
We picked 16.  Seems to be working pretty well.  We have 16 Meg memory
and 1 G Disk.  I think we could get by with 12.  But, no one is 
complaining about performance, which can go down due to swapping.

Chuck Anderson

davecb@yunexus.YorkU.CA (David Collier-Brown) (02/25/91)

In <1991Feb24.025821.11354@news.iastate.edu> john@iastate.edu (Hascall John Paul) writes:
| Does anyone have a solid answer for this question?

thurlow@convex.com (Robert Thurlow) writes:
| I think that part of the problem is that nobody has done a really
| good job at finding an answer, and that part is that there just
| isn't _one_ answer; that there's one for each job mix, machine
| pair, and user community. 

This can be determined empirically with fair ease on a machine which is
swapping/paging via nfs: log in and start X windows and some applications.
we did it with three people (on three workstations) in a couple of hours
without even needing a quiescent net.

--dave
-- 
David Collier-Brown,  | davecb@Nexus.YorkU.CA | lethe!dave
72 Abitibi Ave.,      | 
Willowdale, Ontario,  | Even cannibals don't usually eat their
CANADA. 416-223-8968  | friends.

G.Eustace@massey.ac.nz (Glen Eustace) (02/25/91)

The appropriate number of nfsds is definitely environment and for
that matter application (client) dependent.  We followed the
recommendations of our SE when our Pyramid 9815 was installed, 3
disks and 2 ethernets.  Whatever formula was applied ( a guess is as
godd as any :-) ), the number arrived at was 8.  After running this
way for over a year and having continual problems with slow network
response and insufficient CPU available on the server, a performance
analysis was conducted.  Now for the counter-intuitive part !!  We
have reduced the number of nfsds to 4 and most of our problems have
gone away.  The network performance is better ( but could still be
improved ) and we have sufficient CPU left over to do some other
useful activities like servicing mount requests and pcnfs
authentication and print requests etc :-;.

From this experience we have stuck to 4 nfsds and where appropriate 4
biods on all other machines, DEC3100s and a DEC5000.  We have found
this to be satisfactory.

The moral of the story:  in calculating the nfsds the other processor
activity needs to be considered.

-- 
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
Glen Eustace, Software Manager, Computer Centre, Massey University,
Palmerston North, New Zealand.        EMail: G.Eustace@massey.ac.nz
Phone: +64 63 69099 x7440, Fax: +64 63 505 607,    Timezone: GMT-12

liam@cs.qmw.ac.uk (William Roberts;) (02/26/91)

In <4218@skye.ed.ac.uk> richard@aiai.ed.ac.uk (Richard Tobin) writes:

>In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU> 
gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:
>>If you have too many processes competing for the limited slots in the
>>hardware context cache, your machine will roll over and die.

>Given that nfsd runs in kernel mode inside nfssvc(), is this statement
>about contexts correct?  If so, why is the default number of nfsds for
>Sun 3s 8?

1) Hardware contexts are a feature of the MMU/Instruction cache, and so Greg's 
comment is specific to Sun4 machines.

2) The nfssvc() system call never returns, but the process slot of the caller 
is used as a cheap way to implement multi-threading in the kernel. When an 
nfsd runs out of work, it does a sleep() waiting for requests to come in on 
the socket bound to UDP port 2049. When it gets handles a request involving 
disk I/O, it sleeps waiting for the result. All nice straightforward kernel 
stuff provided that you have a process table entry to work with, but 
impossible if you don't. Using a kernel like Mach or Chorus it would be done 
with true multi-threading of the code, so lightweight process switches would 
be the only overhead; under most UNIX systems it takes the full context switch 
and hence can thrash the hardware contexts (if applicable).

Historical note: prior to SunOS 3.2, there was no wakeupone() routine and so 
ALL sleeping nfsd processes would be made runnable when a request came in. One 
would get to handle the request, and the others would needlessly run, find 
nothing to do and go back to sleep. Under this scheme it was possible to find 
that the file server got slower at night, and needless to say it was 
impossible to choose a "good number" since there were penalties for having too 
many for the current load. I once tried 30 nfsds on a WhiteChapel MG1 server: 
it stopped.

Nowadays only one nfsd will wake up for an incoming request, hence the change 
in "good number" from "4" to "4 or more". One plausible way to determine how 
many nfsds you need is to ask for lots, then see how many use up CPU time (NB. 
this depends on sleep queues being managed unfairly, so it might not work 
these days).
--

William Roberts                 ARPA: liam@cs.qmw.ac.uk
Queen Mary & Westfield College  UUCP: liam@qmw-cs.UUCP
Mile End Road                   AppleLink: UK0087
LONDON, E1 4NS, UK              Tel:  071-975 5250 (Fax: 081-980 6533)

jim@cs.strath.ac.uk (Jim Reid) (02/27/91)

In article <thurlow.667370019@convex.convex.com> thurlow@convex.com (Robert Thurlow) writes:

   .........................  I was reminded of a reason to keep the
   numbers down that I had forgotten - SunOS will wake up all of the
   processes sleeping on the NFS input request socket, and the first
   one will get in.  That could be a lot of process jostling. 

This was certainly true for the early NFS implementations: say around
the time of SunOS2.0. Later implementations included a new kernel
function - wakeup_one() - which will only wake up one process waiting
on an event instead of every process waiting on that event. This is
intended to save the overheads of scheduling N nfsd processes only to
see N-1 of them immediately go back to sleep. The routine is still in
SunOS 4.[01].

		Jim

ckollars@deitrick.East.Sun.COM (Chuck Kollars - Sun Technical Marketing - Boston) (02/27/91)

Both Mike Loukides' book "System Performance Tuning" and Hal Stern's
presentation at SUG have raised interest in the question of how many
nfsd's/biod's.  Here's my summary of public and private answers:

In article <28975@cs.yale.edu> anselmo-ed@CS.YALE.EDU (Ed Anselmo) writes:
>Is there a magic formula for determining how man nfsd's to run?

There's a magic procedure for adjusting the number of nfsd's, and
several magic formulas for picking a starting value.

Q: So what's the magic procedure?

Run 'netstat -s' a few times, and look at the number of "socket
overflows" in the "udp:" category.  If it's growing, run more nfsd's.
If it's zero, you're okay.  (If it's large but not changing, probably
somebody ran 'spray' on your network yesterday.)

Q: So what are the magic formulas for a starting value?

Variation 1-- #(disk controllers) + 1
Variation 2-- #(disk controllers) + #(ethernet interfaces)
Variation 3-- #(disk controllers) * min(4, #(spindles/controller))
Variation 4-- 4 for a small server, 8 for a midsize server, 16 for a large server

Q: Does the number of nfsd's have to be a multiple of 2?

No, there can be any number of nfsd's.  

>"System Performance Tuning" by Loukides says "...4 is appropriate for
>most situations.  The effects of changing this number are hard to
>predict and are often counterintuitive.  For various reasons that are
>beyone the scope of this book, increasing the number of daemons can
>often degrade network performance."
>
>The SunOS 4.1.1 man page suggests 8 nfsd's.
>
>What happens when you run too many nfsd's?  

Server performance may suffer.  

The amount of degradation depends on your hardware, your software, and
just how many "extra" nfsd's you have.  In most cases the penalty for
erring on the high side is small.

On machines with hardware support of multiple contexts (ex:
Sun4/SPARC), running more nfsd's than there are hardware contexts will
force context switching into software.  On machines without software
support for signaling semaphores, an NFS request will wake up _all_ the
sleeping nfsd's, which will stumble all over each other resulting in a
lot of unnecessary context switches and degraded server performance.

>Too few?

The RPC queue of NFS requests will overflow, resulting in timeouts,
retransmissions, backoffs, and hence noticeably degraded client
performance.

This serious effect can make a large expensive server act like a boat
anchor.

Q: How important is it to get the number of nfsd's exactly right?

Not very.  To quote Hal Stern, "This fine-tuning may yield minor
improvements, but major performance problems can usually be traced to
other bottlenecks or problems."

Q: Why do some people run very large numbers of nfsd's, 48 or even 64?

'nhfsstone' benchmark results can often be improved by running a very
large number of nfsd's on the server under test.  Doing this proves
that it's possible.  Doing this does not prove that it's helpful on a
production server with a mixed or bursty workload.

Q: What if the NFS requests are very bursty?  Running enough nfsd's to
   handle the peaks seems to imply running way too many nfsd's most of 
   the time.

NFS requests are seldom so bursty that this is a problem.  If it is,
you could increase "udp_recvspace" in sys/netinet/in_proto.c and
rebuild your kernel.

Q: Does the number of nfsd's being run affect other things?

Yes, be sure MAXUSERS in your kernel config allocates enough kernel
resources to run all your nfsd's.  Despite what its name seems to
imply, MAXUSERS is a "large knob" that controls the size of statically
allocated tables in the kernel.  On large servers with lots of RAM,
turn the knob "way up", otherwise those nfsd's may overflow some kernel
table.

Q: Is there a similar magic formula or procedure for biod's?

Yes, always run the default 4 biod's per client.  Changing the number
of biod's is usually counterproductive.  Reducing the number could
cripple client performance.  Increasing the number could cripple
network or server performance.  Changing the number of biod's on the
clients either way will require adjusting the number of nfsd's on the
server.
---
chuck kollars    <ckollars@East.Sun.COM>
Sun Technical Marketing, located in Sun's Boston Development Center

jstampfl@iliad.West.Sun.COM (John Stampfl - Sun Silicon Valley SE) (03/01/91)

>Q: So what are the magic formulas for a starting value?
>
>Variation 1-- #(disk controllers) + 1
>Variation 2-- #(disk controllers) + #(ethernet interfaces)
>Variation 3-- #(disk controllers) * min(4, #(spindles/controller))
>Variation 4-- 4 for a small server, 8 for a midsize server, 16 for a large server

Chuck,
	I have found serveral problems when there are not enough nfsd's in
particular, "server not responding", slow copies from a fast client to a
slow server.

Also, running with presto or presto and OMNI usually requires less nfsd's or
OMNI_nfsd's.

	I like 24 nfsd's as a starting value on all pure nfs servers, if you
are doing compute serving also, or comms or other cpu intensive job, you may
want a smaller number.

>>What happens when you run too many nfsd's?  
>Server performance may suffer.  
I haven't seen nfs server performance suffer, but have heard of mixed cpu/nfs
servers suffering.


>On machines with hardware support of multiple contexts (ex:
>Sun4/SPARC), running more nfsd's than there are hardware contexts will
>force context switching into software.

I believe that nfsd's run in kernel space and thus donot do hardware context
switches.
>
>Q: Why do some people run very large numbers of nfsd's, 48 or even 64?
>
>'nhfsstone' benchmark results can often be improved by running a very
>large number of nfsd's on the server under test.  Doing this proves
>that it's possible.  Doing this does not prove that it's helpful on a
>production server with a mixed or bursty workload.

I have not seen much improvement in running over 24 nfsd's.  The curve has
flattened out.  With omni and presto 16 may be enough.

One thing, If you have a system with nfs performance problems and running
less than 24 nfsd, try 8 or so more, you may also need to increase maxusers
In may experience going to 24 nfsd's or so doesn't hurt much, except for small
memory cpu/nfs servers.  If more nfsd's don't help, then the lack of nfsd's
wasn't the problem and you can reduce the number.




Keep up the good work, maybe someday I will understand what is going on.

cornell@csl.dl.nec.com (Cornell Kinderknecht) (03/01/91)

>> Is there a magic formula for determining how man nfsd's to run?

I was talking to a Sun software support person yesterday about a different
subject but she made the comment that based on Sun's tests, after 22 nfsd's
running, there was no increase in performance.  So she had recommended a
maximum of 22.

I suppose there is no problem with running more besides that they aren't
needed.  Running too few I would guess, might cause problems when a lot
of nfs requests come in at the same time.

--- Cornell Kinderknecht

pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)

I have crossposted to comp.arch, becasue this is really a system/network
architecture question. NFS is almost incidental :-).

On 22 Feb 91 16:14:12 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:

richard> In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU>
richard> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:

gl8f> If you have too many processes competing for the limited slots in the
gl8f> hardware context cache, your machine will roll over and die. You can
gl8f> look up this number in you hardware manuals somewhere. For low-end
gl8f> sun4's the number is 8. I run 4 nfsd's on such machines. The same
gl8f> problem can bite you with too many biods.

richard> Given that nfsd runs in kernel mode inside nfssvc(), is this
richard> statement about contexts correct?

Yes and no, depending on who is your vendor, and which OS revision and
machine model you have. For Sun there is some history that may be worth
mentioning. Under SunOS 3 the nfsds were in effect kernel processes, so
that they could access the buffer cache, held in the kernel address
space, without copies. Since all nfsds run in the kernel page table
there was no problem.

Under SunOS 4 the buffer cache went away, so each nfsd was given its own
address space (memory mapped IO), while still being technically a kernel
process. This meant that MMU slot thrashing was virtually guaranteed, as
the nfs daemons are activated more or less FIFO and the MMU has a LIFO
replacement policy. As soon as the number of nfsds is greater or equal
to the number of MMU slots problems happen.

I have seen the same server running the same load under SunOS 3 the day
before with 10-20% system time and 100-200 context switches per second,
and with SunOS 4 the day after with 80-90% system time and 800-900
context switches per second. An MMU slot swap on a Sun 3 will take about
a millisecond, which fits.

Under SunOS 4.1.1 things may well be different, as Sun may have
corrected the problem (by making all the nfsds share a single adddress
space and giving each of them a section of it in which to map the
relevant files, for example, or by better tuning the MMU cache
replacement policy to the nfsd activation patterns, for another
example). On larger Sun 4s there are many more MMU slots, say 64, so the
problem effectively does not happen for any sensible number of nfsds.

richard> If so, why is the default number of nfsds for Sun 3s 8?

Sun bogosity :-).


As to the general problem of how many NFS daemons, I have already posted
long treatises on the subject. However briefly the argument is:


Each nfsd is synchronous, that is it may carry out only one operation at
a time, in a cycle: read request packet, find out what it means, go to
the IO subsystem to read/write the relevant block, write the result
packet, loop.

Clearly on a server that has X network interfaces, Y CPUs, and Z disks
(if your controller supports overlapping transfers, otherwise it is the
number of controllers) there cannot be more than X+Y+Z nfsds active, as
at most X nfsds can be reading or writing a packet from a network
interface, at most Y nfsds can be running kernel code, and at most Z
nfsds can be waiting for a a read or a write from a disk.

The optimum number may be lower that X+Y+Z, because it is damn unlikely
that the maximum multiprogramming level will actually be as high as
that, and there may other be processes that compete with nfsds for the
newtork interfaces, or the CPUs, or the disks.

It may also be higher, because this would allow multiple IO requests to
be queued waiting for a disk, thus giving the arm movement optimizer a
chance to work (if there is only ever one outstanding request per disk,
tis implies a de facto FCFS arm movement policy).

The latter argument is somewhat doubtful as there is contradictory
evidence about the relative merits of FCFS and of elevator style sorting
as used by the Unix kernel.

All in all I think that X+Y+Z is a reasonable estimate, or maybe a a
slightly larger number than that if you are persuaded that giving a
chance to the disk request sorter is worthwhile (which may not be true
for a remote file server, as opposite to a timesharing system where it
is almost always worthwhile).

Naturally this is only the "benefit" side of the equation. As to the
"cost" side, it used to be that nfsds had a very low cost (a proc table
slot each and little more), so slightly overallocating them was not a
big problem.  But on some OS/machine combinations the cost becomes very
large over a certain threshold, and this may mean that reducing the
number below the theoretical maximum pays off.

Finally there is question of the Ethernet bandwidth. In the best of
cases an Ethernet interface can process read about 1000 packets/s, and
write 800KB/s (we assume that requests are small, so the number of
packets/s matters, while results are large, so the number of KB/s
matters; stat(2) and read/exec(2) are far more common than write(2)).

Divide that by the number of clients that may be actively requesting
data (usually about a tenth of the total number of machines on a wire
are actively doing remote IO), and you get pretty depressing numbers.

It may be pointless to have say 4 2MB/s server disks capable of doing
each 50 transactions per second each involving say 8-16KB and so have
enough nfsds to take advantage of this parallelism and bandwidth, if the
Ethernet wire and interface are the bottleneck.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lm@slovax.Berkeley.EDU (Larry McVoy) (03/02/91)

In article <PCG.91Feb28195347@odin.cs.aber.ac.uk>, 
pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|> richard> Given that nfsd runs in kernel mode inside nfssvc(), is this
|> richard> statement about contexts correct?
|> 
|> Yes and no, depending on who is your vendor, and which OS revision and
|> machine model you have. For Sun there is some history that may be worth
|> mentioning. Under SunOS 3 the nfsds were in effect kernel processes, so
|> that they could access the buffer cache, held in the kernel address
|> space, without copies. Since all nfsds run in the kernel page table
|> there was no problem.
|> 
|> Under SunOS 4 the buffer cache went away, so each nfsd was given its own
|> address space (memory mapped IO), while still being technically a kernel
|> process. This meant that MMU slot thrashing was virtually guaranteed, as
|> the nfs daemons are activated more or less FIFO and the MMU has a LIFO
|> replacement policy. As soon as the number of nfsds is greater or equal
|> to the number of MMU slots problems happen.

More misinformation spoken in an authoritative manner.  Ah, well, I've done 
the same I guess.  From sys/nfs/nfs_server.c:

nfs_svc()
....
        /* Now, release client memory; we never return back to user */
        relvm(u.u_procp);

From the SCCS history (note the date):

	D 2.83 87/12/15 18:34:42 kepecs 88 87
	remove virtual memory in async_daemon and nfs_svc as
	it's not needed. Remove pre-2.0 code to set rdir in nfs_svc.
	make sure these guys exit if error, since no vm to return to.

In other words, this problem went away 3 years ago, never to return.

|> I have seen the same server running the same load under SunOS 3 the day
|> before with 10-20% system time and 100-200 context switches per second,
|> and with SunOS 4 the day after with 80-90% system time and 800-900
|> context switches per second. An MMU slot swap on a Sun 3 will take about
|> a millisecond, which fits.

You may well have seen this.  Jumping to the conclusing that it is caused
by NFS is false, at least the reasons that you list are not true.

|> richard> If so, why is the default number of nfsds for Sun 3s 8?
|> 
|> Sun bogosity :-).

Piercarlo ignorance :-)

|> The latter argument is somewhat doubtful as there is contradictory
|> evidence about the relative merits of FCFS and of elevator style sorting
|> as used by the Unix kernel.

Ahem.  References, please?  I've looked at this one in detail, this should be interesting.

---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/05/91)

[ this article may have already appeared; I repost it because probably
it did not get out of the local machine; apologies if you see it more
than once ]

	[ ... on SUN NFS/MMU sys time bogosity ... ]

pcg> I have seen the same server running the same load under SunOS 3 the
pcg> day before with 10-20% system time and 100-200 context switches per
pcg> second, and with SunOS 4 the day after with 80-90% system time and
pcg> 800-900 context switches per second. An MMU slot swap on a Sun 3
pcg> will take about a millisecond, which fits.

On 1 Mar 91 21:37:30 GMT, Larry McVoy commented:

lm> You may well have seen this.  Jumping to the conclusing that it is
lm> caused by NFS is false, at least the reasons that you list are not
lm> true.

This you say after recognizing above that the problem existed and
claiming that in recent SunOS releases it has been obviated. Now you
seem to hint that it is NFS related, buit not because of MMU context
switching.

As to me, my educated guesses are: this bogosity appears to be strictly
correlated with the number of NFS transactions processed per second, and
the overhead per transaction seems to be about 1ms and that 1ms seems to
be the cost of a MMU swap, and the number of context switches per second
reported by vmstat(8) seems to be correlated strongly to the number of
active nfsd processed, and the system time accumulated by nfsd processes
becomes very large when there are many context switches per second, but
not otherwise.

Anybody with this problem (it helps to have servers running both SunOS
3.5 and SunOS 4.0.x) can have a look at the evidence, thanks to the
wonders of nfsstat(8), vmstat(1), ps(1) and pstat(8). In particular
'vmstat 1' (the 'r' 'b' 'cs' 'sy' columns), 'ps axv' (the 'TIM" and
'PAGEIN' columns) will be revealing; 'nfstat -ns' and 'pstat -u <nfsd
pid>' will give extra details (sample outputs for both SunOS 3 and 4
available on request).

The inferences that can be drawn are obvious, even if maybe wrong. After
all I don't spend too much time second guessing the *whys* of Sun
bogosities, contrary to appearances. I am already overwhelmed by those
in AT&T Sv386 at home...  :-).

Pray, tell us why the above observed behaviour is not a bogosity, or at
least what was/is the cause, and how/if it has been obviated three years
ago.  My explanation is a best guess, as should be pretty obvious; you
need not guess, and I am sure that enquiring minds want to know.


As to the details:

lm> nfs_svc()

lm>        /* Now, release client memory; we never return back to user */
lm>        relvm(u.u_procp);

lm> From the SCCS history (note the date):

The date is when the file was edited on a machine at Sun R&D... This is
slightly cheating. When did the majority of customers see this? 

lm>	D 2.83 87/12/15 18:34:42 kepecs 88 87
lm>	remove virtual memory in async_daemon and nfs_svc as
lm>	it's not needed. Remove pre-2.0 code to set rdir in nfs_svc.
lm>	make sure these guys exit if error, since no vm to return to.

Note that this was well known to me, and I did write that the nfsds are
*kernel* based processes, and used to run in th kernel's context.

My suspicion is that they now (SunOS 4), either via 'bread()' or
directly, do VM mapped IO to satisfy remote requests and thus require
page table swaps, which causes problems on machines with few contexts.
For sure they are reported having a lot of page-ins in SunOS 4, both by
'pstat -u' and 'ps axv', while they were reported to have a lot of IO
transactions in SunOS 3. It's curious that processes that do not have an
address space are doing page ins. Maybe some kind of address space they
do have... :-)

lm> In other words, this problem went away 3 years ago, never to return.

Much software here is three years old... Same for a lot of people out
there. Also, if Sun R&D corrected the mistake three years ago on their
internal systems, it may take well over three years before it percolates
to some machines in the field.

There are quite a few people still running SunOS 3 out there (because
SunOS 4.0, for this and other reasons, performed so poorly that they
have preferred to stay with an older release, and are too scared to go
on to SunOS 4.1.1 even if admittedly it is vastly improved).

One amusing note though: one of the servers here has been recently put
on 4.1, which is still not the latest and greatest, and it still shows
appallingly high system time overheads directly proportional to NFS
load, but with an important difference: the number of context switches
per second reported by vmstat(1) is no longer appallingly high, even if
it counts the nfs daemons in the runnable and blocked categories. What's
going on?  I have the suspicion that the number of context switches per
second now simply excludes those for the nfsd processes.


Final note: as usual, I want to remind everybody that I am essentially
just a guest for News and Mail access at this site, and therefore none
of my postings should reflect on the reputation of the research
performed by the Coleg Prifysgol Cymru, in any way. I mention my
observations of their systems solely because they are those at hand.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk

dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (03/07/91)

Here are some practical results.  Each machine listed below had 8 nfsds
running originally.  Results of increasing the numbers are shown.

Sun-3/180: Increased to 16.  Result:  system nearly dead of high load
on CPU; NFS response much worse and frequent "NFS server ... not
responding" messages on clients.

Sun-3/280: Increased to 16.  Result:  system much more sluggish; CPU
response very poor (but not dead); NFS response not noticeably
changed.

Sun-4/280: Increased to 32.  Result:  system very sluggish; CPU
response much poorer (nearly dead but not quite); NFS response not
noticeably changed.

Sun-4/490: Increased to 32.  Result: system more sluggish; CPU response
worse but not as bad as with above machines; NFS response not
noticeably changed, but might be slightly better.

One of the adverse effects of a large number of nfsds under SunOS 4.0.3
is that if an executing binary is ever deleted, and then the server
reboots, or if a mounted filesystem is unmounted and replaced by
another while a client was executing from it, the resulting NFS traffic
becomes VERY high and adversely affects both server and client.  The
more nfsds there are the close the server is to being completely dead
at this point.
--
Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com>
UUCP:  oliveb!cirrusl!dhesi

guy@auspex.auspex.com (Guy Harris) (03/09/91)

>>>If you have too many processes competing for the limited slots in the
>>>hardware context cache, your machine will roll over and die.
>
>>Given that nfsd runs in kernel mode inside nfssvc(), is this statement
>>about contexts correct?  If so, why is the default number of nfsds for
>>Sun 3s 8?
>
>1) Hardware contexts are a feature of the MMU/Instruction cache, and so Greg's 
>comment is specific to Sun4 machines.

Eh?  Hardware contexts aren't specific to Sun-4 machines; the *number*
of hardware contexts in various Sun machines that have them (not all do;
the Sun386i and 68030-based Sun-3's don't) is specific to the machine -
Sun-3s that have them have 8 contexts, some SPARC-based Suns have 16,
others have 8, others have 64.  (The "instruction cache" hasn't anything
to do with it; the only Suns I know of with instruction caches are the
Sun-3s, with the on-chip I caches and *maybe* the 4/110, due to the way
the software allocates physical addresses for pages so as not to make
the "static column RAM" "cache" not thrash.  The cache in the Suns I
know about is a unified I/D cache.)

>2) The nfssvc() system call never returns, but the process slot of the caller 
>is used as a cheap way to implement multi-threading in the kernel. When an 
>nfsd runs out of work, it does a sleep() waiting for requests to come in on 
>the socket bound to UDP port 2049. When it gets handles a request involving 
>disk I/O, it sleeps waiting for the result. All nice straightforward kernel 
>stuff provided that you have a process table entry to work with, but 
>impossible if you don't. Using a kernel like Mach or Chorus it would be done 
>with true multi-threading of the code, so lightweight process switches would 
>be the only overhead; under most UNIX systems it takes the full context switch 
>and hence can thrash the hardware contexts (if applicable).

It may not be applicable, *even on machines with hardware contexts*.  As
Larry McVoy has already noted, as of SunOS 4.0, the NFS server processes
release their address space; this means that, when a context switch is
done to such a process, the system doesn't bother switching the hardware
context on SPARC-based Suns, or switches to context 0 (the kernel
context) on Sun-3s with 68020s.  I.e., it won't thrash the hardware
contexts....