[comp.arch] how many nfsd's should I run?

pcg@cs.aber.ac.uk (Piercarlo Grandi) (03/01/91)

I have crossposted to comp.arch, becasue this is really a system/network
architecture question. NFS is almost incidental :-).

On 22 Feb 91 16:14:12 GMT, richard@aiai.ed.ac.uk (Richard Tobin) said:

richard> In article <1991Feb22.012532.26075@murdoch.acc.Virginia.EDU>
richard> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:

gl8f> If you have too many processes competing for the limited slots in the
gl8f> hardware context cache, your machine will roll over and die. You can
gl8f> look up this number in you hardware manuals somewhere. For low-end
gl8f> sun4's the number is 8. I run 4 nfsd's on such machines. The same
gl8f> problem can bite you with too many biods.

richard> Given that nfsd runs in kernel mode inside nfssvc(), is this
richard> statement about contexts correct?

Yes and no, depending on who is your vendor, and which OS revision and
machine model you have. For Sun there is some history that may be worth
mentioning. Under SunOS 3 the nfsds were in effect kernel processes, so
that they could access the buffer cache, held in the kernel address
space, without copies. Since all nfsds run in the kernel page table
there was no problem.

Under SunOS 4 the buffer cache went away, so each nfsd was given its own
address space (memory mapped IO), while still being technically a kernel
process. This meant that MMU slot thrashing was virtually guaranteed, as
the nfs daemons are activated more or less FIFO and the MMU has a LIFO
replacement policy. As soon as the number of nfsds is greater or equal
to the number of MMU slots problems happen.

I have seen the same server running the same load under SunOS 3 the day
before with 10-20% system time and 100-200 context switches per second,
and with SunOS 4 the day after with 80-90% system time and 800-900
context switches per second. An MMU slot swap on a Sun 3 will take about
a millisecond, which fits.

Under SunOS 4.1.1 things may well be different, as Sun may have
corrected the problem (by making all the nfsds share a single adddress
space and giving each of them a section of it in which to map the
relevant files, for example, or by better tuning the MMU cache
replacement policy to the nfsd activation patterns, for another
example). On larger Sun 4s there are many more MMU slots, say 64, so the
problem effectively does not happen for any sensible number of nfsds.

richard> If so, why is the default number of nfsds for Sun 3s 8?

Sun bogosity :-).


As to the general problem of how many NFS daemons, I have already posted
long treatises on the subject. However briefly the argument is:


Each nfsd is synchronous, that is it may carry out only one operation at
a time, in a cycle: read request packet, find out what it means, go to
the IO subsystem to read/write the relevant block, write the result
packet, loop.

Clearly on a server that has X network interfaces, Y CPUs, and Z disks
(if your controller supports overlapping transfers, otherwise it is the
number of controllers) there cannot be more than X+Y+Z nfsds active, as
at most X nfsds can be reading or writing a packet from a network
interface, at most Y nfsds can be running kernel code, and at most Z
nfsds can be waiting for a a read or a write from a disk.

The optimum number may be lower that X+Y+Z, because it is damn unlikely
that the maximum multiprogramming level will actually be as high as
that, and there may other be processes that compete with nfsds for the
newtork interfaces, or the CPUs, or the disks.

It may also be higher, because this would allow multiple IO requests to
be queued waiting for a disk, thus giving the arm movement optimizer a
chance to work (if there is only ever one outstanding request per disk,
tis implies a de facto FCFS arm movement policy).

The latter argument is somewhat doubtful as there is contradictory
evidence about the relative merits of FCFS and of elevator style sorting
as used by the Unix kernel.

All in all I think that X+Y+Z is a reasonable estimate, or maybe a a
slightly larger number than that if you are persuaded that giving a
chance to the disk request sorter is worthwhile (which may not be true
for a remote file server, as opposite to a timesharing system where it
is almost always worthwhile).

Naturally this is only the "benefit" side of the equation. As to the
"cost" side, it used to be that nfsds had a very low cost (a proc table
slot each and little more), so slightly overallocating them was not a
big problem.  But on some OS/machine combinations the cost becomes very
large over a certain threshold, and this may mean that reducing the
number below the theoretical maximum pays off.

Finally there is question of the Ethernet bandwidth. In the best of
cases an Ethernet interface can process read about 1000 packets/s, and
write 800KB/s (we assume that requests are small, so the number of
packets/s matters, while results are large, so the number of KB/s
matters; stat(2) and read/exec(2) are far more common than write(2)).

Divide that by the number of clients that may be actively requesting
data (usually about a tenth of the total number of machines on a wire
are actively doing remote IO), and you get pretty depressing numbers.

It may be pointless to have say 4 2MB/s server disks capable of doing
each 50 transactions per second each involving say 8-16KB and so have
enough nfsds to take advantage of this parallelism and bandwidth, if the
Ethernet wire and interface are the bottleneck.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lm@slovax.Berkeley.EDU (Larry McVoy) (03/02/91)

In article <PCG.91Feb28195347@odin.cs.aber.ac.uk>, 
pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|> richard> Given that nfsd runs in kernel mode inside nfssvc(), is this
|> richard> statement about contexts correct?
|> 
|> Yes and no, depending on who is your vendor, and which OS revision and
|> machine model you have. For Sun there is some history that may be worth
|> mentioning. Under SunOS 3 the nfsds were in effect kernel processes, so
|> that they could access the buffer cache, held in the kernel address
|> space, without copies. Since all nfsds run in the kernel page table
|> there was no problem.
|> 
|> Under SunOS 4 the buffer cache went away, so each nfsd was given its own
|> address space (memory mapped IO), while still being technically a kernel
|> process. This meant that MMU slot thrashing was virtually guaranteed, as
|> the nfs daemons are activated more or less FIFO and the MMU has a LIFO
|> replacement policy. As soon as the number of nfsds is greater or equal
|> to the number of MMU slots problems happen.

More misinformation spoken in an authoritative manner.  Ah, well, I've done 
the same I guess.  From sys/nfs/nfs_server.c:

nfs_svc()
....
        /* Now, release client memory; we never return back to user */
        relvm(u.u_procp);

From the SCCS history (note the date):

	D 2.83 87/12/15 18:34:42 kepecs 88 87
	remove virtual memory in async_daemon and nfs_svc as
	it's not needed. Remove pre-2.0 code to set rdir in nfs_svc.
	make sure these guys exit if error, since no vm to return to.

In other words, this problem went away 3 years ago, never to return.

|> I have seen the same server running the same load under SunOS 3 the day
|> before with 10-20% system time and 100-200 context switches per second,
|> and with SunOS 4 the day after with 80-90% system time and 800-900
|> context switches per second. An MMU slot swap on a Sun 3 will take about
|> a millisecond, which fits.

You may well have seen this.  Jumping to the conclusing that it is caused
by NFS is false, at least the reasons that you list are not true.

|> richard> If so, why is the default number of nfsds for Sun 3s 8?
|> 
|> Sun bogosity :-).

Piercarlo ignorance :-)

|> The latter argument is somewhat doubtful as there is contradictory
|> evidence about the relative merits of FCFS and of elevator style sorting
|> as used by the Unix kernel.

Ahem.  References, please?  I've looked at this one in detail, this should be interesting.

---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

pcg@cs.aber.ac.uk (Piercarlo Antonio Grandi) (03/05/91)

[ this article may have already appeared; I repost it because probably
it did not get out of the local machine; apologies if you see it more
than once ]

	[ ... on SUN NFS/MMU sys time bogosity ... ]

pcg> I have seen the same server running the same load under SunOS 3 the
pcg> day before with 10-20% system time and 100-200 context switches per
pcg> second, and with SunOS 4 the day after with 80-90% system time and
pcg> 800-900 context switches per second. An MMU slot swap on a Sun 3
pcg> will take about a millisecond, which fits.

On 1 Mar 91 21:37:30 GMT, Larry McVoy commented:

lm> You may well have seen this.  Jumping to the conclusing that it is
lm> caused by NFS is false, at least the reasons that you list are not
lm> true.

This you say after recognizing above that the problem existed and
claiming that in recent SunOS releases it has been obviated. Now you
seem to hint that it is NFS related, buit not because of MMU context
switching.

As to me, my educated guesses are: this bogosity appears to be strictly
correlated with the number of NFS transactions processed per second, and
the overhead per transaction seems to be about 1ms and that 1ms seems to
be the cost of a MMU swap, and the number of context switches per second
reported by vmstat(8) seems to be correlated strongly to the number of
active nfsd processed, and the system time accumulated by nfsd processes
becomes very large when there are many context switches per second, but
not otherwise.

Anybody with this problem (it helps to have servers running both SunOS
3.5 and SunOS 4.0.x) can have a look at the evidence, thanks to the
wonders of nfsstat(8), vmstat(1), ps(1) and pstat(8). In particular
'vmstat 1' (the 'r' 'b' 'cs' 'sy' columns), 'ps axv' (the 'TIM" and
'PAGEIN' columns) will be revealing; 'nfstat -ns' and 'pstat -u <nfsd
pid>' will give extra details (sample outputs for both SunOS 3 and 4
available on request).

The inferences that can be drawn are obvious, even if maybe wrong. After
all I don't spend too much time second guessing the *whys* of Sun
bogosities, contrary to appearances. I am already overwhelmed by those
in AT&T Sv386 at home...  :-).

Pray, tell us why the above observed behaviour is not a bogosity, or at
least what was/is the cause, and how/if it has been obviated three years
ago.  My explanation is a best guess, as should be pretty obvious; you
need not guess, and I am sure that enquiring minds want to know.


As to the details:

lm> nfs_svc()

lm>        /* Now, release client memory; we never return back to user */
lm>        relvm(u.u_procp);

lm> From the SCCS history (note the date):

The date is when the file was edited on a machine at Sun R&D... This is
slightly cheating. When did the majority of customers see this? 

lm>	D 2.83 87/12/15 18:34:42 kepecs 88 87
lm>	remove virtual memory in async_daemon and nfs_svc as
lm>	it's not needed. Remove pre-2.0 code to set rdir in nfs_svc.
lm>	make sure these guys exit if error, since no vm to return to.

Note that this was well known to me, and I did write that the nfsds are
*kernel* based processes, and used to run in th kernel's context.

My suspicion is that they now (SunOS 4), either via 'bread()' or
directly, do VM mapped IO to satisfy remote requests and thus require
page table swaps, which causes problems on machines with few contexts.
For sure they are reported having a lot of page-ins in SunOS 4, both by
'pstat -u' and 'ps axv', while they were reported to have a lot of IO
transactions in SunOS 3. It's curious that processes that do not have an
address space are doing page ins. Maybe some kind of address space they
do have... :-)

lm> In other words, this problem went away 3 years ago, never to return.

Much software here is three years old... Same for a lot of people out
there. Also, if Sun R&D corrected the mistake three years ago on their
internal systems, it may take well over three years before it percolates
to some machines in the field.

There are quite a few people still running SunOS 3 out there (because
SunOS 4.0, for this and other reasons, performed so poorly that they
have preferred to stay with an older release, and are too scared to go
on to SunOS 4.1.1 even if admittedly it is vastly improved).

One amusing note though: one of the servers here has been recently put
on 4.1, which is still not the latest and greatest, and it still shows
appallingly high system time overheads directly proportional to NFS
load, but with an important difference: the number of context switches
per second reported by vmstat(1) is no longer appallingly high, even if
it counts the nfs daemons in the runnable and blocked categories. What's
going on?  I have the suspicion that the number of context switches per
second now simply excludes those for the nfsd processes.


Final note: as usual, I want to remind everybody that I am essentially
just a guest for News and Mail access at this site, and therefore none
of my postings should reflect on the reputation of the research
performed by the Coleg Prifysgol Cymru, in any way. I mention my
observations of their systems solely because they are those at hand.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@aber.ac.uk