[comp.protocols.nfs] mountd Performance under Stress

ntm1569@dsacg3.UUCP (Jeff Roth) (08/24/89)

The mount server appears to be becoming a bottleneck for an application in
which we've a large number of PC clients accessing data on a minicomputer
server. On occasion we can have quite a few users issuing multiple mount
requests simultaneously. When this happens we see some of the requests time
out, while users accessing already mounted files continue to receive good
service.

To be precise, the server is a Gould PowerNode 9050 (uniprocessor) with
a full complement of disk and ethernet controllers (four of each), and 
little running on it other than the various network services (next to
no interactive users). PCs start up by mounting and unmounting a file
system to download the application binaries, then mount data(base) files.
With around forty clients accessing the server additional clients' mount
attempts begin to fail (though retries may succeed). (At this point we 
might have a dozen or so new clients trying to mount).

The mount server has to read /etc/exports, and to do the host name to IP
address translation would also have to access /etc/hosts (or the name
server), and it writes /etc/rmtab. So we thought mountd might be having
trouble getting to /etc. But ps "snapshots" showed mountd rarely waiting
on disk.

The mount server obviously also needs CPU cycles, and must compete for them,
mostly with the many NFS server daemons we run. At peaks we see mid-20s
load averages, and with mountd reniced to increase its scheduling priority
we are able to get seventy clients "on" before we again begin to see the
mount requests time out.

Our conclusion's been that we have a CPU bottleneck, with mountd getting
the worst of it. We're a little surprised, though, by the extreme insensi-
tivity of the already-mounted clients to the bottleneck (remember they
continue to see good response time even at peak loads).

I'd be interested in hearing if anyone else has run into this particular
wall in building an NFS application, and what if anything you've done
about it. Or if anyone has any other thoughts on what might be happening.

jroth@dsac.dla.mil  U.S. Defense Logistics Agency  (614) 238-9421
--

chuq@Apple.COM (Chuq Von Rospach) (08/25/89)

>The mount server appears to be becoming a bottleneck for an application in
>which we've a large number of PC clients accessing data on a minicomputer
>server. On occasion we can have quite a few users issuing multiple mount
>requests simultaneously. When this happens we see some of the requests time
>out, while users accessing already mounted files continue to receive good
>service.

Definitely. For a good time, set up a machine exporting USENET to three or
four hundred machines and then have it crash for 24 hours. All of the NFS
servers jump on it as soon as it comes back up, and I've seen mount requests
sit two hours waiting to happen. 

>The mount server has to read /etc/exports, and to do the host name to IP
>address translation would also have to access /etc/hosts (or the name
>server), and it writes /etc/rmtab. So we thought mountd might be having
>trouble getting to /etc. But ps "snapshots" showed mountd rarely waiting
>on disk.

The disk activity of mountd is fairly trivial.hostname looks via Yellow
Pages clears out a good bit since you aren't sequentially searching the host
table.

Imagine, though, what's happening at the network layer. 50-100 (or more)
machines are all trying to create connections to the mountd at once. It's
spinning away, dealing with them as fast as it can, but the ethernet buffers
are all clogged with incoming packets, the mbuf pool is wedged full of
pending requests that are already in the queue (making it tough, sometimes,
for the mountd to get the memory it needs to return an fhandle to the client
so it can finish a given request, packets are being dropped on the floor,
clients are timing out and sending repeat requests -- it gets *really* nasty.

You end up, essentially thrashing at a couple of layers in the kernel and
sending lots and lots of ethernet packets all over everywhere. It isn't,
really, a CPU bottleneck although a faster CPU will help somewhat. 

The problem from what I've seen, is that the statelessness of NFS makes it
impossible for the client to tell whether the server has never seen its
request (as opposed to knowing about it and not acting on it yet). So it has
to assume the request disappeared and send it out again when it times out.
This is correct most of the time, but not in this kind of worst-case
scenario. One way to minimize it under the current scheme would be to make
the "mount request timeout" be a sliding scale similar to ethernet packet
collision delays -- every time it times out, the client waits a little
longer (with a randomizing factor tossed in) before sending the request
again. That isn't reducing the mounting load, but simply spreading it out
further in time. Doesn't hurt the normal case, and would reduce some of the
clogging in the worst case scenario.

chuq

Chuq Von Rospach      =|=     Editor,OtherRealms     =|=     Member SFWA/ASFA
         chuq@apple.com   =|=  CI$: 73317,635  =|=  AppleLink: CHUQ
      [This is myself speaking. No company can control my thoughts.]

liam@cs.qmc.ac.uk (William Roberts) (09/01/89)

>>On occasion we can have quite a few users issuing multiple mount
>>requests simultaneously. When this happens we see some of the requests time
>>out, while users accessing already mounted files continue to receive good
>>service.

This is a difference between user-level RCP and kernel-level RPC.
The kernel level *knows* that its NFS RPC requests are
idempotent and so it doesn't change the xid when it does
sends a retransmission. This means that the first reply is
acceptable no matter how many retransmissions have occurred.

The user-level makes no such guarantee, so there is a new xid
for each retransmission. In particular, this means that the
mount program's RPC request to the mount daemons *have* to be
answered before the timeout period is up otherwise that reply
is discarded as out of date. Ultimately this becomes a race
condition, especially as the mount requests are small and the
machine can buffer lots of them. We had an NFS server with 40
clients that was a 0.5 MIP Whitechapel MG1 - when all 40
clients rebooted after a power failure it was taking about 3
minutes from a client sending a request to the mountd sending
the reply, by which time there were a lot of 25 second timeouts
gone by. Funny thing is, every mountd response is identical, so
the first one would do and the rest can be discarded....
You are just lucky that your server occasionally gets in there
quick enough!

>>The mount server has to read /etc/exports, and to do the host name to IP
>>address translation would also have to access /etc/hosts (or the name
>>server), and
>>              ***it writes /etc/rmtab***      [ my emphasis ]
>>. So we thought mountd might be having
>>trouble getting to /etc. But ps "snapshots" showed mountd rarely waiting
>>on disk.

To be more specific, it does a linear scan through rmtab
looking to see if this mount request is already there and
adds onto the end if it isn't.

On my main machine /etc/rmtab is 978 lines long.

The reason it is so long is that most clients unmount their
disks by crashing, so the rmtab file never gets cleared by
unmount requests. On our MG1 servers we reniced the mountd to
-15 and removed all the /etc/rmtab nonsense.

I'm sorry Chuq, but all that stuff about relentless mashing of
mbufs just doesn't sound at all plausible, especially since the
lucky clients who have already mounted are getting good service.

(If it hadn't been from someone who ought to know I would have
 loudly decried it as complete *@*!%*, but perhaps I'm not so
 certain of my ground...)


The Bottom Line:

1) Change mount to use a TCP connection to the mountd, or
   otherwise provide an idempotent RPC
2) Change mountd to use a dbm file or some other means
   or speeding up the search through rmtab.
3) Encourage people to remove rmtab as part of the boot sequence!


Actually, idempotent RPC is an easy and valuable thing to do,
especially as you just say "Buyer beware" and treat "idempotent
RPC" to mean "don'T increment the xid for each retransmission".
-- 

William Roberts         ARPA: liam@cs.qmc.ac.uk
Queen Mary College      UUCP: liam@qmc-cs.UUCP    AppleLink: UK0087
190 Mile End Road       Tel:  01-975 5250
LONDON, E1 4NS, UK      Fax:  01-980 6533

brent%terra@Sun.COM (Brent Callaghan) (09/06/89)

I made some improvements to the mountd performance for the
SunOs 4.0.3 release.  They were oriented to speeding up mounts
from a client using the automounter's "-hosts" map.  The special
situation here is that you have can have a large number of
mount requests coming in from the same client in a short period
of time but the changes should make the mountd a bit faster for
all mount requests.

- Exports caching.  Previously the mountd had to open the /etc/exports
  file and do a linear search for the exported filesystem for each
  mount request from a client.  I had the file cached as a linked list.
  The list is valid as long as a stat() of /etc/exports shows that it
  hasn't been updated (by exportfs).

- Asynchronous /etc/rmtab updating.  The mountd was changed to update
  the /etc/rmtab *after* it had sent the response containing the
  filehandle back to the client.  There's no reason why the client
  should have to wait for this file to be updated.  BTW: the /etc/rmtab
  is already cached as a linked list.  The disk file is read only when
  the mountd starts up (presumably after a crash).

- Change netgroup/hostname checking.  Given a list of
  hostnames/netgroups there's no easy way to tell whether a name
  represents a hostname or a netgroup.  The old mountd used to take
  each name one at a time first checking it as a netgroup then as a
  hostname.  This is about the most inefficient way to do this checking
  and it could take a huge amount of time if the list was big (I've
  seen exports with 100 or so names in the list).  The new code first
  checks the whole list as if it is hostnames.  This is just a bunch of
  strcmp's so it's relatively fast.  If there's no match, then it
  assumes that the list is netgroups and checks them with innetgr
  calls.  This is a whole lot faster if the list is just hostnames.


	FYI.
		Brent

Made in New Zealand -->  Brent Callaghan  @ Sun Microsystems
			 uucp: sun!bcallaghan
			 phone: (415) 336 1051