[comp.protocols.nfs] disk queues of length zero..... or scaling up hurts

mash@mips.com (John Mashey) (03/06/91)

In article <1991Mar6.003008.9131@bellcore.bellcore.com> mo@bellcore.com (Michael O'Dell) writes:
>I know *I* have seen servers with much longer disk queues.

>For example -

>	Assume you memory map and create large file on a machine with lots
>	of free memory.  Say you write 50 megabytes.  You now close
>	the file and hence ask for it to really go to disk.
>	WHAM! 50 megabytes goes on the disk queue.  Yes this does happen,
>	and boy, is the poor dweeb at some other terminal who just
>	typed "ls" on the same filesystem really screwed.

>There are many more anomolies out there when the machine and the memory
>get sufficiently fast and large....

Actually, mo is understating the case ... it can get even worse...
Suppose you permit the disk cache to occupy up to, or close to 100%
of memory outside the kernel.
Then, all of a sudden, not only is there a giant disk queue for
a specific disk [which makes ls not only on the filesystem, but on the
disk, not so good], BUT you have a gaint bunch of dirty pages in memory,
and if you're not careful, you may have thrown away clean pages of
read-only code to get there.

Now, I'm ANYBODY else on the system, and I type: glurp,
and discover the kernel has to get enough pages written out,
to get enough pages to page glurp in, and if glurp is big,
it executes a little while, then page faults, waits a long time,
then page faults, because every page fault needs to grab a
dirty page, and to get a dirty page, you need to get it written out, etc.

All of this is jsut exacerbated by fast CPUs with big memories,
especially since the typical time-shraring quantum has remained
around 1/60th to 1/100th of a second when we had 1-mips machines;
now we can dirty 50X more pages per quantum....

Our folks had to do a bunch of work in RISC/os to stop this kind of
thing from killing multi-user response time, such as letting the
% of memory allocated as disk cache go up and down, but only up to
a parameter normally set less than 100%.  Of course, you don't need
to have dirtied the disk cache just by one program, a bunch of them
not going so fast can do it also.

As one more example, our folks found some bug in the BSD file system
that's been there forever, and never noticed unti lwe got 50-mips
machines.  We always find that the fastest machine finds some new race
condition in otherwise solid code that's been running a long time.  sigh.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

af@spice.cs.cmu.edu (Alessandro Forin) (03/07/91)

In article <743@spim.mips.COM>, mash@mips.com (John Mashey) writes:
> In article <1991Mar6.003008.9131@bellcore.bellcore.com> mo@bellcore.com (Michael O'Dell) writes:
> >For example -
> 
> >	Assume you memory map and create large file on a machine with lots
> >	of free memory.  Say you write 50 megabytes.  You now close
> >	the file and hence ask for it to really go to disk.
> >	WHAM! 50 megabytes goes on the disk queue.  Yes this does happen,
> >	and boy, is the poor dweeb at some other terminal who just
> >	typed "ls" on the same filesystem really screwed.
> 
> Actually, mo is understating the case ... it can get even worse...
> Suppose you permit the disk cache to occupy up to, or close to 100%
> of memory outside the kernel.
> Then, all of a sudden, not only is there a giant disk queue for
> a specific disk [which makes ls not only on the filesystem, but on the
> disk, not so good], BUT you have a gaint bunch of dirty pages in memory,
> and if you're not careful, you may have thrown away clean pages of
> read-only code to get there.
> 

While I agree with the point made here that disk throughput should
scale up with the rest of the system throughput and it isn't, IMHO
that using an obsolete VM system for reference is not a good starting
point.

I hate to sound like a sales pitch, but it is a fact that with Mach'
external memory management facilities one has much better leverage
on these issues.  Just think of it as a big cache with user-mode
instructions to manage it ;-))

As to what I would do to the disk subsystem... I'd probably concentrate
on making large xfers very fast, with gather-dma for writes 'course.
A bit more intelligence on the disk controller seems mandatory.

Yes, this is partly futuristic given the current software, but that is
where I think we are going.

sandro-