[comp.os.minix] Performance Tuning

bunnell@udel.edu (H Timothy Bunnell) (08/04/90)

On my 16 MHz '286 system, Minix 1.5.10 performance is generally acceptable,
especially if just one job is running at a time.  However, with one
CPU-intensive job running, the system runs other jobs, especially ones that
are more I/O-intensive, in a balky, start-stop, fashion that seems unlike most
of the other multi-tasking systems I've used (e.g., on old PDP-11/34 (TSX+)
and Sun 386i systems).  I wondered if this poor performance was due to the
much-discussed FS bottleneck or if maybe a little system tuning could help.  I
remembered once seeing a comment (by Bruce Evans I think) to the effect that
the default scheduler quantum time was too long for his (386) system.  The
following is a longish discussion of the effects of fiddling with the
scheduler quantum time, and a request for others who have experimented with
similar simple changes to improve performance to post their results or
suggestions.


In clock.c under Minix 1.5.10 MILLISEC is set to allow CPU-bound jobs to run
for at least 100 msec before possibly being shoved to the bottom of the user
queue (if anything else could run).  I looked at the effects on performance of
several (shorter) quantum values.  Specifically, I looked at what happens when
two jobs run simultaneously versus when each runs on an otherwise idle system.
The two jobs were (a) doing a "cat *.c" in /usr/src/kernel (call this the
I/O-job) and (b) training a neural network simulator (call this the CPU-job).
Here are the results of timing how long a single pass of the I/O-job takes
with or without simultaneously running the CPU-job (times are in seconds):

		  system-loaded		  system-idle
Quantum		times(real/user/sys)	times(real/user/sys)
 0.100		  85.0/0.0/6.7		15.0/0.1/8.3
 0.050		  45.0/0.0/6.0		16.0/0.2/8.3
 0.036		  33.0/0.0/5.9		15.0/0.2/8.2
 0.018		  30.0/0.0/5.4		16.0/0.1/8.3


The complementary case is how long a sample of the CPU-job takes with or
without continuously running the I/O-job:

		  system-loaded		  system-idle
Quantum 	times(real/user/sys)	times(real/user/sys)
 0.100		  37.0/30.4/0.2		30.0/28.7/0.2
 0.050		  43.0/32.0/0.9		30.0/28.7/0.2
 0.036		  49.0/33.9/1.9		30.0/28.8/0.2
 0.018		  50.0/34.1/4.0		30.0/28.9/0.3

Note that changing the quantum has little effect in either case when the
system is otherwise idle.  With long quanta the I/O-job showed a huge
performance drop when the CPU-job was running simultaneously.  The performance
drop is less severe with smaller quanta.  With 100 msec quanta, real time
increased by a factor of 5.67, but with 18 msec quanta the increase was only a
factor of 1.875.

From the perspective of the CPU-job, the effects of changing quantum size are
reversed; smaller quanta produce larger performance drops when the system is
loaded with an I/O-job.  However, the differences are not as great as they
were from the perspective of the I/O-job.  In the worst case (quantum = 18
msec) the execution time increased by a factor of 1.67 versus a best-case
(quantum = 100) factor of 1.23.

For my machine, changing the quantum from 100 msec to 18 msec (that's one 60
Hz clock tick) greatly improves performance of I/O-bound jobs while producing
only a moderate degredation on CPU-bound job performance. With a slower
machine the results might be different, the system might spend too much time
switching jobs and not enough time running them. But on faster CPUs the overall
performance improvement is real. It occurs because I/O bound jobs execute for
a very short time before blocking at which point another process will run. If
that other process is CPU-bound, the I/O job will then have to sit for a full
quantum before it gets to run again, even if the I/O completes much sooner. By
using a smaller time slice I/O-bound jobs get to run sooner after they unblock
and will quickly block again giving the CPU to another process.

From a subjective standpoint the effect of reducing quantum size is fairly
important because I/O-intensive jobs tend to be the ones with which users
are directly interacting.  How much of the perceived slowness of Minix (from a
more casual user's standpoint) is really due to small things like quantum
size that can be adjusted quite easily? If anyone else has results of similar
experiments, or has other ideas I would certainly like to hear about them or
see them posted. In fact, a little tutorial on system tuning (things that
do not require redesign of the operating system :-) from one of the
real gurus would be just wonderful.
--
Tim Bunnell
<bunnell@udel.edu>

brucee@runxtsa.runx.oz.au (Bruce Evans) (08/05/90)

In article <26628@nigel.ee.udel.edu> bunnell@udel.edu () writes:
>[System works better with a smaller quantum.]

The problem is really that when a user does a system call, it is rescheduled
to the back of the queue. So a user doing only system calls, in competition
with a hog doing none, in the `best' case where no waiting for physical i/o
is necessary, will run for time user_epsilon + system epsilon while the hog
will run for the quantum. On an infinitely fast processor, the user making
system calls will still take (nr_system_calls * quantum).

Another data point (for a 20MHz 386 and everthing in the cache): for the hog
process (while :; do :; done), ls -l /bin takes 53 sec with the normal
quantum of 100 msec, and 9 sec with a quantum of 17 msec. Without the hog, it
takes 1 or 2 sec. With the hog and the enclosed patch, it takes 2 or 3 sec.

A simple (hasty) fix is enclosed. Terence Holm told me about this a long
time ago, but I was worried about the accompanying bug and didn't put it in
immediately, and then forgot about it. The fix just puts freshly ready users
on the head of the queue instead of the tail. For a system call that does
not have to wait for the hardware, this restores the old queue order, so must
be right. It is weighted towards a LIFO order for processes that have been
waiting. This is back to front.

Note that server tasks are getting in the way again ;-) by obscuring the
original process context. I have made another change (not included) to
proc.c, to remember the last user for better accounting. It might be better
to use the head of the queue only for the last user. This would penalize
users doing physical i/o. But they would have to wait a relatively long time
anyway, and hopefully this case is rare.

This works well with 16 hog processes and ls -l /bin - the ls process gets
slightly more that 1/17 of the time. It is still nice to have a small quantum.
A response time of 17 * 100 msec feels awful while 17 * 17 msec is passable,
much like a slow terminal. A 10MHz 286 should be able to handle a quantum of
1 msec with an overhead of 2% to 5%, but not with the current clock driver.
(In real mode, it takes 2% to save the registers, load a new process pointer,
and reload the registers.)

The bug is that a race condition with signals is enhanced. It shows up for
the `term' program. This forks into 3 processes. 2 of the children do a
kill(0, SIGINT) to terminate. With the new scheduling, these children get
to run before the main shell. (Processing for the first kill usually handles
the parent first, then the other child. The LIFO scheduling causes the
other child to run first, and it does another kill.) The main shell attempts
to catch SIGINTs. It is scheduled to catch the first, but is terminated by
the sceond before it has a chance to run. This is not just the fault of the
new scheduling. A FIFO scheduling order would fail when then parent process
slot is greater than the 2nd child's, instead of less than.

>		  system-loaded		  system-idle
>Quantum		times(real/user/sys)	times(real/user/sys)
> 0.100		  85.0/0.0/6.7		15.0/0.1/8.3
> 0.050		  45.0/0.0/6.0		16.0/0.2/8.3

Note the zero user times and reduced system times for the loaded case. These
are caused by reschuling in synchronization with accounting. The process is
often forced to wait before it has consumed even 1 tick. The effect is
reduced by the new scheduling.

>see them posted. In fact, a little tutorial on system tuning (things that
>do not require redesign of the operating system :-) from one of the
>real gurus would be just wonderful.

I think the obvious things have mostly been done. There is a barrier marked
`64K' that stops you having to worry about how to partition memory (though
not on the ST - I am surprised not to have seen more on tuning the cache/
RAM disk sizes from ST people).

Some of the program sizes need tuning. E.g., in fixbin.sh, cron is chmem'ed
to 64000, it should be 4K. make, nm and patch sometimes need a lot more
than the defaults. ps only needs 16K. Sed can only use 16K. That's about all
it gets with the 64K barrier, but it is given 60000, mostly wasted on the
ST. I use 5120 for term and 2560 for update. Some programs use malloc a lot
and some (like sed) allocate everything in the bss so it is hard to tell
how much they need or can use.

Better sizes could be obtained by having MM write the final heap size to a
database whenever a process exits. A daemon could convert this to a list of
maximum sizes ever used.

#! /bin/sh
# Contents:  proc.c.cdif
# Wrapped by src@besplex on Sun Aug  5 22:41:50 1990
PATH=/bin:/usr/bin:/usr/ucb ; export PATH
if test -f 'proc.c.cdif' -a "${1}" != "-c" ; then 
  echo shar: Will not clobber existing file \"'proc.c.cdif'\"
else
echo shar: Extracting \"'proc.c.cdif'\" \(604 characters\)
sed "s/^X//" >'proc.c.cdif' <<'END_OF_FILE'
X*** dist/proc.c	Mon Feb 12 03:24:19 1990
X--- proc.c	Sun Aug  5 22:17:53 1990
X***************
X*** 358,363 ****
X--- 406,412 ----
X  	return;
X    }
X  #endif
X+ #if 0
X    if (rdy_head[USER_Q] != NIL_PROC)
X  	rdy_tail[USER_Q]->p_nextready = rp;
X    else
X***************
X*** 364,369 ****
X--- 413,424 ----
X  	rdy_head[USER_Q] = rp;
X    rdy_tail[USER_Q] = rp;
X    rp->p_nextready = NIL_PROC;
X+ #else  
X+   /* Better to add users to the *beginning* of the queue. */
X+   rp->p_nextready = rdy_head[USER_Q];
X+   if (rdy_head[USER_Q] == NIL_PROC) rdy_tail[USER_Q] = rp;
X+   rdy_head[USER_Q] = rp;
X+ #endif  
X  }
X  
X  
END_OF_FILE
if test 604 -ne `wc -c <'proc.c.cdif'`; then
    echo shar: \"'proc.c.cdif'\" unpacked with wrong size!
fi
# end of 'proc.c.cdif'
fi
echo shar: End of shell archive.
exit 0
-- 
Bruce Evans
Internet: brucee@runxtsa.runx.oz.au    UUCP: uunet!runxtsa.runx.oz.au!brucee
(My other address (evans@ditsyda.oz.au) no longer works)