[net.unix-wizards] nice

mike.rice@rand-relay@sri-unix.UUCP (08/12/83)

From:  Mike.Caplinger <mike.rice@rand-relay>

Everybody who sent me constructive comments about the scheduling of
niced jobs in 4.1; thanks very much.  Alas, nobody knows anything
conclusive.  Who WROTE the scheduler?  Come forward and confess,
because nobody else is quite sure how it works.

Everybody who told me to serialize the jobs:  thanks but I think you
missed the point.  If serializing them is the only solution, then the
scheduler isn't much, is it?  Either that, or system resources are very
poorly managed under 4.1.  VMS, much as I dislike it, can run 4 batch
jobs simultaneously one priority tick below interactive on the same
hardware, and interactive users are rarely if ever impacted.

mike@brl-vgr@sri-unix.UUCP (08/15/83)

From:      Mike Muuss <mike@brl-vgr>

This does not answer the exact question posed ("how does the 4.1
scheduler work"), but does discuss an alternative scheduler
for UNIX.  The system under discussion below is BRL/UNIX,
a "High-Performance" V6 derivative, which is still in use today
(and supports 25-30 simultaneous users on an 11/70, and
has TCP/IP support, etc).

It is my intention to implement a somewhat more sophisticated
version of this scheduler for 4.2 BSD.  People who would be interested
in hearing more about this project should contact me in early
September.

- - - - - - - - From the archives - - - - - - - - - - - - - - -

Date:     16 Jun 82 20:30:03-EDT (Wed)
From:     Michael Muuss <mike@brl-bmd>
To:       Harry at Brl-Bmd, Steve at Brl-Bmd, Earl at Brl-Bmd
cc:       Brian Harvey <BH@mit-ai>, Gurus at Brl-Bmd, Software at Brl-Bmd,
          BobS at Brl-Bmd, Rucker at Brl-Bmd
Subject:  Notes on the BRL/UNIX Scheduling Algorithm

Based upon a request for information by Earl, I have put together the
following discussion:

	The UNIX Scheduler that we use in BRL/UNIX is most definitely
NOT time-sliced.  It is a strict priority based scheduler, with
premption.  So, the question then boils down to how is the priority
computed.  Here is some information that I have written about this:

SCHEDULING

Taking first the case where there is no swapping, it is the task of the
scheduler to allocate the CPU to the process with the highest priority
on the run queue.  Whenever an event has occured that a process was 
waiting in the kernel for (ie waiting in psleep()), pwakeup() is called
and the process is placed on the run queue, with the priority specified
when psleep was called.  runrun is incremented, so whenever all the
current interrups have been processed, and any kernel execution
has finished, rather than returning the CPU to the user process,
a context switch occurs (magic in the machine assist).

Since the newly awakened process, by convention, has a negative (ie good)
priority, it will get the CPU (assuming no other processes with a higher
priority).  Whenever the system service is completed (sys-call is finished,
signal posted, whatever), and the process is to resume executing
in user mode, the priority is recomputed using setpri(), so that it
competes fairly with other processes.

SWAPPING

As currently implemented, the swapper exists as a totally separate
entity from the scheduler, and in fact is itself scheduled as a process.
The swapper is process zero, hand crafted by the kernel.
While total separation between swapping and scheduling is probably not
the best approach, it does work quite well, and makes the code much
easier to understand and maintain.  (Note that the routine sched() is
actually the SWAPPER, and that swtch() is the SCHEDULER.  Poor naming.)

The swap algorithm that we use (which is different in many of the
details from the BTL algorithms, but still is similar), goes like this:

IF there are processes swapped out, THEN consider what to do
ELSE wait for somebody to be swapped out (usually by expand() or
xnewproc()) and check again.

Get information on the hightest (best) priority process that is swapped
out.  If sufficient core is availible as things stand now, go right
ahead and swap him in, and go back to the beginning.

No free core.  We know the size of the process to be swapped in, so loop
through the process table looking for "easy core" -- that is, processes
which are in core, but are executing a sys-wait call, or are in
ptrace-stop state.  Swap out "easy core" processes until there are no
more of them, or we have swapped out as much space as the process
comming in requires.  Note that this approach ignores the problem of
memory fragmentation, and merely hopes that things will work out.
However, the estimation approach reduces the proc table search
from order N^2 (old way) to order N, which can be very helpful with
large proc tables.

So, after swapping out the estmated amount of "easy core", try to swap
the best swapped out process on the run queue in (because of the elapsed
time, this may no longer be the same process as in the first time
through).  The code above is repeated.

If "enough" easy core could not be found, then we seek the "worst"
process in core.  If the worst in core is still better than the best
on the swap device, forget it, and wait a while.  ALSO, if the "worst"
process in core has been loaded for less than two seconds, forget it.
This is the only anti-thrashing code that we use.

INTERACTIONS on MEMORY RICH Systems

Where the amount of memory on a system is sufficient to hold most of the
members of the run queue, the swapping algorithm above tends to work
quite well, clearing memory of "deadwood" process just sitting around in
sys-WAIT state, keeping memory utilization at around 75%, which seems about
ideal (there is usually space for forks and expands directly in core).  This
points out a lossage:  the memory allocator and expand() don't have the
ability to check for additional core just off the end of a process, so a
lot of needless core-shuffling happens.  This is on the list of things
to be fixed.  In the memory rich scenario, the scheduler controls the
allocation of the CPU, because it can (almost) always find something to run.

INTERACTIONS on MEMORY POOR Systems

When the memory requirements of the run queue greatly exceed availible
memory (a frequent problem on all but 11/44 and 11/70 systems), the
allocation of the CPU is controled primarily by the swapper.  The
scheduler will run whatever is in core, but keeping the right procs in
core is the real trick here.  Because the swapper is strictly priority
based, it will try to keep the "right" processes in core;  the only
difficult case is when there are several processes exhibiting identical
behavior;  they will tend to "round robin" at the same priority.  Having
the decay ratios for incore and onswap different is important here.

The 2-second anti-thrashing constant prevents the system from going into
a total thrash, causing an orderly "staging" of processes through
memory.  Furthermore, where ever possible, I try to arrange to have
swapping take place on either (a) a high transfer rate disk also used
for filesystems or (b) a moderate speed disk dedicated to swapping
(RK05, etc).  This strategy tends to induce some bandwidth limitations
on swapping (~10 swaps/sec for dedicated RK05) so that core shuffling is
contained.

It is important to note that (in my opinion) the behavior of "staging"
processes though memory is the "proper" approach to this problem, if you
agree with the premise that the intent of the system is to provide good
interactive response, regardless of what happens to "crunchers".  While
this does tend to keep swapping activity fairly high, it also provides
reasonable service to interactive processes, which must be distinguished
from traditional thrashing, where little useful work is accomplished
because of all the memory shuffling.

INTERACTIONS on MEMORY STARVED Systems

If there is not enough room for 2 goodly sized user processes in the
machine (quite possible for 18-bit address space machines with less than
full compliment), and large programs are being run, then there may not
be much hope, and the swapping proceedure degenerates into one of
"popping" procs in and out of core, one at a time, which is the best
that can be done.


In a subsequent letter, I'll discuss how the bounds for the three
scheduling parameters are derived (and what they are), and how they
affect scheduling, swapping, and the interaction between the two.

----------------------------

Briefly, the computation of priority depends on several factors:

core usage:  every 20 blocks (10Kb) in p_size of core costs 1 "nice" point.

block io:  every 16 blocks in p_iocnt costs 1 "nice" point.

User-supplied NICE factor:  1 "nice" point per point, up or down.

CPU Usage:  1 nice point for every 16 ticks (0.256 ms) in p_cpu.

DISCUSSION:  The above formula is complicated by the fact that p_iocnt
and p_cpu are not what they seem.  Rather than representing a measure of
instantaneous loading, they include a decaying "history" factor as well.

For each second that passes, the following computations are made for
every process in the system:

Every time the value in u.u_utime (User CPU usage, not inclusing
u.u_stime, which is the CPU cost of doing I/O) crosses a 1-minute
boundary, the p_nice value is irrevocably incremented by 1, to provide a
long-range decay of "Number Crunchers".

[Every clock tick (0.016 ms) of cpu time is recorded by incrementing
p_cpu by one, and the process in marked with the "SWTCHED" flag]

Every second the process is in core, the p_time field is incremented by
HZ to indicate core-residency time.  Value is peak-limited to 90 seconds.

Now, to compute the p_cpu "decay" factor, the proper RATE must be
selected, as follows:

	rate = INRATIO iff SWTCHED is set, ie the process has received
			CPU time in the previous second,
	rate = NSELRATIO if SWTCHED is not set but SLOAD is, ie the
			process is loaded, but not run,
	rate = OUTRATIO if SWTCHED is not set and SLOAD is not set,
			ie process is swapped out, and got no CPU time.

After selecting the proper rate, p_cpu = p_cpu * rate / 100;
where the rate is expressed as a percentage (ie between 0 and 100).
p_iocnt = p_iocnt * diskratio / 100;

So, the entire scheduler really depends on the settings of the four
parameters inratio, nselratio, outratio, and diskratio.  These factors
control the "memory" of the past "behavior" of every process.

On the BMD70, these factors are presently set to the following values:
inratio=88%, nselratio=70%, outratio=60%, diskratio=40%.  These values
may be interogated and changed dynamically by using the program "schp".

Some important relationships exist between these numbers;  for a
memory-rich system, inratio > nselratio > outratio, so that the memory
of behavior is better for the people in core getting time.  This tends
to give a share of the machine in the near future to those processes
which have not gotten time in the immediate past, helping to level the
load.

The setting of the diskratio value has not been studied much;  the
present setting was done mostly by intuition.  The current values of the
cpu time ratios were originally determined by a simulation of the
algorithm, and then modified slightly under actual loading tests.
I presented a short paper on this at the 1st Toronto USENIX conference;
I'll try to dig up all the details and send them on in (yet) another
message.

The current algorithm begins to have difficulties when the machine is
overcommitted by a factor of four or more, as processes which are marked
as "hogs" never get any CPU time at all.  It is not clear that the
scheduler should handle this case;  a bigger computer is a better
approach.

				Best,
				 -Mike

geoff@utcs.UUCP (Geoff Collyer) (12/04/84)

Index:	bin/nice.c 4.2BSD

Description:
	Despite the claim in nice(1) that the number argument is the
	amount by which ``the priority is incremented'', it is actually
	presented to setpriority(2) as an *absolute* priority, not an
	increment.
Repeat-By:
	As an ordinary (non-super) user, type
		nice -2 nice -1 date
	nice will print
		setpriority: Permission denied
Fix:
	My fix was to use nice(3c) instead of the overkill of getpriority(2).
	Diffs follow:
7,9d4
< #include <sys/time.h>
< #include <sys/resource.h>
< 
24,27c20
< 	if (setpriority(PRIO_PROCESS, 0, nicarg) < 0) {
< 		perror("setpriority");
< 		exit(1);
< 	}
---
> 	nice(nicarg);

preece@ccvaxa.UUCP (12/06/84)

>	My fix was to use nice(3c) instead of the overkill of getpriority(2).
>	Diffs follow:
-------------
Two things bother me about this statement.

Shouldn't we really be trying to avoid calls to compatibility routines?
Just because Berkeley didn't bother to remove all their own uses of them,
shouldn't we try not to introduce any more?  Well, I try, anyway.

In what sense is using getpriority "overkill"?  You must mean that using
it is more work for YOU, since it's noticeably less work for the machine.
If you use nice(3c) you add another call and then do, inside it, the
getpriority call you could have done yourself.  And whoever reads your
code has to try to remember whether that old nice call was relative or
absolute.  If you just used getpriority and setpriority it would at
least be clear exactly what you were doing.

scott preece
ihnp4!uiucdcs!ccvaxa!preece

geoff@utcs.uucp (12/10/84)

Description:
	Despite the claim in nice(1) that the number argument is the
	amount by which ``the priority is incremented'', it is actually
	presented to setpriority(2) as an *absolute* priority, not an
	increment.
Repeat-By:
	As an ordinary (non-super) user, type
		nice -2 nice -1 date
	nice will print
		setpriority: Permission denied
Fix:
	My fix was to use nice(3c) instead of the overkill of getpriority(2).
	Diffs follow:
7,9d4
< #include <sys/time.h>
< #include <sys/resource.h>
< 
24,27c20
< 	if (setpriority(PRIO_PROCESS, 0, nicarg) < 0) {
< 		perror("setpriority");
< 		exit(1);
< 	}
---
> 	nice(nicarg);

guy@rlgvax.UUCP (Guy Harris) (12/12/84)

> >	My fix was to use nice(3c) instead of the overkill of getpriority(2).
> >	Diffs follow:
> -------------
> Shouldn't we really be trying to avoid calls to compatibility routines?
> Just because Berkeley didn't bother to remove all their own uses of them,
> shouldn't we try not to introduce any more?  Well, I try, anyway.

If you're writing a program which should run under V7, and 4.1, and S3, and
S5, and... you should use the compatibility routines.

> In what sense is using getpriority "overkill"?  You must mean that using
> it is more work for YOU, since it's noticeably less work for the machine.

Given the number of instructions (kernel and user) executed by using the
"nice" program, I suspect the extra giftwrapping around "setpriority"
provided by the "nice" routine almost completely disappears.  It's hardly
"noticeable".

> If you use nice(3c) you add another call and then do, inside it, the
> getpriority call you could have done yourself.  And whoever reads your
> code has to try to remember whether that old nice call was relative or
> absolute.  If you just used getpriority and setpriority it would at
> least be clear exactly what you were doing.

No argument there.  I *think* all versions of "nice" after V6 add the
value given as an argument to the current "nice", but the V7 manual's
wording (the priority is "augmented" by the argument) is a bit vague
(I checked the 2.9 code, assuming it was basically the same as V7, and
it does work that way); the fact that it is that confusing indicates
that "setpriority" is nicer (no pun intended).

	Guy Harris
	{seismo,ihnp4,allegra}!rlgvax!guy

geoff@utcs.UUCP (Geoff Collyer) (12/12/84)

Scott Preece wrote in reply to my bug report:

	>	My fix was to use nice(3c) instead of the overkill of getpriority(2).
	>	Diffs follow:
	-------------
	Two things bother me about this statement.

	Shouldn't we really be trying to avoid calls to compatibility routines?
	Just because Berkeley didn't bother to remove all their own uses of them,
	shouldn't we try not to introduce any more?  Well, I try, anyway.

I'm interested in keeping programs compatible with non-4.2BSD UNIXes.
Just because Berkeley isn't and puts nasty words in the manual pages
is not reason to stop using the compatible interfaces.

	In what sense is using getpriority "overkill"?  You must mean that using
	it is more work for YOU, since it's noticeably less work for the machine.
	If you use nice(3c) you add another call and then do, inside it, the
	getpriority call you could have done yourself.  And whoever reads your
	code has to try to remember whether that old nice call was relative or
	absolute.  If you just used getpriority and setpriority it would at
	least be clear exactly what you were doing.

I refer to the unnecessary complexity of the set/getpriority interface for
doing the simple task of changing the nice of the current priority.
setpriority has the ability to request of change of priority for processes
selected by process group, user id or process id and this is sufficiently
confusing that the person who changed nice(1) to use setpriority got it wrong.

And what's wrong with preferring the simpler interface?
Incidentally, setpriority is not significantly cheaper than nice, for this job.
One extra function call, which hides the baroque complexity of getpriority
and setpriority, is certainly worth the cost.

I find the use of nice far clearer than the combination of getpriority
and setpriority.  Anyone who is unfamiliar with nice is unlikely to be
capable of writing portable code.  People who limit themselves to knowledge
of 4.2BSD are severely restricting the portability of any code they write.
N.B. to flamers: this doesn't mean that one must learn the rococo intricacies
of System V; simple understanding of v7 will suffice.

ado@elsie.UUCP (Arthur David Olson) (12/12/84)

> >	My fix was to use nice(3c) instead of the overkill of getpriority(2).

> If you just used getpriority and setpriority it would at
> least be clear exactly what you were doing.

Except, of course, to those of us who lack getpriority and setpriority.
(Take my 4.1bsd system, please.  :-))  The nice call at least has the
advantage of being portable to a larger number of UNIX* (and variant)
systems.
--
Nice is a Mister Rogers trademark.
UNIX is an AT&T Bell Laboratories trademark.
--
	..decvax!seismo!elsie!ado			(301) 496-5688
	DEC, VAX and Elsie are Digital Equipment and Borden trademarks

jonab@sdcrdcf.UUCP (Jonathan Biggar) (12/13/84)

I doubt that Berkely every had plans to remove the nice call in favor of
get/setpriority.  They just provided an upwards compatible feature
and changed the nice system call to a library routine that used
get/setpriority.  To remove nice would be committing suicide.

Jon Biggar
{allegra,burdvax,cbosgd,hplabs,ihnp4,sdccsu3}!sdcrdcf!jonab

geoff@utcs.UUCP (Geoff Collyer) (12/16/84)

Guy Harris wrote:

> > If you use nice(3c) you add another call and then do, inside it, the
> > getpriority call you could have done yourself.  And whoever reads your
> > code has to try to remember whether that old nice call was relative or
> > absolute.  If you just used getpriority and setpriority it would at
> > least be clear exactly what you were doing.

> No argument there.  I *think* all versions of "nice" after V6 add the
> value given as an argument to the current "nice", but the V7 manual's
> wording (the priority is "augmented" by the argument) is a bit vague
> (I checked the 2.9 code, assuming it was basically the same as V7, and
> it does work that way); the fact that it is that confusing indicates
> that "setpriority" is nicer (no pun intended).

The V7 nice(2) page does say ``augmented'', but declares the system call as

	nice(incr)

and refers to the argument as an ``increment'' in the next paragraph.
More importantly, nice(1) talks about its optional argument as an increment.
There's really no ambiguity here.  The nice system call (or function on 4.2)
does exactly what nice(1) wants.

Robert Elz wrote:

> The only thing that bothered me about the suggested fix, was that
> the test for an error was omitted.  That is not a good idea.

My suggested fix is the nice call from the V7 nice command,
which has the advantage that if you ask for an increment that you aren't
allowed, the niced command still runs, unlike the vanilla 4.2 nice command,
which prints a message and doesn't run the niced command.
-- 
8 > 4 + 5.

kre@mulga.OZ (Robert Elz) (12/19/84)

| >	My fix was to use nice(3c) instead of the overkill of getpriority(2).
| >	Diffs follow:
| -------------
| Two things bother me about this statement.
| 
| Shouldn't we really be trying to avoid calls to compatibility routines?
| Just because Berkeley didn't bother to remove all their own uses of them,
| shouldn't we try not to introduce any more?  Well, I try, anyway.
| 
| In what sense is using getpriority "overkill"?  You must mean that using
| it is more work for YOU, since it's noticeably less work for the machine.
| If you use nice(3c) you add another call and then do, inside it, the
| getpriority call you could have done yourself.  And whoever reads your
| code has to try to remember whether that old nice call was relative or
| absolute.  If you just used getpriority and setpriority it would at
| least be clear exactly what you were doing.
| 
| scott preece
| ihnp4!uiucdcs!ccvaxa!preece

Please, NO.  That is the way I used to think before I thought
about it....  Its wrong!

The compatability routines are just that, "compatability" routines.
they are not "transition" routines.

You should use the low level system calls only when you can
demonstrate a real need for them, not just because they happen
to be handy.

Think what happens when you use "setpriority" when you could have used
"nice".  Someone who takes your program to a system that doesn't
have "setpriority" (deficient as that system may be) has to either
understand your code very well, in order to deduce that you were really
just doing what "nice" would have done, or they have to emulate
"setpriority" on their system.  Since setpriority provides facilities
to alter the "niceness" of any process, group of processes, or
user's processes, that is not likely to be easy to do.

So, use the compatability routines (nice, time, alarm, ...) unless
you can demonstrate a REAL NEED.  If the extra 50us or so that it
takes to call the compatibility routine really hurts you, and
you can prove it, then go ahead, use the system call (and remember,
you can save another 50us or so at each call if you perform
the system call in line with asm's :-).  Most demonstrable uses
will be when the extra functionality is required.  This is good,
it means that a program that uses these system calls will be
visibly non-portable to obsolete/backward UNIX operating systems.

So, the nice(1) program should have been left calling nice(3)
(or nice(2), whichever), whereas renice(1) should use setpriority(2).
Nice(1) should work anywhere, renice(1) only works on 4.2
systems (to implement it on others needs something like the
gross butchery that was used to implement it on 4.1).

The only thing that bothered me about the suggested fix, was that
the test for an error was omitted.  That is not a good idea.

Robert Elz				decvax!mulga!kre

* UNIX is UNREGISTERED