[comp.unix.questions] SunOS and SVRn Scheduler Algorithm

david@jpl-devvax.JPL.NASA.GOV (David E. Smyth) (02/16/90)

Can somebody please post the actual scheduling algorithm used by SunOS
4.0.3?  And for System V just so we stay compatable.

Why?  There is some difference of opinion here about how effective
"nice" is in controlling how much various processes execute.  The SunOS
documents are completely vague in the sveral places process priorities
are discussed, such as nice(1), getpriority(2), and so on.

In one place there is a statement that processes which use more CPU get
a lower effective priority by the scheduler so that interactive
processes are more responsive.  Elsewhere is a comment that a priority
of 19 means the process will never run unless no other process has
anything to do.

The actual algorithm should clear up this confusion.  I'm very suprised
its not in the documentation.

chris@mimsy.umd.edu (Chris Torek) (02/16/90)

In article <7085@jpl-devvax.JPL.NASA.GOV> david@jpl-devvax.JPL.NASA.GOV
(David E. Smyth) writes:
>Can somebody please post the actual scheduling algorithm used by SunOS
>4.0.3?  And for System V just so we stay compatable.

Although these probably differ from the BSD algorithm (perhaps even
significantly), the BSD scheduling algorithm is relevant as well.
(Besides, it is the only one I am familiar with :-) .)

In any case, it is not possible to post this algorithm as it consists
of several hundred lines scattered all throughout a number of different
files.  It is hard to analyse, harder to predict, and though it has
been tweaked and tuned until it seems to work fairly well, it probably
ought not to be called an `algorithm' at all.

>Why?  There is some difference of opinion here about how effective
>"nice" is in controlling how much various processes execute.

Very little. :-)

The short version of the BSD scheduler:

A. CPU usage.  The priority depends on CPU usage.  Every second (modulo
   a special case for processes that are asleep, since many are and this
   saves time), p->p_cpu is updated as follows:

	/* NB: all these MIN and MAX calls are mainly to ensure variables
	   stay in `char' ranges */

	p->p_cpu = MAX(MIN((filter(avenrun[0]) * p->p_cpu) + p->p_nice, 255),
			0);

   `filter' is a magic formula defined as follows:

	#define	filter(loadav) ((2 * (loadav)) / (2 * (loadav) + 1))

   The following comment is relevant:

	 * We wish to decay away 90% of p_cpu in (5 * loadavg) seconds.
	 * That is, the system wants to compute a value of decay such
	 * that the following for loop:
	 * 	for (i = 0; i < (5 * loadavg); i++)
	 * 		p_cpu *= decay;
	 * will compute
	 * 	p_cpu *= 0.1;
	 * for all values of loadavg:
	 *
	 * Mathematically this loop can be expressed by saying:
	 * 	decay ** (5 * loadavg) ~= .1
	 *
	 * The system computes decay as:
	 * 	decay = (2 * loadavg) / (2 * loadavg + 1)
	 *
	 * We wish to prove that the system's computation of decay
	 * will always fulfill the equation:
	 * 	decay ** (5 * loadavg) ~= .1
	 *
	 * If we compute b as:
	 * 	b = 2 * loadavg
	 * then
	 * 	decay = b / (b + 1)
	 *
	 * We now need to prove two things:
	 *	1) Given factor ** (5 * loadavg) ~= .1, prove factor == b/(b+1)
	 *	2) Given b/(b+1) ** power ~= .1, prove power == (5 * loadavg)
	 *	
	 * Facts:
	 *         For x close to zero, exp(x) =~ 1 + x, since
	 *              exp(x) = 0! + x**1/1! + x**2/2! + ... .
	 *              therefore exp(-1/b) =~ 1 - (1/b) = (b-1)/b.
	 *         For x close to zero, ln(1+x) =~ x, since
	 *              ln(1+x) = x - x**2/2 + x**3/3 - ...     -1 < x < 1
	 *              therefore ln(b/(b+1)) = ln(1 - 1/(b+1)) =~ -1/(b+1).
	 *         ln(.1) =~ -2.30
	 *
	 * Proof of (1):
	 *    Solve (factor)**(power) =~ .1 given power (5*loadav):
	 *	solving for factor,
	 *      ln(factor) =~ (-2.30/5*loadav), or
	 *      factor =~ exp(-1/((5/2.30)*loadav) =~ exp(-1/(2*loadav)) =
	 *          exp(-1/b) =~ (b-1)/b =~ b/(b+1).                    QED
	 *
	 * Proof of (2):
	 *    Solve (factor)**(power) =~ .1 given factor == (b/(b+1)):
	 *	solving for power,
	 *      power*ln(b/(b+1)) =~ -2.30, or
	 *      power =~ 2.3 * (b + 1) = 4.6*loadav + 2.3 =~ 5*loadav.  QED
	 *
	 * Actual power values for the implemented algorithm are as follows:
	 *      loadav: 1       2       3       4
	 *      power:  5.68    10.32   14.94   19.55


B. The CPU usage factor grows whenever the clock hits while a process
   is running.  The following comment comes from kern_clock.c:

	 * We adjust the priority of the current process.
	 * The priority of a process gets worse as it accumulates
	 * CPU time.  The cpu usage estimator (p_cpu) is increased here
	 * and the formula for computing priorities (in kern_synch.c)
	 * will compute a different value each time the p_cpu increases
	 * by 4.  The cpu usage estimator ramps up quite quickly when
	 * the process is running (linearly), and decays away exponentially,
	 * at a rate which is proportionally slower when the system is
	 * busy.  The basic principal is that the system will 90% forget
	 * that a process used a lot of CPU time in 5*loadav seconds.
	 * This causes the system to favor processes which haven't run
	 * much recently, and to round-robin among other processes.

   This code, in essence, says

	p->p_cpu = MIN(p->p_cpu + 1, 255);

C. Priority.  After adjusting p->p_cpu (in both A and B), each process is
   assigned a `user priority'.  This is calculated as

	p->p_usrpri = MIN(127,
		(p->p_cpu / 4) + PUSER + 2 * p->p_nice +
		(p->p_rssize > p->p_maxrss && freemem < desfree ? 2*4 : 0)
	    );

   The process p is scheduled to run as soon as the kernel returns to user
   code if p has a better p_usrpri priority than the `current' process.

D. All kernel-mode processes have a better p_pri (< PZERO) than any user
   priority (>= PZERO), and the kernel runs them until they leave kernel
   mode (are ready to return to user code); only then will it choose the
   process with the best (lowest) p_usrpri.

One notable problem with the above scheme is that no process is charged
for CPU time if the clock does not tick while the process is running.
Conversely, a process is charged for an entire clock tick if it happens
to be the one running when the clock ticks.  Since processes are rescheduled
on clock ticks, this is often not serious; however, it is possible for
a process to arrange to run for 9/10ths of a clock tick, immediately
after each clock tick, hence get 90% of the CPU while being charged for
0% of the CPU.  (Hint: see select(2) and setitimer(2).)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris