mike.rice@rand-relay@sri-unix.UUCP (08/12/83)
From: Mike.Caplinger <mike.rice@rand-relay> Everybody who sent me constructive comments about the scheduling of niced jobs in 4.1; thanks very much. Alas, nobody knows anything conclusive. Who WROTE the scheduler? Come forward and confess, because nobody else is quite sure how it works. Everybody who told me to serialize the jobs: thanks but I think you missed the point. If serializing them is the only solution, then the scheduler isn't much, is it? Either that, or system resources are very poorly managed under 4.1. VMS, much as I dislike it, can run 4 batch jobs simultaneously one priority tick below interactive on the same hardware, and interactive users are rarely if ever impacted.
mike@brl-vgr@sri-unix.UUCP (08/15/83)
From: Mike Muuss <mike@brl-vgr> This does not answer the exact question posed ("how does the 4.1 scheduler work"), but does discuss an alternative scheduler for UNIX. The system under discussion below is BRL/UNIX, a "High-Performance" V6 derivative, which is still in use today (and supports 25-30 simultaneous users on an 11/70, and has TCP/IP support, etc). It is my intention to implement a somewhat more sophisticated version of this scheduler for 4.2 BSD. People who would be interested in hearing more about this project should contact me in early September. - - - - - - - - From the archives - - - - - - - - - - - - - - - Date: 16 Jun 82 20:30:03-EDT (Wed) From: Michael Muuss <mike@brl-bmd> To: Harry at Brl-Bmd, Steve at Brl-Bmd, Earl at Brl-Bmd cc: Brian Harvey <BH@mit-ai>, Gurus at Brl-Bmd, Software at Brl-Bmd, BobS at Brl-Bmd, Rucker at Brl-Bmd Subject: Notes on the BRL/UNIX Scheduling Algorithm Based upon a request for information by Earl, I have put together the following discussion: The UNIX Scheduler that we use in BRL/UNIX is most definitely NOT time-sliced. It is a strict priority based scheduler, with premption. So, the question then boils down to how is the priority computed. Here is some information that I have written about this: SCHEDULING Taking first the case where there is no swapping, it is the task of the scheduler to allocate the CPU to the process with the highest priority on the run queue. Whenever an event has occured that a process was waiting in the kernel for (ie waiting in psleep()), pwakeup() is called and the process is placed on the run queue, with the priority specified when psleep was called. runrun is incremented, so whenever all the current interrups have been processed, and any kernel execution has finished, rather than returning the CPU to the user process, a context switch occurs (magic in the machine assist). Since the newly awakened process, by convention, has a negative (ie good) priority, it will get the CPU (assuming no other processes with a higher priority). Whenever the system service is completed (sys-call is finished, signal posted, whatever), and the process is to resume executing in user mode, the priority is recomputed using setpri(), so that it competes fairly with other processes. SWAPPING As currently implemented, the swapper exists as a totally separate entity from the scheduler, and in fact is itself scheduled as a process. The swapper is process zero, hand crafted by the kernel. While total separation between swapping and scheduling is probably not the best approach, it does work quite well, and makes the code much easier to understand and maintain. (Note that the routine sched() is actually the SWAPPER, and that swtch() is the SCHEDULER. Poor naming.) The swap algorithm that we use (which is different in many of the details from the BTL algorithms, but still is similar), goes like this: IF there are processes swapped out, THEN consider what to do ELSE wait for somebody to be swapped out (usually by expand() or xnewproc()) and check again. Get information on the hightest (best) priority process that is swapped out. If sufficient core is availible as things stand now, go right ahead and swap him in, and go back to the beginning. No free core. We know the size of the process to be swapped in, so loop through the process table looking for "easy core" -- that is, processes which are in core, but are executing a sys-wait call, or are in ptrace-stop state. Swap out "easy core" processes until there are no more of them, or we have swapped out as much space as the process comming in requires. Note that this approach ignores the problem of memory fragmentation, and merely hopes that things will work out. However, the estimation approach reduces the proc table search from order N^2 (old way) to order N, which can be very helpful with large proc tables. So, after swapping out the estmated amount of "easy core", try to swap the best swapped out process on the run queue in (because of the elapsed time, this may no longer be the same process as in the first time through). The code above is repeated. If "enough" easy core could not be found, then we seek the "worst" process in core. If the worst in core is still better than the best on the swap device, forget it, and wait a while. ALSO, if the "worst" process in core has been loaded for less than two seconds, forget it. This is the only anti-thrashing code that we use. INTERACTIONS on MEMORY RICH Systems Where the amount of memory on a system is sufficient to hold most of the members of the run queue, the swapping algorithm above tends to work quite well, clearing memory of "deadwood" process just sitting around in sys-WAIT state, keeping memory utilization at around 75%, which seems about ideal (there is usually space for forks and expands directly in core). This points out a lossage: the memory allocator and expand() don't have the ability to check for additional core just off the end of a process, so a lot of needless core-shuffling happens. This is on the list of things to be fixed. In the memory rich scenario, the scheduler controls the allocation of the CPU, because it can (almost) always find something to run. INTERACTIONS on MEMORY POOR Systems When the memory requirements of the run queue greatly exceed availible memory (a frequent problem on all but 11/44 and 11/70 systems), the allocation of the CPU is controled primarily by the swapper. The scheduler will run whatever is in core, but keeping the right procs in core is the real trick here. Because the swapper is strictly priority based, it will try to keep the "right" processes in core; the only difficult case is when there are several processes exhibiting identical behavior; they will tend to "round robin" at the same priority. Having the decay ratios for incore and onswap different is important here. The 2-second anti-thrashing constant prevents the system from going into a total thrash, causing an orderly "staging" of processes through memory. Furthermore, where ever possible, I try to arrange to have swapping take place on either (a) a high transfer rate disk also used for filesystems or (b) a moderate speed disk dedicated to swapping (RK05, etc). This strategy tends to induce some bandwidth limitations on swapping (~10 swaps/sec for dedicated RK05) so that core shuffling is contained. It is important to note that (in my opinion) the behavior of "staging" processes though memory is the "proper" approach to this problem, if you agree with the premise that the intent of the system is to provide good interactive response, regardless of what happens to "crunchers". While this does tend to keep swapping activity fairly high, it also provides reasonable service to interactive processes, which must be distinguished from traditional thrashing, where little useful work is accomplished because of all the memory shuffling. INTERACTIONS on MEMORY STARVED Systems If there is not enough room for 2 goodly sized user processes in the machine (quite possible for 18-bit address space machines with less than full compliment), and large programs are being run, then there may not be much hope, and the swapping proceedure degenerates into one of "popping" procs in and out of core, one at a time, which is the best that can be done. In a subsequent letter, I'll discuss how the bounds for the three scheduling parameters are derived (and what they are), and how they affect scheduling, swapping, and the interaction between the two. ---------------------------- Briefly, the computation of priority depends on several factors: core usage: every 20 blocks (10Kb) in p_size of core costs 1 "nice" point. block io: every 16 blocks in p_iocnt costs 1 "nice" point. User-supplied NICE factor: 1 "nice" point per point, up or down. CPU Usage: 1 nice point for every 16 ticks (0.256 ms) in p_cpu. DISCUSSION: The above formula is complicated by the fact that p_iocnt and p_cpu are not what they seem. Rather than representing a measure of instantaneous loading, they include a decaying "history" factor as well. For each second that passes, the following computations are made for every process in the system: Every time the value in u.u_utime (User CPU usage, not inclusing u.u_stime, which is the CPU cost of doing I/O) crosses a 1-minute boundary, the p_nice value is irrevocably incremented by 1, to provide a long-range decay of "Number Crunchers". [Every clock tick (0.016 ms) of cpu time is recorded by incrementing p_cpu by one, and the process in marked with the "SWTCHED" flag] Every second the process is in core, the p_time field is incremented by HZ to indicate core-residency time. Value is peak-limited to 90 seconds. Now, to compute the p_cpu "decay" factor, the proper RATE must be selected, as follows: rate = INRATIO iff SWTCHED is set, ie the process has received CPU time in the previous second, rate = NSELRATIO if SWTCHED is not set but SLOAD is, ie the process is loaded, but not run, rate = OUTRATIO if SWTCHED is not set and SLOAD is not set, ie process is swapped out, and got no CPU time. After selecting the proper rate, p_cpu = p_cpu * rate / 100; where the rate is expressed as a percentage (ie between 0 and 100). p_iocnt = p_iocnt * diskratio / 100; So, the entire scheduler really depends on the settings of the four parameters inratio, nselratio, outratio, and diskratio. These factors control the "memory" of the past "behavior" of every process. On the BMD70, these factors are presently set to the following values: inratio=88%, nselratio=70%, outratio=60%, diskratio=40%. These values may be interogated and changed dynamically by using the program "schp". Some important relationships exist between these numbers; for a memory-rich system, inratio > nselratio > outratio, so that the memory of behavior is better for the people in core getting time. This tends to give a share of the machine in the near future to those processes which have not gotten time in the immediate past, helping to level the load. The setting of the diskratio value has not been studied much; the present setting was done mostly by intuition. The current values of the cpu time ratios were originally determined by a simulation of the algorithm, and then modified slightly under actual loading tests. I presented a short paper on this at the 1st Toronto USENIX conference; I'll try to dig up all the details and send them on in (yet) another message. The current algorithm begins to have difficulties when the machine is overcommitted by a factor of four or more, as processes which are marked as "hogs" never get any CPU time at all. It is not clear that the scheduler should handle this case; a bigger computer is a better approach. Best, -Mike
geoff@utcs.UUCP (Geoff Collyer) (12/04/84)
Index: bin/nice.c 4.2BSD
Description:
Despite the claim in nice(1) that the number argument is the
amount by which ``the priority is incremented'', it is actually
presented to setpriority(2) as an *absolute* priority, not an
increment.
Repeat-By:
As an ordinary (non-super) user, type
nice -2 nice -1 date
nice will print
setpriority: Permission denied
Fix:
My fix was to use nice(3c) instead of the overkill of getpriority(2).
Diffs follow:
7,9d4
< #include <sys/time.h>
< #include <sys/resource.h>
<
24,27c20
< if (setpriority(PRIO_PROCESS, 0, nicarg) < 0) {
< perror("setpriority");
< exit(1);
< }
---
> nice(nicarg);
preece@ccvaxa.UUCP (12/06/84)
> My fix was to use nice(3c) instead of the overkill of getpriority(2). > Diffs follow: ------------- Two things bother me about this statement. Shouldn't we really be trying to avoid calls to compatibility routines? Just because Berkeley didn't bother to remove all their own uses of them, shouldn't we try not to introduce any more? Well, I try, anyway. In what sense is using getpriority "overkill"? You must mean that using it is more work for YOU, since it's noticeably less work for the machine. If you use nice(3c) you add another call and then do, inside it, the getpriority call you could have done yourself. And whoever reads your code has to try to remember whether that old nice call was relative or absolute. If you just used getpriority and setpriority it would at least be clear exactly what you were doing. scott preece ihnp4!uiucdcs!ccvaxa!preece
geoff@utcs.uucp (12/10/84)
Description:
Despite the claim in nice(1) that the number argument is the
amount by which ``the priority is incremented'', it is actually
presented to setpriority(2) as an *absolute* priority, not an
increment.
Repeat-By:
As an ordinary (non-super) user, type
nice -2 nice -1 date
nice will print
setpriority: Permission denied
Fix:
My fix was to use nice(3c) instead of the overkill of getpriority(2).
Diffs follow:
7,9d4
< #include <sys/time.h>
< #include <sys/resource.h>
<
24,27c20
< if (setpriority(PRIO_PROCESS, 0, nicarg) < 0) {
< perror("setpriority");
< exit(1);
< }
---
> nice(nicarg);
guy@rlgvax.UUCP (Guy Harris) (12/12/84)
> > My fix was to use nice(3c) instead of the overkill of getpriority(2). > > Diffs follow: > ------------- > Shouldn't we really be trying to avoid calls to compatibility routines? > Just because Berkeley didn't bother to remove all their own uses of them, > shouldn't we try not to introduce any more? Well, I try, anyway. If you're writing a program which should run under V7, and 4.1, and S3, and S5, and... you should use the compatibility routines. > In what sense is using getpriority "overkill"? You must mean that using > it is more work for YOU, since it's noticeably less work for the machine. Given the number of instructions (kernel and user) executed by using the "nice" program, I suspect the extra giftwrapping around "setpriority" provided by the "nice" routine almost completely disappears. It's hardly "noticeable". > If you use nice(3c) you add another call and then do, inside it, the > getpriority call you could have done yourself. And whoever reads your > code has to try to remember whether that old nice call was relative or > absolute. If you just used getpriority and setpriority it would at > least be clear exactly what you were doing. No argument there. I *think* all versions of "nice" after V6 add the value given as an argument to the current "nice", but the V7 manual's wording (the priority is "augmented" by the argument) is a bit vague (I checked the 2.9 code, assuming it was basically the same as V7, and it does work that way); the fact that it is that confusing indicates that "setpriority" is nicer (no pun intended). Guy Harris {seismo,ihnp4,allegra}!rlgvax!guy
geoff@utcs.UUCP (Geoff Collyer) (12/12/84)
Scott Preece wrote in reply to my bug report: > My fix was to use nice(3c) instead of the overkill of getpriority(2). > Diffs follow: ------------- Two things bother me about this statement. Shouldn't we really be trying to avoid calls to compatibility routines? Just because Berkeley didn't bother to remove all their own uses of them, shouldn't we try not to introduce any more? Well, I try, anyway. I'm interested in keeping programs compatible with non-4.2BSD UNIXes. Just because Berkeley isn't and puts nasty words in the manual pages is not reason to stop using the compatible interfaces. In what sense is using getpriority "overkill"? You must mean that using it is more work for YOU, since it's noticeably less work for the machine. If you use nice(3c) you add another call and then do, inside it, the getpriority call you could have done yourself. And whoever reads your code has to try to remember whether that old nice call was relative or absolute. If you just used getpriority and setpriority it would at least be clear exactly what you were doing. I refer to the unnecessary complexity of the set/getpriority interface for doing the simple task of changing the nice of the current priority. setpriority has the ability to request of change of priority for processes selected by process group, user id or process id and this is sufficiently confusing that the person who changed nice(1) to use setpriority got it wrong. And what's wrong with preferring the simpler interface? Incidentally, setpriority is not significantly cheaper than nice, for this job. One extra function call, which hides the baroque complexity of getpriority and setpriority, is certainly worth the cost. I find the use of nice far clearer than the combination of getpriority and setpriority. Anyone who is unfamiliar with nice is unlikely to be capable of writing portable code. People who limit themselves to knowledge of 4.2BSD are severely restricting the portability of any code they write. N.B. to flamers: this doesn't mean that one must learn the rococo intricacies of System V; simple understanding of v7 will suffice.
ado@elsie.UUCP (Arthur David Olson) (12/12/84)
> > My fix was to use nice(3c) instead of the overkill of getpriority(2). > If you just used getpriority and setpriority it would at > least be clear exactly what you were doing. Except, of course, to those of us who lack getpriority and setpriority. (Take my 4.1bsd system, please. :-)) The nice call at least has the advantage of being portable to a larger number of UNIX* (and variant) systems. -- Nice is a Mister Rogers trademark. UNIX is an AT&T Bell Laboratories trademark. -- ..decvax!seismo!elsie!ado (301) 496-5688 DEC, VAX and Elsie are Digital Equipment and Borden trademarks
jonab@sdcrdcf.UUCP (Jonathan Biggar) (12/13/84)
I doubt that Berkely every had plans to remove the nice call in favor of get/setpriority. They just provided an upwards compatible feature and changed the nice system call to a library routine that used get/setpriority. To remove nice would be committing suicide. Jon Biggar {allegra,burdvax,cbosgd,hplabs,ihnp4,sdccsu3}!sdcrdcf!jonab
geoff@utcs.UUCP (Geoff Collyer) (12/16/84)
Guy Harris wrote: > > If you use nice(3c) you add another call and then do, inside it, the > > getpriority call you could have done yourself. And whoever reads your > > code has to try to remember whether that old nice call was relative or > > absolute. If you just used getpriority and setpriority it would at > > least be clear exactly what you were doing. > No argument there. I *think* all versions of "nice" after V6 add the > value given as an argument to the current "nice", but the V7 manual's > wording (the priority is "augmented" by the argument) is a bit vague > (I checked the 2.9 code, assuming it was basically the same as V7, and > it does work that way); the fact that it is that confusing indicates > that "setpriority" is nicer (no pun intended). The V7 nice(2) page does say ``augmented'', but declares the system call as nice(incr) and refers to the argument as an ``increment'' in the next paragraph. More importantly, nice(1) talks about its optional argument as an increment. There's really no ambiguity here. The nice system call (or function on 4.2) does exactly what nice(1) wants. Robert Elz wrote: > The only thing that bothered me about the suggested fix, was that > the test for an error was omitted. That is not a good idea. My suggested fix is the nice call from the V7 nice command, which has the advantage that if you ask for an increment that you aren't allowed, the niced command still runs, unlike the vanilla 4.2 nice command, which prints a message and doesn't run the niced command. -- 8 > 4 + 5.
kre@mulga.OZ (Robert Elz) (12/19/84)
| > My fix was to use nice(3c) instead of the overkill of getpriority(2). | > Diffs follow: | ------------- | Two things bother me about this statement. | | Shouldn't we really be trying to avoid calls to compatibility routines? | Just because Berkeley didn't bother to remove all their own uses of them, | shouldn't we try not to introduce any more? Well, I try, anyway. | | In what sense is using getpriority "overkill"? You must mean that using | it is more work for YOU, since it's noticeably less work for the machine. | If you use nice(3c) you add another call and then do, inside it, the | getpriority call you could have done yourself. And whoever reads your | code has to try to remember whether that old nice call was relative or | absolute. If you just used getpriority and setpriority it would at | least be clear exactly what you were doing. | | scott preece | ihnp4!uiucdcs!ccvaxa!preece Please, NO. That is the way I used to think before I thought about it.... Its wrong! The compatability routines are just that, "compatability" routines. they are not "transition" routines. You should use the low level system calls only when you can demonstrate a real need for them, not just because they happen to be handy. Think what happens when you use "setpriority" when you could have used "nice". Someone who takes your program to a system that doesn't have "setpriority" (deficient as that system may be) has to either understand your code very well, in order to deduce that you were really just doing what "nice" would have done, or they have to emulate "setpriority" on their system. Since setpriority provides facilities to alter the "niceness" of any process, group of processes, or user's processes, that is not likely to be easy to do. So, use the compatability routines (nice, time, alarm, ...) unless you can demonstrate a REAL NEED. If the extra 50us or so that it takes to call the compatibility routine really hurts you, and you can prove it, then go ahead, use the system call (and remember, you can save another 50us or so at each call if you perform the system call in line with asm's :-). Most demonstrable uses will be when the extra functionality is required. This is good, it means that a program that uses these system calls will be visibly non-portable to obsolete/backward UNIX operating systems. So, the nice(1) program should have been left calling nice(3) (or nice(2), whichever), whereas renice(1) should use setpriority(2). Nice(1) should work anywhere, renice(1) only works on 4.2 systems (to implement it on others needs something like the gross butchery that was used to implement it on 4.1). The only thing that bothered me about the suggested fix, was that the test for an error was omitted. That is not a good idea. Robert Elz decvax!mulga!kre * UNIX is UNREGISTERED