[net.unix-wizards] 4.2bsd kernel auto-nicing, scheduling

speck@vlsi.caltech.edu (Don Speck) (02/18/86)

    When I have to kill infinite-looping processes, I usually find
them at nice 0, while much more deserving processes are at nice 4,
getting nowhere.  This is brought about by the formula that the
4.2bsd softclock() routine uses to determine when to "nice" a
process (names simplified for clarity):

	if (uid != 0 && nice == NZERO && utime > 10*60)
		nice = NZERO+4;

Only processes doing "real work" are penalized.  Runaway sendmail's
and syslog's go unbridled, since they are system overhead; likewise
page-thrashers and processes looping on EOF go un-niced, since these
use only system time, no user time (or nearly so).  Processes already
at nice 1 are left alone, so the malicious user can get an advantage
over the "real work" jobs at nice 4 (and magnify that advantage simply
by starting 20 infinite-loop jobs at nice 20, as one user demonstrated;
the load average was so high that the nice 4 job got no runtime).

    No doubt the intention of this code was to improve keyboard
response, but the side-effects are awful:  no one can get any
heavy-duty computing done.  Thus it seems that the code ought to
be removed, but if nothing gets nice'd, response gets pretty slow.

    One possibility would be to make the test unbiased:

	if ( utime+stime > 10*60 && nice >= NZERO && nice < NZERO+4)
		nice = NZERO+4;

but now we're even more likely to zap long-lived login shells
and editors; the cure is no better than the bug.

    What we really want is to favor processes that someone is
waiting for, and disfavor "overnight" types of processes.  The
problem is, how do you tell which is which?  This is what nice(1)
is for, but we've tried waving a carrot (steeply declining cpu
charges for nice'd processes) at the users, and they still don't
use nice.

    Despite lack of mind-reading hardware, the system managers
can often tell when a job is not being waited for:  when the
owner hasn't typed anything for a long time, we renice(8) his
running processes unmercifully.  Complaints, if any, relate
more to perceived uneven application of this policy than to
the policy itself.  Unfortunately, this type of scheduling is
crude and often too-little-too-late.

    I would prefer that the scheduler itself knew something
about interactive processes, and gave them a higher percentage
of the cpu.  Interactive processes are distinguished by a very
bursty pattern of cpu usage:  compute, I/O wait, compute, I/O
wait, .... i.e, frequent waitng for slow events, while processes
that we want to disfavor do not wait for slow events.

    My proposal is a modification to kern_synch.c such that when
a kernel process calls sleep() with pri > PZERO (waiting on a
slow event), the associated user process is given a short-term
boost, probably by forgiving some of p_cpu (recent cpu usage).

    Shells waking up after a command exits would get a boost.
Processes waking up from a read() on a tty or pipe would get
a boost if they did, indeed, have to wait; processes doing
single-character reads would probably find typeahead waiting
for them, not have to go to sleep, and not get the boost.
Disk I/O and paging would NOT get a boost (they are not slow
events).

    Does anyone see a way that this could be abused?  I am
reminded of VMS, which gives a short-term priority boost for
doing I/O; doing single-character writes to a tty would make
one's priority soar, so much that no one else could get any cpu.
That wouldn't happen here, since tty output usually just stuffs
another character in the output buffer, with no sleep required,
hence no priority boost granted.  Are there any other ways that
this could go wrong?  Will I cause lots of context switches from
rapidly varying priorities?  Will pipe readers and writers start
preempting each other in a battle of priorities?

    Comment, analysis, suggestions, flames to:

Don Speck	seismo!cit-vax!speck  or  speck@vlsi.caltech.edu

mike@BRL.ARPA (Mike Muuss) (02/18/86)

A long time ago, for V6, I did a pretty snazzy scheduler that
incorporated short-term and medium-term behavior into it's
assessment of priorities.  The result was (and still is) PDP-11/70
systems with nearly instant (<200ms) response for "simple" interactions.

These days, I am firmly convinced that the proper way to implement
this would be with a user-mode daemon that collected medium-term
and long-term behavior patterns of processes, folded in issues
such as time of day, number of people on, "demo in progress" flags, etc,
and issued commands to the kernel adjusting various parameters
that would control the kernel's short-term scheduling behavior,
plus also providing per-process control (much like NICE currently,
but not user changable, and separate).

Creation of the necessary kernel mechanisms is nearly trivial.
The user-mode "Scheduling Advisor" could want to grow into a full
Expert System, but would probably be 90% satisfying if it just used
a half a dozen heuristics, most of which I believe I can articulate.

BRL currently has a low-level effort to implement this type of mechanism,
with two of our GITs (Gurus-in-training) working the problem.  If you
would like to contribute ideas, etc, please send them to <Jcst@BRL>
(Joint CS Team).  If we achieve any kind of success, we will notify
the Unix-Wizards list, so please don't just send "tell me when you are
done" messages.
		Best,
		 -Mike Muuss

thomas@utah-gr.UUCP (Spencer W. Thomas) (02/19/86)

We have disabled this feature here in the past, since we have a tendency
to start an emacs in the morning and use it all day.  With subprocesses
throwing output into windows as fast as they can, it's really easy to
get more than 10 minutes of CPU in a day of work.  All of a sudden your
response slows to crawl:  "D**m it, my emacs just got auto-niced".  I
also experimented with code that looked at the ratio between system &
user time.  This prevented auto-nicing of emacs, I'm really not sure how
effective it was otherwise.  It certainly wouldn't have slowed down a
run away sendmail (which seems to be your problem).  The long-crunching
jobs would eventually get niced, which they should be anyway.  In fact,
we ask people to run jobs that are going to take a LOOONNGGG time to run
them at nice 19.  This can backfire if a process with a large RSS
(bigger than the physical memory) is running -- you get a very nasty
swapping/thrashing behaviour.  Some constant about how often a process
can be swapped is too small.

-- 
=Spencer   ({ihnp4,decvax}!utah-cs!thomas, thomas@utah-cs.ARPA)

sandel@milano.UUCP (02/19/86)

I have implemented auto-renicing as part of the
"assassin" daemon which kills idle logins.  The
renicing actions are configurable as to threshhold,
amount of nice, etc.  It is also configurable to
re-nice procs only during certain times of the day
and on certain days.
If anyone is interested, please let me know.
Currently, the assassin only runs on VAXen, but it
shouldn't be too difficult to make it run on other
machines.

-- 
Charles Sandel
    arpa: sandel@mcc.arpa
    uucp: *!ut-sally!im4u!milano!sandel (or  *!ut-sally!charles)
An endangered species: native Austinite

boyd@inset.UUCP (Boyd Roberts) (02/20/86)

    In article <1014@brl-smoke.ARPA> speck@vlsi.caltech.edu writes:
    >
    >    When I have to kill infinite-looping processes, I usually find
    >them at nice 0, while much more deserving processes are at nice 4,
    >getting nowhere...
    >
    >	if (uid != 0 && nice == NZERO && utime > 10*60)
    >		nice = NZERO+4;
    >
    >Only processes doing "real work" are penalized.  Runaway sendmail's
    >and syslog's go unbridled, since they are system overhead; likewise
    >page-thrashers and processes looping on EOF go un-niced, since these
    >use only system time, no user time (or nearly so).  Processes already
    >at nice 1 are left alone, so the malicious user can get an advantage
    >over the "real work" jobs at nice 4 (and magnify that advantage simply
    >by starting 20 infinite-loop jobs at nice 20, as one user demonstrated;
    >the load average was so high that the nice 4 job got no runtime).

It's good to see the "hack it till it works" approach is alive and well.
Given that the Berkeley paging code was written that way, it must be right.

We've also got some really keen heuristics:

    "real work" == large user time

So, from them we can make a few rash generalisations.  With these we can
code up some unworkable mess.

Some fairly viable work has been done on SHARE scheduling on UNIX
in Australia.  You should check it out.  It actually uses some
algorithms, and was even designed.

    "Don't diddle the code.  Choose a better algorithm."


Boyd Roberts

+++
+   ..!mcvax!ukc!inset!boyd
+   boyd@inset.co.uk
+   boyd@basser.oz
+
+++ "Isn't that kind of severe?"

john@frog.UUCP (John Woods, Software) (02/20/86)

> 
>     When I have to kill infinite-looping processes, I usually find
> them at nice 0, while much more deserving processes are at nice 4,
> getting nowhere...
>     What we really want is to favor processes that someone is
> waiting for, and disfavor "overnight" types of processes...
>     I would prefer that the scheduler itself knew something
> about interactive processes, and gave them a higher percentage
> of the cpu....
>     My proposal is a modification to kern_synch.c such that when
> a kernel process calls sleep() with pri > PZERO (waiting on a
> slow event), the associated user process is given a short-term
> boost, probably by forgiving some of p_cpu (recent cpu usage)....

Well, here is (roughly) the scheme UNOS uses.  Processes have three numbers
for priority, their ceiling, floor, and current values (makes sense).  The
following things happen to the current priority value:  when you are stuck
in the ready queue for longer than a certain time, it is incremented some
amount; when you use up your time quantum*, it is decremented. *(The time
quantum is a function of the priority, with low priority numbers getting long
quanta, thus allowing compute jobs to get lots of work done as long as no
high priority jobs hit the ready queue, which will interrupt the low priority
job, whether or not its quantum is up).  Additionally, certain interactions
cause priority boosts, and among them are blocked teletype read or write,
pipe read or write (this boost is currently disabled, since it had some
disagreeable results; try it if you want, but I think you won't like it),
and a tunable credit for disk interaction (0 by default in our boot-time
configuration file, for similar reasons).  In general, the boost for tty
interaction works quite nicely, with the slightly disagreeable result that
uucp tends to become quite a hog (not surprising, it is a hog anywhere it
lives...); this could be eased by dropping its ceiling a few notches.


--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

This space dedicated to Challenger and her crew,
Francis R. Scobee, Michael J. Smith, Ellison S. Onizuka, Judith Resnik,
Ronald E. McNair, Gregory B. Jarvis, and Christa McAuliffe.

"...and slipped the surly bonds of Earth to touch the face of God."

j@utah-cs.UUCP (J Lepreau) (02/26/86)

In the Denver Usenix proceedings is a paper by Jeffrey Straathof et al
of the Univ of Maryland which describes a new scheduler for BSD Unix
which they claim is both much superior and much smaller than the 4.2 or
4.3 ones.  I haven't yet read it, and didn't hear the talk.  The address
is Dept of Computer Science, Univ of Maryland, College Park, MD  20742.
Now why haven't we heard from Chris Torek about this?

henry@utzoo.UUCP (Henry Spencer) (02/26/86)

> Some fairly viable work has been done on SHARE scheduling on UNIX
> in Australia.  You should check it out.  It actually uses some
> algorithms, and was even designed.

Sounds very interesting.  How about some >>references<<?  To published
papers, not unobtainable internal tech reports!
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

chris@umcp-cs.UUCP (Chris Torek) (02/27/86)

In article <3697@utah-cs.UUCP> j@utah-cs.UUCP (Jay Lepreau) writes:

>In the Denver Usenix proceedings is a paper by Jeffrey Straathof et al
>of the Univ of Maryland which describes a new scheduler for BSD Unix
>[...] Now why haven't we heard from Chris Torek about this?

Because I am not involved in the project.  I would rather let Jeff
speak for himself.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

jeff@umcp-cs.UUCP (Jeff Straathof) (02/27/86)

In article <3375@umcp-cs.UUCP> chris@umcp-cs.UUCP (Chris Torek) writes:
>...  I would rather let Jeff speak for himself.

Here I speak.  Let me give a brief description of how Maryland's new UNIX
scheduler works and then tell what I've done with it since the USENIX
proceedings paper.

The new scheduler employs a multilevel feedback run queue with roundrobin
scheduling in each level.  Quantum sizes vary with each level and begin
when processes get control of the cpu.  Processes get preempted by
higher-priority processes leaving blocked states.  The priority of a process
directly determines the level of the run queue in which it should reside;
the one with the highest priority gets serviced first.

The priority of a process is determined by what it does.  Priority boosts are
given for socket input, terminal input (specific to mode) and disk input,
among other minor things.  Priority bumps are given for quantum expirations.
Each process has a priority maximum and priority minimum limiting the range
of its priority; its values are inherited from its parent.

The quantum sizes, boost amounts, and bump amount can be tuned per
configuration.  The priority maximum and priority minimum of any running
process can be changed, thus providing external resource control.

The new scheduler has not been subjected to any rigorous performance
testing.  It's code is much cleaner than that of the previous scheduler so
its throughput should be better.  It is clearly obvious to users of a heavily
loaded machine that the scheduler distinguishes between interactive and
cpu-bound processes extremely well.  As a matter of fact, I am writing this
on a heavily loaded machine not running my scheduler and am getting pretty
annoyed.  The persons running their troffs of course don't have the same
feelings.

Since the last USENIX conference, I've fixed up some of the code and prepared
it for distribution.  I even dug up our old 4.2 code and reinstalled the
scheduler in that.  To tune my scheduler and to find out the real performance
differences between it and the standard scheduler, I have developed a pretty
snazzy remote terminal emulator to run some benchmarking.  A description of
the emulator and the results of the testing will hopefully make it to the next
USENIX conference.

The scheduler code will be available very soon for both 4.2 and beta 4.3 BSD.
It's great for those who don't think the standard scheduler is what they
need.  It's a must for those doing their own scheduling work who want to avoid
learning from scratch.  Send me mail if you want more information.

-- 
Spoken: Jeff Straathof 	ARPA:	jeff@mimsy.umd.edu	Phone: +1-301-454-7690
CSNet:	jeff@umcp-cs 	UUCP:	seismo!umcp-cs!jeff
USPS: Comp Sci Dept, Lab for Parallel Comp, Univ of Md, College Park, MD 20742

guy@sun.uucp (Guy Harris) (03/01/86)

> Priority boosts are given for socket input, terminal input (specific to
> mode)...

Unfortunately, the terminal mode doesn't tell you as much as you might
think.  The distinction between screen editors and programs like "more" is
*not* that the former run in RAW mode and the latter run in CBREAK mode. The
bulk of "full-screen" programs run in CBREAK mode on systems running
V7-descended terminal drivers; there is no way to distinguish between "vi"
and "more" simply on the basis of the mode it puts the terminal into.
(EMACS may put the terminal into RAW mode, but that's because it was the
only way to get an 8-bit data path to use with terminals with META keys that
turn the 8th bit on.  If it merely wanted to disable XON/XOFF flow control,
it could have set the start and stop characters to '\377'.)  It would also
be interesting to see what effect treating programs running in RAW mode as
screen editors would have on programs like "uucico" (which are the sort of
programs RAW mode was intended for - programs which want a raw 8-bit data
path for data transfer, not screen editors).

Furthermore, the System V driver has a cleaner "ioctl" interface; you don't
have all-inclusive mode bits like RAW and CBREAK, you have particular
processing operations which you turn on and off.  It's not clear how you'd
distinguish between RAW and CBREAK mode in this context.  (The scheduler
work in that paper is not just of interest to 4.[23]BSD users; all UNIX
variants since V7 use similar schedulers with similar deficiencies.)

It would be better to attempt to come up with a classification of processes
as "extremely interactive", "very interactive", and "interactive" based on
some measure independent of the details of the terminal driver's "ioctl"s,
rather than an *ad hoc* classification based on the mode the author of the
program the process is running chose to use.  (The current scheme gives an
incentive to rewrite EMACS to use CBREAK mode wherever possible, since you
get a bigger priority boost when a read completes than you do in RAW mode.)

A first cut at this might be based on the real time, or process virtual
time, interval between blocking reads from the terminal, with longer
intervals giving rise to greater priority boosts.  This seems to match the
intent of the RAW/CBREAK/cooked split, as described in the paper: "extremely
interactive" processes, which do only a little work between keystrokes (e.g.
a screen editor updating the current line) get small boosts; "very
interactive" processes, which do more work between keystrokes (e.g. "more",
or "ed", or even "vi" or EMACS, when used as a file browser) getting larger
boosts; and "interactive" processes, which do large amounts of work between
keystrokes (e.g., a shell) getting still larger boosts.  (Note that a
process can provide a different sort of interactive response at different
times, depending on what sort of work it's doing in response to keystrokes;
a screen editor can get different priority boosts depending on whether the
user is typing text in or doing an interactive global search and replace.)

This area of the new scheduler needs more work.  The requirements weren't
really stated; what was the rationale for the three-way classification of
processes, and why give the most interactive processes the least boost?
(For instance, is it intended that an interactive process should always get
placed at approximately the same priority relative to its "priority-max"
when it wakes up from a terminal I/O wait, so that a process which does
little work between terminal I/O waits needs a smaller boost than a process
which does a fair bit of work, since it has had fewer quantum expirations
and has had its priority lowered fewer times?)  The choices sound
reasonable, but I'm curious why those were the choices made.

Also, is there a danger of starvation with this scheduler?  Could several
people banging away at screen editors keep their process' priorities at
their "priority-max", with the risk that there could always be at least one
of those processes eligible to run, thus locking out all other processes?
On a large system with many users logged in, it seems possible that enough
users could be typing at any given time that one might always be eligible to
run.  This might actually be what is desired, but if there are situations
where it is not desired some mechanism (perhaps some form of "aging") will
have to be provided to prevent it.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.arpa	(yes, really)

jeff@umcp-cs.UUCP (Jeff Straathof) (03/03/86)

In article <3303@sun.uucp> guy@sun.UUCP writes:
>... there is no way to distinguish between "vi"
>and "more" simply on the basis of the mode it puts the terminal into.

True that they both operate in CBREAK mode, but "vi" does a lot more input
than "more" does, that's how the difference in priorities arises.

>(EMACS may put the terminal into RAW mode, but that's because ...

Good point about EMACS.  Dividing interactive processes on the basis of mode
was just a choice.  With the current implementation of the scheduler, it is
simple to disregard the mode by making the various boost values for terminal
input equal.

>It would also be interesting to see what effect treating programs running in
>RAW mode as screen editors would have on programs like "uucico"...

"Uucico" can definitely be a problem.  What will probably be required of
it is a voluntary call to setminmax() to lower its priority.  I haven't done
it yet but probably will.  Thanks to Chris Torek for originally pointing it
out to me and thanks to Guy for pointing it out to everyone else.

>Furthermore, the System V driver has a cleaner "ioctl" interface; you don't
>have all-inclusive mode bits like RAW and CBREAK, you have particular
>processing operations which you turn on and off.  It's not clear how you'd
>distinguish between RAW and CBREAK mode in this context.

I have never worked with System V so don't know the answer.  I might not
know the answer even if I had worked with it.  The key with this scheduler
is that some of the boosting has been moved to the terminal driver.  The
implementation makes it easy to change the criteria for receiving terminal
input boosts.  I considered several different ones and chose just one to
implement.  I've got others in mind though.

>(The scheduler
>work in that paper is not just of interest to 4.[23]BSD users; all UNIX
>variants since V7 use similar schedulers with similar deficiencies.)

Glad to here others agree.

>It would be better to attempt to come up with a classification of processes
>as "extremely interactive", "very interactive", and "interactive" based on
>some measure independent of the details of the terminal driver's "ioctl"s,
>rather than an *ad hoc* classification based on the mode the author of the
>program the process is running chose to use.

Perhaps it would, but remember that whatever the mode the author chooses he
gets no boost unless he does the input.  That fact is more important than
the ad-hoc mode classifications.

>(The current scheme gives an
>incentive to rewrite EMACS to use CBREAK mode wherever possible, since you
>get a bigger priority boost when a read completes than you do in RAW mode.)

Any full screen editor, regardless of the mode in which it operates, does
so much input that it stays in the top levels of the run queue in the current
implementation.

>A first cut at this might be based on the real time, or process virtual
>time, interval between blocking reads from the terminal, with longer
>intervals giving rise to greater priority boosts.  This seems to match the
>intent of the RAW/CBREAK/cooked split, as described in the paper...

Not really.  A scheduler-buster could then compile the file between inputs,
something we wanted to avoid.

>... "extremely
>interactive" processes, which do only a little work between keystrokes (e.g.
>a screen editor updating the current line) get small boosts; "very
>interactive" processes, which do more work between keystrokes (e.g. "more",
>or "ed", or even "vi" or EMACS, when used as a file browser) getting larger
>boosts; and "interactive" processes, which do large amounts of work between
>keystrokes (e.g., a shell) getting still larger boosts.  (Note that a
>process can provide a different sort of interactive response at different
>times, depending on what sort of work it's doing in response to keystrokes;
>a screen editor can get different priority boosts depending on whether the
>user is typing text in or doing an interactive global search and replace.)

Once an editor begins to do a global search-and-replace, it becomes less
of an interactive process by my classifications.  The other editor users
should get quicker service if they are just doing inserts when you are doing
your global replace.  You should still get much quicker service than the
remaining processes because you started out in that top level and won't
drop many levels before requiring input again.  For all of this to work,
the scheduler should be tuned properly.

Boosts for "extremely" interactive processes are small because they get so
many of them.  It probably doesn't matter what the amount is as long as it
is bigger than 0.  Boosts for the "very" interactive processes that do things
like file browsing and news reading are larger because fewer of them occur and
more cpu service is required to get the next batch (perhaps a screenful) of
output prepared.  If they didn't get the larger boost, they might fall into the
lower levels with the compilers and not give the user the small response time
for which we were looking.  Boosts for interactive processes like shells are
larger to give the command it might fork a chance to show its interactive
nature.  If the child, or the process itself, isn't interactive it will drop
through the levels quickly.

My intentions might be met by just giving an equal boost for any type of
terminal input and letting the quantum sizes to the rest.

>This area of the new scheduler needs more work.  The requirements weren't
>really stated; what was the rationale for the three-way classification of
>processes, ...

I just wanted to try to properly assign priorities among the interactive
processes themselves as the paper stated.  There are several different methods
of doing this each with tradeoffs between effectiveness and cost.  I chose
the 'mode method' because I thought it would be fairly effective and because
it would incur very little cost.

>...and why give the most interactive processes the least boost?
>(For instance, is it intended that an interactive process should always get
>placed at approximately the same priority relative to its "priority-max"
>when it wakes up from a terminal I/O wait, so that a process which does
>little work between terminal I/O waits needs a smaller boost than a process
>which does a fair bit of work, since it has had fewer quantum expirations
>and has had its priority lowered fewer times?)  The choices sound
>reasonable, but I'm curious why those were the choices made.

Correct.  Perhaps my explanation of the differing boost amount should go
here. Sorry about leaving all of it out of the paper.

>...is there a danger of starvation (by many editors) with this scheduler...

Definitely.  We thought about it beforehand but didn't know the answer.  We
firmly believed the effort had enough potential merit to at least try out.
The scheduler does maintain a small and stable response time for interactive
processes regardless of system load (determined by use only).  Obviously
something has to give during peak hours, and it's the troffs and compilers
that do.  Though they won't get starved if the machine is powerful enough,
they will if it isn't.

>...(having editors lock out others)... might actually be what is desired...

It was what was desired by us.  The goals of the new scheduler placed
interactive responsiveness as the most important.  Not having that particular
responsiveness during peak hours was a major motivation for the research.
I suspect that users will learn about the best times to compile and about
the best times to look their code over twice before compiling.  As it stood
before, the best time to compile was always now!  That policy was getting
quite aggrivating.

>... but if there are situations where it is not desired some mechanism
>(perhaps some form of "aging") will have to be provided to prevent it.

If you want to avoid that situation, tune the scheduler that way.  Make the
priority boosts smaller for terminal input, make the quantum sizes at the top
of the queue smaller, and make the quantum expiration bump larger.  All of that
might not work if you have so many editors running, but then your machine
might not be powerful enough to handle the workload people are placing on it.
Interactive responsiveness was our major goal, and avoiding a form of "aging"
in BSD was what provoked this good discussion some time ago.

The design of the scheduler is quite simple.  I imagine that it is discussed
in most introductory OS books.  The simplicity unvails cases like uucico
that break the intentions, but the simplicity also provides opportunities for
experimentation and success.  Many ideas people have about scheduling can
be tried out with simple modifications to this implementation.  One I will
try next is that of the "base+boost" method.

Good tools for measuring the performance of such schedulers are a must, and
they are what I am working on now (along with preparing everything for
distribution, argh).  Once I get everything done, the results will be made
available.

Thanks to Guy Harris for his comments.

-- 
Spoken: Jeff Straathof 	ARPA:	jeff@mimsy.umd.edu	Phone: +1-301-454-7690
CSNet:	jeff@umcp-cs 	UUCP:	seismo!umcp-cs!jeff
USPS: Comp Sci Dept, Lab for Parallel Comp, Univ of Md, College Park, MD 20742

guy@sun.uucp (Guy Harris) (03/03/86)

> Some fairly viable work has been done on SHARE scheduling on UNIX
> in Australia.  You should check it out.  It actually uses some
> algorithms, and was even designed.

Given that I don't speak English or Australian, just whatever we speak here
in the USofA (no, it's not American, considering Canadians don't speak
exactly the same language either, and they're (North) Americans as well),
what do you mean by "algorithms" here?  Over here, we tend to think
"algorithm" as meaning any procedure of the sort executed by a computer,
whether it's well thought-out or specified or not.  You may think of
auto-nicing as a hack (I certainly do), but by my definition the procedure
that implements it certainly qualifies as an algorithm....
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.arpa	(yes, really)

root@topaz.RUTGERS.EDU (Charles Root) (03/03/86)

You might be interested in our experience with our DEC-20's.  The
default DEC-20 scheduler sounds a lot like what you have done.  In
particular, it has heuristic boosts for terminal input, and other
things designed to favor interactive jobs.  Like many other
universities, we have a heavily loaded timesharing system used for
students.  What we found was that under heavy load Emacs ran just
great, but the first time you tried to compile, you might as well go
out for dinner (multi-course, in your favorite New York restaurant).
Since all of our students eventually did a compilation, the results
were a disaster.  We finally ended up using the "class scheduler".
This is a scheduler that keeps track of what average share of the CPU
is going to each user.  It gives boosts or penalties depending upon
whether you are getting more or less than your fair share.  Emacs ran
a bit slower (though there was still some interactive boost), but
people got their work done.  Eventually, we ended up moving our two
research machines to the same scheduler, because we started running
into the same thing there, too.  I concluded that what most people
really expect to get out of a timesharing system is 1/N of the
machine, where N is the load average.

chris@umcp-cs.UUCP (Chris Torek) (03/03/86)

Well, I guess it is time for me to step into the fray at last.

In article <4516@topaz.RUTGERS.EDU> root@topaz.UUCP (probably Chuck Hedrick)
writes:

>... The default DEC-20 scheduler sounds a lot like what [Jeff et
>al.] have done. ... we found was that under heavy load Emacs ran
>just great, but the first time you tried to compile, you might as
>well go out for dinner....

Funny thing, just a few hours ago I was talking to one of the local
grad student hackers here, and mentioned that I was afraid that we
might start to see the same thing.

>We finally ended up using the "class scheduler".  [It] keeps track
>of what average share of the CPU is going to each user.  It gives
>boosts or penalties depending upon whether you are getting more or
>less than your fair share.
>
>I concluded that what most people really expect to get out of a
>timesharing system is 1/N of the machine, where N is the load
>average.

That was what I said *I* wanted, at any rate.  If the load is 10,
I expect to be able to get ~1/10th of the machine myself.  If I
choose to spend it on compilations, I *still* expect to get `my'
share of the CPU power of the machine.  Of course, if I start
editing at the same time, I would rather my editor get most of my
1/10th, not my compile, which indicates that event-based scheduling
is still a good idea.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

wagner@utcs.uucp (Michael Wagner) (03/06/86)

I've just been catching up on unix-wizards, which I haven't looked at in
a loooong time, and I found this discussion of 4.2 scheduling.  
Interestingly, the Maryland stuff sounds *very* close to the 
(rather heuristic) scheduler in CP (the 'monitor', if you will) of VM.
It (basically) has:
1) a starting priority for every user
2) no explicit upper/lower limits per user, but certain other constructs
   generate something like a probability cloud around that starting 
   priority
3) a number of queues that users reside on.  Aside from the ones that 
   designate resource unavailability (notably storage overcommitment),
   there are basically 3 queues for 'runable' users, called (with typical
   lack of imagination) Q1, Q2 and Q3.  They correspond to very interactive,
   somewhat interactive, and batch (by batch I mean it runs to completion
   once started, not that you send it off elsewhere).  In the HPO
   (High Performance Option) variant of the product, the issue is clouded
   (but vastly improved, I'm told...I'll shortly be able to report on this)
   by the splitting of each of these into waiting-for-CPU-dispatch and
   waiting-for-other-quick-services (page fault, etc) queues.  They are
   informally called the real-run-list and the run-list, respectively.
   I can't recall the formal names right now.  Interrupts move users
   between the run-list and the real-run-list, and only the real-run-list
   need be scanned on most redispatches.  
4) A transaction can be loosely defined as the time between pressing the
   enter key, and getting the last of the results.  (for those who don't
   know, S/370 moves terminal interaction (irrevocably) into front-end
   hardware, so there is no raw-mode equivalent.  The entire input is
   presented to the mainframe in one interrupt when it is completed.
   Completion is signalled by a special key (enter).  Some (limited)
   local editing is possible before hitting enter.)  When a transaction
   is started, it is given a 'deadline' based on the basic priority of
   the user, some acknowledgement of the load average (called the expansion
   factor in CP), and an initial (recent-history-based, I think) 
   characterization of the transaction.  After that, the run lists are
   sorted and serviced in 'deadline' order.  This effectively prevents
   indefinite postponement (but, as pointed out in earlier postings, 
   indefinite postponement strategies are rendered ineffective when 
   faced with inadequate hardware).
5) While a transaction is 'runable', it is thrown into the pool and gets
   time-slices in rough proportion to it's deadline priority.  The short-
   term scheduler strategy is a modified round-robin (I think someone once
   called it a round-turkey :-) ).  If a transaction sucks up enough 
   resource, the scheduler decides it was wrong to characterize this 
   transaction as interactive, and it moves it to Q2 (which also has the
   effect of recalculating it's deadline, because it has now shown itself
   to be less interactive).  A similar shift can occur Q2->Q3.  Living in
   'lower' queues has the effect of getting you larger slices of resource,
   but less frequently.
There's more, but you get the point (and this is getting long).  One of
the things I found amazing (coming from a background with more 
sophisticated schedulers) was how well this all works, in the face of
tremendous loads.  We used to run 3 large UTS systems (a 370 UNIX-like
system) and a smattering of little CMS users (mostly development and
maintenance) on our system.  Even when the machine ran flat out, editing
response for CMS users was uneffected by the load.  
(There were other problems, but they basically weren't CP's fault, and
are really another story).
As another example, we recently disconnected one channel path to DASD
(in order to provide a test path for a new system being build concurrently
with production).    That cuts our access to disk in half.
No one seems to notice in their editing response.  I can see it in the
performance reports, but it's minimal.  I can also now flog the system
with artificially constructed workloads and actually effect other users
(something I basically couldn't do when both channels were connected).
I think we're seeing the effect of inadequate hardware there, though, and
not bad scheduling decisions.

One place where this all falls down is that the scheduler basically makes
only CPU dispatching decisions. It has no influence on I/O queue decisions,
nor paging queue decisions.  This all works if you have overdesigned your
I/O configuration relative to the rest of the machine, but would fail 
badly otherwise.  This is relatively easy to do with 370 arch. I/O gear,
but (seemingly) much harder on VAXEN.  I'm curious how this is compensated
for in scheduling for 4.2

Well, enough for now.  I don't know how interested people on this list are
in hearing about work being done on other systems (especially blue ones :-)).


Michael Wagner (wagner@utcs)

jbuck@epimass.UUCP (Joe Buck) (03/08/86)

In article <3311@sun.uucp> guy@sun.uucp (Guy Harris) writes:
>> Some fairly viable work has been done on SHARE scheduling on UNIX
>> in Australia.  You should check it out.  It actually uses some
>> algorithms, and was even designed.
>what do you mean by "algorithms" here?  Over here, we tend to think
>"algorithm" as meaning any procedure of the sort executed by a computer,
>whether it's well thought-out or specified or not.  You may think of
>auto-nicing as a hack (I certainly do), but by my definition the procedure
>that implements it certainly qualifies as an algorithm....

I don't believe it!  I, a mere mortal, get to correct Guy Harris :-).
Though you frequently see the word "algorithm" associated with schedulers,
they are not because they don't terminate (at least they aren't supposed
to).  Knuth (who is "over here") lists the following properties of algorithms:

Finiteness: 	always terminates in a finite number of steps
Definiteness:	each step is precisely defined
Input:		zero or more inputs
Output:		one or more outputs
Effectiveness:	"... all of the operations to be performed ... can in
		 principle be done exactly and in a finite length of
		 time ... using pencil and paper"

Knuth uses the term "computational method" to describe things that lack
finiteness.

Other writers besides Knuth also give finiteness as a criterion.  No one
seems to require that the algorithm be particularly efficient or
well-designed.  But I'd go along with the original writer (I don't know
who he was) up to a point: OSes that have been hacked until they work,
sort of, may lack definiteness.  Sure, there is a definite sequence of
instructions being executed, but who knows what it is.

Of course, the function that decides the next process to be executed at
a given time is (I hope!) an algorithm.
-- 
- Joe Buck <ihnp4!pesnta!epimass!jbuck>

sbs@valid.UUCP (Steven Brian McKechnie Sargent) (03/09/86)

> > Some fairly viable work has been done on SHARE scheduling on UNIX
> > in Australia.  You should check it out.  It actually uses some
> > algorithms, and was even designed.
> 
> Given that I don't speak English or Australian, just whatever we speak here
> in the USofA (no, it's not American, considering Canadians don't speak
> exactly the same language either, and they're (North) Americans as well),
> what do you mean by "algorithms" here?  Over here, we tend to think
> "algorithm" as meaning any procedure of the sort executed by a computer,
> whether it's well thought-out or specified or not.  You may think of
> auto-nicing as a hack (I certainly do), but by my definition the procedure
> that implements it certainly qualifies as an algorithm....
> -- 
> 	Guy Harris
> 	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
> 	guy@sun.arpa	(yes, really)

*** REPLACE THIS LINE WITH YOUR MESSAGE ***

You twit.

jack@boring.uucp (Jack Jansen) (03/09/86)

This discussion brings a question to mind: Has anyone
ever tried to do I/O scheduling?

It seems to me that it would be nice I/O for interactive
processes (reading editor buffers, paging) be done at a higher
priority.

Did anyone do any experiments in this direction? If so, what
were the results?
-- 
	Jack Jansen, jack@mcvax.UUCP
	The shell is my oyster.

boyd@inset.UUCP (Boyd Roberts) (03/10/86)

  >> Some fairly viable work has been done on SHARE scheduling on UNIX
  >> in Australia.  You should check it out.  It actually uses some
  >> algorithms, and was even designed.
  >
  >Given that I don't speak English or Australian, just whatever we speak here
  >in the USofA (no, it's not American, considering Canadians don't speak
  >exactly the same language either, and they're (North) Americans as well),
  >what do you mean by "algorithms" here?

By "algorithm" I mean something that has been designed.  It was not
used in the definitive sense.  Or,

	Hack != Algorithm

Yes, there is a paper.  And, there should be a new one.  Ask "piers@basser.oz".

A "hack" is never a "solution", it's a "rabid mess".


Boyd Roberts

+++
+   ..!mcvax!ukc!inset!boyd
+   boyd@inset.co.uk
+   boyd@basser.oz
+
+++ "Isn't that kind of severe?"

wagner@utcs.uucp (Michael Wagner) (03/10/86)

In article <172@epimass.UUCP> jbuck@epimass.UUCP (Joe Buck) writes:
>I don't believe it!  I, a mere mortal, get to correct Guy Harris :-).

I don't think so.  At least not this time.

>Though you frequently see the word "algorithm" associated with schedulers,
>they are not because they don't terminate (at least they aren't supposed
>to).  

I certainly think schedulers terminate.  There might be some confusion about
either what a scheduler is or what time scale to examine their behaviour
on.  Schedulers that I am familiar with are entered to decide who to run
next, and return (i.e. terminate) with a recommendation.  If it were 
otherwise, you'd never get any useful work done.  The scheduler does,
usually, keep some sort of historical data around to assist in it's 
decisions, but that doesn't change the fact that it doesn't run all the
time.

>
>Of course, the function that decides the next process to be executed at
>a given time is (I hope!) an algorithm.
>-- 
>- Joe Buck <ihnp4!pesnta!epimass!jbuck>

The function that decides the next process to be executed at a given time
is (I hope) a scheduler!

Michael Wagner (wagner@utcs)