[comp.sys.mips] Memory utilization & inter-process contention

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/22/89)

     After all the talk in comp.arch about cache and memory system design
at the hardware level, how about some discussion on higher level issues
relating to memory utilization, process scheduling, or resource contention?

    A lot of current machines, such as my MIPS box, have pretty standard
virtual memory systems: demand paging is used to allocate real memory to
individual processes, with some sort of least-recently-used algorithm
employed to select pages to write out (or forget) when virtual memory use
exceeds real memory.  This seems to work reasonably well if your typical
job mix consists of numerous small processes.  But severe, and unnecessary,
performance degradation can occur if you have several processes which have
very poor locality of reference when accessing their data, are sufficiently
small that any singly process can fit in real memory, and yet sufficiently
large that no two processes will fit in real memory.

    As a concrete example (typical of what I've been trying to run lately),
suppose you have a 32 Mbyte system and that there are only two processes
running, each of which uses 25 Mbytes of virtual memory (almost all data space)
with rapid, random access patterns.  (I.e., normal behavior for a program like
Mathematica.)  If only one job were present, it would fit in real memory and
the machine would run at virtually 100% cpu utilization.  With two such jobs
in a demand-paged environment what appears to happen is that each process
typically gets enough real memory so that it denies the other process 
sufficient memory to run effectively - i.e., both processes continually
page fault and cpu utilization plumments.

    This sort of contention could obviously be prevented by a smarter
scheduling technique - something which would entirely swap out one
process for a period of time in order to let the other process run.

Questions:

    Which (if any) high-performance Unix systems are capable of avoiding
this type of inter-process contention?  What works best - actual process
swapping layered on top of demand paging, better paging algorithms,
clever scheduling based on adaptive process priorities, ... ?

    How common is this type of problem?  Anyone else share my feeling that
many Unix systems could do a lot better at handling large processes?
Anyone else share my feeling that this is important (and sadly neglected)?

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

henry@utzoo.uucp (Henry Spencer) (08/22/89)

In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>suppose you have a 32 Mbyte system and that there are only two processes
>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
>with rapid, random access patterns...
>    This sort of contention could obviously be prevented by a smarter
>scheduling technique - something which would entirely swap out one
>process for a period of time in order to let the other process run.

Have you ever figured out how long it takes to swap out a 25MB process?
Or to swap it back in again?  The fact is, this example system simply
does not have enough main memory for its workload.  The fix is, either
double the memory or run the jobs in sequence rather than simultaneously.

"Real memory for real performance." -- Mike Tilson
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/23/89)

In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>>suppose you have a 32 Mbyte system and that there are only two processes
>>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
>>with rapid, random access patterns...
>>    This sort of contention could obviously be prevented by a smarter
>>scheduling technique - something which would entirely swap out one
>>process for a period of time in order to let the other process run.
>
>Have you ever figured out how long it takes to swap out a 25MB process?
>Or to swap it back in again?  The fact is, this example system simply
>does not have enough main memory for its workload.  The fix is, either
>double the memory or run the jobs in sequence rather than simultaneously.
>
>V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
>1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

    This response ("buy more memory, or sequence jobs by hand"), while common,
completely misses the point.  The real issue here is how to maximize overall
performance with a given physical configuration.  The problem I raised is
"scale invariant" - no matter how much memory one has, there will always be 
important jobs which fill all of memory.  A bigger memory just means that
I'll try to run bigger jobs.  This is why I deliberately phrased the initial
description of this problem in terms of ratios of process size to physical
memory.  Part of intended purpose of my posting was to emphasize that 
system performance in the presence of large processes (large relative to
however much memory you can afford, but not so large that the system
shouldn't be able to run them efficiently) IS important!

    As to your specific question - yes, I have considered how long it takes
to swap out a 25 Mb process.  On my system, perhaps 10-20 seconds.  This is
utterly negligible for jobs which require many cpu-hours or cpu-days.  
The real issue here is also "scale invariant": as process size grows
certainly the time to swap the entire process in or out grows.  But so
does the time required to page the same amount of data in from disk
and, I submit, so does the typical cpu time.  I see no reason to frown
on swapping strategies provided that the frequency of process swaps is
intelligently adjusted as a function of process size.

    Finally, while running multiple large jobs in sequence is a valid
suggestion and will certainly avoid inter-process contention, to me
this is an admission of failure.  (Why do we like Unix?  Multiprocessing?)
If a single person is starting multiple large jobs, then scheduling by hand
may be acceptable.  If different users are trying to run big jobs then
strict sequential scheduling is a real pain.  One should be able to do better.
While I may have high expectations, I do think that minimizing the type of
contention I discussed is a reasonable user desire and should be a relevant
operating system design goal.

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

frazier@oahu.cs.ucla.edu (Greg Frazier) (08/23/89)

In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
=>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
=>>suppose you have a 32 Mbyte system and that there are only two processes
=>>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
=>
=>Have you ever figured out how long it takes to swap out a 25MB process?
=>Or to swap it back in again?  The fact is, this example system simply
=>does not have enough main memory for its workload.  The fix is, either
=>double the memory or run the jobs in sequence rather than simultaneously.

The original posting was asking if there was an OS smart enough
to run the jobs in sequence for him.  That's the point - most
machines experience a wide range of workloads, and cannot be
customized for each one of them.  Any heavily-used scientific
machine is going to experience the described situation at some
point in its career.  Now, obviously, if you have somebody
monitoring the machine's performance, he can "manually" prevent
the two jobs from running simultaneously.  What would be more
desireable, however, is for the OS to realize what is going on,
and for _it_ to cause the jobs to run sequentially.  I suspect
that this would require too much intelligence on the part of the
OS, but then, what do I know - I'm just an architect! :-)

Greg Frazier
&&&&&&&&&&&&&&&&________________________@@@@@@@@@@@@@@@@@@@@
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/23/89)

    In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
    >In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
    >=>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
    >=>>	[ comments about excessive inter-process memory contention ]
    >=>
    >=   [ comments about avoiding thw whole situation ]
    >
    >The original posting was asking if there was an OS smart enough
    >to run the jobs in sequence for him.  That's the point - most
    >machines experience a wide range of workloads, and cannot be
    >customized for each one of them.  Any heavily-used scientific
    >machine is going to experience the described situation at some
    >point in its career.  Now, obviously, if you have somebody

Absolutely correct.

    >monitoring the machine's performance, he can "manually" prevent
    >the two jobs from running simultaneously.  What would be more
    >desireable, however, is for the OS to realize what is going on,
    >and for _it_ to cause the jobs to run sequentially.  I suspect
    >that this would require too much intelligence on the part of the
    >OS, but then, what do I know - I'm just an architect! :-)
    >
    >Greg Frazier

I'd like to emphasize that it should be rather easy for an OS to recognize
when excessive page faulting is causing the system to thrash.
And once that happens, there's plenty of unused cpu cycles which could
be used by the OS in order to decide what to do about it!  So adding
some intelligence to the OS to better cope with this type of problem
need not have significant negative impact on ordinary interactive behavior.

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

khb@road.Sun.COM (Keith Bierman - Advanced Languages - Floating Point Group ) (08/23/89)

In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
...deleted stuff

>machines experience a wide range of workloads, and cannot be
>customized for each one of them.  Any heavily-used scientific
>machine is going to experience the described situation at some
>point in its career.  Now, obviously, if you have somebody
>monitoring the machine's performance, he can "manually" prevent
>the two jobs from running simultaneously.  What would be more
>desireable, however, is for the OS to realize what is going on,
>and for _it_ to cause the jobs to run sequentially.  I suspect
>that this would require too much intelligence on the part of the
>OS, but then, what do I know - I'm just an architect! :-)

These are reasons why some folks actually like batch OS ... in the old
days we had ways of informing the system of our expected needs (cpu,
io, "cards", etc.) and this was used by some fairly clever scheduling
algorithms. As Unix machines grow up, I personally expect something
fancier than nice level to allow the savvy programmer to inform the OS
of expected needs (for these scientific codes, we should be able to
assert, long running, not very interactive .. this would enable long
timeslices but at low priority ... and perhaps only one mongo job at a
time). 

There are some "batchqueue" type products which third parties and
national labs have cobbled up.

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

davecb@yunexus.UUCP (David Collier-Brown) (08/23/89)

[In a discussion of excessive inter-process memory contention ]
lgy@blake.acs.washington.edu (Laurence Yaffe) writes:
| I'd like to emphasize that it should be rather easy for an OS to recognize
| when excessive page faulting is causing the system to thrash.
| And once that happens, there's plenty of unused cpu cycles which could
| be used by the OS in order to decide what to do about it!  So adding
| some intelligence to the OS to better cope with this type of problem
| need not have significant negative impact on ordinary interactive behavior.

	The execution of this policy should be fairly cheap: deciding on
when to carry it out might be harder.
	I'll argue that a daemon which detects processes thrashing the
system can find all the processes which should be scheduled sequentially,
but will have to do a large amount of work ensuring that they're really
non-interactive, able to be arbitrarily delayed, etc.  Once you do the
decision making, the kernel can smash their priorities down "below the
flashing lights", or make them the idle process itself.

	I really wouldn't care to have the policy-maker in a kernel.
I think the risc/cisc complexity/cost arguments apply here (and that's
the only thing about architecture in this whole posting).

	For practical purposes, the SysV printer spooling system is really
quite a general-purpose creature, and can be turned into a mechanism for
simulating a batch queue (source: Drew Sullivan, personal communication).
	As we have a machine (Mips M2000) with such a spooler sitting there
unused, I'll have to see about using it as a batch queue for our scientists'
number-crunching jobs.
					--dave
-- 
David Collier-Brown,  | davecb@yunexus, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | Joyce C-B:
CANADA. 223-8968      |    He's so smart he's dumb.

adam@castle.ed.ac.uk (Adam Hamilton) (08/23/89)

In article <3342@blake.acs.washington.edu> lgy@blake.acs.washington.edu (Laurence Yaffe) writes:

	We have an ongoing discussion about scheduling of large jobs.
I suggest that a good batch system would be better at this than trying to
do everything in the kernel.  It can monitor the various requirements of
the separate jobs (store utilisation, i/o and system call patterns etc)
and tune accordingly.  What it needs from the operating system is this
information plus some more (like interactive usage).
	Summary - if it doesn't need to be in the kernel, don't put it there.

henry@utzoo.uucp (Henry Spencer) (08/24/89)

In article <3342@blake.acs.washington.edu> lgy@blake.acs.washington.edu (Laurence Yaffe) writes:
>...This response ("buy more memory, or sequence jobs by hand"), while common,
>completely misses the point.  The real issue here is how to maximize overall
>performance with a given physical configuration...

Well, but I don't think it *entirely* misses the point.  One should be
aware that there are circumstances in which *no* amount of fiddling will
get reasonable performance out of a given configuration and a given
workload, as the two are simply incompatible.

>    As to your specific question - yes, I have considered how long it takes
>to swap out a 25 Mb process.  On my system, perhaps 10-20 seconds.  This is
>utterly negligible for jobs which require many cpu-hours or cpu-days.  

It would have been nice if you'd specified that these are background compute
jobs, not interactive applications; your example (Mathematica) strongly
suggested otherwise.  Timesharing machines and batch computing servers
are very different applications of the hardware.  For interactive processes,
I stand by my original comment:  the system is simply too small.  For batch
work, the idea is reasonably practical but requires very different system
tuning, which Unix (traditionally an interactive system) hasn't given a lot
of attention to.

Also, have you *timed* that swapping, or is that just maximum disk transfer
rate?  Remember that your system probably can't drive your disks at full
speed; disk transfer rates are "guaranteed not to exceed" rates, not often
achievable in practice.  Your point is still valid, in that the time is
negligible for long-running batch jobs, but a dedicated batch machine is
clearly needed, as that kind of disk activity will ruin interactive use.
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cander@unisoft.UUCP (Charles Anderson) (08/24/89)

From article <3332@blake.acs.washington.edu>, by lgy@blake.acs.washington.edu (Laurence Yaffe):
> Questions:
>     Which (if any) high-performance Unix systems are capable of avoiding
> this type of inter-process contention?  What works best - actual process
> swapping layered on top of demand paging, better paging algorithms,
> clever scheduling based on adaptive process priorities, ... ?

Straight out of school  I worked on a proprietary, non-Unix system that
handled this problem fairly well (even though the company went out of
business).  In addition the usual queues for runable procs, I/O waiting
procs, etc, the operating system had a "trouble maker" queue.  This was
for processes that were making too many demands on the system, i.e.
encouraging thrashing.  Processes on the TM queue were stripped of
their resources (essentially swapped out, I think).  After a while they
were allowed to return and were given some extra grace because the
first thing they're going to do is page-fault in their working set
which would otherwise land them right back in the pokey.

In the example that Mr. Yaffe gives, there would be two processes, A &
B, competing with one another and thus driving the system towards
thrashing.  The system sees this, selects A as the trouble maker and
puts him in the penalty box.  B runs fine without A getting in his
way.  After a while A is allowed to return and he starts slamming
against B.  A is graced, having just returned from the TM queue, so B
gets selected for the TM queue and A runs well until B returns when the
sequence repeats again.  I don't know how well this would work on a real 
Unix system, but it seemed to work well for this particular system.  

-- 

Charles.
{sun, amdahl, ucbvax, pyramid, uunet}!unisoft!cander

ingoldsb@ctycal.COM (Terry Ingoldsby) (08/24/89)

In article <3342@blake.acs.washington.edu>, lgy@blake.acs.washington.edu (Laurence Yaffe) writes:
> In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
> >In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
> >>suppose you have a 32 Mbyte system and that there are only two processes
> >>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
> >>with rapid, random access patterns...
> >>    This sort of contention could obviously be prevented by a smarter
> >>scheduling technique - something which would entirely swap out one
> >>process for a period of time in order to let the other process run.
> >
> >Have you ever figured out how long it takes to swap out a 25MB process?
> >Or to swap it back in again?  The fact is, this example system simply
> >does not have enough main memory for its workload.  The fix is, either
> >double the memory or run the jobs in sequence rather than simultaneously.
> >
> >V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
> >1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
> 
>     This response ("buy more memory, or sequence jobs by hand"), while common,
> completely misses the point.  The real issue here is how to maximize overall

But surely this is a discussion of the classis tradeoff of response time vs
throughput.  It is quite common that tuning a system for maximum performance
decreases overall response time to the point that users threaten to lynch the
system manager.  Note that most supercomputers, whose raison d'etre is to go
real fast, don't timeshare at all.

Finally, I don't understand why your performance is so poor, but maybe there
is something I don't understand about UNIX scheduling, or else your task is
*so* degenerate that no memory management/scheduling scheme will work.  In
normal circumstances if a process tries to access a non-resident page, the OS
blocks that process until the I/O completes and lets the other job(s) that
do have the necessary info in memory continue to run.  If you mean that your
jobs are so degenerate that each process does only a handful of memory accesses
before page faulting, then obviously you will always be waiting for the
page to be swapped in.  If your accesses are truly random, then *no* memory
scheme will ever be able to give you a good hit rate.  If they are not
truly random, then organize your data such that elements are stored in
such a way that chronologically accessed data reside on the same page.

In summary, either reorganize your data, or do what Henry says.
Although interesting, this discussion is probably moving away from the
charter or comp.arch.
-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/24/89)

In article <2394@unisoft.UUCP> cander@unisoft.UUCP (Charles Anderson) writes:
-From article <3332@blake.acs.washington.edu>, by lgy@newton.phys.washington.edu (Laurence Yaffe):
-> Questions:
->     Which (if any) high-performance Unix systems are capable of avoiding
-> this type of inter-process contention?  What works best - actual process
-> swapping layered on top of demand paging, better paging algorithms,
-> clever scheduling based on adaptive process priorities, ... ?
-
-Straight out of school  I worked on a proprietary, non-Unix system that
-handled this problem fairly well (even though the company went out of
-business).  In addition the usual queues for runable procs, I/O waiting
-procs, etc, the operating system had a "trouble maker" queue.  This was
-for processes that were making too many demands on the system, i.e.
-encouraging thrashing.  Processes on the TM queue were stripped of
-their resources (essentially swapped out, I think).  After a while they
-were allowed to return and were given some extra grace because the
-first thing they're going to do is page-fault in their working set
-which would otherwise land them right back in the pokey.
-
-In the example that Mr. Yaffe gives, there would be two processes, A &
-B, competing with one another and thus driving the system towards
-thrashing.  The system sees this, selects A as the trouble maker and
-puts him in the penalty box.  B runs fine without A getting in his
-way.  After a while A is allowed to return and he starts slamming
-against B.  A is graced, having just returned from the TM queue, so B
-gets selected for the TM queue and A runs well until B returns when the
-sequence repeats again.  I don't know how well this would work on a real 
-Unix system, but it seemed to work well for this particular system.  
-
--- 
-Charles.

    Of the various responses to my initial query, this note by Charles
Anderson comes closest to the sort of approach I've been wishing for.
It has the virtues of:

a) being dynamic - responding to what a process actually does, not what
   some user said it might do.  This is clearly desirable.  I take it
   for granted that static, rigidly enforced job queues ala IBM are a
   prime case of the "cure" being as bad as the "disease".
   ("I'm sorry, your job ran for 2 weeks before exceeding the page fault
   limit you initially guessed.  You lose.  Try running it again with a
   larger limit...")

b) approximating what a human system manager typically does:

	notice that the system is starting to thrash, 

	look at the active processes and see which ones
	are generating high page fault rates,

	find out the effective "working set" size of these processes
	and figure out which processes can effectively run concurrently,

	try putting one or more of the processes "on hold"
	in order to allow the remaining processes to run effectively,

	periodically reevaluate the situation and readjust which (if any)
	processes should be put "on hold".

    Whether "on hold" means explicitly swapped out, or suspended, or
    simply restricted to a small, but non-zero number of physical memory
    pages is an implementation issue (about which I was hoping to hear
    some interesting discussion).  I do know that simply manipulating
    "nice" values is not adequate (at least on the systems I use) for
    preventing memory contention.  I don't know the "best" way to
    implement this sort of approach - I'm a full time user and part
    time system manager, not an OS guru.

The important point is the need for some sort of intelligently selected,
dynamic response which prevents system resources from being allocated
excessively "democratically" at times when "equal shares for all"
turns into "everyone starves".

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

bin@primate.wisc.edu (Brain in Neutral) (08/24/89)

This thread now really has little to do with MIPS systems in
particular.  Please, edit the newsgroups lines to say comp.arch,
not comp.arch,comp.sys.mips.  Thanks.

Paul DuBois
dubois@primate.wisc.edu

rwc%ilmarin.c3@lanl.gov (Robert W. Cox) (08/24/89)

From article <440@ctycal.UUCP>, by ingoldsb@ctycal.COM (Terry Ingoldsby):

> Note that most supercomputers, whose raison d'etre is to go
> real fast, don't timeshare at all.

The majority of supercomputers are Crays.  The most common OS's on
Crays are CTSS and UNICOS, both of which are timesharing.

Bob Cox

ram@wb1.cs.cmu.edu (Rob MacLachlan) (08/24/89)

Being a Lisp programmer, I have also observed that Unix (and many other
O.S.'s also) tends to lose when running large programs that don't fit a
"typical" VM behavior pattern.

One problem is that many O.S.'s use a purely global page replacement
algorithm: no consideration is given to what process that page belongs to,
and to what the history of that process is.  

Many classic VM algorithms (such as Working Set) do consider processes and
degree of thrashing, etc., but these algorithms seem to be little used in
recently designed o.s.'es.  [Since this is comp.arch, I will note in passing
that is may be partly due to brain-dead VM hardware, such as the VAX, which
doesn't even have a reference bit.]  This is probably because people
discovered that with lower degrees of time-sharing and larger memories,
global strategies worked just as well for the "typical" workload.

As to how an O.S. could address these problems, I strongly agree that a
dynamic strategy is preferable to a batch-oriented one.  A Lisp process
generally persists for an entire interactive session, and may exhibit a wide
range of different behaviors during that period.  The O.S. must not only
recognise when the process is antisocial, it must also recognize when it
becomes social again.

As to exactly what to do when you discover that memory is over-committed:
>    Whether "on hold" means explicitly swapped out, or suspended, or
>    simply restricted to a small, but non-zero number of physical memory
>    pages is an implementation issue (about which I was hoping to hear
>    some interesting discussion).

It's hard to know exactly what you mean by "explicitly swapped out" in a
paged VM system.  Unlike in a true swapping system, you are never going to
write out clean pages.  The question is only what you are going to do to the
processe's resident pages at the time you decide to "swap" it.  You can:
 -- Clean all the dirty pages (perhaps freeing them once cleaned), and
 -- free all of the clean pages.

Scheduling writes of all the clean pages is probably worthwhile; it probably
doesn't make any difference whether you free the clean pages or just allow
them to be replaced when you page the other process in.

"suspending" the process presumably means stopping the process, but doing
neither of the above actions on its pages.  This would work pretty well,
especially if the VM system cleans dirty pages efficiently (many at a time).

Restricting a thrashing process to even less memory is a bad idea.  Although
it will allow other processes to get them memory they need, the restricted
process will thrash even more, soaking up disk bandwidth and O.S. runtime
that could be used by other processes.  There's no point in trying to let
the thrashing process run, since it's not going to be able to get anything
done without the memory it needs.

  Rob

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (08/25/89)

..> L. Yaffe talks about thrashing (2 25M processes on a 32M system), asks:
>    Which (if any) high-performance Unix systems are capable of avoiding
>this type of inter-process contention?  What works best - actual process
>swapping layered on top of demand paging, better paging algorithms,
>clever scheduling based on adaptive process priorities, ... ?

Gould NPL had a "working set" scheduler for memory; a working set was
determined by looking at the number of pages touched by a process in a
given interval of process virtual time (not system time); a process would
not run if the working set could not be loaded. Pervase Akhtar wrote a
paper on it for a past UNIX conference.

Working set schedulers are quite common in large (non-UNIX) systems.
See any recent issue of ACM SIGMETRICS' Conference Proceedings.
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

ttwang@polyslo.CalPoly.EDU (Thomas Wang) (08/25/89)

lgy@newton.phys.washington.edu (Laurence Yaffe) writes:

>    A lot of current machines, such as my MIPS box, have pretty standard
>virtual memory systems: demand paging is used to allocate real memory to
>individual processes, with some sort of least-recently-used algorithm
>employed to select pages to write out.
>Severe, and unnecessary, performance degradation can occur if you have several
>processes which have very poor locality of reference when accessing their
>data, are sufficiently small that any singly process can fit in real memory,
>and yet sufficiently large that no two processes will fit in real memory.

To avoid excessive page fault rate, a good algorithm is to load in multiple
pages depending on system load.  Actually how many pages should you load in
is still a research question.  In HP's term, loading in extra pages is
called 'pre-fetching'.

HP's MPE/XL have pre-fetching.  I don't know about Unix, but my guess is
that Unix probably does not.


 -Thomas Wang ("I am, therefore I am."
                 - Akira               )

                                                     ttwang@polyslo.calpoly.edu

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (08/25/89)

>Have you ever figured out how long it takes to swap out a 25MB process?
>Or to swap it back in again?  The fact is, this example system simply
>does not have enough main memory for its workload.  The fix is, either
>double the memory or run the jobs in sequence rather than simultaneously.

Isn't that what the scheduler is supposed to figure out by itself?
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

hammondr@sunroof.crd.ge.com (richard a hammond) (08/25/89)

lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>
>>...
>>Severe, and unnecessary, performance degradation can occur if you have several
>>processes which have very poor locality of reference when accessing their
>>data, are sufficiently small that any singly process can fit in real memory,
>>and yet sufficiently large that no two processes will fit in real memory.

In article <13773@polyslo.CalPoly.EDU> ttwang@polyslo.CalPoly.EDU (Thomas Wang) comments:

>To avoid excessive page fault rate, a good algorithm is to load in multiple
>pages depending on system load.  Actually how many pages should you load in
>is still a research question.  In HP's term, loading in extra pages is
>called 'pre-fetching'.
>
>HP's MPE/XL have pre-fetching.  I don't know about Unix, but my guess is
>that Unix probably does not.

Note that in the case postulated by L. Yaffe, pre-fetching makes the thrashing
more severe, because the memory accesses have "poor locality of reference",
i.e. the working set is very close to the total memory required by the
process.  Fetching pages n through n+m because of a page fault on n
would increase the I/O overhead without increasing the probability
that you would avoid the next page fault.

Historical commentary:
The early multi-user UNIX machines, DEC PDP 11/45 and 11/70, could support
more physical memory than the largest process could use, so the scheduler
didn't worry about this sort of thrashing.

VM works fine if you have good locality of reference, while it fails miserably
if you don't.  With poor locality of reference, you need to have enough
physical memory to keep the job in memory.

Practical commentary:
Yes, the problem is interesting and can be solved, but I never observed
it except on our large number crunching machines, i.e. the Convex and Alliant.
On those machines I scheduled the jobs by hand, after all, if the job is going
to take 10 days of CPU time, you can just adjust the jobs so one runs over
night and the other one runs during the day.  So, I conclude that the problem
isn't that interesting for the vast majority of UNIX systems, i.e. all
our Suns and Vaxen didn't have the problem.

Rich Hammond

slackey@bbn.com (Stan Lackey) (08/25/89)

In article <5967@pt.cs.cmu.edu> ram@wb1.cs.cmu.edu (Rob MacLachlan) writes:

>Many classic VM algorithms (such as Working Set) do consider processes and
>degree of thrashing, etc., but these algorithms seem to be little used in
>recently designed o.s.'es.  [Since this is comp.arch, I will note in passing
>that is may be partly due to brain-dead VM hardware, such as the VAX, which
>doesn't even have a reference bit.]

Instead of examining a reference bit, an OS could check to see if the
page table entry is in the translation cache; this can be better than
a simple reference bit, which only tells you if a page has been read.
If a translation is in the cache, it indicates that the page is likely
being read often.
-Stan

leichter@CS.YALE.EDU (Jerry Leichter) (08/26/89)

In article <5967@pt.cs.cmu.edu>, ram@wb1.cs.cmu.edu (Rob MacLachlan) writes...
>...Many classic VM algorithms (such as Working Set) do consider processes and
>degree of thrashing, etc., but these algorithms seem to be little used in
>recently designed o.s.'es.  [Since this is comp.arch, I will note in passing
>that is may be partly due to brain-dead VM hardware, such as the VAX, which
>doesn't even have a reference bit.]...

VMS and the VAX architecture were designed at the same time by a group of
people who worked together.  VMS has implemented a Working Set algorithm
since it first shipped.

I've heard (from reasonably reliable sources) that the hardware people were
ready to include a reference bit, but the software people's work showed they
didn't need it, given appropriate algorithms.  The fact that early Unix
implementations for the VAX didn't handle it correctly is a problem with the
developers of those early versions, not with the VAX architecture.

I have never understood this need people have to use the VAX as the standard
bad example of EVERYTHING.  In fact, the designers of the VAX worked in ways
very similar to the way good RISC designers work today:  Hardware and software
people cooperated, and there was an attempt to balance the complexity each
had to deal with.  It may seem hard to believe, almost 15 years later, that
the compiler writers heavily influenced the design of the ISA, asking for
things like 3-address arithmetic instructions in addition to 2-address and
indexed addressing modes, but given the compiler technology of the day, this
is what they thought would be useful to them.  Compiler technology has changed
greatly since then, as has hardware technology.

The main difference between the VAX design team and current design teams is
probably in the inclusion of the OS designers.  The process structure of the
VAX, the support in hardware for AST's, the security structure - all of these
are direct reflections of the OS being developed.  These days, OS design is
rapidly becoming a lost art - Unix is the answer, what was the question?  So
there is little room for interplay any more between OS designers and hardware
designers - the hardware designers provide what the well-understood Unix
kernel needs and are finished.  (To a certain extent, language design is
approaching the same point:  Support C and figure you are done.  The VAX was
viewed from the start as a multi-lingual machine, and its ISA contains
elements that the COBOL developers found important, as well as elements the
FORTRAN developers wanted.  Most of the early RISC papers used C examples,
exclusively.  More recently, we've seen "Smalltalk RISC's" and "LISP RISC's".
I'm certainly not up on all the literature, but I know of no RISC design,
paper or otherwise, specifically intended to simultaneously support two quite
different languages efficiently.  As a language designer, I find it very
distressing to think that the machines of the future may be "refined" to the
point where doing anything but Unix and C will require extreme effort.  Take a
look at the history of LISP on CDC machines to see the possible results.)

							-- Jerry

henry@utzoo.uucp (Henry Spencer) (08/27/89)

In article <70663@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter) writes:
>I've heard (from reasonably reliable sources) that the hardware people were
>ready to include a reference bit, but the software people's work showed they
>didn't need it, given appropriate algorithms...

That *they* didn't need it, note, not that other software developers might
not need it.  (Of course, as we all know, there was only going to be *one*
operating system for the VAX...)

>The main difference between the VAX design team and current design teams is
>probably in the inclusion of the OS designers.  The process structure of the
>VAX, the support in hardware for AST's, the security structure - all of these
>are direct reflections of the OS being developed...

The same is true of the architectural tweaks on many modern machines.  For
example, according to John Mashey, the MIPS machines make a considerable
effort to do precise exceptions because the OS people got quite upset about
the alternative.  The difference is that modern OS people don't see any need
to reinvent the square wheel, so the issue is how to *support* the system,
not what it should look like.

>...To a certain extent, language design is
>approaching the same point:  Support C and figure you are done.  The VAX was
>viewed from the start as a multi-lingual machine, and its ISA contains
>elements that the COBOL developers found important, as well as elements the
>FORTRAN developers wanted.  Most of the early RISC papers used C examples,
>exclusively.  More recently, we've seen "Smalltalk RISC's" and "LISP RISC's".
>I'm certainly not up on all the literature, but I know of no RISC design,
>paper or otherwise, specifically intended to simultaneously support two quite
>different languages efficiently...

Might one ask how many RISC designs you are acquainted with?  The SPARC,
the MIPS architecture, and the HP Precision Architecture, to name three,
were all designed with considerable attention to multi-language support.
The fact is, supporting C well is very nearly enough to support the other
"conventional" languages well.  (Note, "very nearly" -- some attention to
the fine points is still desirable.)  Compiler designers today understand
that it may be better to convert COBOL numbers to binary for arithmetic
than to mess up the hardware with decimal instructions, for example.
COBOL programs are seldom arithmetic-bound anyway.

>distressing to think that the machines of the future may be "refined" to the
>point where doing anything but Unix and C will require extreme effort.  Take a
>look at the history of LISP on CDC machines to see the possible results.)

Take a look at the history of Lisp on Lisp machines, whose time has come
*and gone* -- those awful "C-only" RISC machines run Lisp faster than the
custom-cooked Lisp machines do.
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

plogan@mentor.com (Patrick Logan) (08/30/89)

In article <1989Aug26.232710.27174@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <70663@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter) writes:
>>distressing to think that the machines of the future may be "refined" to the
>>point where doing anything but Unix and C will require extreme effort.  Take a
>>look at the history of LISP on CDC machines to see the possible results.)
>
>Take a look at the history of Lisp on Lisp machines, whose time has come
>*and gone* -- those awful "C-only" RISC machines run Lisp faster than the
>custom-cooked Lisp machines do.
>-- 
>V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
>1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

Saying that C machines run Lisp faster than Lisp machines is
simplistic. To run Lisp faster on a C machine requires more
declarations in the code (or, as with T, more use of "type-specific"
procedures such as fx+ instead of generic +). This goes against the
grain of Lisp programming.

Lisp machines provide a better environment (e.g. protection, ability
to write fast, generic code) per performance unit. I'd guess the only
reasons their time is gone are market based. They're sure to return in
one form or another as more and more C machine programs are written in
Lisp-like languages. Many C machine architectures (e.g. SPARC) already
incorporate some support for Lisp.
-- 
Patrick Logan                | ...!{decwrl,sequent,tessi}!mntgfx!plogan 
Mentor Graphics Corporation  | plogan@pdx.MENTOR.COM                    
Beaverton, Oregon            | "My other computer is a Lisp machine."

mash@mips.COM (John Mashey) (08/31/89)

In article <1989Aug30.152155.9613@mentor.com> plogan@mentor.com (Patrick Logan) writes:
>In article <1989Aug26.232710.27174@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>Take a look at the history of Lisp on Lisp machines, whose time has come
>>*and gone* -- those awful "C-only" RISC machines run Lisp faster than the
>>custom-cooked Lisp machines do.

>Saying that C machines run Lisp faster than Lisp machines is
>simplistic. To run Lisp faster on a C machine requires more
>declarations in the code (or, as with T, more use of "type-specific"
>procedures such as fx+ instead of generic +). This goes against the
>grain of Lisp programming.
>
>Lisp machines provide a better environment (e.g. protection, ability
>to write fast, generic code) per performance unit. I'd guess the only
>reasons their time is gone are market based. They're sure to return in
>one form or another as more and more C machine programs are written in
>Lisp-like languages. Many C machine architectures (e.g. SPARC) already
>incorporate some support for Lisp.

There are many good ideas in LISP, and more of it will filter over.
The issue (and it has been discussed before in comp.arch), is that
the economics of the computer business end up with the following:
	a) If something is a low-volume, special-purpose architecture,
	it must be MUCH better at what it does than the high-volume,
	cheap competition.  If it isn't, it will end up going away,
	because the competition gets to keep tracking the technology
	curve at speed, rather than getting behind it.  also,
	if something has high-volume in VLSI, it gets to be cheap.
	b) The current RISC chips are getting fast and cheap enough
	that this is happening.
	c) There is still some fine-tuning to be done, but most of
	them are more like tweaks added to some of the existing
	architectures (tagged adds, for example, by themselves, do
	not necesarily solve the problem: consider the exception-
	handling sequences also needed).

Useful datapoints might be, now and in future:
	a) Symbolics, Inc. history, where it seems like they're
	geting out of hardware-design business in favor of (fine) software.
	b) TI: Ivory chip versus SPARCs

CONJECTURE/PREDICTION: fast VLSI RISCs will end up capturing the high
end of the LISP market over the next few years.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

baum@Apple.COM (Allen J. Baum) (09/12/89)

[]
>In article <70663@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter) writes:
>I have never understood this need people have to use the VAX as the standard
>bad example of EVERYTHING.  In fact, the designers of the VAX worked in ways
>very similar to the way good RISC designers work today:  Hardware and software
>people cooperated, and there was an attempt to balance the complexity each
>had to deal with.
.....
>the compiler writers heavily influenced the design of the ISA, asking for
>things like 3-address arithmetic instructions in addition to 2-address and
>indexed addressing modes, but given the compiler technology of the day, this
>is what they thought would be useful to them.  Compiler technology has changed
>greatly since then, as has hardware technology.
>
>The main difference between the VAX design team and current design teams is
>probably in the inclusion of the OS designers.  The process structure of the
>VAX, the support in hardware for AST's, the security structure - all of these
>are direct reflections of the OS being developed.

Hear, hear! I'm not particularly a fan of VAX, but its getting a lot of bad
press from the RISC people because of its performance vs. the new chips.
This is NOT because it is a bad design, its because its an old design, and
the nature of the tradeoffs have changed. This includes not just compiler,
and OS technology, but lots of the hardware technology: memories, I/O, logic,
and packaging. These make a big difference on how a system is designed. 

The major fault of VAX proponents is that they haven't recognized that these
tradeoffs have made a difference, and they still think they can compete (on
the basis of performance, as opposed to just running old code).

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum