[comp.arch] Memory utilization & inter-process contention

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/22/89)

     After all the talk in comp.arch about cache and memory system design
at the hardware level, how about some discussion on higher level issues
relating to memory utilization, process scheduling, or resource contention?

    A lot of current machines, such as my MIPS box, have pretty standard
virtual memory systems: demand paging is used to allocate real memory to
individual processes, with some sort of least-recently-used algorithm
employed to select pages to write out (or forget) when virtual memory use
exceeds real memory.  This seems to work reasonably well if your typical
job mix consists of numerous small processes.  But severe, and unnecessary,
performance degradation can occur if you have several processes which have
very poor locality of reference when accessing their data, are sufficiently
small that any singly process can fit in real memory, and yet sufficiently
large that no two processes will fit in real memory.

    As a concrete example (typical of what I've been trying to run lately),
suppose you have a 32 Mbyte system and that there are only two processes
running, each of which uses 25 Mbytes of virtual memory (almost all data space)
with rapid, random access patterns.  (I.e., normal behavior for a program like
Mathematica.)  If only one job were present, it would fit in real memory and
the machine would run at virtually 100% cpu utilization.  With two such jobs
in a demand-paged environment what appears to happen is that each process
typically gets enough real memory so that it denies the other process 
sufficient memory to run effectively - i.e., both processes continually
page fault and cpu utilization plumments.

    This sort of contention could obviously be prevented by a smarter
scheduling technique - something which would entirely swap out one
process for a period of time in order to let the other process run.

Questions:

    Which (if any) high-performance Unix systems are capable of avoiding
this type of inter-process contention?  What works best - actual process
swapping layered on top of demand paging, better paging algorithms,
clever scheduling based on adaptive process priorities, ... ?

    How common is this type of problem?  Anyone else share my feeling that
many Unix systems could do a lot better at handling large processes?
Anyone else share my feeling that this is important (and sadly neglected)?

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

henry@utzoo.uucp (Henry Spencer) (08/22/89)

In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>suppose you have a 32 Mbyte system and that there are only two processes
>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
>with rapid, random access patterns...
>    This sort of contention could obviously be prevented by a smarter
>scheduling technique - something which would entirely swap out one
>process for a period of time in order to let the other process run.

Have you ever figured out how long it takes to swap out a 25MB process?
Or to swap it back in again?  The fact is, this example system simply
does not have enough main memory for its workload.  The fix is, either
double the memory or run the jobs in sequence rather than simultaneously.

"Real memory for real performance." -- Mike Tilson
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/23/89)

In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>>suppose you have a 32 Mbyte system and that there are only two processes
>>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
>>with rapid, random access patterns...
>>    This sort of contention could obviously be prevented by a smarter
>>scheduling technique - something which would entirely swap out one
>>process for a period of time in order to let the other process run.
>
>Have you ever figured out how long it takes to swap out a 25MB process?
>Or to swap it back in again?  The fact is, this example system simply
>does not have enough main memory for its workload.  The fix is, either
>double the memory or run the jobs in sequence rather than simultaneously.
>
>V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
>1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

    This response ("buy more memory, or sequence jobs by hand"), while common,
completely misses the point.  The real issue here is how to maximize overall
performance with a given physical configuration.  The problem I raised is
"scale invariant" - no matter how much memory one has, there will always be 
important jobs which fill all of memory.  A bigger memory just means that
I'll try to run bigger jobs.  This is why I deliberately phrased the initial
description of this problem in terms of ratios of process size to physical
memory.  Part of intended purpose of my posting was to emphasize that 
system performance in the presence of large processes (large relative to
however much memory you can afford, but not so large that the system
shouldn't be able to run them efficiently) IS important!

    As to your specific question - yes, I have considered how long it takes
to swap out a 25 Mb process.  On my system, perhaps 10-20 seconds.  This is
utterly negligible for jobs which require many cpu-hours or cpu-days.  
The real issue here is also "scale invariant": as process size grows
certainly the time to swap the entire process in or out grows.  But so
does the time required to page the same amount of data in from disk
and, I submit, so does the typical cpu time.  I see no reason to frown
on swapping strategies provided that the frequency of process swaps is
intelligently adjusted as a function of process size.

    Finally, while running multiple large jobs in sequence is a valid
suggestion and will certainly avoid inter-process contention, to me
this is an admission of failure.  (Why do we like Unix?  Multiprocessing?)
If a single person is starting multiple large jobs, then scheduling by hand
may be acceptable.  If different users are trying to run big jobs then
strict sequential scheduling is a real pain.  One should be able to do better.
While I may have high expectations, I do think that minimizing the type of
contention I discussed is a reasonable user desire and should be a relevant
operating system design goal.

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

frazier@oahu.cs.ucla.edu (Greg Frazier) (08/23/89)

In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
=>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
=>>suppose you have a 32 Mbyte system and that there are only two processes
=>>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
=>
=>Have you ever figured out how long it takes to swap out a 25MB process?
=>Or to swap it back in again?  The fact is, this example system simply
=>does not have enough main memory for its workload.  The fix is, either
=>double the memory or run the jobs in sequence rather than simultaneously.

The original posting was asking if there was an OS smart enough
to run the jobs in sequence for him.  That's the point - most
machines experience a wide range of workloads, and cannot be
customized for each one of them.  Any heavily-used scientific
machine is going to experience the described situation at some
point in its career.  Now, obviously, if you have somebody
monitoring the machine's performance, he can "manually" prevent
the two jobs from running simultaneously.  What would be more
desireable, however, is for the OS to realize what is going on,
and for _it_ to cause the jobs to run sequentially.  I suspect
that this would require too much intelligence on the part of the
OS, but then, what do I know - I'm just an architect! :-)

Greg Frazier
&&&&&&&&&&&&&&&&________________________@@@@@@@@@@@@@@@@@@@@
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/23/89)

    In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
    >In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
    >=>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
    >=>>	[ comments about excessive inter-process memory contention ]
    >=>
    >=   [ comments about avoiding thw whole situation ]
    >
    >The original posting was asking if there was an OS smart enough
    >to run the jobs in sequence for him.  That's the point - most
    >machines experience a wide range of workloads, and cannot be
    >customized for each one of them.  Any heavily-used scientific
    >machine is going to experience the described situation at some
    >point in its career.  Now, obviously, if you have somebody

Absolutely correct.

    >monitoring the machine's performance, he can "manually" prevent
    >the two jobs from running simultaneously.  What would be more
    >desireable, however, is for the OS to realize what is going on,
    >and for _it_ to cause the jobs to run sequentially.  I suspect
    >that this would require too much intelligence on the part of the
    >OS, but then, what do I know - I'm just an architect! :-)
    >
    >Greg Frazier

I'd like to emphasize that it should be rather easy for an OS to recognize
when excessive page faulting is causing the system to thrash.
And once that happens, there's plenty of unused cpu cycles which could
be used by the OS in order to decide what to do about it!  So adding
some intelligence to the OS to better cope with this type of problem
need not have significant negative impact on ordinary interactive behavior.

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

khb@road.Sun.COM (Keith Bierman - Advanced Languages - Floating Point Group ) (08/23/89)

In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
...deleted stuff

>machines experience a wide range of workloads, and cannot be
>customized for each one of them.  Any heavily-used scientific
>machine is going to experience the described situation at some
>point in its career.  Now, obviously, if you have somebody
>monitoring the machine's performance, he can "manually" prevent
>the two jobs from running simultaneously.  What would be more
>desireable, however, is for the OS to realize what is going on,
>and for _it_ to cause the jobs to run sequentially.  I suspect
>that this would require too much intelligence on the part of the
>OS, but then, what do I know - I'm just an architect! :-)

These are reasons why some folks actually like batch OS ... in the old
days we had ways of informing the system of our expected needs (cpu,
io, "cards", etc.) and this was used by some fairly clever scheduling
algorithms. As Unix machines grow up, I personally expect something
fancier than nice level to allow the savvy programmer to inform the OS
of expected needs (for these scientific codes, we should be able to
assert, long running, not very interactive .. this would enable long
timeslices but at low priority ... and perhaps only one mongo job at a
time). 

There are some "batchqueue" type products which third parties and
national labs have cobbled up.

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

sbf10@uts.amdahl.com (Samuel Fuller) (08/23/89)

In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>
>The original posting was asking if there was an OS smart enough
>to run the jobs in sequence for him.  That's the point - most
>machines experience a wide range of workloads, and cannot be
>customized for each one of them.  Any heavily-used scientific
>machine is going to experience the described situation at some
>point in its career.  Now, obviously, if you have somebody
>monitoring the machine's performance, he can "manually" prevent
>the two jobs from running simultaneously.  What would be more
>desireable, however, is for the OS to realize what is going on,
>and for _it_ to cause the jobs to run sequentially.  I suspect
>that this would require too much intelligence on the part of the
>OS, but then, what do I know - I'm just an architect! :-)

This type of job scheduling is what IBM's MVS operating system does
very well and what Unix does very poorly.  As far as I know the
Unix operating system has no concept of job classes.
The dreaded JCL requires that you tell the system how long your job
will run, how much memory it will need, and any resources like tape
drives that you will be using.  Your job is then classified based on
these parameters.  An MVS system will only allow a certain number of
jobs to be run in each class.  If your job runs longer than you said
it would, it is terminated  ungracefully.

MVS is willing to trade batch throughput for interactive throughput.
A very heavily loaded MVS system still has excellent interactive response.
But any big programs, like simulations of future processors, stay
swapped out of the system until late in the night.

Unix has the opposite problem.  When a Unix system is very heavily loaded
everybody notices. Programs run slowly and response time goes to hell.

Is there a happy middle ground?
-- 
---------------------------------------------------------------------------
Sam Fuller / Amdahl System Performance Architecture

I speak for myself, from the brown hills of San Jose.

UUCP: {ames,decwrl,uunet}!amdahl!sbf10 | USPS: 1250 E. Arques Ave (M/S 139)
INTERNET: sbf10@amdahl.com             |       P.O. Box 3470
PHONE: (408) 746-8927                  |       Sunnyvale, CA 94088-3470
---------------------------------------------------------------------------

martin@minster.york.ac.uk (08/23/89)

This seems to be one area in which ``progress'' in the last few years
(especially on Unix systems) appears to have been going backwards!!

Several systems in the past have had very good (although still very simple)
paging systems which minimise the kind of effects that have been mentioned.
A good example that springs to mind was the Edinburgh EMAS system.

Given existing work, there is no excuse for a system to EVER thrash.
If a process gets too big, it might slow down, but it should not affect
other processes (to first order, anyway).

Recently too much effort has (I my opinion :-) been put into ``feature-full''
paging systems, and not enough into paging systems that work. Dare I say it,
Berkeley and SunOS leap to mind here.

Martin
usenet: ...!mcvax!ukc!minster!martin
JANET:  martin@uk.ac.york.minster
surface:
	Martin C. Atkins
	Dept. of Computer Science
	University of York
	Heslington
	York YO1 5DD
	ENGLAND

davecb@yunexus.UUCP (David Collier-Brown) (08/23/89)

[In a discussion of excessive inter-process memory contention ]
lgy@blake.acs.washington.edu (Laurence Yaffe) writes:
| I'd like to emphasize that it should be rather easy for an OS to recognize
| when excessive page faulting is causing the system to thrash.
| And once that happens, there's plenty of unused cpu cycles which could
| be used by the OS in order to decide what to do about it!  So adding
| some intelligence to the OS to better cope with this type of problem
| need not have significant negative impact on ordinary interactive behavior.

	The execution of this policy should be fairly cheap: deciding on
when to carry it out might be harder.
	I'll argue that a daemon which detects processes thrashing the
system can find all the processes which should be scheduled sequentially,
but will have to do a large amount of work ensuring that they're really
non-interactive, able to be arbitrarily delayed, etc.  Once you do the
decision making, the kernel can smash their priorities down "below the
flashing lights", or make them the idle process itself.

	I really wouldn't care to have the policy-maker in a kernel.
I think the risc/cisc complexity/cost arguments apply here (and that's
the only thing about architecture in this whole posting).

	For practical purposes, the SysV printer spooling system is really
quite a general-purpose creature, and can be turned into a mechanism for
simulating a batch queue (source: Drew Sullivan, personal communication).
	As we have a machine (Mips M2000) with such a spooler sitting there
unused, I'll have to see about using it as a batch queue for our scientists'
number-crunching jobs.
					--dave
-- 
David Collier-Brown,  | davecb@yunexus, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | Joyce C-B:
CANADA. 223-8968      |    He's so smart he's dumb.

adam@castle.ed.ac.uk (Adam Hamilton) (08/23/89)

In article <3342@blake.acs.washington.edu> lgy@blake.acs.washington.edu (Laurence Yaffe) writes:

	We have an ongoing discussion about scheduling of large jobs.
I suggest that a good batch system would be better at this than trying to
do everything in the kernel.  It can monitor the various requirements of
the separate jobs (store utilisation, i/o and system call patterns etc)
and tune accordingly.  What it needs from the operating system is this
information plus some more (like interactive usage).
	Summary - if it doesn't need to be in the kernel, don't put it there.

mat@uts.amdahl.com (Mike Taylor) (08/23/89)

In article <9aid02rf4dNn01@amdahl.uts.amdahl.com>, sbf10@uts.amdahl.com (Samuel Fuller) writes:
> 
> This type of job scheduling is what IBM's MVS operating system does
> very well and what Unix does very poorly.  As far as I know the
> Unix operating system has no concept of job classes.
> 
> Unix has the opposite problem.  When a Unix system is very heavily loaded
> everybody notices. Programs run slowly and response time goes to hell.
> 
> Is there a happy middle ground?

There is NQS, which adds some of the batch queuing ideas to Unix systems.
What doesn't exist, as far as I know, is much control of dynamic resource
allocation.  MVS went through an evolution from manual control to
automated control (SRM - System Resource Manager) which Unix has yet to
follow.  But even SRM cannot provide a free lunch - it attempts to avoid
overcommitting resources by shutting off or reducing service to some users
according to rules that you provide.
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

slackey@bbn.com (Stan Lackey) (08/23/89)

In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>=>In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>=>Have you ever figured out how long it takes to swap out a 25MB process?
>=>Or to swap it back in again?

>The original posting was asking if there was an OS smart enough
>to run the jobs in sequence for him.  That's the point - most
>machines experience a wide range of workloads, and cannot be
>customized for each one of them.  Any heavily-used scientific ...

That was one of the best things (I thought) that VMS has: the ability
to set up batch queues of different priorities, and set the number of
jobs in progress at each level.  On two large programs I worked on,
there were some large simulations and many small ones.  Interactive
jobs were set at level 4, so they got done first.  We set up level 3
(lower priority) with a job limit of 2 or 3 for the small jobs, and
level 2 (even lower) for the huge ones that ran for, oh, 6 hours and
used megabytes, with a job limit of 1.  The only time the machine was
idle was right after a reboot.

We found that there should be more than one job available, so
something useful happens during paging (assuming your machine pages).
-Stan

henry@utzoo.uucp (Henry Spencer) (08/24/89)

In article <3342@blake.acs.washington.edu> lgy@blake.acs.washington.edu (Laurence Yaffe) writes:
>...This response ("buy more memory, or sequence jobs by hand"), while common,
>completely misses the point.  The real issue here is how to maximize overall
>performance with a given physical configuration...

Well, but I don't think it *entirely* misses the point.  One should be
aware that there are circumstances in which *no* amount of fiddling will
get reasonable performance out of a given configuration and a given
workload, as the two are simply incompatible.

>    As to your specific question - yes, I have considered how long it takes
>to swap out a 25 Mb process.  On my system, perhaps 10-20 seconds.  This is
>utterly negligible for jobs which require many cpu-hours or cpu-days.  

It would have been nice if you'd specified that these are background compute
jobs, not interactive applications; your example (Mathematica) strongly
suggested otherwise.  Timesharing machines and batch computing servers
are very different applications of the hardware.  For interactive processes,
I stand by my original comment:  the system is simply too small.  For batch
work, the idea is reasonably practical but requires very different system
tuning, which Unix (traditionally an interactive system) hasn't given a lot
of attention to.

Also, have you *timed* that swapping, or is that just maximum disk transfer
rate?  Remember that your system probably can't drive your disks at full
speed; disk transfer rates are "guaranteed not to exceed" rates, not often
achievable in practice.  Your point is still valid, in that the time is
negligible for long-running batch jobs, but a dedicated batch machine is
clearly needed, as that kind of disk activity will ruin interactive use.
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (08/24/89)

In article <9aid02rf4dNn01@amdahl.uts.amdahl.com> sbf10@amdahl.uts.amdahl.com (Samuel Fuller) writes:
>Unix has the opposite problem.  When a Unix system is very heavily loaded
>everybody notices. Programs run slowly and response time goes to hell.
>
>Is there a happy middle ground?

A better scheduler is reported to help a whole lot.  The existing Unix one
was mean for small timesharing systems with no significant background load.
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

cander@unisoft.UUCP (Charles Anderson) (08/24/89)

From article <3332@blake.acs.washington.edu>, by lgy@blake.acs.washington.edu (Laurence Yaffe):
> Questions:
>     Which (if any) high-performance Unix systems are capable of avoiding
> this type of inter-process contention?  What works best - actual process
> swapping layered on top of demand paging, better paging algorithms,
> clever scheduling based on adaptive process priorities, ... ?

Straight out of school  I worked on a proprietary, non-Unix system that
handled this problem fairly well (even though the company went out of
business).  In addition the usual queues for runable procs, I/O waiting
procs, etc, the operating system had a "trouble maker" queue.  This was
for processes that were making too many demands on the system, i.e.
encouraging thrashing.  Processes on the TM queue were stripped of
their resources (essentially swapped out, I think).  After a while they
were allowed to return and were given some extra grace because the
first thing they're going to do is page-fault in their working set
which would otherwise land them right back in the pokey.

In the example that Mr. Yaffe gives, there would be two processes, A &
B, competing with one another and thus driving the system towards
thrashing.  The system sees this, selects A as the trouble maker and
puts him in the penalty box.  B runs fine without A getting in his
way.  After a while A is allowed to return and he starts slamming
against B.  A is graced, having just returned from the TM queue, so B
gets selected for the TM queue and A runs well until B returns when the
sequence repeats again.  I don't know how well this would work on a real 
Unix system, but it seemed to work well for this particular system.  

-- 

Charles.
{sun, amdahl, ucbvax, pyramid, uunet}!unisoft!cander

ingoldsb@ctycal.COM (Terry Ingoldsby) (08/24/89)

In article <3342@blake.acs.washington.edu>, lgy@blake.acs.washington.edu (Laurence Yaffe) writes:
> In article <1989Aug22.163100.25540@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
> >In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
> >>suppose you have a 32 Mbyte system and that there are only two processes
> >>running, each of which uses 25 Mbytes of virtual memory (almost all data space)
> >>with rapid, random access patterns...
> >>    This sort of contention could obviously be prevented by a smarter
> >>scheduling technique - something which would entirely swap out one
> >>process for a period of time in order to let the other process run.
> >
> >Have you ever figured out how long it takes to swap out a 25MB process?
> >Or to swap it back in again?  The fact is, this example system simply
> >does not have enough main memory for its workload.  The fix is, either
> >double the memory or run the jobs in sequence rather than simultaneously.
> >
> >V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
> >1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
> 
>     This response ("buy more memory, or sequence jobs by hand"), while common,
> completely misses the point.  The real issue here is how to maximize overall

But surely this is a discussion of the classis tradeoff of response time vs
throughput.  It is quite common that tuning a system for maximum performance
decreases overall response time to the point that users threaten to lynch the
system manager.  Note that most supercomputers, whose raison d'etre is to go
real fast, don't timeshare at all.

Finally, I don't understand why your performance is so poor, but maybe there
is something I don't understand about UNIX scheduling, or else your task is
*so* degenerate that no memory management/scheduling scheme will work.  In
normal circumstances if a process tries to access a non-resident page, the OS
blocks that process until the I/O completes and lets the other job(s) that
do have the necessary info in memory continue to run.  If you mean that your
jobs are so degenerate that each process does only a handful of memory accesses
before page faulting, then obviously you will always be waiting for the
page to be swapped in.  If your accesses are truly random, then *no* memory
scheme will ever be able to give you a good hit rate.  If they are not
truly random, then organize your data such that elements are stored in
such a way that chronologically accessed data reside on the same page.

In summary, either reorganize your data, or do what Henry says.
Although interesting, this discussion is probably moving away from the
charter or comp.arch.
-- 
  Terry Ingoldsby                       ctycal!ingoldsb@calgary.UUCP
  Land Information Systems                           or
  The City of Calgary         ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/24/89)

In article <2394@unisoft.UUCP> cander@unisoft.UUCP (Charles Anderson) writes:
-From article <3332@blake.acs.washington.edu>, by lgy@newton.phys.washington.edu (Laurence Yaffe):
-> Questions:
->     Which (if any) high-performance Unix systems are capable of avoiding
-> this type of inter-process contention?  What works best - actual process
-> swapping layered on top of demand paging, better paging algorithms,
-> clever scheduling based on adaptive process priorities, ... ?
-
-Straight out of school  I worked on a proprietary, non-Unix system that
-handled this problem fairly well (even though the company went out of
-business).  In addition the usual queues for runable procs, I/O waiting
-procs, etc, the operating system had a "trouble maker" queue.  This was
-for processes that were making too many demands on the system, i.e.
-encouraging thrashing.  Processes on the TM queue were stripped of
-their resources (essentially swapped out, I think).  After a while they
-were allowed to return and were given some extra grace because the
-first thing they're going to do is page-fault in their working set
-which would otherwise land them right back in the pokey.
-
-In the example that Mr. Yaffe gives, there would be two processes, A &
-B, competing with one another and thus driving the system towards
-thrashing.  The system sees this, selects A as the trouble maker and
-puts him in the penalty box.  B runs fine without A getting in his
-way.  After a while A is allowed to return and he starts slamming
-against B.  A is graced, having just returned from the TM queue, so B
-gets selected for the TM queue and A runs well until B returns when the
-sequence repeats again.  I don't know how well this would work on a real 
-Unix system, but it seemed to work well for this particular system.  
-
--- 
-Charles.

    Of the various responses to my initial query, this note by Charles
Anderson comes closest to the sort of approach I've been wishing for.
It has the virtues of:

a) being dynamic - responding to what a process actually does, not what
   some user said it might do.  This is clearly desirable.  I take it
   for granted that static, rigidly enforced job queues ala IBM are a
   prime case of the "cure" being as bad as the "disease".
   ("I'm sorry, your job ran for 2 weeks before exceeding the page fault
   limit you initially guessed.  You lose.  Try running it again with a
   larger limit...")

b) approximating what a human system manager typically does:

	notice that the system is starting to thrash, 

	look at the active processes and see which ones
	are generating high page fault rates,

	find out the effective "working set" size of these processes
	and figure out which processes can effectively run concurrently,

	try putting one or more of the processes "on hold"
	in order to allow the remaining processes to run effectively,

	periodically reevaluate the situation and readjust which (if any)
	processes should be put "on hold".

    Whether "on hold" means explicitly swapped out, or suspended, or
    simply restricted to a small, but non-zero number of physical memory
    pages is an implementation issue (about which I was hoping to hear
    some interesting discussion).  I do know that simply manipulating
    "nice" values is not adequate (at least on the systems I use) for
    preventing memory contention.  I don't know the "best" way to
    implement this sort of approach - I'm a full time user and part
    time system manager, not an OS guru.

The important point is the need for some sort of intelligently selected,
dynamic response which prevents system resources from being allocated
excessively "democratically" at times when "equal shares for all"
turns into "everyone starves".

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

levisonm@qucis.queensu.CA (Mark Levison) (08/24/89)

  I am not sure how Mr Yaffe's MIPS box or any other UNIX system handles
thrashing but I do remember the solutions that we discussed in my O.S.       
course a few years ago.  Lets assume we have two jobs both of which need
approx 20 MBs of memory to avoid thrashing on this system (which has 32 MBs),
and that these are both jobs with long run times.  Two different users start
them up and don't know about each other so the thrashing sets in.  The O.S.
should be able to detect the thrashing by obversing the very high paging     
and reduced CPU utilization.  It selects one of the processes with the most
pages in memory and stomps on it (or swaps it out if you want the technical
term).  Now the thrashing goes away for a while ... when the CPU and paging
are in the right state again the OS swaps the stompped process back in and
makes note with all its other information that it has recently been stompped.
Now if the second job has not finished then the thrashing problem occurs    
all over again but this time the second job is stompped because the first
job has been that stomp note with it.  This will continue until one of the 
processes has finished.  I am suprised that most UNIX boxes don't already
support this and my system manager probably won't appreciate me experimenting
on our system.

Mark Levison
levisonm@qucis.queensu.ca

chris@yarra.oz.au (Chris Jankowski) (08/24/89)

In article <3332@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>     As a concrete example (typical of what I've been trying to run lately),
> suppose you have a 32 Mbyte system and that there are only two processes
> running, each of which uses 25 Mbytes of virtual memory (almost all data space)
> with rapid, random access patterns.  (I.e., normal behavior for a program like
> Mathematica.)  If only one job were present, it would fit in real memory and
> the machine would run at virtually 100% cpu utilization.  With two such jobs
> in a demand-paged environment what appears to happen is that each process
> typically gets enough real memory so that it denies the other process 
> sufficient memory to run effectively - i.e., both processes continually
> page fault and cpu utilization plumments.
>
>     This sort of contention could obviously be prevented by a smarter
> scheduling technique - something which would entirely swap out one
> process for a period of time in order to let the other process run.
>
> Questions:
>
  .....
>     How common is this type of problem?  Anyone else share my feeling that
> many Unix systems could do a lot better at handling large processes?
> Anyone else share my feeling that this is important (and sadly neglected)?

I believe that the main design objective for UNIX scheduler was to achieve
good *response* time for interactive users and it does that unless there is
a severe resource depletion in the system due to excesive load.

In your particular problem you seem to try to maximise *throughput*.
There is an obvious tradeoff here and there are no free lunches.
The best way to do it is to create a batch enviroment in the 1960s mainframe
style. UNIX System V has a ready mechanism for you to run a single
stream batch queue. There are also many third party products which implement
fancy multistream batch queues with even fancier management.

I think that no scheduler can solve the problem in real life situation
except for such an artificially simple example. In reality you would need 
the scheduler to predict all *future* requirements of all processes.
Even if the processes could signal its future requirements to the scheduler
resolution will require a lot of processing time and at the current level
of technology not many people can afford one of their processors to spend
all its time calculating future savings. But the processes do not know
their future requirements either.

      -m-------   Chris Jankowski - Senior Systems Engineer chris@yarra.oz{.au}
    ---mmm-----   Pyramid  Technology  Australia	    fax  +61 3 820 0536
  -----mmmmm---   11th Floor, 14 Queens Road                tel. +61 3 820 0711
-------mmmmmmm-   Melbourne, Victoria, 3004       AUSTRALIA       (03) 820 0711

"Knowing how things work is the basis for appreciation,
and is thus a source of civilized delight."  -- William Safire

bin@primate.wisc.edu (Brain in Neutral) (08/24/89)

This thread now really has little to do with MIPS systems in
particular.  Please, edit the newsgroups lines to say comp.arch,
not comp.arch,comp.sys.mips.  Thanks.

Paul DuBois
dubois@primate.wisc.edu

chrisp@hcr.UUCP (Chris Phillips) (08/24/89)

In article <d0e502oG4d0e01@amdahl.uts.amdahl.com> mat@amdahl.UUCP writes:
>In article <9aid02rf4dNn01@amdahl.uts.amdahl.com>, sbf10@uts.amdahl.com (Samuel Fuller) writes:
>> 
>> This type of job scheduling is what IBM's MVS operating system does
>> [...]
>> Is there a happy middle ground?
>[...]
>allocation.  MVS went through an evolution from manual control to
>automated control (SRM - System Resource Manager) which Unix has yet to
>follow.  But even SRM cannot provide a free lunch - it attempts to avoid
>overcommitting resources by shutting off or reducing service to some users
>according to rules that you provide.

Ten years back when I was involved with various parts of MVS(3.6+?) I
believe we could show a 30% cpu load on a 43xx class machine with no
active work. A good portion of this was the SRM. It (the SRM) was also
larger than the UNIX(tm) of that time (1976-1979) (And mostly written
in PL/S II). This is not to say that any form of intelligent resource
management is bad, but rather to emphasize that it will cost. I think
IBM is now providing an AI expert system to generate the parameters
for the SRM. This (at least early in MVS life) used to be nearly a
Black Art. I can imagine an interface based on NeXTStep would be quite
impressive!
-- 
Chris Phillips, Hcr Corp.,1001  "A wise old owl sat in an oak,
130 Bloor St W,Toronto Ontario    The more he heard the less he spoke,
Bell: (416) 922-1937   M5S 1N5   The less he spoke the more he heard,
CIS : 75026,3673       CANADA     Why aren't we all like that wise old bird?"
UUCP: {utzoo,utcsri,ihnp4}!hcr!chrisp         <Traditional childrens verse>

rwc%ilmarin.c3@lanl.gov (Robert W. Cox) (08/24/89)

From article <440@ctycal.UUCP>, by ingoldsb@ctycal.COM (Terry Ingoldsby):

> Note that most supercomputers, whose raison d'etre is to go
> real fast, don't timeshare at all.

The majority of supercomputers are Crays.  The most common OS's on
Crays are CTSS and UNICOS, both of which are timesharing.

Bob Cox

ram@wb1.cs.cmu.edu (Rob MacLachlan) (08/24/89)

Being a Lisp programmer, I have also observed that Unix (and many other
O.S.'s also) tends to lose when running large programs that don't fit a
"typical" VM behavior pattern.

One problem is that many O.S.'s use a purely global page replacement
algorithm: no consideration is given to what process that page belongs to,
and to what the history of that process is.  

Many classic VM algorithms (such as Working Set) do consider processes and
degree of thrashing, etc., but these algorithms seem to be little used in
recently designed o.s.'es.  [Since this is comp.arch, I will note in passing
that is may be partly due to brain-dead VM hardware, such as the VAX, which
doesn't even have a reference bit.]  This is probably because people
discovered that with lower degrees of time-sharing and larger memories,
global strategies worked just as well for the "typical" workload.

As to how an O.S. could address these problems, I strongly agree that a
dynamic strategy is preferable to a batch-oriented one.  A Lisp process
generally persists for an entire interactive session, and may exhibit a wide
range of different behaviors during that period.  The O.S. must not only
recognise when the process is antisocial, it must also recognize when it
becomes social again.

As to exactly what to do when you discover that memory is over-committed:
>    Whether "on hold" means explicitly swapped out, or suspended, or
>    simply restricted to a small, but non-zero number of physical memory
>    pages is an implementation issue (about which I was hoping to hear
>    some interesting discussion).

It's hard to know exactly what you mean by "explicitly swapped out" in a
paged VM system.  Unlike in a true swapping system, you are never going to
write out clean pages.  The question is only what you are going to do to the
processe's resident pages at the time you decide to "swap" it.  You can:
 -- Clean all the dirty pages (perhaps freeing them once cleaned), and
 -- free all of the clean pages.

Scheduling writes of all the clean pages is probably worthwhile; it probably
doesn't make any difference whether you free the clean pages or just allow
them to be replaced when you page the other process in.

"suspending" the process presumably means stopping the process, but doing
neither of the above actions on its pages.  This would work pretty well,
especially if the VM system cleans dirty pages efficiently (many at a time).

Restricting a thrashing process to even less memory is a bad idea.  Although
it will allow other processes to get them memory they need, the restricted
process will thrash even more, soaking up disk bandwidth and O.S. runtime
that could be used by other processes.  There's no point in trying to let
the thrashing process run, since it's not going to be able to get anything
done without the memory it needs.

  Rob

rpeglar@csinc.UUCP (Rob Peglar x615) (08/25/89)

In article <9aid02rf4dNn01@amdahl.uts.amdahl.com>, sbf10@uts.amdahl.com (Samuel Fuller) writes:
> 
> In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
> >
> >The original posting was asking if there was an OS smart enough
> >to run the jobs in sequence for him.  That's the point - most
> >machines experience a wide range of workloads, and cannot be
> >customized for each one of them.  Any heavily-used scientific
> >machine is going to experience the described situation at some
> >point in its career. 

(verbiage deleted)

I don't know of any, offhand.  There are a few batch-oriented OSes
out there that allow the user to sequence a series of jobs by informing
the OS (a priori) of the sequence desired.  No funny file games needed.
But, again, the point was the OS doing "automatic sequencing".  Can't
think of one.

> 
> This type of job scheduling is what IBM's MVS operating system does
> very well and what Unix does very poorly.  As far as I know the
> Unix operating system has no concept of job classes.
> The dreaded JCL requires that you tell the system how long your job
> will run, how much memory it will need, and any resources like tape
> drives that you will be using.  Your job is then classified based on
> these parameters.  An MVS system will only allow a certain number of
> jobs to be run in each class.  If your job runs longer than you said
> it would, it is terminated  ungracefully.
> 
> MVS is willing to trade batch throughput for interactive throughput.
> A very heavily loaded MVS system still has excellent interactive response.
> But any big programs, like simulations of future processors, stay
> swapped out of the system until late in the night.

Missed the point.  Many OSes, again, allow a priori resource information
and schedule/control individual jobs (without sequencing as above) and/or
system load, based on the estimates.  No OS (that this addled brain can
think of) solve the optimization problem of maximizing resource utilization
vs. minimizing system throughput, even with knowledge of (N) jobs' initial
conditions, by job sequencing, in a dynamic state (non a priori).

> 
> Unix has the opposite problem.  When a Unix system is very heavily loaded
> everybody notices. Programs run slowly and response time goes to hell.
> 
> Is there a happy middle ground?

Yup.  For example, VSOS has both a priori resource information-based scheduling,
and also dynamic alteration of process state, based on admin-selected memory
utilization criteria.  Both processes and queues can be affected.  It's not
a job sequencer, however.

But it ain't Unix.

Rob

Rob Peglar	...!uunet!csinc!rpeglar
Manager, Software R&D, Workstation Group
Control Systems, Inc., St. Paul MN

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (08/25/89)

..> L. Yaffe talks about thrashing (2 25M processes on a 32M system), asks:
>    Which (if any) high-performance Unix systems are capable of avoiding
>this type of inter-process contention?  What works best - actual process
>swapping layered on top of demand paging, better paging algorithms,
>clever scheduling based on adaptive process priorities, ... ?

Gould NPL had a "working set" scheduler for memory; a working set was
determined by looking at the number of pages touched by a process in a
given interval of process virtual time (not system time); a process would
not run if the working set could not be loaded. Pervase Akhtar wrote a
paper on it for a past UNIX conference.

Working set schedulers are quite common in large (non-UNIX) systems.
See any recent issue of ACM SIGMETRICS' Conference Proceedings.
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

paulsc@radio_flyer.WV.TEK.COM (Paul Scherf;685-2734;61-028;692-4142;orca) (08/25/89)

Much of my computing, during the time I was an undergraduate student, was on a
batch system, with job size queues, etc. ....  The job sizes were based on the
maximum number of CPU seconds, maximum memory size, ... requested on the "job
card" (the first line/card in the input file).  Every time the computing center
published a change to the scheduling algorithm or the job size categories, some
of my friends and I would come up with an "un-scheduling" algorithm for the
best job size, based on the current job mix.  (There were several terminals
around campus, solely for checking on the status of a particular job or the
current job mix.)  We found that often we could get better service than our
classmates, by forcing our jobs into larger job sizes, by telling the system we
were going to use more CPU time, memory, ... than we would really use.  Some of
the most complex scheduling algorithms, let us come up with the most effective
"un-scheduling" algorithms.

Sure made it easier to learn queueing theory,
Paul Scherf, Tektronix, Box 1000, MS 61-028, Wilsonville, OR, USA
paulsc@orca.WV.Tek.com     503-685-2734     tektronix!orca!paulsc

mdeale@cosmos.acs.calpoly.edu (Myron Deale) (08/25/89)

   There's a paper by Jerry C. Yan, "Post-Game Analysis - A Heuristic
Resource Managment Framework for Concurrent Systems." Tech. Rpt. CSL-
TR-88-374, Stanford U., where the idea is "program-machine mapping
is improved 'in between' program executions."

   Not exactly dynamic, doesn't necessarily catch the thrasher in the
act, but the next time it's run, gotcha. Love the name, "Post-Game
Analysis."  ["Ho, what about these Dodgers! 22nd inning, the catcher
calls for a jsr to hi-mem and Dempsey swaps it out of the ball-park." :]
Would be nice to bring it into real-time though.


-Myron
// mdeale@cosmos.acs.calpoly.edu

ttwang@polyslo.CalPoly.EDU (Thomas Wang) (08/25/89)

lgy@newton.phys.washington.edu (Laurence Yaffe) writes:

>    A lot of current machines, such as my MIPS box, have pretty standard
>virtual memory systems: demand paging is used to allocate real memory to
>individual processes, with some sort of least-recently-used algorithm
>employed to select pages to write out.
>Severe, and unnecessary, performance degradation can occur if you have several
>processes which have very poor locality of reference when accessing their
>data, are sufficiently small that any singly process can fit in real memory,
>and yet sufficiently large that no two processes will fit in real memory.

To avoid excessive page fault rate, a good algorithm is to load in multiple
pages depending on system load.  Actually how many pages should you load in
is still a research question.  In HP's term, loading in extra pages is
called 'pre-fetching'.

HP's MPE/XL have pre-fetching.  I don't know about Unix, but my guess is
that Unix probably does not.

 -Thomas Wang ("I am, therefore I am."
                 - Akira               )

                                                     ttwang@polyslo.calpoly.edu

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (08/25/89)

>Have you ever figured out how long it takes to swap out a 25MB process?
>Or to swap it back in again?  The fact is, this example system simply
>does not have enough main memory for its workload.  The fix is, either
>double the memory or run the jobs in sequence rather than simultaneously.

Isn't that what the scheduler is supposed to figure out by itself?
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

hammondr@sunroof.crd.ge.com (richard a hammond) (08/25/89)

lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>
>>...
>>Severe, and unnecessary, performance degradation can occur if you have several
>>processes which have very poor locality of reference when accessing their
>>data, are sufficiently small that any singly process can fit in real memory,
>>and yet sufficiently large that no two processes will fit in real memory.

In article <13773@polyslo.CalPoly.EDU> ttwang@polyslo.CalPoly.EDU (Thomas Wang) comments:

>To avoid excessive page fault rate, a good algorithm is to load in multiple
>pages depending on system load.  Actually how many pages should you load in
>is still a research question.  In HP's term, loading in extra pages is
>called 'pre-fetching'.
>
>HP's MPE/XL have pre-fetching.  I don't know about Unix, but my guess is
>that Unix probably does not.

Note that in the case postulated by L. Yaffe, pre-fetching makes the thrashing
more severe, because the memory accesses have "poor locality of reference",
i.e. the working set is very close to the total memory required by the
process.  Fetching pages n through n+m because of a page fault on n
would increase the I/O overhead without increasing the probability
that you would avoid the next page fault.

Historical commentary:
The early multi-user UNIX machines, DEC PDP 11/45 and 11/70, could support
more physical memory than the largest process could use, so the scheduler
didn't worry about this sort of thrashing.

VM works fine if you have good locality of reference, while it fails miserably
if you don't.  With poor locality of reference, you need to have enough
physical memory to keep the job in memory.

Practical commentary:
Yes, the problem is interesting and can be solved, but I never observed
it except on our large number crunching machines, i.e. the Convex and Alliant.
On those machines I scheduled the jobs by hand, after all, if the job is going
to take 10 days of CPU time, you can just adjust the jobs so one runs over
night and the other one runs during the day.  So, I conclude that the problem
isn't that interesting for the vast majority of UNIX systems, i.e. all
our Suns and Vaxen didn't have the problem.

Rich Hammond

slackey@bbn.com (Stan Lackey) (08/25/89)

In article <5967@pt.cs.cmu.edu> ram@wb1.cs.cmu.edu (Rob MacLachlan) writes:

>Many classic VM algorithms (such as Working Set) do consider processes and
>degree of thrashing, etc., but these algorithms seem to be little used in
>recently designed o.s.'es.  [Since this is comp.arch, I will note in passing
>that is may be partly due to brain-dead VM hardware, such as the VAX, which
>doesn't even have a reference bit.]

Instead of examining a reference bit, an OS could check to see if the
page table entry is in the translation cache; this can be better than
a simple reference bit, which only tells you if a page has been read.
If a translation is in the cache, it indicates that the page is likely
being read often.
-Stan

leichter@CS.YALE.EDU (Jerry Leichter) (08/26/89)

In article <5967@pt.cs.cmu.edu>, ram@wb1.cs.cmu.edu (Rob MacLachlan) writes...
>...Many classic VM algorithms (such as Working Set) do consider processes and
>degree of thrashing, etc., but these algorithms seem to be little used in
>recently designed o.s.'es.  [Since this is comp.arch, I will note in passing
>that is may be partly due to brain-dead VM hardware, such as the VAX, which
>doesn't even have a reference bit.]...

VMS and the VAX architecture were designed at the same time by a group of
people who worked together.  VMS has implemented a Working Set algorithm
since it first shipped.

I've heard (from reasonably reliable sources) that the hardware people were
ready to include a reference bit, but the software people's work showed they
didn't need it, given appropriate algorithms.  The fact that early Unix
implementations for the VAX didn't handle it correctly is a problem with the
developers of those early versions, not with the VAX architecture.

I have never understood this need people have to use the VAX as the standard
bad example of EVERYTHING.  In fact, the designers of the VAX worked in ways
very similar to the way good RISC designers work today:  Hardware and software
people cooperated, and there was an attempt to balance the complexity each
had to deal with.  It may seem hard to believe, almost 15 years later, that
the compiler writers heavily influenced the design of the ISA, asking for
things like 3-address arithmetic instructions in addition to 2-address and
indexed addressing modes, but given the compiler technology of the day, this
is what they thought would be useful to them.  Compiler technology has changed
greatly since then, as has hardware technology.

The main difference between the VAX design team and current design teams is
probably in the inclusion of the OS designers.  The process structure of the
VAX, the support in hardware for AST's, the security structure - all of these
are direct reflections of the OS being developed.  These days, OS design is
rapidly becoming a lost art - Unix is the answer, what was the question?  So
there is little room for interplay any more between OS designers and hardware
designers - the hardware designers provide what the well-understood Unix
kernel needs and are finished.  (To a certain extent, language design is
approaching the same point:  Support C and figure you are done.  The VAX was
viewed from the start as a multi-lingual machine, and its ISA contains
elements that the COBOL developers found important, as well as elements the
FORTRAN developers wanted.  Most of the early RISC papers used C examples,
exclusively.  More recently, we've seen "Smalltalk RISC's" and "LISP RISC's".
I'm certainly not up on all the literature, but I know of no RISC design,
paper or otherwise, specifically intended to simultaneously support two quite
different languages efficiently.  As a language designer, I find it very
distressing to think that the machines of the future may be "refined" to the
point where doing anything but Unix and C will require extreme effort.  Take a
look at the history of LISP on CDC machines to see the possible results.)

							-- Jerry

bruce@blender.UUCP (Bruce Thompson) (08/26/89)

In article <9aid02rf4dNn01@amdahl.uts.amdahl.com>, sbf10@uts.amdahl.com (Samuel Fuller) writes:
> 
> In article <26642@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
> >
> >The original posting was asking if there was an OS smart enough
> >to run the jobs in sequence for him.
	/* much deleted */

> The dreaded JCL requires that you tell the system how long your job
> will run, how much memory it will need, and any resources like tape
> drives that you will be using.  Your job is then classified based on
> these parameters.
	/* more deleted */

    I think that what has just been demonstrated is that memory and
processor time is a resource which must be managed by the OS on the same 
level as it manages tape units, disk drives, etc. etc. I think that it 
would be fairly widely agreed that generally it is not known either how
much memory will be required, nor how long a program will run. While in
some cases there may be advanced knowledge of the characteristics of a
program this will not be the case in general.

    I think perhaps the issue is whether or not there is a general method
which can be used to determine on an `on demand' basis how to schedule
memory and the processor(s). This is essentially a summary of what the
entire area of operating system development has been about. As I recall,
one of the reasons that operating systems have tended to move away from
pre-identification of resource requirements was due to the fact that the
requirements were not known accurately.

    In the particular case of processor time requirements, one of the
more successful strategies for processor scheduling involves multiple
priority queues. Perhaps something along these lines can be developed for
managing memory resources as well.

    Certainly, it occurs to me that there is no simple solution.

	Bruce Thompson

henry@utzoo.uucp (Henry Spencer) (08/27/89)

In article <70663@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter) writes:
>I've heard (from reasonably reliable sources) that the hardware people were
>ready to include a reference bit, but the software people's work showed they
>didn't need it, given appropriate algorithms...

That *they* didn't need it, note, not that other software developers might
not need it.  (Of course, as we all know, there was only going to be *one*
operating system for the VAX...)

>The main difference between the VAX design team and current design teams is
>probably in the inclusion of the OS designers.  The process structure of the
>VAX, the support in hardware for AST's, the security structure - all of these
>are direct reflections of the OS being developed...

The same is true of the architectural tweaks on many modern machines.  For
example, according to John Mashey, the MIPS machines make a considerable
effort to do precise exceptions because the OS people got quite upset about
the alternative.  The difference is that modern OS people don't see any need
to reinvent the square wheel, so the issue is how to *support* the system,
not what it should look like.

>...To a certain extent, language design is
>approaching the same point:  Support C and figure you are done.  The VAX was
>viewed from the start as a multi-lingual machine, and its ISA contains
>elements that the COBOL developers found important, as well as elements the
>FORTRAN developers wanted.  Most of the early RISC papers used C examples,
>exclusively.  More recently, we've seen "Smalltalk RISC's" and "LISP RISC's".
>I'm certainly not up on all the literature, but I know of no RISC design,
>paper or otherwise, specifically intended to simultaneously support two quite
>different languages efficiently...

Might one ask how many RISC designs you are acquainted with?  The SPARC,
the MIPS architecture, and the HP Precision Architecture, to name three,
were all designed with considerable attention to multi-language support.
The fact is, supporting C well is very nearly enough to support the other
"conventional" languages well.  (Note, "very nearly" -- some attention to
the fine points is still desirable.)  Compiler designers today understand
that it may be better to convert COBOL numbers to binary for arithmetic
than to mess up the hardware with decimal instructions, for example.
COBOL programs are seldom arithmetic-bound anyway.

>distressing to think that the machines of the future may be "refined" to the
>point where doing anything but Unix and C will require extreme effort.  Take a
>look at the history of LISP on CDC machines to see the possible results.)

Take a look at the history of Lisp on Lisp machines, whose time has come
*and gone* -- those awful "C-only" RISC machines run Lisp faster than the
custom-cooked Lisp machines do.
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

GPWRDCS@gp.govt.nz (Don Stokes, GPO) (08/28/89)

In article <261500008@S34.Prime.COM>, BEAR@S34.Prime.COM writes:
>     I believe what you're looking for is a working set scheduler. In a
> nutshell, this type of scheduler pages a job against itself rather than
> against the system (i.e. the pool of available pages belongs to the process
> (rather than the system) and is of a size set by the system when then process
> is created). Unfortunately, I can't think of an available OS that uses a WS
> scheduler off the top of my head (that doesn't mean they don't exist!). Good
> luck.

I must be lucky then.  VMS has had one since day one.  Working set 
parameters can be set by either authorization information, as parameters 
to the $CREPRC system service (which creates a process) or explicitly 
during process execution (SET WORKING_SET command, or $ADJWSL system 
service).  Their are three parameters - default working set, working set 
quota (amount of WS that a process can "expect" to have when running) and 
working set extent (maximum WS that a process can get at if there is free 
memory on the system).  There are current and authorized values for these 
parameters.  I tend to run processes with lowish WSdefault and WSquota, 
and a high WSextent - means that processes that need memory can get it if 
it is available, but don't hog memory.

Don Stokes, Systems Programmer    /  /   Domain:                  don@gp.govt.nz
Government Printing Office,      /GP/   PSImail:          PSI%0530147000028::DON
Wellington, New Zealand         /  /   Bang:    ...!uunet!vuwcomp!windy!gpwd!don
--------------------------------------------------------------------------------
A shortcut is the longest distance between two points.

guy@auspex.auspex.com (Guy Harris) (08/29/89)

>I must be lucky then.  VMS has had one since day one.  Working set 
>parameters can be set by either authorization information, as parameters 
>to the $CREPRC system service (which creates a process) or explicitly 
>during process execution (SET WORKING_SET command, or $ADJWSL system 
>service).

OK, so which of the OSes that support working set scheduling do so
without having to be told by some external agent what the working set of
a process at some given time is?  (Do the VMS calls even tell the OS
that, or do they just tell it how big the working set is?)

Jerry Leichter claimed that the VMS software people showed that a
reference bit is not necessary, given the appropriate algorithms; does
VMS how have appropriate algorithms to determine the contents of the
working set without requiring a reference bit, or did the designers
decide that the appropriate algorithm is to believe what the user or the
application tells you and not try to figure it out for yourself?  (Not a
rhetorical question - I'm willing to accept that the latter is the
appropriate algorithm, given sufficient evidence to demonstrate that
claim.)

dwc@cbnewsh.ATT.COM (Malaclypse the Elder) (08/29/89)

i've been doing work on demand paging for unix system v for quite
some time and one of the more interesting papers i came across was
a paper by some dec people on a "segmented fifo" approach to local
page replacement.  i don't know if it is what they use in vms but
the gist of the approach is that fifo is not too bad since a replacement
page can be found very quickly.  it has been labeled "bad" by academia
because of belady's "fifo anomaly" which showed that with fifo, there
exist "some reference strings" in which having more memory will lead
to more faults than with smaller allocations.

now this is only my opinion, but i believe that it does not pay to
worry about such situations as long as, in general, giving more memory
leads to less page faults.  and fifo does exhibit this trend.  however,
the other "problem" is that based on some simulations with some reference
strings, fifo seems not to do as well as lru.  the "segmented fifo" attempts
to approximate lru performance by setting the page table entries for a
"small number" of pages at the head of the fifo to fault on reference.
this fault results in the page moving to the back of the queue (ie no
longer a candidate for stealing) and thus approximates lru.

anyway, in my studies, i have also come to the conclusion that a reference
bit may not be that desirable anyway since you don't necessarily want to do
a linear scan of all the page table entries of either a process (local
stealing) or the system (global stealing) in order to find a page to steal.
and there are people here who were looking at working set schedulers
for system v that attempt to address the issues presented by the original
poster.  but we have to re-engineer those ideas for sun's VM architecture
which makes its system v debut in release 4 and it remains to be seen if
those ideas are compatible with the VM architecture.

of course, this is only my view of things and i can't speak for any
of the people who make the actual decisions about what gets in.

danny chen
att!hocus!dwc

leichter@CS.YALE.EDU (Jerry Leichter) (08/29/89)

In article <2389@auspex.auspex.com>, guy@auspex.auspex.com (Guy Harris) writes...
>>I must be lucky then.  VMS has had one since day one.  Working set 
>>parameters can be set by either authorization information, as parameters 
>>to the $CREPRC system service (which creates a process) or explicitly 
>>during process execution (SET WORKING_SET command, or $ADJWSL system 
>>service).
> 
>OK, so which of the OSes that support working set scheduling do so
>without having to be told by some external agent what the working set of
>a process at some given time is?  (Do the VMS calls even tell the OS
>that, or do they just tell it how big the working set is?)

They only specify the size.  A process in VMS comes equipped with three
significant numbers controlling working set allocation:  The limit, quota,
and extent.

The working set limit is essentially the "zero point" for working set adjust-
ment:  When an image exits, the process's working set is set back to its
working set limit.  (Recall that VMS doesn't create a new process per executed
image; rather, it re-uses the existing context.  This is done by having two
threads of control within each process.  User code runs in the user-level
thread; DCL, the equivalent of the Shell, runs in the supervisor-level thread.
(Actually, there are actually two more threads, executive and kernel, but they
aren't at issue here.)  When user-level code exits, all user-level structures
are discarded, all user-level channels are closed, and all pages of virtual
memory owned by user mode are freed.  DCL then goes on to the next command.
In over-all effect, this is like the Shell's use of fork/exec except with
light-weight threads (shared memory space).  It's not EXACTLY like light-
weight threads as they are normally pictured because the hardware protects
the supervisor-mode thread from the user-mode thread.)

Anyhow, returning to working sets:  The second parameter, the working set
quota, is the amount of VM that the process is guaranteed to have available
to it.  On a busy system, it is the upper bound on the process's working set
size.

If the system has plenty of spare memory, a process is allowed to exceed its
working set quota.  The third parameter, working set extent, is a hard limit
on the size of a process's working set.  When a process with a working set
size below its quota needs pages, and none are available, the system will
grab them back from processes that have "borrowed" pages and are over quota.
(If there are no "borrowed" pages to be found, the system takes a number of
other actions; ultimately, it swaps entire processes out, making all of their
pages available to satisfy the demand for "guaranteed" pages.)

The net effect is that when the system is short of memory, processes page
against themselves, while when the system has memory to spare, they page
against the system-wide pool of pages, but are still subject to an upper
limit on size.

There are various parameters that control things like how many pages the
system must have free before it will allow processes to exceed their working
set quotas.  In fact, there are probably 10 or 20 user-settable parameters
to let you tune the performance of the system.  Some can be changed on a
running system; others require a re-boot.  Setting them is something of a
black art; fortunately, standard VMS comes with programs to set them auto-
matically, based on actual workload.

>Jerry Leichter claimed that the VMS software people showed that a
>reference bit is not necessary, given the appropriate algorithms; does
>VMS how have appropriate algorithms to determine the contents of the
>working set without requiring a reference bit, or did the designers
>decide that the appropriate algorithm is to believe what the user or the
>application tells you and not try to figure it out for yourself?  (Not a
>rhetorical question - I'm willing to accept that the latter is the
>appropriate algorithm, given sufficient evidence to demonstrate that
>claim.)

The basic algorithm is simple:  Pages lost from a process working set go to
the end of a free list.  If a fault occurs for a page on the free list, it
can be given back to the process; its contents will be unchanged.

The lists are not actually searched, since the structures defining virtual
memory contain sufficient information to find a page still on the free list
with little effort.  Such "soft faults" are quite fast - not as fast as not
having incurred the fault at all, of course, but perhaps a few tens of in-
struction times.

There's more to this, of course.  For example, a "dirty" page can't go to the
free list; instead, it goes to a dirty list.  When the dirty list gets long
enough, the modified page writer writes it out and eventually moves the page
to the free list.  The page remains eligible for return to the process via a
"soft" fault until it is actually given to some other process.  Under normal
conditions, pages actually within the working set of a process are almost
certain to be found on a free list even if grabbed.

There are all sorts of additional twists.  For example, some pages can be
locked in the working set.  Any program can do this, though it's rare that
user-mode programs do.  Rather, things like pages containing page tables get
locked in the working set.  (Some pages HAVE to be locked in the working set
to avoid deadlocks in the pager.)  Also, when VMS scans for pages to "steal"
from a process, it keeps enough context to try to avoid recently-faulted-in
pages.  (I don't recall the details.)

Does this all work?  Well, both VMS (and Unix these days) seem to support vir-
tual memory on VAXes with a great deal of success.  I haven't seen anyone
claim to have an architecture of roughly equivalent CPU power/memory and I/O
bandwidth/etc. but WITH page reference bits which supports more processes/
does measurably better at VM management/etc.  The VAX architecture has been
pretty heavily studied - there are a number of published papers - and none
of them that I have read has found a bottleneck in the page replacement code.

							-- Jerry

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (08/29/89)

>OK, so which of the OSes that support working set scheduling do so
>without having to be told by some external agent what the working set of
>a process at some given time is?  (Do the VMS calls even tell the OS
>that, or do they just tell it how big the working set is?)

It may have gotten lost: Gould NP1 UNIX.
The NP1 hardware supports reference bits.


>Jerry Leichter claimed that the VMS software people showed that a
>reference bit is not necessary, given the appropriate algorithms; does
>VMS how have appropriate algorithms to determine the contents of the
>working set without requiring a reference bit, or did the designers
>decide that the appropriate algorithm is to believe what the user or the
>application tells you and not try to figure it out for yourself?  (Not a
>rhetorical question - I'm willing to accept that the latter is the
>appropriate algorithm, given sufficient evidence to demonstrate that
>claim.)

Simulate reference bits by making the page non-RW, and faulting when you
try to read - voila, a reference, and then you just make the page readable.
Similarly, simulate modify bits by making the page non-W...
    I posted a reference to a SPUR paper that discusses the HW/SW tradeoffs
involved, yesterday.
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

dricejb@drilex.UUCP (Craig Jackson drilex1) (08/29/89)

In article <2389@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>>I must be lucky then.  VMS has had one since day one.  Working set 
>>parameters can be set by either authorization information, as parameters 
>>to the $CREPRC system service (which creates a process) or explicitly 
>>during process execution (SET WORKING_SET command, or $ADJWSL system 
>>service).
>
>OK, so which of the OSes that support working set scheduling do so
>without having to be told by some external agent what the working set of
>a process at some given time is?  (Do the VMS calls even tell the OS
>that, or do they just tell it how big the working set is?)

The older operating system, called MCP, on the Unisys A-Series did
classic working set maintenance, without either a referenced bit
or a modified bit.  It did use some form of a clock algorithm to
figure out what to throw out, but maintained working sets to determine
how much of each task's memory to throw out.  The working set calculations
were based on things like the number of segments faulted-in in a given
period and their size.

Interestingly, the operating system for their latest architecture, called
MCP/AS, punts all of the working set stuff, replacing it with a simple
clock algorithm with two hands.  There is only one memory pool, consisting
of the entire machine.  This allows a low-priority user task to suck up
lots of memory, until both it and high-priority tasks begin to page.  The
low-priority task doesn't mind, but everybody else suffers.  (Lots of the
the code which is equivalent of the Unix kernel is pagable on this system.)
The vendor response was to buy more memory; we're up to 144MB and still
hurting occasionally.  Their next response was "Don't run big-memory jobs".

This issue points up one other thing: this operating system runs fine as
a transaction-processing monitor, where there are relatively few tasks
to be run, and each task performs a well-defined function.  This is the
direction that mainframes are going.  In practice, most mainframes have
already conceded the time sharing realm to other processors (Unix, PCs, etc.).
The main refuge of mainframes today is serious bean-counting, and servicing
the "operators standing by to take your order".

(This gratuitous comment brought to you through fear of the
inews line-counter.)

>Jerry Leichter claimed that the VMS software people showed that a
>reference bit is not necessary, given the appropriate algorithms; does
>VMS how have appropriate algorithms to determine the contents of the
>working set without requiring a reference bit, or did the designers
>decide that the appropriate algorithm is to believe what the user or the
>application tells you and not try to figure it out for yourself?  (Not a
>rhetorical question - I'm willing to accept that the latter is the
>appropriate algorithm, given sufficient evidence to demonstrate that
>claim.)

I think that the VMS guys thought that the ability to recover pages off
the free list gave them many of the features of reference bits, without
either the memory traffic to update them or the software effort to read
them.

P.S. The largest processors of the Unisys A-Series still don't have
modified bits.  The designers decided that the memory traffic would be
too high.  However, code and read-only arrays (declared at compile time)
has been marked specially for years, and need not be paged out.
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,ll-xn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

mash@mips.COM (John Mashey) (08/30/89)

In article <4176@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes:
......
>I think that the VMS guys thought that the ability to recover pages off
>the free list gave them many of the features of reference bits, without
>either the memory traffic to update them or the software effort to read
>them.
>
>P.S. The largest processors of the Unisys A-Series still don't have
>modified bits.  The designers decided that the memory traffic would be
>too high.  However, code and read-only arrays (declared at compile time)
>has been marked specially for years, and need not be paged out.

Say some more.  I'd assume there's a trap-on-write bit or equivalent
somewhere, but no modified bit set by hardware?

Note that on a paged machine running UNIX, with typically-sized pages
(i.e., fairly different from A-series, I think), hardware-set modified
bits are unlikely to consume much bandwidth, unless done by blindly
writing a page table entry upon every write (as opposed to state
changes). This why it is easy to get away with trapping writes to
set modify bits instead of having the hardware do it.

In UNIX, the frequency of the state change:
	writable, but not yet modified -> modified
is:
	a) far less than references, TLBmisses, or similar actions.
	b) Occurs in the following cases:
		1) Write to a page that is demand-fill-zero (like a
			stack section).
		2) Read to a demand-fill-zero page, later followed by
			a write.
		3) Write to a demand-load page (.data).
		4) Read a demand-load page, later followed by a write.
		5) Write to a copy-on-write page inherited from fork,
			or any other reason to use copy-on-write.

Now, In cases 1) and 3), you know you were doing a write, so you can
mark the page modified in the first place, as soon as you've zeroed or
loaded it.  In case 2), you might as well do that also, as reasonable
code sequences are unlikely to leave a .bss or sbrk'd hunk of memory
clean for very long.  In case 5), everybody turns off the ability of
hardware to do a write without taking an interrupt, anyway, because the
OS has significant work to do.
That leaves 4), where you reference some .data, but go for a longtime
without doing a write to the page.  (Note that this is more likely than
the same sequence for a DFZ-page, since you might have large tables of
data that are effectively read-only, but could be modified.)
item 4) is clearly bounded by the rate of demand-load pageins/second of
data (not code), which varies from system to system, but is usually
much less than the number of disk I/Os.  As an example, I took a quick
look at one of our M/2000s, which had 300 processes, and was doing 500
context-switches a second (I have no idea what it was doing). A bound on the
number for 4) is the pgfil number from vsar (plus some of the swpin).
These numbers were usually = 0 (this is a big machine), but I did see
them get as high as 5/second after watching for 5 minutes.  Thus,
hardware-set modify bits would save you about 5 exceptions/second.
On a smaller machine, the number might be higher, of course.
The machine was doing about 1000 syscalls/sec, for context.

BOTTOM LINE: either you want to trap a write and do something complicated
anyway, in which case you turn off any hardware's ability to set a
modified bit, or you're saving at most one exception per page of
paged/swapped-in data.

I can't dig up a reference, but I'd swear I've seen at least once,
a CPU which had at least one bug in setting modify bits somewhere,
i.e., the hardware to do this is not always trivial....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

barry@PRC.Unisys.COM (Barry Traylor) (08/30/89)

In article <4176@drilex.UUCP> dricejb@drilex.UUCP (Craig Jackson drilex1) writes:
>The older operating system, called MCP, on the Unisys A-Series did
>classic working set maintenance, without either a referenced bit
>or a modified bit.  It did use some form of a clock algorithm to
>figure out what to throw out, but maintained working sets to determine
>how much of each task's memory to throw out.  The working set calculations
>were based on things like the number of segments faulted-in in a given
>period and their size.

While I'm glad you like the older algorithms, you fail to mention that
workingsets were maintained by stepping linearly through memory and forcing 
data and code segments out to disk.  While stats were kept for each
process's memory usage, no intellegence was applied to the selection of
segments to overlay.  You also failed to note that on split code/data machines 
(B7900s and ASN A15s), code was not subjected to the workingset mechanism at 
all.  In any rate, to come up with those nifty statistics, the entire
memory space of the process was forced out of memory.

>Interestingly, the operating system for their latest architecture, called
>MCP/AS, punts all of the working set stuff, replacing it with a simple
>clock algorithm with two hands.  There is only one memory pool, consisting
>of the entire machine.  This allows a low-priority user task to suck up
>lots of memory, until both it and high-priority tasks begin to page.  The
>low-priority task doesn't mind, but everybody else suffers.  (Lots of the
>the code which is equivalent of the Unix kernel is pagable on this system.)
>The vendor response was to buy more memory; we're up to 144MB and still
>hurting occasionally.  Their next response was "Don't run big-memory jobs".

First, I was solely responsible for the defective MCP/AS workingset mechanism, 
which was intended to be fool-proof.  Unfortunately it spent most of its time 
fooling itself, alternating between doing too much and doing too little.  The 
mechanism has been largely reimplemented (by me again, so send complaints to 
the same address) on the 38 release and is much more predictable and
reliably tunable.  You failed to mention that the MCP/AS mechanism is based 
on an LRU concept, and had it worked as I had expected (which, of course, you 
could not have known), it would have been much better in terms of overall 
throughput than the ASN (old) mechanism.  My experimentation with the 38 
version shows that it truly controls workingsets while minimally consuming 
system resources.  Your milage will, of course, vary.  Also, I'm not sure,
but I think that SUSPENDER can still be run.  SUSPENDER forces low priority
tasks to stop executing if thrashing is detected (I think).

>This issue points up one other thing: this operating system runs fine as
>a transaction-processing monitor, where there are relatively few tasks
>to be run, and each task performs a well-defined function.  

The MCP/AS release 38 workingset mechanism works well for any task that hangs
around for more than a couple of minutes.  Workingset control is largely only 
an issue on long running tasks that, over the course of the run, consume large  
quantities of memory, but at any given period of time use a relatively minor 
subset of that total.  The workingset mechanism will remove from memory those 
segments that are not indicating any use WITHOUT FORCING THEM OUT TO DISK.
Those segments in use will remain in memory unless DEMAND overlay activity 
kicks them out.  Of course for the MCP/AS workingset mechanism, there are a 
couple of tunable parameters and the whole thing can be turned off, if desired.

The fundamental philosophy behind the mechanism is that memory allocations
are faster if there is available memory, and processes run just as well
accessing the current workingset with just the workingset in memory as they do  when all of the memory segments are resident.

I'm not sure what you mean by "few tasks".  Most of the time our in-house
A17 shows more than 200 active tasks, not all of which are well behaved.

>P.S. The largest processors of the Unisys A-Series still don't have
>modified bits.  The designers decided that the memory traffic would be
>too high.  However, code and read-only arrays (declared at compile time)
>has been marked specially for years, and need not be paged out.

The modified bit seemed like a good idea at the time.  We have never been
able to gather statistics about the effectiveness of it.  Part of the
problem is that the data must be written out to disk at least once and once
written, all that's saved are subsequent writes IF the data is not modified.
In the mean time the processor flows are fouled up on every memory write
operation since the modified bit must be checked.  On a highly concurrent
processor such is used with the A15/A17, the extra state required to be
kept with every segment reference seemed too high of a price to pay for an
unknown (and still unproven) return.  

Barry Traylor (Yup, that's right, my opinions are my own, NOT UNISYS's)
Unisys A Series Engineering (Tredyffrin)
(barry@prc.unisys.com)

plogan@mentor.com (Patrick Logan) (08/30/89)

In article <1989Aug26.232710.27174@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <70663@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter) writes:
>>distressing to think that the machines of the future may be "refined" to the
>>point where doing anything but Unix and C will require extreme effort.  Take a
>>look at the history of LISP on CDC machines to see the possible results.)
>
>Take a look at the history of Lisp on Lisp machines, whose time has come
>*and gone* -- those awful "C-only" RISC machines run Lisp faster than the
>custom-cooked Lisp machines do.
>-- 
>V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
>1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

Saying that C machines run Lisp faster than Lisp machines is
simplistic. To run Lisp faster on a C machine requires more
declarations in the code (or, as with T, more use of "type-specific"
procedures such as fx+ instead of generic +). This goes against the
grain of Lisp programming.

Lisp machines provide a better environment (e.g. protection, ability
to write fast, generic code) per performance unit. I'd guess the only
reasons their time is gone are market based. They're sure to return in
one form or another as more and more C machine programs are written in
Lisp-like languages. Many C machine architectures (e.g. SPARC) already
incorporate some support for Lisp.
-- 
Patrick Logan                | ...!{decwrl,sequent,tessi}!mntgfx!plogan 
Mentor Graphics Corporation  | plogan@pdx.MENTOR.COM                    
Beaverton, Oregon            | "My other computer is a Lisp machine."

guy@auspex.auspex.com (Guy Harris) (08/31/89)

>Simulate reference bits by making the page non-RW, and faulting when you
>try to read - voila, a reference, and then you just make the page readable.

Yes, that's the trick used by [34].xBSD; I was wondering whether VMS had
some other scheme that used neither real nor simulated reference bits.

lehotsky@pmax7.osf.org (Alan Lehotsky) (08/31/89)

There is one other trick used by VMS (at least on the 780).  The original
11/780 did not have a translation-buffer "read-out".  So, it was
eco-ed to give access to the last 32 virtual-address translations.  This
gave them the most-recently used pages so that they could avoid a truely
random page-replacement strategy.

I don't know whether newer processors retained this feature - but I think
that it was an extremely clever trick!

Al Lehotsky
(former) Bliss Wizard

leichter@CS.YALE.EDU (Jerry Leichter) (08/31/89)

By an interesting coincidence, I received the following message - pointing to
a study behind the VAX and VMS paging architecture - just yesterday, quite
independently of this ongoing discussion:

	In the 1981 Journal of the ACM (0-89791-051-6 /81/0009/0048) Rollins
	Turner and Hank Levy published an article comparing the original
	segmented FIFO VMS mem. mgt. system to a pure Least Recently Used
	(LRU) methodology. In it, they show that the fault behavior of the VMS
	modified FIFO approaches arbitrarily close to that of pure LRU as the
	size of the secondary caches (free and modified lists) is increased.

Note that a JACM article would probably be pretty theoretical in nature.

							-- Jerry

dricejb@drilex.UUCP (Craig Jackson drilex1) (08/31/89)

In article <26504@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <4176@drilex.UUCP> dricejb@drilex.UUCP (Me) writes:
>......
>>P.S. The largest processors of the Unisys A-Series still don't have
>>modified bits.  The designers decided that the memory traffic would be
>>too high.  However, code and read-only arrays (declared at compile time)
>>has been marked specially for years, and need not be paged out.
>
>Say some more.  I'd assume there's a trap-on-write bit or equivalent
>somewhere, but no modified bit set by hardware?
>
>Note that on a paged machine running UNIX, with typically-sized pages
>(i.e., fairly different from A-series, I think), hardware-set modified
>bits are unlikely to consume much bandwidth, unless done by blindly
>writing a page table entry upon every write (as opposed to state
>changes). This why it is easy to get away with trapping writes to
>set modify bits instead of having the hardware do it.
[ more analysis of Unix write patterns.]

The A-Series hardware essentially has a trap-on-write bit in each
segment descriptor.  Code segments are read-only by definition; data
segments have a bit in the descriptor.  These could be used by an
operating system to detect modified segments, and deal with them.
(At least the data segments could.)  However, the operating system
does not do this: Either the data has been marked read-only by the
compiler, or it is assumed to be read/write.  Read-only data and
code are assumed to be unchangeable throughout execution; they're
always retrieved from the object code file, rather than the page
file.

The reason why there could be quite a bit of traffic for modify
bits is that the A-Series is a segmented machine.  The average
segment size (about 600 bytes on our machine) is much smaller
than the typical Unix page.  Note that the smaller machines in
the A-series do have modified bits; I guess they decided the extra
I/O traffic would be more important than the memory traffic.

However, I still don't understand why they left off the bits
on the large boxes.  They're caching the writes; why couldn't they
cache the modified bit updates too?
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,ll-xn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

weaver@weitek.WEITEK.COM (08/31/89)

In article <2396@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>>Simulate reference bits by making the page non-RW, and faulting when you
>>try to read - voila, a reference, and then you just make the page readable.
>
>Yes, that's the trick used by [34].xBSD; I was wondering whether VMS had
>some other scheme that used neither real nor simulated reference bits.

The 'modified FIFO' scheme used by VMS works as follows: each process at any 
given time has a fixed number of pages dedicated to it (the number is adjusted
periodically according to page fault statistics). These pages are divided into
three sets: the working set, the free list and the dirty list. All three are 
mapped by the process's page table, but only the first set is allowed to be 
used by the process. When a page is faulted out, it goes to the free list 
or the dirty list, depending on whether it was modified. Another page
is dropped off the free or dirty list, in FIFO fashion. Dirty pages must
of course be written to disk. 

When a page fault occurs, the page tables are checked to see if the page is 
in the free list or dirty list. If so, the page is taken off that list and 
put in the working set. This is called a 'soft' page fault. If not, we need
to get a page from disk, a 'hard' page fault. A page is chosen at random 
to be taken out of the working set, and put at the head of the free or 
diry list.

If you consider only the traffic to and from disk, this looks a lot like LRU
(least recently used), because pages on free and dirty lists, while chosen at 
random, age in a FIFO before being discarded or written back. In place of the
scanning of the working set for reference bits, we have the soft page faults.

Michael Weaver
Weitek

meissner@tiktok.rtp.dg.com (Michael Meissner) (08/31/89)

In article <2389@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:

>  OK, so which of the OSes that support working set scheduling do so
>  without having to be told by some external agent what the working set of
>  a process at some given time is?  (Do the VMS calls even tell the OS
>  that, or do they just tell it how big the working set is?)

AOS/VS and it's big brother AOS/VS-II both support working set
scheduling.  The underlying hardware (the propritary Eclipse/MV
series) supports reference/modified bits.  The native UNIX port, DG/UX
does not use working set for individual processes, but uses a global
page replacement scheme.

In AOS/VS (ans AOS/VS-II), there are a bunch of ways to tune the
working set.  The system manager can specify the minimum and maximum
of pages that can be resident at any one time on a per-user basis
(there are separate minimums/maximums for interactive and bach
processes).  Also, you can indicate in a program file or at process
creation, the minimum working set that the program needs (and also the
maximum at process creation).

--
--
Michael Meissner, Data General.
Uucp:		...!mcnc!rti!xyzzy!meissner		If compiles were much
Internet:	meissner@dg-rtp.DG.COM			faster, when would we
Old Internet:	meissner%dg-rtp.DG.COM@relay.cs.net	have time for netnews?

mash@mips.COM (John Mashey) (08/31/89)

In article <1989Aug30.152155.9613@mentor.com> plogan@mentor.com (Patrick Logan) writes:
>In article <1989Aug26.232710.27174@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>Take a look at the history of Lisp on Lisp machines, whose time has come
>>*and gone* -- those awful "C-only" RISC machines run Lisp faster than the
>>custom-cooked Lisp machines do.

>Saying that C machines run Lisp faster than Lisp machines is
>simplistic. To run Lisp faster on a C machine requires more
>declarations in the code (or, as with T, more use of "type-specific"
>procedures such as fx+ instead of generic +). This goes against the
>grain of Lisp programming.
>
>Lisp machines provide a better environment (e.g. protection, ability
>to write fast, generic code) per performance unit. I'd guess the only
>reasons their time is gone are market based. They're sure to return in
>one form or another as more and more C machine programs are written in
>Lisp-like languages. Many C machine architectures (e.g. SPARC) already
>incorporate some support for Lisp.

There are many good ideas in LISP, and more of it will filter over.
The issue (and it has been discussed before in comp.arch), is that
the economics of the computer business end up with the following:
	a) If something is a low-volume, special-purpose architecture,
	it must be MUCH better at what it does than the high-volume,
	cheap competition.  If it isn't, it will end up going away,
	because the competition gets to keep tracking the technology
	curve at speed, rather than getting behind it.  also,
	if something has high-volume in VLSI, it gets to be cheap.
	b) The current RISC chips are getting fast and cheap enough
	that this is happening.
	c) There is still some fine-tuning to be done, but most of
	them are more like tweaks added to some of the existing
	architectures (tagged adds, for example, by themselves, do
	not necesarily solve the problem: consider the exception-
	handling sequences also needed).

Useful datapoints might be, now and in future:
	a) Symbolics, Inc. history, where it seems like they're
	geting out of hardware-design business in favor of (fine) software.
	b) TI: Ivory chip versus SPARCs

CONJECTURE/PREDICTION: fast VLSI RISCs will end up capturing the high
end of the LISP market over the next few years.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

pcg@thor.cs.aber.ac.uk (Piercarlo Grandi) (09/03/89)

In article <3398@cbnewsh.ATT.COM> dwc@cbnewsh.ATT.COM (Malaclypse the Elder) writes:

   i've been doing work on demand paging for unix system v for quite
   some time [ ... ] it has been labeled "bad" by academia
   because of belady's "fifo anomaly" which showed that with fifo, there
   exist "some reference strings" in which having more memory will lead
   to more faults than with smaller allocations.

   now this is only my opinion, but i believe that it does not pay to
   worry about such situations as long as, in general, giving more memory
   leads to less page faults.  and fifo does exhibit this trend.

The non monotonicity of fifo is not terribly worrying. As long as you can
show that it does not occur frequently.

My own favorite page replacement policy is (local) page fault
frequency, as it automatically balances the load between memory and
and disc. The great advantage of ws over pff is that ws performance is
not very sensitive to its parameters, so tuning is not difficult, while pff
is not so lucky. Pff, by the way, is non monotonical.

A very interesting, forgotten article by Dijkstra argues for
essentially what is essentially pff.  The argument is compelling;
ws minimizes the space time integral, and pff balances the load
between disc and memory. That is ws is optimal only if the
configuration is optimally balanced for the job mix running on
it, because the lowest space time integral gives the lowest cost
only in that case.  Example: assume that memory is scarce
relatively to disc bandwidth; it may well pay to allow higher
page fault rates, up to filling the disc bandwidth.

Another interesting idea contained in Dijkstra's paper is that it
would be nice, actually very very nice, if the hw had ways to
cheaply calculate the page fault rate with one page more and one
page less than those allocated to the current job. This would
improve working set selection very much.

   however, the other "problem" is that based on some simulations with some
   reference strings, fifo seems not to do as well as lru. 

Fifo is bad because usage frequency does not really feed back, of course,
and locality info is not exploited very well.

   anyway, in my studies, i have also come to the conclusion that a reference
   bit may not be that desirable anyway since you don't necessarily want to do
   a linear scan of all the page table entries of either a process (local
   stealing) or the system (global stealing) in order to find a page to steal.

There are shortcuts. A paper I remember on SIGOPS detailed the
design of the Univac 80 (IBM lookalike, ex RCA) ws pager. It
tried hard to minimize such overheads, and to provide an
efficient proxy to the true ws rule of time based victim
selection. Seemed well done. Another interesting (old as well) paper
seems to be that describing the VM ws pager done at Grenoble.

The current situation is that clock pagers in BSD and system V
are not nice.

The BSD pager was designed very very well by an extremely
competent person with a specialization in performance analysis of
virtual memory; unfortunately it was developed on a machine with
512Kbytes. On today's machines with 8-16-32 Mbytes, a clock scan
may take several minutes, making the well designed algorithm
essentially useless. A good example of how unanticipated changes
of scale make algorithms inappropriate.

   and there are people here who were looking at working set schedulers
   for system v that attempt to address the issues presented by the original
   poster.

The system V pager also uses clock, with the same problem. Too bad that
the system V pager interacts with one of the worst swappers around. I would
really like AT&T to redesign the system V swapper. Even Bach says it stinks.

   but we have to re-engineer those ideas for sun's VM architecture
   which makes its system v debut in release 4 and it remains to be seen if
   those ideas are compatible with the VM architecture.

I hope that system V.4 has a better pager and swapper. It is not
true like many people assume that memory is infinite and all
machines are single user, so paging essentially never occurs and
swapping should never occur.  On top of that, the SUN vm
architecture is not the most smooth around; it was designed with
the SUN hacker's principle in mind, "as long as it performs/works
in some frequent enough cases, people will buy it".
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (09/03/89)

In article <3398@cbnewsh.ATT.COM> dwc@cbnewsh.ATT.COM (Malaclypse the
Elder) writes:

   i've been doing work on demand paging for unix system v for quite
   some time [ ... ] it has been labeled "bad" by academia because of
   belady's "fifo anomaly" which showed that with fifo, there exist
   "some reference strings" in which having more memory will lead to
   more faults than with smaller allocations.

   now this is only my opinion, but i believe that it does not pay to
   worry about such situations as long as, in general, giving more
   memory leads to less page faults.  and fifo does exhibit this
   trend.

The non monotonicity of fifo is not terribly worrying. As long as you
can show that it does not occur frequently.

My own favorite page replacement policy is (local) page fault frequency, as
it automatically balances the load between memory and disc. The great
advantage of ws over pff is that ws performance is not very sensitive to its
parameters, so tuning is not difficult, while pff is not so lucky. Pff, by
the way, is non monotonical, but apparently this is not too bad. There
is some paper comparing pff and ws in some oldish issue of an ieee journal.

A very interesting, forgotten article by Dijkstra argues for
essentially what is essentially pff.  The argument is compelling; ws
minimizes the space time integral, and pff balances the load between
disc and memory. That is, ws is optimal only if the configuration is
optimally balanced for the job mix running on it, because the lowest
space time integral gives the lowest cost only in that case.  Example:
assume that memory is scarce relatively to disc bandwidth; it may well
pay to allow higher page fault rates, that is a non optimal space time
integral, for the sake of exploiting the disc's bandwidth.

Admittedly, given that knees are pronounced in many quantity vs. performance
curves, this is is not expected to give much better performace; but the real
point is that probably pff is more flexible that ws, as the job mix changes.
Moreover, paging device bandwidth is often *the* bottleneck, and pff is
driven by it directly.

Another interesting idea contained in Dijkstra's paper is that it would
be nice, actually very very nice, if the hw had ways to cheaply
calculate the page fault rate with one page more and one page less than
those allocated to the current job. This would improve working set
selection very much, especially with pff.

   however, the other "problem" is that based on some simulations with
   some reference strings, fifo seems not to do as well as lru.

Fifo is bad because usage frequency does not really feed back, of
course, and locality info is not exploited very well.

   anyway, in my studies, i have also come to the conclusion that a
   reference bit may not be that desirable anyway since you don't
   necessarily want to do a linear scan of all the page table entries
   of either a process (local stealing) or the system (global stealing)
   in order to find a page to steal.

There are shortcuts. A paper I remember on SIGOPS detailed the design
of the Univac 80 (IBM lookalike, ex RCA) ws pager. It tried hard to
minimize such overheads, and to provide an efficient proxy to the true
ws rule of time based victim selection. Seemed well done. Another
interesting (old as well) paper seems to be that describing the VM ws
pager done at Grenoble.

The current situation is that clock pagers in BSD and system V are not
nice.

The BSD pager was designed very very well by an extremely competent
person with a specialization in performance analysis of virtual memory;
unfortunately it was developed on a machine with 512Kbytes. On today's
machines with 8-16-32 Mbytes, a clock scan may take several minutes,
making the well designed algorithm essentially useless. A good example
of how unanticipated changes of scale make algorithms inappropriate.

   and there are people here who were looking at working set schedulers
   for system v that attempt to address the issues presented by the
   original poster.

The system V pager also uses clock, with the same problem. Too bad that
the system V pager interacts with one of the worst swappers around. I
would really like AT&T to redesign the system V swapper. Even Bach says
it stinks.

   but we have to re-engineer those ideas for sun's VM architecture
   which makes its system v debut in release 4 and it remains to be
   seen if those ideas are compatible with the VM architecture.

I hope that system V.4 has a better pager and swapper. It is not true
like many people assume that memory is infinite and all machines are
single user, so paging essentially never occurs and swapping should
never occur.  On top of that, the SUN vm architecture is not the most
smooth around; it was designed with the SUN hacker's principle in mind,
"as long as it performs/works in some frequent enough cases, people
will buy it".
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

jim@cs.strath.ac.uk (Jim Reid) (09/04/89)

In article <1114@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>I hope that system V.4 has a better pager and swapper. It is not true
>like many people assume that memory is infinite and all machines are
>single user, so paging essentially never occurs and swapping should
>never occur.  On top of that, the SUN vm architecture is not the most
>smooth around; it was designed with the SUN hacker's principle in mind,
>"as long as it performs/works in some frequent enough cases, people
>will buy it".

It's hard to see how System V.4 will dramatically improve the VM system.
System V.4 is being touted as the ultimate all-singing, all-dancing UNIX
version. This is supposed to run on most hardware platforms: humble
machines with maybe only 0.5 or 1 Mbyte of RAM (the Sun kernel people
will soon show those users the error of their ways) to monster multi-user
systems with many tens of Megabytes of memory. On top of that are the
architectural considerations: what's best for a multiprocessor box or a
RISC? What about the interaction of page tables with the MMU? How can
that code be written in a generic way for different computers? Then there
will be the variety of job mixes to consider: window managers, lisp
processes, database packages, simulators and so on. Then consider "new"
ideas like paging through the file system and lightweight processes. Some
of these are bound to surface in System V.4

A VM system general enough to perform well for most potential users and
their applications on most potential hardware platforms is asking a lot.

		Jim

dmocsny@uceng.UC.EDU (daniel mocsny) (09/05/89)

In article <278@baird.cs.strath.ac.uk>, jim@cs.strath.ac.uk (Jim Reid) writes:
> A VM system general enough to perform well for most potential users and
> their applications on most potential hardware platforms is asking a lot.

And the cost of failing to give this to the users is a lot more.
Namely, OS fragmentation today costs users $$ billions/yr in lost
time, reduced software availability, artificially increased
complexity, and delays in quickly exploiting higher-performance
architecture.

This isn't pocket change, nor is it funny money. This is real wealth,
the kind you can take to the bank. Fragmentation takes this money out
of the users' pockets. Meaning that those users have less remaining in
their pockets to spend on more high-tech toys. The computer industry
is large enough now that it begins to tax the ability of society to
subsidize its past failures to develop and adhere to comprehensive
standards. Maximizing future computer industry growth depends on
removing as much nonessential complexity from the user as possible.

The best way to do this is to listen to what the users are screaming
for. From the user's standpoint, a computer is essentially a tool to
access the work of programmers. I.e., users don't buy computers
because they like the layout of the motherboard, the elegance of the
address decoders, etc. They buy computers to run software. The only
questions are: how much software will the box run, and how fast? The
ability of a programmer to generate wealth for users is proportional
to the number of boxes that can run that programmer's work. Therefore,
for the computer industry to deliver the most value to the customer
(and to consequently share in that value) it must take whatever steps
are necessary to stop destroying the work of programmers.

Dan Mocsny
dmocsny@uceng.uc.edu

pcg@thor.cs.aber.ac.uk (Piercarlo Grandi) (09/06/89)

In article <PCG.89Sep2220027@thor.cs.aber.ac.uk>
pcg@thor.cs.aber.ac.uk (Piercarlo Grandi) writes:

   A very interesting, forgotten article by Dijkstra argues for
   essentially what is essentially pff.

My reference and article archive/database is still in a state of
disarray.  I cannot find the book where this paper was.
It was published by Springer Verlag in the their LNCS series, and
it was part of the proceedings of an OS course in the seventies,
probably a NATO one.

   There are shortcuts. A paper I remember on SIGOPS detailed the
   design of the Univac 80 (IBM lookalike, ex RCA) ws pager. It
   tried hard to minimize such overheads, and to provide an
   efficient proxy to the true ws rule of time based victim
   selection. Seemed well done. Another interesting (old as well) paper
   seems to be that describing the VM ws pager done at Grenoble.

The Univac paper is

	Marc H. Fogel "The VMOS paging algorithm: a practical
	implementation of the working set model", ACM SIGOPS, Jan 1974.

The VM paper is

	Juan Rodriguez-Rosell, Jean-Pierre Dupuy "The design,
	implementation, and evaluation of a working set dispatcher",
	CACM, Apr 1973.

The paper that compares pff and ws is:

	Ram K. Gupta, Mark A. Franklin "Working set and page fault
	frequency algorithms: a performance comparison", IEEE TC,
	Aug 1978.

Other interesting policies are thos described in:

	Alan J. Smith "A modified working set paging algorithm",
	IEEE TC, Sep 1976.

	Richard W. Carr, John L. Hennessy "WSClock - a simple and
	effective algorithm for virtual memory management", ACM SIGOPS 1981.

	Domenico Ferrari "VSWS: the variable interval sample working set
	policy", IEEE TSE, May 1983.

Hope this helps. I think that the Gupta paper is extremely interesting
reading, and so are the two on actual working set simulations. VSWS seems
also nice.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mds@wang.UUCP (Marc San Soucie) (09/07/89)

In article <2089@uceng.UC.EDU>, dmocsny@uceng.UC.EDU (daniel mocsny) writes:

> Therefore,
> for the computer industry to deliver the most value to the customer
> (and to consequently share in that value) it must take whatever steps
> are necessary to stop destroying the work of programmers.
> 
> Dan Mocsny
> dmocsny@uceng.uc.edu

For instance, by eliminating once and for all the scourge of 64K segmented
memory from the computer industry. That this might drive the minions of
Intel and Microsoft to seek more productive lines of work causes me to
shed no more tears than I shed for the passing of the Z80.

The byte gods alone know how much programmer productivity has been sent down
the shiny chute because of the unnatural, irrational, inconceivably obscure
demands of MS/DOS and the 80x86, keepers of the foul flame of Medium Model
and its unwelcome cousins Huge and Outrageous.

Not that UNIX is likely to save the world either.

The planet needs, deserves, something more enlightened than the carp floating
about these days. But who can afford to develop something new in these times
of economic shortsightedness?

    Marc San Soucie
    Massachusetts
    mds@wang.wang.com (opinions those of the author and maybe his mother)

slackey@bbn.com (Stan Lackey) (09/08/89)

In article <2089@uceng.UC.EDU> dmocsny@uceng.UC.EDU (daniel mocsny) writes:
>In article <278@baird.cs.strath.ac.uk>, jim@cs.strath.ac.uk (Jim Reid) writes:
>> A VM system general enough to perform well for most potential users and
>> their applications on most potential hardware platforms is asking a lot.
>
>And the cost of failing to give this to the users is a lot more.

This gets back to fundamental laws.  One of them is "special purpose
is always cheaper than general purpose [to do the same specific job]".
In other words, if you want the best paging performance on a VAX, you
get VMS.  Because UNIX is a general purpose tool, ie it was made to
run on anything, it can't be as good on a specific box as an OS that
was written and tuned to run on that specific box and nowhere else.

Giving up something is required for generality; no general purpose
machine is going to run FFT's as fast as an FFT box that costs the
same.

Now, lots of high speed box vendors have tuned their proprietary
versions of UNIX around their own architectures, usually taking
advantage of the specifics of their own boxes.  But the new version
is not general purpose.
-Stan

dmocsny@uceng.UC.EDU (daniel mocsny) (09/08/89)

In article <45344@bbn.COM>, slackey@bbn.com (Stan Lackey) writes:
> [A] fundamental [law] ... is "special purpose
> is always cheaper than general purpose [to do the same specific job]".
> In other words, if you want the best paging performance on a VAX, you
> get VMS.  Because UNIX is a general purpose tool, ie it was made to
> run on anything, it can't be as good on a specific box as an OS that
> was written and tuned to run on that specific box and nowhere else.

Agreed. But why can't the hardware be tuned to run UNIX? Why build
interesting hardware first, and then almost as an afterthought put
an OS on it that is going to give users a porting nightmare?

Another question is this: is *every* difference between VMS and UNIX
essential to the superior paging performance of VMS on VAX? In other
words, *how much* UNIX do we have to give up to get that paging
performance, compared to how much we are giving up?

> Giving up something is required for generality; no general purpose
> machine is going to run FFT's as fast as an FFT box that costs the
> same.

Ah, but not all incompatibilities provide any clear-cut benefit.
Indeed, I think the opposite situation is more common. Consider the
proprietary minicomputer market. Can any truly disinterested observer
claim that all the fragmentation there leads to any significant
functional differences? I'm certain some exist, but in the real world
can they really justify the massive loss to users caused by reduced
software availability, porting problems, heterogeneous network
nightmares, and so on?

In another newsgroup, I proposed the notion of horizontal vs.
vertical fragmentation. Certainly as technology advances, we cannot
expect to indefinitely remain compatible with obsoleting hardware,
protocols, etc. The incompatibility resulting from advancing
technology I call vertical fragmentation. This is an essential
cost of progress, which the gains of progress offset.

On the other hand, when several vendors follow arbitrary and
incompatible paths to yield systems of roughly equivalent performance,
they have horizontally fragmented the market. One does not need great
imagination to list hundreds of examples of this. Just look around any
multi-vendor shop. Horizontal fragmentation yields possible
short-term competitive advantages to some vendors at the cost of
greatly increased artificial complexity to the user. This reduces the
amount of wealth the user can generate, ultimately reducing the size
of the overall market for all vendors.

Finally, computers are, after all, general-purpose devices. Unless you
are building embedded controllers or selling to users with high
technical sophistication and stable plans, you face another
fundamental law: Most people have no real idea of all the things a
computer can do for them until they actually try them. If a computer
doesn't act as a window into the wider work of the programming
community, the user will simply fail to learn about all the unforeseen
benefits available out there. Again, this translates into lost wealth
to the user, leading to a smaller market for the vendors.

Dan Mocsny
dmocsny@uceng.uc.edu

gary@dgcad.SV.DG.COM (Gary Bridgewater) (09/09/89)

In article <45344@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>In article <2089@uceng.UC.EDU> dmocsny@uceng.UC.EDU (daniel mocsny) writes:
>>In article <278@baird.cs.strath.ac.uk>, jim@cs.strath.ac.uk (Jim Reid) writes:
>>> A VM system general enough to perform well for most potential users and
>>> their applications on most potential hardware platforms is asking a lot.
>This gets back to fundamental laws.  One of them is "special purpose
>is always cheaper than general purpose [to do the same specific job]".

I think it goes deeper than that. People want to buy the cheapest box that
will do their job. They want to pull it out of the box, plug it in and
have it satisfy their deepest desires from the git go.

A reasonable method of solving the paging problem would be to divorce the
paging functionality from the policy decisions. Then you document the
interface to the policy modules, provide a default and let the end users
customize the thing as much as they want. All the necessary information
would be made available via tables/calls to allow the finest micro-tuning.
Note that this is (intended to be) a far cry from buying the sources, hiring
a guru and letting him/her fool around for six months.

The problem is that this is complicated and may, horrors, require reading the
manual (that shiny thing in the shrink wrap package in the back of your file
drawer, remember?). In any case, I still don't expect a hue and cry from the
user community for this. Memory is cheaper than gurus' salaries over the
same time frame. I don't mean to imply that customers are stupid, they just
aren't (usually)  computer literate and they don't want to be.

I don't think you can have the perfect (to some degree) generic algorithm.
The variables multiply too fast and behavior differences are nearly
infinite. And, don't forget that the purpose of VM is to save money, not
time. If you want real memory performance then get real memory. If you
want to emulate real memory at a cost of V/R then get VM. TANSTAAFL or
TANSTAFM.
-- 
Gary Bridgewater, Data General Corp., Sunnyvale Ca.
gary@sv4.ceo.sv.dg.com or 
{amdahl,aeras,amdcad,mas1,matra3}!dgcad.SV.DG.COM!gary
No good deed goes unpunished.

baum@Apple.COM (Allen J. Baum) (09/12/89)

[]
>In article <70663@yale-celray.yale.UUCP> leichter@CS.YALE.EDU (Jerry Leichter) writes:
>I have never understood this need people have to use the VAX as the standard
>bad example of EVERYTHING.  In fact, the designers of the VAX worked in ways
>very similar to the way good RISC designers work today:  Hardware and software
>people cooperated, and there was an attempt to balance the complexity each
>had to deal with.
.....
>the compiler writers heavily influenced the design of the ISA, asking for
>things like 3-address arithmetic instructions in addition to 2-address and
>indexed addressing modes, but given the compiler technology of the day, this
>is what they thought would be useful to them.  Compiler technology has changed
>greatly since then, as has hardware technology.
>
>The main difference between the VAX design team and current design teams is
>probably in the inclusion of the OS designers.  The process structure of the
>VAX, the support in hardware for AST's, the security structure - all of these
>are direct reflections of the OS being developed.

Hear, hear! I'm not particularly a fan of VAX, but its getting a lot of bad
press from the RISC people because of its performance vs. the new chips.
This is NOT because it is a bad design, its because its an old design, and
the nature of the tradeoffs have changed. This includes not just compiler,
and OS technology, but lots of the hardware technology: memories, I/O, logic,
and packaging. These make a big difference on how a system is designed. 

The major fault of VAX proponents is that they haven't recognized that these
tradeoffs have made a difference, and they still think they can compete (on
the basis of performance, as opposed to just running old code).

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

johnl@esegue.segue.boston.ma.us (John R. Levine) (09/12/89)

In article <34640@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>Hear, hear! I'm not particularly a fan of VAX, but its getting a lot of bad
>press from the RISC people because of its performance vs. the new chips.
>This is NOT because it is a bad design, its because its an old design, and
>the nature of the tradeoffs have changed. ...

Maybe I'm just an old grouch (and programming around all the microcode bugs
bringing up 4.1 on a /750 made me mighty grouchy) but it seems to me that
the Vax was never that good a design.  The architecture was predicated on a
technological bet that microcode memory would always be faster than regular
memory, and when that turned out to be wrong the penalty for the heavily
microcoded implementation that the Vax demands has become very large.

Also, it has the distinct flavor of having been designed by a bunch of guys
sitting around saying "yeah, that would be nice, we'll put it in microcode
so it will be fast" and thereby ending up with an architecture that is
practically impossible to pipeline and has complicated instructions that
nobody uses because it's faster to get the same effect using simpler ones.
I don't doubt that the OS and compiler guys were in on the design, but a
little more scepticism about what the software guys asked for would have
helped a lot.

For comparison, consider the considerably older IBM 360 architecture.  They
made a few real goofs, notably 24 rather than 32 bit addresses which they
are still slowly fixing, and hex floating point.  Nonetheless, the
relatively regular architecture and fixed instruction formats have worn
well.  There are relatively few instructions that nobody uses (MVZ and LNR,
perhaps.)  IBM has certainly had more success with really fast 360s than DEC
has with really fast Vaxen.  (The software which one runs on a really fast
360 is an entirely different argument, though.)
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 492 3869
johnl@esegue.segue.boston.ma.us, {ima|lotus}!esegue!johnl, Levine@YALE.edu
Massachusetts has 64 licensed drivers who are over 100 years old.  -The Globe