[comp.os.vms] Performance Parameters for the VAX.

NIELAND@FALCON.BERKELEY.EDU.UUCP (12/03/87)

Does anyone out there have any recommendations for which parameters to tweek 
on an 11/780 do handle periods of have CPU usage?

Don't suggest the Dynamic Load Balencer as here is the background.

System is used as a electronic mail, electronic bulletin board for 
administrative use and for hard research work.  An 8650 is on order and is 
schedule to arrive in March and all research work will be moved to the 8650 
leaving on the electronic communications on the 780.

Currently whenever the hard researchers start gearing up and doing several 
compiles and heavy analysis runs the response time goes way down and the mail 
people get frustrated.  I would like get the system to let mail and such small 
cpu items go in and do what they need when they want and let the crunchers 
take up all the slack (not compete with the crunching).

Some of the people doing the crunching are also people using the admin 
electronic messaging, so I can set the priority down on their accounts.

I don't want to but the Dynamic load balancing software that is available for 
a problem that should only be around for a few of more months.

I am wondering what parameters should be tweaked to balance out the system.

Please send responses to me.

--------------------------------------------------------------------------------
|                M. Edward (Ted) Nieland - Systems Analyst                     |
|------------------------------------------------------------------------------|
| US Snail:                            | Arpa Internet:                        |
| Systems Research Laboratories, Inc.  | TNIELAND@WPAFB-AAMRL.ARPA             |
| 2800 Indian Ripple Road   WP 196     | NIELAND%FALCON@WPAFB-AAMRL.ARPA       |
| Dayton, OH  45440                    |                                       |
|------------------------------------------------------------------------------|
| A T & T:  (513) 255-5156                                                     |
--------------------------------------------------------------------------------

mike@VAX.OIT.UDEL.EDU ("Michael J. Porter") (12/04/87)

We have all our users submit heavy crunch jobs as batch.  The batch
queues have a lower priority than interactive, so our interactive
users are not affected.  Putting CPU time limits on users will
insure that they submit jobs.  We simply threatened to do this and
our users cooperated making life easier for all.

				mike@vax.oit.udel.edu
------

jeh@crash.cts.com (Jamie Hanrahan) (12/05/87)

In article <8712040527.AA06002@ucbvax.Berkeley.EDU>,
  NIELAND@FALCON.BERKELEY.EDU (Ted Nieland - SRL) writes:
>Does anyone out there have any recommendations for which parameters to tweek 
>on an 11/780 do handle periods of heavy CPU usage? ...
>
>Currently whenever the hard researchers start gearing up and doing several 
>compiles and heavy analysis runs the response time goes way down and the mail 
>people get frustrated.  I would like get the system to let mail and such small 
>cpu items go in and do what they need when they want and let the crunchers 
>take up all the slack (not compete with the crunching).

You need to investigate a little bit with the MONITOR utility and find out
what's happening to the system when the crunchers are running.  Watch it for
a while when they're not running, and then when they are, to get a feel for
the differences.  

I suspect that when the crunchers aren't on you've got lots of CPU idle time,
and when they're running you have none.  If this isn't the case, you may be
disk or memory bound instead, and the following ideas should be ignored.  

Your first look should be at MONITOR MODES.  Look at idle time without and
with the crunchers.  If it drops to zero (or nearly so) during the crunch,
keep reading.  

The next thing to worry about is where the CPU is spending its time.  

If all or most of the additional load is in user mode, you can try:

	- Reducing the SYSGEN parameter QUANTUM .  As a start, set it to 5
	(50 milliseconds) instead of the default 20.  Despite old VMS 
	folklore, even absurdly low QUANTUM values (e.g. 2) have minimal
	`overhead' impact on a 780-class VAX.  A lower quantum will stir
	up the queues of `computable' processes faster, giving everyone a
	shot at the CPU more often.  You want to set QUANTUM to the minimum
	value that will let your typical interactive user get one keystroke's
	worth of work done within one quantum.  

	- Ensure that everyone's UAF entries specify the same base priority.
	VMS priorities are absolute, not proportional -- if there are 
	priority 5 and priority 4 processes, both CPU hogs, the pri. 5 one
	will get all the CPU and the other will get none.  (The priority
	increment controlled by the SYSGEN parameter PIXSCAN modifies this
	somewhat, but the principle still holds.)  

	- Administrative cures:  If the crunchers aren't doing their work
	interactively (i.e. if they're just waiting for the DCL prompt to come
	back), encourage them to run their stuff in a batch queue, and set the
	job limit on the queue to something small, like 1, and set the
	base priority on the queue to 2.  Note that it is possible to enforce
	this via ACLs -- you can set ACLs on the offending images that require
	the executing process to hold the BATCH identifier in order to run
	them.  Explain to your users that it does them no good to try to run
	say, 8 cpu hogs at once instead of serially; there's only one cpu and
	they are still essentially running one at a time.  In fact they'll get 
	done faster if run one at a time, since a batch queue with a low job
	limit can specify a very large working set extent, which can speed 
	execution, and besides they'll avoid all the context-switching over-
	head (which will get worse as you decrease QUANTUM!).  This is really
	the best answer, but it can be hard to implement, depending on the
	political situation at your shop.  

If all or most of the additional load is in kernel mode:

This is an indication that the hogs are probably incurring lots of soft 
(in-memory) page faults.  Check MONITOR PAGE and look at the total number of 
faults per second vs. the number of page read I/Os per second; the page read 
I/O rate is essentially the hard fault rate (not exactly, I know, but close 
enough for our purposes), and the remainder are hard faults.  

While a soft fault doesn't involve disk I/O it still takes about fifty to
a hundred instruction times to resolve, so if you can reduce their number 
you can dramatically reduce the amount of cpu time the offending image needs
to execute.  You reduce the soft faults by giving the users of the image a
large WSEXTENT, and setting WSINC to some large value (200 or 500 works well)
so that the system will increase the process's working set very fast as soon
as a few faults occur.  I am assuming here that you have the memory available
to do this; if you don't, you may be out of luck (but see below).  

While you're looking at MONITOR PAGE, check the system page fault rate.  
These are faults against the system working set, and hurt everybody.  2
per second is about the most you should see.  If there are more, increase
SYSMWCNT (I think that's its name; anyway, it's a sysgen parameter that
controls the size of the system working set) in steps of 30% or so until 
they go away.  

If you appear to have some CPU idle time even though the system is bogged
down, do a MONITOR PAGE and check for page I/O and for outswap activity.  
What *may* be happening is that the offending process's working sets are too 
large, so that when they run, they force others to be outswapped.  Cure:  Set
their WSQUOTAs down and increase their WSEXTENTs instead.  Then they'll
only be able to use the extra memory when it's available.  Caution:  When
the memory isn't available, they'll start incurring lots of soft (and maybe
hard) page faults; see above.

Another thing that can force unnecessary swapping is BALSETCNT.  This should
normally be two less than MAXPROCESSCNT.  Only BALSETCNT processes (plus
null and swapper) can be inswapped at a time; if there are more procs on
your system, the remainder will be forced to be outswapped whether there's
memory for them or not.  Do not, however, twiddle BALSETCNT without using
AUTOGEN - if you have a large VIRTUALPAGECNT, setting BALSETCNT too high
may result in an unbootable system.  

>Some of the people doing the crunching are also people using the admin 
>electronic messaging, so I can set the priority down on their accounts.

Not a good idea; then they may get no CPU time at all.  See the above
discussion on priorities.  

>I don't want to get the Dynamic load balancing software that is available for 
>a problem that should only be around for a few more months.

DLB mostly is good for systems that are short on memory.  It does a great
job of parcelling out memory to the processes that need it, and of taking
away memory that processes don't need, thereby reducing page fault rates.
If your system is memory-bound DLB may well help.  

Disclaimer:  The above is the standard list of first things to try.  Your
machine may have some other obscure problems, in which case none of the
above will likely help.  But this should give you a good start.  

Caution:  The standard rule of thumb is that `tuning' a VAX can only buy
you about 10-20%, unless it is radically mistuned in the first place.
So don't expect miracles.

Footnote:  Since batch queues are unpopular, what is REALLY needed for 
situations like this (and most shops have situations like this; very few
VMS machines have homogeneous loads) is a way to specify priority adjustments 
according to the image being run.  Compilers, analysis packages, and the like
should get an automatic "-1" adjustment, or else editors and the like should
get an automatic "+1".  It would be nice to be able to specify a priority
adjustment along with enhanced privs when you INSTALL an image (and, while
we're at it, let's add the ability to associate rights identifiers with an
installed image as well; this would be like Unix's setuid feature, only more
powerful).  

Failing that, a reasonably competent programmer should be able to take one of
the `idle process killer' programs available from DECUS and add this sort of 
feature -- it could check the image name for each process against a list of 
known names, apply priority fixes as specified by the list, and reset the 
priorities to normal when the images go away.  I've got far too much on my
"to do in my spare time" list as it is to volunteer to do such a thing, but
I'd sure like to see it done...