[comp.unix.wizards] Performance Tuning Ultrix 4.1

corey@milton.u.washington.edu (Corey Satten) (04/30/91)

(This is cross-posted to unix-wizards because it may also apply to 4.3BSD)


	    Performance Tuning a DEC Ultrix 4.1 Workstation

			       Round 2

		Corey Satten, corey@cac.washington.edu
		  Networks and Distributed Computing
		       University of Washington
			  Seattle, Washington
			      April 1991



This is a follow up to work first posted in September 1990.

History:

      Our department is using a rather maximally configured DECstation as
  a time-sharing host.  It is a DEC 5000 running Ultrix 4.1 and has six
  disks, mostly 660 meg or 1 gigabyte.  It serves /usr/local/bin via NFS
  to about a dozen workstations; talks to several printers; is the
  departmental electronic mail machine; hosts some campus wide mailing
  lists; is our anonymous FTP server; is one of two campus default domain
  nameservers; and also time-sharing host for about 20 X-terminals plus
  a dozen or more other users connected via telnet.  We are supporting
  about 250 megs of swap space on roughly 43 megabytes of the 56 megabyte
  physical memory.  A 'ps aux' listing usually has 400+ processes in it.

      In my September posting, I described how tweaking some global
  variables in the kernel allowed us to improve performance by paging more
  and swapping less and maintaining a larger chunk of free memory.  Several
  people on campus and in netland followed our lead and reported similar
  improvements.  (Global kernel variables can be conveniently tweaked on
  a running system with the kmem program included with this posting.)

      Our system, thus tweaked, spoiled us by the times it was fast and
  frustrated us by the times it wasn't.  Occasionally we had reports of
  very large (20+ second) character echo delays experienced by one user
  while others in the same environment running the same programs saw no
  delays.  Furthermore, large programs tended to lose too many pages to
  run satisfactorily.  These continuing problems, plus my gut feeling that
  56 megabytes should really be enough memory for what we're doing
  motivated me to continue investigating.

Current Work:

      Close examination of our system revealed that we had some processes
  which had not run in a very long time (perhaps days) but which still
  had a significant RSS.  Scrutiny of the source combined with some
  careful experiments lead me to discover that data pages never page out
  under normal circumstances even though all the complicated code to do
  so is there.  Judging from the code, this was a conscious decision made
  in BSD Unix.  Modern Unix systems running X windows tend to have more
  idle processes than ever before and those processes tend to have larger
  data spaces than their non-windowing ancestors.  Thus, not paging data
  pages causes memory to be tied up with junk which only swapping can
  remove.  On our system, I estimated perhaps 20 megabytes fell into
  this category.

      So why don't these data pages eventually swap out?  When free memory
  becomes less than "desfree", the kernel looks for processes which have
  been sleeping longer than "maxslp" (usually 20 seconds) to swap out.
  To my horror, I discovered that it always starts looking at the beginning
  of the process table and stops when it has satisfied the need for free
  memory.  A quick modification to "ps" to print the processes in
  process-table order and also print the number of times each had swapped
  confirmed this.  No process in the last half of our process table had
  ever swapped and those at the front had swapped a lot.  The extreme
  prejudice which is directed against processes living at the front of
  the table could well explain the horrible performance some users
  occasionally reported that the rest of us didn't see.

      In order to completely rectify both the swapping and paging problems
  described above, kernel changes are required, however I believe fixing
  the data paging problem has the biggest effect and later I will describe
  a partial workaround to the data paging problem which may work for those
  of you who can't change your kernels.  (You might also try asking your
  DEC rep to supply these changes in binary form.)

      To fix the swapping problem, I modified the FORALLPROC macro used
  by vm_sched.c to begin swapping where it last left off so, in the long
  run, process table position has no predictable effect on swapping and
  long time sleepers will eventually swap out.  To achieve this, I also
  needed to make other small changes to vm_sched.c.

      To fix the data paging problem I simply changed the two places which
  prevented data paging in vm_page.c.  In both vm_sched.c and vm_page.c,
  I inserted global variables which I can set and test at runtime to
  experiment.  Since data pages can be more expensive to page out than
  text pages, BSD and Ultrix have 2 limits on data pageouts.  1) only
  maxpgio/4 data pages per second will pageout.  2) data pages are only
  paged out if the process they belong to has an RSS > (saferss - sleeptime)
  where sleeptime is the number of seconds since the process has run.
  Our system seems to be doing fine with these defaults.

      If you can't change your kernel but you still want to try to get
  data pages to page out you may be able to use the following trick to
  force all processes to exceed their soft memory limit.  Rename /etc/init
  to /etc/init.orig and replace /etc/init with the following trivial
  program:

	#include <sys/time.h>
	#include <sys/resource.h>

	main(argc, argv, envp)
	    char *argv[], *envp[];
	    int argc;
	{
	    struct rlimit rlp;

	    getrlimit(RLIMIT_RSS,&rlp);
	    rlp.rlim_cur = 1;		/* zero may be better if it works */
	    setrlimit(RLIMIT_RSS,&rlp);
	    execve("/etc/init.orig", argv, envp);
	}
  
  This will effectively nice them all (which shouldn't matter since it
  happens to them all uniformly) and allow their data pages to be paged
  out as long as the resident set size is greater than the value of
  "saferss" (6 pages) minus the idle time of the process.  Unfortunately,
  DEC changed p_rssize from signed long to unsigned long so as soon as a
  process has been idle longer than saferss seconds the data pages stop
  paging out again.  The workaround for those who can't re-compile is to
  set saferss to 127 which will guarantee the subtraction never goes
  negative and idle processes will eventually lose data pages (although
  not as quickly as with a fixed kernel and a smaller value of saferss).

      For those of you with source to Ultrix 4.1, context diffs of my
  changes are appended below.  I have manually deleted my monitoring
  changes from the diffs to make the functional changes clearer and the
  diffs about 90% smaller.  In addition, there are several things to note:
  first, that I applied the FORALLPROC change only to vm_sched.c by naming
  the changed file procNDC.h and changing the #include in vm_sched.c
  accordingly.  Second, the p++ which is uniformly changed to ++p is a
  vestige of earlier changes which have been removed.  The only lingering
  effect is to enlarge the context of this context-diff to include the
  entire macro (which is why I left it).  Third, even though the new
  FORALLPROC code is complicated, it maintains the desirable timing
  characteristics of the original since the new stuff only executes twice
  for each invocation of the FORALLPROC macro.

      One final note for kernel hacking purists, in vm_sched.c where
  processes are swapped when RSS reaches zero, what I really want is to
  have them swap when RSS is zero AND the number of swapouts of the process
  is less than 2 AND the number of pageins is greater than zero but I
  didn't trust myself to implement that change since the per process count
  of swapouts and pageins is in the u.u_ru structure and I was unsure how
  to access it properly.  As a result, our nfsd and biod processes
  aimlessly swap in and out, fortunately at negligible cost.

      We are now running with the following kernel parameters poked into
  /dev/kmem by my kmem program (appended somewhere below).

	lotsfree 128 -> 768	/* begin scanning for pages with 3meg free */
	desfree   64 -> 512	/* begin swapping sleeping procs at 2meg */
	coreyf1   xx -> 40	/* ... if sleeping longer than 40 seconds */
	minfree   28 -> 128	/* consider swapping running procs at .5meg */

  The elevated scan rate we needed when data pages didn't page seems to be
  unnecessary now.  Slow scanning on our system encounters plenty of stale
  pages to keep the free list replenished.

Results:

  1) Our system now almost never swaps jobs because of memory shortfall.
  2) The active real memory (displayed by vmstat -v) is about double
     what it was before -- about 25 megabytes -- with spikes as high
     as 37 megabytes.  We never saw such high numbers before.
  3) RSS of idle processes eventually reaches zero and these processes
     are then gently swapped out.
  4) Paging activity (scan rate, etc.) is much lower than before -- often
     the system goes for minutes with a scan rate of zero.
  5) Interactive response is good on FrameMaker and other medium size
     programs which were formerly frustratingly slow.
  6) Large programs (such as cc -O on perl version 3's eval.c where the
     optimizer grows to 20 megabytes) can run effectively -- cpu idle
     time drops to zero for the entire 2-minute optimizer run and
     interactive response in other applications remains good.
  7) The load average on the machine is somewhat lower and the spikes
     are considerably lower.
  8) We are considering increasing our file-system buffer-cache from
     the default 15% since most of our idle time now seems to be
     file-system related rather than virtual memory related.

--------
Corey Satten, corey@cac.washington.edu
Networks and Distributed Computing
University of Washington
(206)543-5611

The kmem program and context diffs follow:

*** /usr/src/Ultrix-4.1-RISC/sys/h/proc.h	Fri Jul  6 07:18:00 1990
--- /usr/src/Ultrix-4.1-RISC/sys/h/procNDC.h	Fri Apr  5 21:13:18 1991
***************
*** 398,434 ****
   * out of the For loop, and not one of the inner While loops
   */
  
! #define NEXTPROC	{ pp++; goto _a ; }
  
  #define FORALLPROC(X) {						\
! 	register unsigned long *_bp;				\
! 	register struct proc *pp = proc;			\
  	register unsigned long _mask;				\
  								\
  	/*							\
  	 * for the whole index into the table			\
  	 */							\
! 	for ( _bp = proc_bitmap;				\
! 		_bp < &proc_bitmap[max_proc_index] ; _bp++ ) {	\
  		/*						\
  		 * If any bits in this longword are used,	\
  		 * find the associated structures		\
  		 */						\
! 		if (_mask = *_bp) {				\
  			_a: if (_mask) {			\
  				_b: if ((_mask&1) == 0) {	\
  					_mask = _mask >> 1;	\
! 					pp++;			\
  					goto _b;		\
  				}				\
  				_mask = _mask >> 1;		\
  				{ X }				\
! 				pp++;				\
  				goto _a;			\
  			} else {				\
  				if (_mask = ((pp-proc)%32))	\
! 					pp += 32 - _mask;	\
  			}					\
  		} else pp += 32;				\
  	}							\
  }
--- 398,452 ----
   * out of the For loop, and not one of the inner While loops
   */
  
! #define NEXTPROC	{ ++pp; goto _a ; }
  
+ #define INC_bp(X,Y) (X < &proc_bitmap[max_proc_index-1] ? \
+ 			++X : (Y=proc, X=proc_bitmap))
+ 
  #define FORALLPROC(X) {						\
! 	static unsigned long *_bp = proc_bitmap;		\
! 	register struct proc *pp = proc + 32*(_bp-proc_bitmap);	\
! 	static struct proc *_opp = 0;				\
  	register unsigned long _mask;				\
+ 	register unsigned long *_bpe = _bp;			\
+ 	register int _more;					\
+ 	unsigned long _maskmask;				\
  								\
  	/*							\
  	 * for the whole index into the table			\
  	 */							\
! 	for (_more = 2; ; INC_bp(_bp,pp)) {			\
! 		if (_bp == _bpe)				\
! 			if (--_more) {				\
! 				int i = _opp-pp+1;		\
! 				_maskmask = ~0;			\
! 				if (i<32 && i>=0)		\
! 				    _maskmask <<= i;		\
! 				else _maskmask = 0;		\
! 				_mask = *_bp & _maskmask;	\
! 			} else _mask = *_bp & ~_maskmask;	\
! 		else _mask = *_bp;				\
  		/*						\
  		 * If any bits in this longword are used,	\
  		 * find the associated structures		\
  		 */						\
! 		if (_mask) {					\
  			_a: if (_mask) {			\
  				_b: if ((_mask&1) == 0) {	\
  					_mask = _mask >> 1;	\
! 					++pp;			\
  					goto _b;		\
  				}				\
  				_mask = _mask >> 1;		\
+ 				_opp = pp;			\
  				{ X }				\
! 				++pp;				\
  				goto _a;			\
  			} else {				\
  				if (_mask = ((pp-proc)%32))	\
! 					pp += 32-_mask;		\
  			}					\
  		} else pp += 32;				\
+ 	if (!_more) break;					\
  	}							\
  }


*** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c_orig	Tue Jul 17 12:30:21 1990
--- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c	Tue Apr 23 12:58:18 1991
***************
*** 274,279 ****
--- 274,283 ----
  int	nohash = 0;		/* turn on/off hashing */
  int	nobufcache = 1;		/* turn on/off buf cache for data */
  
+ /* symbols added for performance prodding at request of corey@cac */
+ int     coreyp1 = 0;            /* data does page out (stock is coreyp1 = 1) */
+ int     coreyp2 = 10;           /* data pageouts per second (was maxpgio/4) */
+ 
  extern int swapfrag;
  /*
   * Handle a page fault.
***************
*** 1318,1324 ****
  			(void) splx(s);
  			return(0);
  		}
! 		if ((rp->p_vm & (SSEQL|SUANOM)) == 0 &&
  		    rp->p_rssize <= rp->p_maxrss) {
  			smp_unlock(seg_lock);
  			smp_unlock(&lk_cmap);
--- 1322,1328 ----
  			(void) splx(s);
  			return(0);
  		}
! 		if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && coreyp1 &&
  		    rp->p_rssize <= rp->p_maxrss) {
  			smp_unlock(seg_lock);
  			smp_unlock(&lk_cmap);
***************
*** 1332,1338 ****
  		 * Guarantee a minimal investment in data
  		 * space for jobs in balance set.
  		 */
! 		if (rp->p_rssize < saferss - rp->p_slptime) {
  			smp_unlock(&lk_p_vm);
  			smp_unlock(&lk_cmap);
  			(void) splx(s);
--- 1336,1342 ----
  		 * Guarantee a minimal investment in data
  		 * space for jobs in balance set.
  		 */
! 		if ((long)rp->p_rssize < saferss - rp->p_slptime) {
  			smp_unlock(&lk_p_vm);
  			smp_unlock(&lk_cmap);
  			(void) splx(s);
***************
*** 1371,1377 ****
  		 * Limit pushes to avoid saturating
  		 * pageout device.
  		 */
! 		    (pushes > maxpgio / 4)) {
  			if (seg_lock != &lk_p_vm)
  				smp_unlock(&lk_p_vm);
  			smp_unlock(seg_lock);
--- 1375,1381 ----
  		 * Limit pushes to avoid saturating
  		 * pageout device.
  		 */
! 		    (pushes > coreyp2 /* was maxpgio / 4 */)) {
  			if (seg_lock != &lk_p_vm)
  				smp_unlock(&lk_p_vm);
  			smp_unlock(seg_lock);



*** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c_orig	Fri Jul  6 06:41:49 1990
--- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c		Thu Apr 25 10:12:23 1991
***************
*** 103,109 ****
  #include "../h/seg.h"
  #include "../h/dir.h"
  #include "../h/user.h"
! #include "../h/proc.h"
  #include "../h/text.h"
  #include "../h/vm.h"
  #include "../h/cmap.h"
--- 103,109 ----
  #include "../h/seg.h"
  #include "../h/dir.h"
  #include "../h/user.h"
! #include "../h/procNDC.h"
  #include "../h/text.h"
  #include "../h/vm.h"
  #include "../h/cmap.h"
***************
*** 143,148 ****
--- 143,154 ----
  int     minfree = 0;
  int     desfree = 0;
  int	lotsfree= 0;
+ 
+ /* symbols added for performance prodding at request of corey@cac */
+ int     swload = TO_FIX(2);
+ int	coreyf0 = 1;		/* goto loop after swapout */
+ int	coreyf1 = 20;		/* softswap enabled after this many seconds */
+ int	coreyf4 = 1;		/* don't swap when RSS=0 */
  #endif mips
  
  #ifdef vax
***************
*** 291,297 ****
  	    (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree &&
  #endif vax
  #ifdef mips
!             (avenrun[0] >= TO_FIX(2) && imax(avefree, avefree30) < desfree &&
  #endif mips
  	    (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) {
  		desperate = 1;
--- 297,307 ----
  	    (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree &&
  #endif vax
  #ifdef mips
! 	    /* 
! 	     * symbol "swload" added for performance prodding at
! 	     * request of corey@cac 26 Feb 91
! 	     */
!             (avenrun[0] >= swload && imax(avefree, avefree30) < desfree &&
  #endif mips
  	    (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) {
  		desperate = 1;
***************
*** 340,353 ****
  
  	case SSLEEP:
  	case SSTOP:
! 		if ((freemem < desfree || pp->p_rssize == 0) &&
! 		    pp->p_slptime > maxslp &&
  		   (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) &&
  		    swappable(pp)) {
  			/*
  			 * Kick out deadwood.
  			 */
  
  			pp->p_sched &= ~SLOAD;
  
  			smp_unlock(&lk_rq);
--- 350,366 ----
  
  	case SSLEEP:
  	case SSTOP:
! 		if (coreyf1 &&
! 		   (freemem < desfree || (pp->p_rssize == 0 && coreyf4)) &&
! 		    pp->p_slptime > coreyf1 /* was maxslp */ &&
  		   (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) &&
  		    swappable(pp)) {
+ 			int breakout;
  			/*
  			 * Kick out deadwood.
  			 */
  
+ 			breakout = pp->p_rssize ? 1 : 1-coreyf4;
  			pp->p_sched &= ~SLOAD;
  
  			smp_unlock(&lk_rq);
***************
*** 357,363 ****
  			goto loop;
  #endif vax
  #ifdef mips
! 			NEXTPROC;
  #endif mips
  		} 
  	        smp_unlock(&lk_rq);
--- 370,381 ----
  			goto loop;
  #endif vax
  #ifdef mips
! 			if (coreyf0 && breakout) {
! 			    goto loop;
! 			    }
! 			else {
! 			    NEXTPROC;
! 			    }
  #endif mips
  		} 
  	        smp_unlock(&lk_rq);
***************
*** 540,545 ****
--- 558,564 ----
  			if (sleeper < pp->p_slptime) {
  				p = pp;
  				sleeper = pp->p_slptime;
+ 				if (sleeper == 127) return(p);	/* Corey */
  			}
  		} else if (!sleeper && (pp->p_stat==SRUN||pp->p_stat==SSLEEP)) {
  			rppri = pp->p_rssize;



: ----- cut here ----- cut here ----- cut here ----- cut here -----
: This is a "shell archive".  Save everything after the cut mark
: in a file called thisstuff, then feed it to sh by typing sh thisstuff.
: SHAR archive format.  Archive created Thu Apr 25 10:26:03 PDT 1991
echo x - kmem.c
echo '-rw-r--r--  1 corey       3925 Apr 25 10:24 kmem.c    (as sent)'
sed 's/^-//' >kmem.c <<'+FUNKY+STUFF+'
-/*
- * a tool to use in place of adb (on systems without adb) which lets you
- * peek and poke at the values of kernel variables in /dev/kmem
- *
- * usage:	kmem [-s#] var1 var2 ... varN
- *  or
- * usage:	kmem -w var1=val1 var2=val2 ... varN=valN
- *
- * If -s# is given, loop every # seconds and repeat.  This is handy for
- * watching variables like freemem or debugging flags.  The following simple
- * awk script can postprocess the output filtering out values which don't
- * change:
- *		{ 	if (NF > 2) date = $0
- *			else if (seen[$1] != $2) {
- *			    seen[$1] = $2
- *			    if (date != "") {
- *				print ""; print date
- *				date = ""
- *				}
- *			    print
- *			    }
- * 		 }
- *
- * Corey Satten, corey@cac.washington.edu, 9/6/90 - Ultrix 4.0 version
- */
-#include<stdio.h>
-#include<nlist.h>
-#include<sys/file.h>
-#include <time.h>
-
-struct nlist *nl;		/* how we find locations of names */
-int *nv;			/* the new values for each name */
-int w_flag = 0;			/* write new values? */
-char *file = "/vmunix";		/* default file to read symbols from */
-int kmem;
-
-main(argc, argv)
-    int argc;
-    char *argv[];
-{
-    int f;			/* walks argv upto index of first non-flag */
-    int i;			/* walks through remaining arguments */
-    int value = 0;
-    int rc = 0;
-    int sleeptime = 0;		/* if set nonzero with -s#, repeat every # secs */
-
-    /*
-     * flag parsing
-     */
-    for (f=1; f<argc && *(argv[f]) == '-'; ++f) {
-	switch(argv[f][1]) {
-	default:
-	    fprintf(stderr, "%s: unknown flag -%c\n", argv[0], argv[f][1]);
-	    exit(1);
-	case 'w':
-	    w_flag = 1;
-	    break;
-	case 's':
-	    sscanf(argv[f][2] ? argv[f]+2 : argv[++f], "%d", &sleeptime);
-	    break;
-	case 'f':
-	    file = argv[++f];
-	    break;
-	}
-    }
-
-    /*
-     * handle the remaining arguments as either symname or symname=value
-     * depending on whether -w (w_flag) was specified.
-     */
-
-    nl = (struct nlist *) malloc( sizeof(*nl) * (argc-f+1) );
-    nv = (int *) malloc( sizeof(int) * (argc-f+1) );
-    if (!nv || !nl) {perror("malloc"); exit(1);};
-
-    for (i=0; i<argc-f; ++i) {
-	char *name = (char *)malloc(strlen(argv[i+f])+1);
-
-	if (!name) {perror("malloc"); exit(1);};
-	rc = sscanf(argv[i+f], "%[^=]=%d", name, &value);
-	if (rc - w_flag != 1) {
-	    fprintf(stderr, "%s: bad argument: %s\n", argv[0], argv[i+f]);
-	    exit(1);
-	    }
-	nl[i].n_name = name;
-	nv[i] = value;
-	}
-    nl[i].n_name = "";
-
-    /*
-     * now figure out where to read/write in /dev/kmem and do it
-     */
-    
-    nlist(file, nl);
-
-    kmem = open("/dev/kmem", w_flag ? O_RDWR : O_RDONLY);
-    if (kmem < 0) {
-	perror("/dev/kmem open");
-	exit(1);
-	}
-
-    sleeploop:
-
-	if (sleeptime) {
-	    putchar('\n'); date();
-	    }
-
-	for (i=0; i<argc-f; ++i) {
-	    long seekto = (long)nl[i].n_value;
-
-	    if (nl[i].n_type == 0) {
-		fprintf(stderr, "%s: symbol `%s' not found in namelist of %s\n",
-		    argv[0], nl[i].n_name, file);
-	    /*
-	     *  We promise to do all writes in command line order, so if one
-	     *  is going to fail, we'd best bail out rather than continue.
-	     */
-		if (w_flag) exit(2);
-		else	continue;
-		}
-	    if ( lseek(kmem, seekto, 0) != seekto ) {
-		perror("/dev/kmem lseek"); exit(2);
-		}
-	    if ( read(kmem, &value, sizeof(int)) != sizeof(int) ) {
-		perror("/dev/kmem read"); exit(2);
-		}
-
-	    printf("%s(0x%x)\t%d", nl[i].n_name, nl[i].n_value, value);
-
-	    if (w_flag) {
-		if ( lseek(kmem, seekto, 0) != seekto ) {
-		    perror("/dev/kmem lseek"); exit(2);
-		    }
-		value = nv[i];
-		printf(" -> %d", value);
-		if ( write(kmem, &value, sizeof(int)) != sizeof(int) ) {
-		    perror("/dev/kmem write"); exit(2);
-		    }
-		}
-	    putchar('\n');
-	    }
-
-    if (sleeptime) {
-	fflush(stdout);
-	sleep(sleeptime);
-	goto sleeploop;
-	}
-}
-/*
- * print ascii version of current date and time
- */
-date() {
-    char *at;
-    static char db[30];
-    int i;
-
-    time(&i);
-    at = asctime(localtime(&i));
-    strcpy(db, at+4);
-    db[20] = 0;
-    puts(db);
-    }
+FUNKY+STUFF+
chmod u=rw,g=r,o=r kmem.c
ls -l kmem.c
exit 0

torek@elf.ee.lbl.gov (Chris Torek) (05/02/91)

In article <1991Apr30.160331.16215@milton.u.washington.edu>
corey@milton.u.washington.edu (Corey Satten) writes:
>(This is cross-posted to unix-wizards because it may also apply to 4.3BSD)

Just a quick note:  It seems unlikely to, as the code being changed is
almost completely different in 4.3BSD.  In particular, the swapout code
picks the five largest processes, not the `first' set of processes that
will satisfy the immediate demands.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

corey@milton.u.washington.edu (Corey Satten) (05/02/91)

In article <12714@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>In article <1991Apr30.160331.16215@milton.u.washington.edu>
>corey@milton.u.washington.edu (Corey Satten) writes:
>>(This is cross-posted to unix-wizards because it may also apply to 4.3BSD)
>
>Just a quick note:  It seems unlikely to, as the code being changed is
>almost completely different in 4.3BSD.  In particular, the swapout code
>picks the five largest processes, not the `first' set of processes that
>will satisfy the immediate demands.

Oh dear, I can't believe I'm about to take issue with Chris Torek.  This
is indeed a rare day.  I hope I don't blow it.

Chris, I find almost identical code in 4.3BSD (tahoe I think).  In file
vm_page.c the lines which prevent data from paging are almost identical
(but without DEC's SMP locking).  In file vm_sched.c the code near the
comment about swapping out deadwood is almost identical with Ultrix and
on our system, that was the code which was doing almost all of the
swapping (though possibly because I elevated the values of lotsfree and
desfree and scan rate to improve performance in round 1).  I think you
are talking about the "hardwsap" code (when desperate == 1), which on
our system, turned out to rarely execute.  In any case, even if
hardswap does happen, it still looks like it doesn't use the largest 5
processes (as the comment would indicate) if there are any processes
which have been sleeping longer than maxslp (20 seconds), I think it
uses the one closest to the beginning of the process table which has
been sleeping the longest.  I think longest maxes out at 127 seconds so
with our process mix, that means it would usually swap the first
few processes which have been sleeping at least 127 seconds.

Anyway, as I said, I haven't measured a BSD system and I didn't study
hardswap as closely as the other since it almost never happened on our
system, so I only say it may apply because the code looks to me very
similar to the Ultrix code I have become rather familiar with.
Furthermore, now that we are paging out data, we aren't swapping
processes with RSS>0 at all so I think the paging part of the fix may
be more important than the swapping part anyway.

--------
Corey Satten, corey@cac.washington.edu
Networks and Distributed Computing
University of Washington
(206)543-5611

torek@elf.ee.lbl.gov (Chris Torek) (05/03/91)

In article <1991May2.052140.27048@milton.u.washington.edu>
corey@milton.u.washington.edu (Corey Satten) writes:
>Oh dear, I can't believe I'm about to take issue with Chris Torek.

Well, I *do* goof sometimes :-)

>Chris, I find almost identical code in 4.3BSD (tahoe I think).  In file
>vm_page.c the lines which prevent data from paging are almost identical

I was only talking about your second posting, which I thought dealt only
with the swap code.  I presume you mean these three lines:

	if (c->c_type != CTEXT) {
		if (rp->p_rssize < saferss - rp->p_slptime)
			return (0);
	}

Since p_rssize is in units of `core clicks' (512 or 1024 bytes) and
saferss is 32 and p_slptime is in [0..127] and all of the numbers
are signed (I checked :-) ), this should only affect processes with
less than 16 or 32 KB of resident set size, and once they have been
asleep for 32 seconds, it should not affect them at all (since rssize
will always be >= 0 and the rhs of the compare will be <= 0).

>In file vm_sched.c the code near the comment about swapping out
>deadwood is almost identical with Ultrix and on our system, that
>was the code which was doing almost all of the swapping

Aha!  This code is supposed to run `almost never', as far as I can
tell.  The idea is:

	if we think we need memory, or if the process is already out; and
	if this process has not run for `a long time' (20 seconds); and
	if nothing funny is going on with the text or proc;
	throw it out.  (See below as to why already-out means anything)

Hence `freemem < desfree' rather than `freemem < lotsfree': the pageout
daemon is supposed to keep free memory in the range [desfree..lotsfree)
under normal load.

>(though possibly because I elevated the values of lotsfree and
>desfree and scan rate to improve performance in round 1).  I think you
>are talking about the "hardwsap" code (when desperate == 1), which on
>our system, turned out to rarely execute.

I was.  (I believed the comment rather than trying to figure out the
code.  This code is all gone in the new Mach-based VM anyway.)

>Furthermore, now that we are paging out data, we aren't swapping
>processes with RSS>0 at all so I think the paging part of the fix may
>be more important than the swapping part anyway.

It is.  The swap code in the old BSD VM is only supposed to fire off
in a few special cases:

	- very low on memory, and pageout daemon cannot keep up (this
	  is the `hardswap' case);
	- expansion swaps (need space between p0 and p1 and the usual
	  easy expansion failed);
	- process has already paged out entirely, and the UPAGES pages
	  of u. might help, so kick out its u. as well (this is one of
	  the `deadwood' cases)---since UPAGES is 16 (*1024 bytes) this
	  is not really very profitable;
	- very low on memory (freemem < desfree) and process has been
	  idle for some time (this is the other `deadwood' case);
	- `kernelmap' has become fragmented (need contiguous pte
	  pages): we swap like crazy just to defragment it (horrible,
	  but rare).

Anyway, now that I am looking at the hardswap code, I think you are
right:  the code iterates through the whole proc table (never stops)
but only accumulates `big' processes if it does not find a `sleeper'
(something sleeping for > 20 seconds).  Generally there is always at
least one such, and it takes that one and then starts the whole thing
over (as you described).  If the paging system has done its job,
however, this will gain little and after a dozen or so sleepers
have been swapped out, the `big' process code will fire.

Incidentally, it is not surprising that the code worked poorly on
DECstations: it is tuned for machines on which the CPU is considerably
slower than the I/O, rather than the other way around.  On the 780,
it was often better to `work hard' than to `work smart'....
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

corey@milton.u.washington.edu (Corey Satten) (05/03/91)

In article <12759@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>
>I presume you mean these three lines:
>
>	if (c->c_type != CTEXT) {
>		if (rp->p_rssize < saferss - rp->p_slptime)
>			return (0);
>	}
>
>Since p_rssize is in units of `core clicks' (512 or 1024 bytes) and
>saferss is 32 and p_slptime is in [0..127] and all of the numbers
>are signed (I checked :-) ), this should only affect processes with
>less than 16 or 32 KB of resident set size, and once they have been
>asleep for 32 seconds, it should not affect them at all (since rssize
>will always be >= 0 and the rhs of the compare will be <= 0).

Yes, that's the place which is OK in BSD but DEC changed p_rssize to
unsigned so it would have worked only until the right side became negative;
however I believe that code never executed on Ultrix or BSD because of
the code a few lines before (BSD code fragment):

		if ((rp->p_flag & (SSEQL|SUANOM)) == 0 &&
		    rp->p_rssize <= rp->p_maxrss)
			return (0);

which the front hand does for valid pages.  I think this means that unless
the process has executed a vadvise() to warn of sequential or anomalous
paging behavior, the front hand never invalidates data pages.  But I need
to back off a little here and remind readers that this is what I discovered
happens on Ultrix and the code looks very similar but I haven't experimented
with BSD.

>Anyway, now that I am looking at the hardswap code, I think you are
>right:  the code iterates through the whole proc table (never stops)
>but only accumulates `big' processes if it does not find a `sleeper'
>(something sleeping for > 20 seconds).  Generally there is always at
>least one such, and it takes that one and then starts the whole thing
>over (as you described).  If the paging system has done its job,
>however, this will gain little and after a dozen or so sleepers
>have been swapped out, the `big' process code will fire.

So on our system, the paging system was not doing its job because it was
only paging out what little text it could find between lots of stale data
and we had enough processes in the process table that there were always
enough sleepers to swap out and in at the beginning of the process table
and the results were not pretty.

Chris, I really thank you for taking the time to look this over.  I hope,
as you say, it is already done right in the new code.

--------
Corey Satten, corey@cac.washington.edu
Networks and Distributed Computing
University of Washington
(206)543-5611

torek@elf.ee.lbl.gov (Chris Torek) (05/04/91)

In article <1991May2.231911.23612@milton.u.washington.edu>
corey@milton.u.washington.edu (Corey Satten) writes:
>however I believe that code never executed on Ultrix or BSD because of
>the code a few lines before (BSD code fragment):
>
>		if ((rp->p_flag & (SSEQL|SUANOM)) == 0 &&
>		    rp->p_rssize <= rp->p_maxrss)
>			return (0);
>
>which the front hand does for valid pages.  I think this means that unless
>the process has executed a vadvise() to warn of sequential or anomalous
>paging behavior, the front hand never invalidates data pages.

Not in the old BSD kernel.  The code path is, for the front hand (on
the VAX):

	/*
	 * `page cluster' info is generally treated as `bits in the
	 * first pte mapping a page the cluster', hence the `mark first'
	 * code below.
	 */
	if (page cluster is valid) {
		mark page cluster invalid;
		mark process SPTECHG;
		if (any page in the cluster is modified)
			mark the first page modified;
		make all pages in the cluster look like the first one;
		if (it is a text page cluster)
			let other users know about changes;
		if (process is normal)
			we are done, return 0;
	}

The SEQL and SUANOM cases, and the rssize > maxrss case, are where
the process should not be paged in LRU fashion but rather in `almost
MRU'.  In this case, if the page was valid and we made it not-valid
we also try to page it out.  This is not quite right---we should
not be paging out text pages as the vadvise call is for data, not
text---but is probably `good enough'.

On the Tahoe, which has a reference bit, the valid bit and `fast
reclaim' stuff is unnecessary, and the code path looks like:

	if (this is a text page cluster)
		if (any process using it has referenced it)
			mark the first page as referenced;
	if (the page cluster has been referenced) {
		mark page cluster not-referenced;
		if (any page in the cluster is modified)
			mark the first page modified;
		make all pages in the cluster look like the first one;
		if (it is a text page cluster)
			let other users know about changes;
		if (process is normal)
			we are done, return 0;
	}

The back hand, of course, never looks at valid/referenced pages at
all.  Thus, the code in `>' above is meant only to cause the front hand
to do pageouts.  For most processes, the front hand merely paves the
way for the back hand to do pageouts.  Imagine a wall clock with both
hands moving at the same speed: the time it takes for the back hand to
pass the same spot as the front hand is the time a process is given to
`reclaim' a page.  We take it away, but leave it in memory, and if you
ask for it before the back hand gets around to it, you get to keep it.
If not, we dust off the page (write it to swap) if it is dirty, and
then put it in the `clean' pile (free pages).

For processes that have done a vadvise(SSEQL) or (SUANOM), we presume
instead that `recently used' means `unlikely to be reused', so in this
case we have the front hand dust it off instead---if you just used it,
we take it away.

The planned 4.2BSD `new VM' (which is only now being implemented by
beating the Mach VM into a different shape) had an `madvise' call which
was intended to mark anomalous or sequential behaviour on a per-region
or per-page basis, rather than per-process.  In the meantime the old
vadvise call was deprecated, but it lives on. . . .

Presumably someone broke this in the DEC MIPS port.  The MIPS chip does
not have PTEs, so PTEs must be done in software, so you can define your
own used/modified/ref'd/etc bits.  This is much easier in the new
Mach-based VM, where the responsibility for hardware management is in
a separate file.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

corey@milton.u.washington.edu (Corey Satten) (05/07/91)

In article <12792@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes:
>In article <1991May2.231911.23612@milton.u.washington.edu>
>corey@milton.u.washington.edu (Corey Satten) writes:
>>however I believe that code never executed on Ultrix or BSD because of
>>the code a few lines before (BSD code fragment):
>>
>>		if ((rp->p_flag & (SSEQL|SUANOM)) == 0 &&
>>		    rp->p_rssize <= rp->p_maxrss)
>>			return (0);
>>
>>which the front hand does for valid pages.  I think this means that unless
>>the process has executed a vadvise() to warn of sequential or anomalous
>>paging behavior, the front hand never invalidates data pages.
>
>Not in the old BSD kernel.  The code path is, for the front hand (on
>the VAX):

Chris, I stand corrected for both BSD *and* for Ultrix.  The fragment where
I force all processes to behave as if SEQL or SUANOM was set is (as you
point out) not required to get data pages to page out -- so on Ultrix, only
the signed comparison fix is really required in vm_page.c.  However...

On our Ultrix system, since I have the SUANOM test anded with a global I
can poke while the system is running, I can experiment with turning it on
and off.  I find that without my (technically unnecessary) "fix", our
system can't free enough pages to keep from swapping continuously unless
the scan rate is increased very substantially, and on one of our smaller,
more memory starved systems, the highest scan rate achievable (by poking
fastscan=1 and slowscan=2) is insufficient.  I'm still unsure exactly how
to interpret this, (especially in light of the fact that the number of
dirty pages written to disk is limited to a rather small number per second)
but now that I've experimentally confirmed this, I did want to correct my
earlier posting before everyone forgets what we're talking about.

Thanks again for taking the time to look this over with me.

--------
Corey Satten, corey@cac.washington.edu
Networks and Distributed Computing
University of Washington
(206)543-5611