corey@milton.u.washington.edu (Corey Satten) (04/30/91)
(This is cross-posted to unix-wizards because it may also apply to 4.3BSD) Performance Tuning a DEC Ultrix 4.1 Workstation Round 2 Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington Seattle, Washington April 1991 This is a follow up to work first posted in September 1990. History: Our department is using a rather maximally configured DECstation as a time-sharing host. It is a DEC 5000 running Ultrix 4.1 and has six disks, mostly 660 meg or 1 gigabyte. It serves /usr/local/bin via NFS to about a dozen workstations; talks to several printers; is the departmental electronic mail machine; hosts some campus wide mailing lists; is our anonymous FTP server; is one of two campus default domain nameservers; and also time-sharing host for about 20 X-terminals plus a dozen or more other users connected via telnet. We are supporting about 250 megs of swap space on roughly 43 megabytes of the 56 megabyte physical memory. A 'ps aux' listing usually has 400+ processes in it. In my September posting, I described how tweaking some global variables in the kernel allowed us to improve performance by paging more and swapping less and maintaining a larger chunk of free memory. Several people on campus and in netland followed our lead and reported similar improvements. (Global kernel variables can be conveniently tweaked on a running system with the kmem program included with this posting.) Our system, thus tweaked, spoiled us by the times it was fast and frustrated us by the times it wasn't. Occasionally we had reports of very large (20+ second) character echo delays experienced by one user while others in the same environment running the same programs saw no delays. Furthermore, large programs tended to lose too many pages to run satisfactorily. These continuing problems, plus my gut feeling that 56 megabytes should really be enough memory for what we're doing motivated me to continue investigating. Current Work: Close examination of our system revealed that we had some processes which had not run in a very long time (perhaps days) but which still had a significant RSS. Scrutiny of the source combined with some careful experiments lead me to discover that data pages never page out under normal circumstances even though all the complicated code to do so is there. Judging from the code, this was a conscious decision made in BSD Unix. Modern Unix systems running X windows tend to have more idle processes than ever before and those processes tend to have larger data spaces than their non-windowing ancestors. Thus, not paging data pages causes memory to be tied up with junk which only swapping can remove. On our system, I estimated perhaps 20 megabytes fell into this category. So why don't these data pages eventually swap out? When free memory becomes less than "desfree", the kernel looks for processes which have been sleeping longer than "maxslp" (usually 20 seconds) to swap out. To my horror, I discovered that it always starts looking at the beginning of the process table and stops when it has satisfied the need for free memory. A quick modification to "ps" to print the processes in process-table order and also print the number of times each had swapped confirmed this. No process in the last half of our process table had ever swapped and those at the front had swapped a lot. The extreme prejudice which is directed against processes living at the front of the table could well explain the horrible performance some users occasionally reported that the rest of us didn't see. In order to completely rectify both the swapping and paging problems described above, kernel changes are required, however I believe fixing the data paging problem has the biggest effect and later I will describe a partial workaround to the data paging problem which may work for those of you who can't change your kernels. (You might also try asking your DEC rep to supply these changes in binary form.) To fix the swapping problem, I modified the FORALLPROC macro used by vm_sched.c to begin swapping where it last left off so, in the long run, process table position has no predictable effect on swapping and long time sleepers will eventually swap out. To achieve this, I also needed to make other small changes to vm_sched.c. To fix the data paging problem I simply changed the two places which prevented data paging in vm_page.c. In both vm_sched.c and vm_page.c, I inserted global variables which I can set and test at runtime to experiment. Since data pages can be more expensive to page out than text pages, BSD and Ultrix have 2 limits on data pageouts. 1) only maxpgio/4 data pages per second will pageout. 2) data pages are only paged out if the process they belong to has an RSS > (saferss - sleeptime) where sleeptime is the number of seconds since the process has run. Our system seems to be doing fine with these defaults. If you can't change your kernel but you still want to try to get data pages to page out you may be able to use the following trick to force all processes to exceed their soft memory limit. Rename /etc/init to /etc/init.orig and replace /etc/init with the following trivial program: #include <sys/time.h> #include <sys/resource.h> main(argc, argv, envp) char *argv[], *envp[]; int argc; { struct rlimit rlp; getrlimit(RLIMIT_RSS,&rlp); rlp.rlim_cur = 1; /* zero may be better if it works */ setrlimit(RLIMIT_RSS,&rlp); execve("/etc/init.orig", argv, envp); } This will effectively nice them all (which shouldn't matter since it happens to them all uniformly) and allow their data pages to be paged out as long as the resident set size is greater than the value of "saferss" (6 pages) minus the idle time of the process. Unfortunately, DEC changed p_rssize from signed long to unsigned long so as soon as a process has been idle longer than saferss seconds the data pages stop paging out again. The workaround for those who can't re-compile is to set saferss to 127 which will guarantee the subtraction never goes negative and idle processes will eventually lose data pages (although not as quickly as with a fixed kernel and a smaller value of saferss). For those of you with source to Ultrix 4.1, context diffs of my changes are appended below. I have manually deleted my monitoring changes from the diffs to make the functional changes clearer and the diffs about 90% smaller. In addition, there are several things to note: first, that I applied the FORALLPROC change only to vm_sched.c by naming the changed file procNDC.h and changing the #include in vm_sched.c accordingly. Second, the p++ which is uniformly changed to ++p is a vestige of earlier changes which have been removed. The only lingering effect is to enlarge the context of this context-diff to include the entire macro (which is why I left it). Third, even though the new FORALLPROC code is complicated, it maintains the desirable timing characteristics of the original since the new stuff only executes twice for each invocation of the FORALLPROC macro. One final note for kernel hacking purists, in vm_sched.c where processes are swapped when RSS reaches zero, what I really want is to have them swap when RSS is zero AND the number of swapouts of the process is less than 2 AND the number of pageins is greater than zero but I didn't trust myself to implement that change since the per process count of swapouts and pageins is in the u.u_ru structure and I was unsure how to access it properly. As a result, our nfsd and biod processes aimlessly swap in and out, fortunately at negligible cost. We are now running with the following kernel parameters poked into /dev/kmem by my kmem program (appended somewhere below). lotsfree 128 -> 768 /* begin scanning for pages with 3meg free */ desfree 64 -> 512 /* begin swapping sleeping procs at 2meg */ coreyf1 xx -> 40 /* ... if sleeping longer than 40 seconds */ minfree 28 -> 128 /* consider swapping running procs at .5meg */ The elevated scan rate we needed when data pages didn't page seems to be unnecessary now. Slow scanning on our system encounters plenty of stale pages to keep the free list replenished. Results: 1) Our system now almost never swaps jobs because of memory shortfall. 2) The active real memory (displayed by vmstat -v) is about double what it was before -- about 25 megabytes -- with spikes as high as 37 megabytes. We never saw such high numbers before. 3) RSS of idle processes eventually reaches zero and these processes are then gently swapped out. 4) Paging activity (scan rate, etc.) is much lower than before -- often the system goes for minutes with a scan rate of zero. 5) Interactive response is good on FrameMaker and other medium size programs which were formerly frustratingly slow. 6) Large programs (such as cc -O on perl version 3's eval.c where the optimizer grows to 20 megabytes) can run effectively -- cpu idle time drops to zero for the entire 2-minute optimizer run and interactive response in other applications remains good. 7) The load average on the machine is somewhat lower and the spikes are considerably lower. 8) We are considering increasing our file-system buffer-cache from the default 15% since most of our idle time now seems to be file-system related rather than virtual memory related. -------- Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington (206)543-5611 The kmem program and context diffs follow: *** /usr/src/Ultrix-4.1-RISC/sys/h/proc.h Fri Jul 6 07:18:00 1990 --- /usr/src/Ultrix-4.1-RISC/sys/h/procNDC.h Fri Apr 5 21:13:18 1991 *************** *** 398,434 **** * out of the For loop, and not one of the inner While loops */ ! #define NEXTPROC { pp++; goto _a ; } #define FORALLPROC(X) { \ ! register unsigned long *_bp; \ ! register struct proc *pp = proc; \ register unsigned long _mask; \ \ /* \ * for the whole index into the table \ */ \ ! for ( _bp = proc_bitmap; \ ! _bp < &proc_bitmap[max_proc_index] ; _bp++ ) { \ /* \ * If any bits in this longword are used, \ * find the associated structures \ */ \ ! if (_mask = *_bp) { \ _a: if (_mask) { \ _b: if ((_mask&1) == 0) { \ _mask = _mask >> 1; \ ! pp++; \ goto _b; \ } \ _mask = _mask >> 1; \ { X } \ ! pp++; \ goto _a; \ } else { \ if (_mask = ((pp-proc)%32)) \ ! pp += 32 - _mask; \ } \ } else pp += 32; \ } \ } --- 398,452 ---- * out of the For loop, and not one of the inner While loops */ ! #define NEXTPROC { ++pp; goto _a ; } + #define INC_bp(X,Y) (X < &proc_bitmap[max_proc_index-1] ? \ + ++X : (Y=proc, X=proc_bitmap)) + #define FORALLPROC(X) { \ ! static unsigned long *_bp = proc_bitmap; \ ! register struct proc *pp = proc + 32*(_bp-proc_bitmap); \ ! static struct proc *_opp = 0; \ register unsigned long _mask; \ + register unsigned long *_bpe = _bp; \ + register int _more; \ + unsigned long _maskmask; \ \ /* \ * for the whole index into the table \ */ \ ! for (_more = 2; ; INC_bp(_bp,pp)) { \ ! if (_bp == _bpe) \ ! if (--_more) { \ ! int i = _opp-pp+1; \ ! _maskmask = ~0; \ ! if (i<32 && i>=0) \ ! _maskmask <<= i; \ ! else _maskmask = 0; \ ! _mask = *_bp & _maskmask; \ ! } else _mask = *_bp & ~_maskmask; \ ! else _mask = *_bp; \ /* \ * If any bits in this longword are used, \ * find the associated structures \ */ \ ! if (_mask) { \ _a: if (_mask) { \ _b: if ((_mask&1) == 0) { \ _mask = _mask >> 1; \ ! ++pp; \ goto _b; \ } \ _mask = _mask >> 1; \ + _opp = pp; \ { X } \ ! ++pp; \ goto _a; \ } else { \ if (_mask = ((pp-proc)%32)) \ ! pp += 32-_mask; \ } \ } else pp += 32; \ + if (!_more) break; \ } \ } *** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c_orig Tue Jul 17 12:30:21 1990 --- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c Tue Apr 23 12:58:18 1991 *************** *** 274,279 **** --- 274,283 ---- int nohash = 0; /* turn on/off hashing */ int nobufcache = 1; /* turn on/off buf cache for data */ + /* symbols added for performance prodding at request of corey@cac */ + int coreyp1 = 0; /* data does page out (stock is coreyp1 = 1) */ + int coreyp2 = 10; /* data pageouts per second (was maxpgio/4) */ + extern int swapfrag; /* * Handle a page fault. *************** *** 1318,1324 **** (void) splx(s); return(0); } ! if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && rp->p_rssize <= rp->p_maxrss) { smp_unlock(seg_lock); smp_unlock(&lk_cmap); --- 1322,1328 ---- (void) splx(s); return(0); } ! if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && coreyp1 && rp->p_rssize <= rp->p_maxrss) { smp_unlock(seg_lock); smp_unlock(&lk_cmap); *************** *** 1332,1338 **** * Guarantee a minimal investment in data * space for jobs in balance set. */ ! if (rp->p_rssize < saferss - rp->p_slptime) { smp_unlock(&lk_p_vm); smp_unlock(&lk_cmap); (void) splx(s); --- 1336,1342 ---- * Guarantee a minimal investment in data * space for jobs in balance set. */ ! if ((long)rp->p_rssize < saferss - rp->p_slptime) { smp_unlock(&lk_p_vm); smp_unlock(&lk_cmap); (void) splx(s); *************** *** 1371,1377 **** * Limit pushes to avoid saturating * pageout device. */ ! (pushes > maxpgio / 4)) { if (seg_lock != &lk_p_vm) smp_unlock(&lk_p_vm); smp_unlock(seg_lock); --- 1375,1381 ---- * Limit pushes to avoid saturating * pageout device. */ ! (pushes > coreyp2 /* was maxpgio / 4 */)) { if (seg_lock != &lk_p_vm) smp_unlock(&lk_p_vm); smp_unlock(seg_lock); *** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c_orig Fri Jul 6 06:41:49 1990 --- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c Thu Apr 25 10:12:23 1991 *************** *** 103,109 **** #include "../h/seg.h" #include "../h/dir.h" #include "../h/user.h" ! #include "../h/proc.h" #include "../h/text.h" #include "../h/vm.h" #include "../h/cmap.h" --- 103,109 ---- #include "../h/seg.h" #include "../h/dir.h" #include "../h/user.h" ! #include "../h/procNDC.h" #include "../h/text.h" #include "../h/vm.h" #include "../h/cmap.h" *************** *** 143,148 **** --- 143,154 ---- int minfree = 0; int desfree = 0; int lotsfree= 0; + + /* symbols added for performance prodding at request of corey@cac */ + int swload = TO_FIX(2); + int coreyf0 = 1; /* goto loop after swapout */ + int coreyf1 = 20; /* softswap enabled after this many seconds */ + int coreyf4 = 1; /* don't swap when RSS=0 */ #endif mips #ifdef vax *************** *** 291,297 **** (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree && #endif vax #ifdef mips ! (avenrun[0] >= TO_FIX(2) && imax(avefree, avefree30) < desfree && #endif mips (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) { desperate = 1; --- 297,307 ---- (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree && #endif vax #ifdef mips ! /* ! * symbol "swload" added for performance prodding at ! * request of corey@cac 26 Feb 91 ! */ ! (avenrun[0] >= swload && imax(avefree, avefree30) < desfree && #endif mips (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) { desperate = 1; *************** *** 340,353 **** case SSLEEP: case SSTOP: ! if ((freemem < desfree || pp->p_rssize == 0) && ! pp->p_slptime > maxslp && (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) && swappable(pp)) { /* * Kick out deadwood. */ pp->p_sched &= ~SLOAD; smp_unlock(&lk_rq); --- 350,366 ---- case SSLEEP: case SSTOP: ! if (coreyf1 && ! (freemem < desfree || (pp->p_rssize == 0 && coreyf4)) && ! pp->p_slptime > coreyf1 /* was maxslp */ && (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) && swappable(pp)) { + int breakout; /* * Kick out deadwood. */ + breakout = pp->p_rssize ? 1 : 1-coreyf4; pp->p_sched &= ~SLOAD; smp_unlock(&lk_rq); *************** *** 357,363 **** goto loop; #endif vax #ifdef mips ! NEXTPROC; #endif mips } smp_unlock(&lk_rq); --- 370,381 ---- goto loop; #endif vax #ifdef mips ! if (coreyf0 && breakout) { ! goto loop; ! } ! else { ! NEXTPROC; ! } #endif mips } smp_unlock(&lk_rq); *************** *** 540,545 **** --- 558,564 ---- if (sleeper < pp->p_slptime) { p = pp; sleeper = pp->p_slptime; + if (sleeper == 127) return(p); /* Corey */ } } else if (!sleeper && (pp->p_stat==SRUN||pp->p_stat==SSLEEP)) { rppri = pp->p_rssize; : ----- cut here ----- cut here ----- cut here ----- cut here ----- : This is a "shell archive". Save everything after the cut mark : in a file called thisstuff, then feed it to sh by typing sh thisstuff. : SHAR archive format. Archive created Thu Apr 25 10:26:03 PDT 1991 echo x - kmem.c echo '-rw-r--r-- 1 corey 3925 Apr 25 10:24 kmem.c (as sent)' sed 's/^-//' >kmem.c <<'+FUNKY+STUFF+' -/* - * a tool to use in place of adb (on systems without adb) which lets you - * peek and poke at the values of kernel variables in /dev/kmem - * - * usage: kmem [-s#] var1 var2 ... varN - * or - * usage: kmem -w var1=val1 var2=val2 ... varN=valN - * - * If -s# is given, loop every # seconds and repeat. This is handy for - * watching variables like freemem or debugging flags. The following simple - * awk script can postprocess the output filtering out values which don't - * change: - * { if (NF > 2) date = $0 - * else if (seen[$1] != $2) { - * seen[$1] = $2 - * if (date != "") { - * print ""; print date - * date = "" - * } - * print - * } - * } - * - * Corey Satten, corey@cac.washington.edu, 9/6/90 - Ultrix 4.0 version - */ -#include<stdio.h> -#include<nlist.h> -#include<sys/file.h> -#include <time.h> - -struct nlist *nl; /* how we find locations of names */ -int *nv; /* the new values for each name */ -int w_flag = 0; /* write new values? */ -char *file = "/vmunix"; /* default file to read symbols from */ -int kmem; - -main(argc, argv) - int argc; - char *argv[]; -{ - int f; /* walks argv upto index of first non-flag */ - int i; /* walks through remaining arguments */ - int value = 0; - int rc = 0; - int sleeptime = 0; /* if set nonzero with -s#, repeat every # secs */ - - /* - * flag parsing - */ - for (f=1; f<argc && *(argv[f]) == '-'; ++f) { - switch(argv[f][1]) { - default: - fprintf(stderr, "%s: unknown flag -%c\n", argv[0], argv[f][1]); - exit(1); - case 'w': - w_flag = 1; - break; - case 's': - sscanf(argv[f][2] ? argv[f]+2 : argv[++f], "%d", &sleeptime); - break; - case 'f': - file = argv[++f]; - break; - } - } - - /* - * handle the remaining arguments as either symname or symname=value - * depending on whether -w (w_flag) was specified. - */ - - nl = (struct nlist *) malloc( sizeof(*nl) * (argc-f+1) ); - nv = (int *) malloc( sizeof(int) * (argc-f+1) ); - if (!nv || !nl) {perror("malloc"); exit(1);}; - - for (i=0; i<argc-f; ++i) { - char *name = (char *)malloc(strlen(argv[i+f])+1); - - if (!name) {perror("malloc"); exit(1);}; - rc = sscanf(argv[i+f], "%[^=]=%d", name, &value); - if (rc - w_flag != 1) { - fprintf(stderr, "%s: bad argument: %s\n", argv[0], argv[i+f]); - exit(1); - } - nl[i].n_name = name; - nv[i] = value; - } - nl[i].n_name = ""; - - /* - * now figure out where to read/write in /dev/kmem and do it - */ - - nlist(file, nl); - - kmem = open("/dev/kmem", w_flag ? O_RDWR : O_RDONLY); - if (kmem < 0) { - perror("/dev/kmem open"); - exit(1); - } - - sleeploop: - - if (sleeptime) { - putchar('\n'); date(); - } - - for (i=0; i<argc-f; ++i) { - long seekto = (long)nl[i].n_value; - - if (nl[i].n_type == 0) { - fprintf(stderr, "%s: symbol `%s' not found in namelist of %s\n", - argv[0], nl[i].n_name, file); - /* - * We promise to do all writes in command line order, so if one - * is going to fail, we'd best bail out rather than continue. - */ - if (w_flag) exit(2); - else continue; - } - if ( lseek(kmem, seekto, 0) != seekto ) { - perror("/dev/kmem lseek"); exit(2); - } - if ( read(kmem, &value, sizeof(int)) != sizeof(int) ) { - perror("/dev/kmem read"); exit(2); - } - - printf("%s(0x%x)\t%d", nl[i].n_name, nl[i].n_value, value); - - if (w_flag) { - if ( lseek(kmem, seekto, 0) != seekto ) { - perror("/dev/kmem lseek"); exit(2); - } - value = nv[i]; - printf(" -> %d", value); - if ( write(kmem, &value, sizeof(int)) != sizeof(int) ) { - perror("/dev/kmem write"); exit(2); - } - } - putchar('\n'); - } - - if (sleeptime) { - fflush(stdout); - sleep(sleeptime); - goto sleeploop; - } -} -/* - * print ascii version of current date and time - */ -date() { - char *at; - static char db[30]; - int i; - - time(&i); - at = asctime(localtime(&i)); - strcpy(db, at+4); - db[20] = 0; - puts(db); - } +FUNKY+STUFF+ chmod u=rw,g=r,o=r kmem.c ls -l kmem.c exit 0
torek@elf.ee.lbl.gov (Chris Torek) (05/02/91)
In article <1991Apr30.160331.16215@milton.u.washington.edu> corey@milton.u.washington.edu (Corey Satten) writes: >(This is cross-posted to unix-wizards because it may also apply to 4.3BSD) Just a quick note: It seems unlikely to, as the code being changed is almost completely different in 4.3BSD. In particular, the swapout code picks the five largest processes, not the `first' set of processes that will satisfy the immediate demands. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
corey@milton.u.washington.edu (Corey Satten) (05/02/91)
In article <12714@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >In article <1991Apr30.160331.16215@milton.u.washington.edu> >corey@milton.u.washington.edu (Corey Satten) writes: >>(This is cross-posted to unix-wizards because it may also apply to 4.3BSD) > >Just a quick note: It seems unlikely to, as the code being changed is >almost completely different in 4.3BSD. In particular, the swapout code >picks the five largest processes, not the `first' set of processes that >will satisfy the immediate demands. Oh dear, I can't believe I'm about to take issue with Chris Torek. This is indeed a rare day. I hope I don't blow it. Chris, I find almost identical code in 4.3BSD (tahoe I think). In file vm_page.c the lines which prevent data from paging are almost identical (but without DEC's SMP locking). In file vm_sched.c the code near the comment about swapping out deadwood is almost identical with Ultrix and on our system, that was the code which was doing almost all of the swapping (though possibly because I elevated the values of lotsfree and desfree and scan rate to improve performance in round 1). I think you are talking about the "hardwsap" code (when desperate == 1), which on our system, turned out to rarely execute. In any case, even if hardswap does happen, it still looks like it doesn't use the largest 5 processes (as the comment would indicate) if there are any processes which have been sleeping longer than maxslp (20 seconds), I think it uses the one closest to the beginning of the process table which has been sleeping the longest. I think longest maxes out at 127 seconds so with our process mix, that means it would usually swap the first few processes which have been sleeping at least 127 seconds. Anyway, as I said, I haven't measured a BSD system and I didn't study hardswap as closely as the other since it almost never happened on our system, so I only say it may apply because the code looks to me very similar to the Ultrix code I have become rather familiar with. Furthermore, now that we are paging out data, we aren't swapping processes with RSS>0 at all so I think the paging part of the fix may be more important than the swapping part anyway. -------- Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington (206)543-5611
torek@elf.ee.lbl.gov (Chris Torek) (05/03/91)
In article <1991May2.052140.27048@milton.u.washington.edu> corey@milton.u.washington.edu (Corey Satten) writes: >Oh dear, I can't believe I'm about to take issue with Chris Torek. Well, I *do* goof sometimes :-) >Chris, I find almost identical code in 4.3BSD (tahoe I think). In file >vm_page.c the lines which prevent data from paging are almost identical I was only talking about your second posting, which I thought dealt only with the swap code. I presume you mean these three lines: if (c->c_type != CTEXT) { if (rp->p_rssize < saferss - rp->p_slptime) return (0); } Since p_rssize is in units of `core clicks' (512 or 1024 bytes) and saferss is 32 and p_slptime is in [0..127] and all of the numbers are signed (I checked :-) ), this should only affect processes with less than 16 or 32 KB of resident set size, and once they have been asleep for 32 seconds, it should not affect them at all (since rssize will always be >= 0 and the rhs of the compare will be <= 0). >In file vm_sched.c the code near the comment about swapping out >deadwood is almost identical with Ultrix and on our system, that >was the code which was doing almost all of the swapping Aha! This code is supposed to run `almost never', as far as I can tell. The idea is: if we think we need memory, or if the process is already out; and if this process has not run for `a long time' (20 seconds); and if nothing funny is going on with the text or proc; throw it out. (See below as to why already-out means anything) Hence `freemem < desfree' rather than `freemem < lotsfree': the pageout daemon is supposed to keep free memory in the range [desfree..lotsfree) under normal load. >(though possibly because I elevated the values of lotsfree and >desfree and scan rate to improve performance in round 1). I think you >are talking about the "hardwsap" code (when desperate == 1), which on >our system, turned out to rarely execute. I was. (I believed the comment rather than trying to figure out the code. This code is all gone in the new Mach-based VM anyway.) >Furthermore, now that we are paging out data, we aren't swapping >processes with RSS>0 at all so I think the paging part of the fix may >be more important than the swapping part anyway. It is. The swap code in the old BSD VM is only supposed to fire off in a few special cases: - very low on memory, and pageout daemon cannot keep up (this is the `hardswap' case); - expansion swaps (need space between p0 and p1 and the usual easy expansion failed); - process has already paged out entirely, and the UPAGES pages of u. might help, so kick out its u. as well (this is one of the `deadwood' cases)---since UPAGES is 16 (*1024 bytes) this is not really very profitable; - very low on memory (freemem < desfree) and process has been idle for some time (this is the other `deadwood' case); - `kernelmap' has become fragmented (need contiguous pte pages): we swap like crazy just to defragment it (horrible, but rare). Anyway, now that I am looking at the hardswap code, I think you are right: the code iterates through the whole proc table (never stops) but only accumulates `big' processes if it does not find a `sleeper' (something sleeping for > 20 seconds). Generally there is always at least one such, and it takes that one and then starts the whole thing over (as you described). If the paging system has done its job, however, this will gain little and after a dozen or so sleepers have been swapped out, the `big' process code will fire. Incidentally, it is not surprising that the code worked poorly on DECstations: it is tuned for machines on which the CPU is considerably slower than the I/O, rather than the other way around. On the 780, it was often better to `work hard' than to `work smart'.... -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
corey@milton.u.washington.edu (Corey Satten) (05/03/91)
In article <12759@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: > >I presume you mean these three lines: > > if (c->c_type != CTEXT) { > if (rp->p_rssize < saferss - rp->p_slptime) > return (0); > } > >Since p_rssize is in units of `core clicks' (512 or 1024 bytes) and >saferss is 32 and p_slptime is in [0..127] and all of the numbers >are signed (I checked :-) ), this should only affect processes with >less than 16 or 32 KB of resident set size, and once they have been >asleep for 32 seconds, it should not affect them at all (since rssize >will always be >= 0 and the rhs of the compare will be <= 0). Yes, that's the place which is OK in BSD but DEC changed p_rssize to unsigned so it would have worked only until the right side became negative; however I believe that code never executed on Ultrix or BSD because of the code a few lines before (BSD code fragment): if ((rp->p_flag & (SSEQL|SUANOM)) == 0 && rp->p_rssize <= rp->p_maxrss) return (0); which the front hand does for valid pages. I think this means that unless the process has executed a vadvise() to warn of sequential or anomalous paging behavior, the front hand never invalidates data pages. But I need to back off a little here and remind readers that this is what I discovered happens on Ultrix and the code looks very similar but I haven't experimented with BSD. >Anyway, now that I am looking at the hardswap code, I think you are >right: the code iterates through the whole proc table (never stops) >but only accumulates `big' processes if it does not find a `sleeper' >(something sleeping for > 20 seconds). Generally there is always at >least one such, and it takes that one and then starts the whole thing >over (as you described). If the paging system has done its job, >however, this will gain little and after a dozen or so sleepers >have been swapped out, the `big' process code will fire. So on our system, the paging system was not doing its job because it was only paging out what little text it could find between lots of stale data and we had enough processes in the process table that there were always enough sleepers to swap out and in at the beginning of the process table and the results were not pretty. Chris, I really thank you for taking the time to look this over. I hope, as you say, it is already done right in the new code. -------- Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington (206)543-5611
torek@elf.ee.lbl.gov (Chris Torek) (05/04/91)
In article <1991May2.231911.23612@milton.u.washington.edu> corey@milton.u.washington.edu (Corey Satten) writes: >however I believe that code never executed on Ultrix or BSD because of >the code a few lines before (BSD code fragment): > > if ((rp->p_flag & (SSEQL|SUANOM)) == 0 && > rp->p_rssize <= rp->p_maxrss) > return (0); > >which the front hand does for valid pages. I think this means that unless >the process has executed a vadvise() to warn of sequential or anomalous >paging behavior, the front hand never invalidates data pages. Not in the old BSD kernel. The code path is, for the front hand (on the VAX): /* * `page cluster' info is generally treated as `bits in the * first pte mapping a page the cluster', hence the `mark first' * code below. */ if (page cluster is valid) { mark page cluster invalid; mark process SPTECHG; if (any page in the cluster is modified) mark the first page modified; make all pages in the cluster look like the first one; if (it is a text page cluster) let other users know about changes; if (process is normal) we are done, return 0; } The SEQL and SUANOM cases, and the rssize > maxrss case, are where the process should not be paged in LRU fashion but rather in `almost MRU'. In this case, if the page was valid and we made it not-valid we also try to page it out. This is not quite right---we should not be paging out text pages as the vadvise call is for data, not text---but is probably `good enough'. On the Tahoe, which has a reference bit, the valid bit and `fast reclaim' stuff is unnecessary, and the code path looks like: if (this is a text page cluster) if (any process using it has referenced it) mark the first page as referenced; if (the page cluster has been referenced) { mark page cluster not-referenced; if (any page in the cluster is modified) mark the first page modified; make all pages in the cluster look like the first one; if (it is a text page cluster) let other users know about changes; if (process is normal) we are done, return 0; } The back hand, of course, never looks at valid/referenced pages at all. Thus, the code in `>' above is meant only to cause the front hand to do pageouts. For most processes, the front hand merely paves the way for the back hand to do pageouts. Imagine a wall clock with both hands moving at the same speed: the time it takes for the back hand to pass the same spot as the front hand is the time a process is given to `reclaim' a page. We take it away, but leave it in memory, and if you ask for it before the back hand gets around to it, you get to keep it. If not, we dust off the page (write it to swap) if it is dirty, and then put it in the `clean' pile (free pages). For processes that have done a vadvise(SSEQL) or (SUANOM), we presume instead that `recently used' means `unlikely to be reused', so in this case we have the front hand dust it off instead---if you just used it, we take it away. The planned 4.2BSD `new VM' (which is only now being implemented by beating the Mach VM into a different shape) had an `madvise' call which was intended to mark anomalous or sequential behaviour on a per-region or per-page basis, rather than per-process. In the meantime the old vadvise call was deprecated, but it lives on. . . . Presumably someone broke this in the DEC MIPS port. The MIPS chip does not have PTEs, so PTEs must be done in software, so you can define your own used/modified/ref'd/etc bits. This is much easier in the new Mach-based VM, where the responsibility for hardware management is in a separate file. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov
corey@milton.u.washington.edu (Corey Satten) (05/07/91)
In article <12792@dog.ee.lbl.gov> torek@elf.ee.lbl.gov (Chris Torek) writes: >In article <1991May2.231911.23612@milton.u.washington.edu> >corey@milton.u.washington.edu (Corey Satten) writes: >>however I believe that code never executed on Ultrix or BSD because of >>the code a few lines before (BSD code fragment): >> >> if ((rp->p_flag & (SSEQL|SUANOM)) == 0 && >> rp->p_rssize <= rp->p_maxrss) >> return (0); >> >>which the front hand does for valid pages. I think this means that unless >>the process has executed a vadvise() to warn of sequential or anomalous >>paging behavior, the front hand never invalidates data pages. > >Not in the old BSD kernel. The code path is, for the front hand (on >the VAX): Chris, I stand corrected for both BSD *and* for Ultrix. The fragment where I force all processes to behave as if SEQL or SUANOM was set is (as you point out) not required to get data pages to page out -- so on Ultrix, only the signed comparison fix is really required in vm_page.c. However... On our Ultrix system, since I have the SUANOM test anded with a global I can poke while the system is running, I can experiment with turning it on and off. I find that without my (technically unnecessary) "fix", our system can't free enough pages to keep from swapping continuously unless the scan rate is increased very substantially, and on one of our smaller, more memory starved systems, the highest scan rate achievable (by poking fastscan=1 and slowscan=2) is insufficient. I'm still unsure exactly how to interpret this, (especially in light of the fact that the number of dirty pages written to disk is limited to a rather small number per second) but now that I've experimentally confirmed this, I did want to correct my earlier posting before everyone forgets what we're talking about. Thanks again for taking the time to look this over with me. -------- Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington (206)543-5611