corey@milton.u.washington.edu (Corey Satten) (04/30/91)
(This is cross-posted to unix-wizards because it may also apply to 4.3BSD) Performance Tuning a DEC Ultrix 4.1 Workstation Round 2 Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington Seattle, Washington April 1991 This is a follow up to work first posted in September 1990. History: Our department is using a rather maximally configured DECstation as a time-sharing host. It is a DEC 5000 running Ultrix 4.1 and has six disks, mostly 660 meg or 1 gigabyte. It serves /usr/local/bin via NFS to about a dozen workstations; talks to several printers; is the departmental electronic mail machine; hosts some campus wide mailing lists; is our anonymous FTP server; is one of two campus default domain nameservers; and also time-sharing host for about 20 X-terminals plus a dozen or more other users connected via telnet. We are supporting about 250 megs of swap space on roughly 43 megabytes of the 56 megabyte physical memory. A 'ps aux' listing usually has 400+ processes in it. In my September posting, I described how tweaking some global variables in the kernel allowed us to improve performance by paging more and swapping less and maintaining a larger chunk of free memory. Several people on campus and in netland followed our lead and reported similar improvements. (Global kernel variables can be conveniently tweaked on a running system with the kmem program included with this posting.) Our system, thus tweaked, spoiled us by the times it was fast and frustrated us by the times it wasn't. Occasionally we had reports of very large (20+ second) character echo delays experienced by one user while others in the same environment running the same programs saw no delays. Furthermore, large programs tended to lose too many pages to run satisfactorily. These continuing problems, plus my gut feeling that 56 megabytes should really be enough memory for what we're doing motivated me to continue investigating. Current Work: Close examination of our system revealed that we had some processes which had not run in a very long time (perhaps days) but which still had a significant RSS. Scrutiny of the source combined with some careful experiments lead me to discover that data pages never page out under normal circumstances even though all the complicated code to do so is there. Judging from the code, this was a conscious decision made in BSD Unix. Modern Unix systems running X windows tend to have more idle processes than ever before and those processes tend to have larger data spaces than their non-windowing ancestors. Thus, not paging data pages causes memory to be tied up with junk which only swapping can remove. On our system, I estimated perhaps 20 megabytes fell into this category. So why don't these data pages eventually swap out? When free memory becomes less than "desfree", the kernel looks for processes which have been sleeping longer than "maxslp" (usually 20 seconds) to swap out. To my horror, I discovered that it always starts looking at the beginning of the process table and stops when it has satisfied the need for free memory. A quick modification to "ps" to print the processes in process-table order and also print the number of times each had swapped confirmed this. No process in the last half of our process table had ever swapped and those at the front had swapped a lot. The extreme prejudice which is directed against processes living at the front of the table could well explain the horrible performance some users occasionally reported that the rest of us didn't see. In order to completely rectify both the swapping and paging problems described above, kernel changes are required, however I believe fixing the data paging problem has the biggest effect and later I will describe a partial workaround to the data paging problem which may work for those of you who can't change your kernels. (You might also try asking your DEC rep to supply these changes in binary form.) To fix the swapping problem, I modified the FORALLPROC macro used by vm_sched.c to begin swapping where it last left off so, in the long run, process table position has no predictable effect on swapping and long time sleepers will eventually swap out. To achieve this, I also needed to make other small changes to vm_sched.c. To fix the data paging problem I simply changed the two places which prevented data paging in vm_page.c. In both vm_sched.c and vm_page.c, I inserted global variables which I can set and test at runtime to experiment. Since data pages can be more expensive to page out than text pages, BSD and Ultrix have 2 limits on data pageouts. 1) only maxpgio/4 data pages per second will pageout. 2) data pages are only paged out if the process they belong to has an RSS > (saferss - sleeptime) where sleeptime is the number of seconds since the process has run. Our system seems to be doing fine with these defaults. If you can't change your kernel but you still want to try to get data pages to page out you may be able to use the following trick to force all processes to exceed their soft memory limit. Rename /etc/init to /etc/init.orig and replace /etc/init with the following trivial program: #include <sys/time.h> #include <sys/resource.h> main(argc, argv, envp) char *argv[], *envp[]; int argc; { struct rlimit rlp; getrlimit(RLIMIT_RSS,&rlp); rlp.rlim_cur = 1; /* zero may be better if it works */ setrlimit(RLIMIT_RSS,&rlp); execve("/etc/init.orig", argv, envp); } This will effectively nice them all (which shouldn't matter since it happens to them all uniformly) and allow their data pages to be paged out as long as the resident set size is greater than the value of "saferss" (6 pages) minus the idle time of the process. Unfortunately, DEC changed p_rssize from signed long to unsigned long so as soon as a process has been idle longer than saferss seconds the data pages stop paging out again. The workaround for those who can't re-compile is to set saferss to 127 which will guarantee the subtraction never goes negative and idle processes will eventually lose data pages (although not as quickly as with a fixed kernel and a smaller value of saferss). For those of you with source to Ultrix 4.1, context diffs of my changes are appended below. I have manually deleted my monitoring changes from the diffs to make the functional changes clearer and the diffs about 90% smaller. In addition, there are several things to note: first, that I applied the FORALLPROC change only to vm_sched.c by naming the changed file procNDC.h and changing the #include in vm_sched.c accordingly. Second, the p++ which is uniformly changed to ++p is a vestige of earlier changes which have been removed. The only lingering effect is to enlarge the context of this context-diff to include the entire macro (which is why I left it). Third, even though the new FORALLPROC code is complicated, it maintains the desirable timing characteristics of the original since the new stuff only executes twice for each invocation of the FORALLPROC macro. One final note for kernel hacking purists, in vm_sched.c where processes are swapped when RSS reaches zero, what I really want is to have them swap when RSS is zero AND the number of swapouts of the process is less than 2 AND the number of pageins is greater than zero but I didn't trust myself to implement that change since the per process count of swapouts and pageins is in the u.u_ru structure and I was unsure how to access it properly. As a result, our nfsd and biod processes aimlessly swap in and out, fortunately at negligible cost. We are now running with the following kernel parameters poked into /dev/kmem by my kmem program (appended somewhere below). lotsfree 128 -> 768 /* begin scanning for pages with 3meg free */ desfree 64 -> 512 /* begin swapping sleeping procs at 2meg */ coreyf1 xx -> 40 /* ... if sleeping longer than 40 seconds */ minfree 28 -> 128 /* consider swapping running procs at .5meg */ The elevated scan rate we needed when data pages didn't page seems to be unnecessary now. Slow scanning on our system encounters plenty of stale pages to keep the free list replenished. Results: 1) Our system now almost never swaps jobs because of memory shortfall. 2) The active real memory (displayed by vmstat -v) is about double what it was before -- about 25 megabytes -- with spikes as high as 37 megabytes. We never saw such high numbers before. 3) RSS of idle processes eventually reaches zero and these processes are then gently swapped out. 4) Paging activity (scan rate, etc.) is much lower than before -- often the system goes for minutes with a scan rate of zero. 5) Interactive response is good on FrameMaker and other medium size programs which were formerly frustratingly slow. 6) Large programs (such as cc -O on perl version 3's eval.c where the optimizer grows to 20 megabytes) can run effectively -- cpu idle time drops to zero for the entire 2-minute optimizer run and interactive response in other applications remains good. 7) The load average on the machine is somewhat lower and the spikes are considerably lower. 8) We are considering increasing our file-system buffer-cache from the default 15% since most of our idle time now seems to be file-system related rather than virtual memory related. -------- Corey Satten, corey@cac.washington.edu Networks and Distributed Computing University of Washington (206)543-5611 The kmem program and context diffs follow: *** /usr/src/Ultrix-4.1-RISC/sys/h/proc.h Fri Jul 6 07:18:00 1990 --- /usr/src/Ultrix-4.1-RISC/sys/h/procNDC.h Fri Apr 5 21:13:18 1991 *************** *** 398,434 **** * out of the For loop, and not one of the inner While loops */ ! #define NEXTPROC { pp++; goto _a ; } #define FORALLPROC(X) { \ ! register unsigned long *_bp; \ ! register struct proc *pp = proc; \ register unsigned long _mask; \ \ /* \ * for the whole index into the table \ */ \ ! for ( _bp = proc_bitmap; \ ! _bp < &proc_bitmap[max_proc_index] ; _bp++ ) { \ /* \ * If any bits in this longword are used, \ * find the associated structures \ */ \ ! if (_mask = *_bp) { \ _a: if (_mask) { \ _b: if ((_mask&1) == 0) { \ _mask = _mask >> 1; \ ! pp++; \ goto _b; \ } \ _mask = _mask >> 1; \ { X } \ ! pp++; \ goto _a; \ } else { \ if (_mask = ((pp-proc)%32)) \ ! pp += 32 - _mask; \ } \ } else pp += 32; \ } \ } --- 398,452 ---- * out of the For loop, and not one of the inner While loops */ ! #define NEXTPROC { ++pp; goto _a ; } + #define INC_bp(X,Y) (X < &proc_bitmap[max_proc_index-1] ? \ + ++X : (Y=proc, X=proc_bitmap)) + #define FORALLPROC(X) { \ ! static unsigned long *_bp = proc_bitmap; \ ! register struct proc *pp = proc + 32*(_bp-proc_bitmap); \ ! static struct proc *_opp = 0; \ register unsigned long _mask; \ + register unsigned long *_bpe = _bp; \ + register int _more; \ + unsigned long _maskmask; \ \ /* \ * for the whole index into the table \ */ \ ! for (_more = 2; ; INC_bp(_bp,pp)) { \ ! if (_bp == _bpe) \ ! if (--_more) { \ ! int i = _opp-pp+1; \ ! _maskmask = ~0; \ ! if (i<32 && i>=0) \ ! _maskmask <<= i; \ ! else _maskmask = 0; \ ! _mask = *_bp & _maskmask; \ ! } else _mask = *_bp & ~_maskmask; \ ! else _mask = *_bp; \ /* \ * If any bits in this longword are used, \ * find the associated structures \ */ \ ! if (_mask) { \ _a: if (_mask) { \ _b: if ((_mask&1) == 0) { \ _mask = _mask >> 1; \ ! ++pp; \ goto _b; \ } \ _mask = _mask >> 1; \ + _opp = pp; \ { X } \ ! ++pp; \ goto _a; \ } else { \ if (_mask = ((pp-proc)%32)) \ ! pp += 32-_mask; \ } \ } else pp += 32; \ + if (!_more) break; \ } \ } *** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c_orig Tue Jul 17 12:30:21 1990 --- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_page.c Tue Apr 23 12:58:18 1991 *************** *** 274,279 **** --- 274,283 ---- int nohash = 0; /* turn on/off hashing */ int nobufcache = 1; /* turn on/off buf cache for data */ + /* symbols added for performance prodding at request of corey@cac */ + int coreyp1 = 0; /* data does page out (stock is coreyp1 = 1) */ + int coreyp2 = 10; /* data pageouts per second (was maxpgio/4) */ + extern int swapfrag; /* * Handle a page fault. *************** *** 1318,1324 **** (void) splx(s); return(0); } ! if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && rp->p_rssize <= rp->p_maxrss) { smp_unlock(seg_lock); smp_unlock(&lk_cmap); --- 1322,1328 ---- (void) splx(s); return(0); } ! if ((rp->p_vm & (SSEQL|SUANOM)) == 0 && coreyp1 && rp->p_rssize <= rp->p_maxrss) { smp_unlock(seg_lock); smp_unlock(&lk_cmap); *************** *** 1332,1338 **** * Guarantee a minimal investment in data * space for jobs in balance set. */ ! if (rp->p_rssize < saferss - rp->p_slptime) { smp_unlock(&lk_p_vm); smp_unlock(&lk_cmap); (void) splx(s); --- 1336,1342 ---- * Guarantee a minimal investment in data * space for jobs in balance set. */ ! if ((long)rp->p_rssize < saferss - rp->p_slptime) { smp_unlock(&lk_p_vm); smp_unlock(&lk_cmap); (void) splx(s); *************** *** 1371,1377 **** * Limit pushes to avoid saturating * pageout device. */ ! (pushes > maxpgio / 4)) { if (seg_lock != &lk_p_vm) smp_unlock(&lk_p_vm); smp_unlock(seg_lock); --- 1375,1381 ---- * Limit pushes to avoid saturating * pageout device. */ ! (pushes > coreyp2 /* was maxpgio / 4 */)) { if (seg_lock != &lk_p_vm) smp_unlock(&lk_p_vm); smp_unlock(seg_lock); *** /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c_orig Fri Jul 6 06:41:49 1990 --- /usr/src/Ultrix-4.1-RISC/sys/vm/vm_sched.c Thu Apr 25 10:12:23 1991 *************** *** 103,109 **** #include "../h/seg.h" #include "../h/dir.h" #include "../h/user.h" ! #include "../h/proc.h" #include "../h/text.h" #include "../h/vm.h" #include "../h/cmap.h" --- 103,109 ---- #include "../h/seg.h" #include "../h/dir.h" #include "../h/user.h" ! #include "../h/procNDC.h" #include "../h/text.h" #include "../h/vm.h" #include "../h/cmap.h" *************** *** 143,148 **** --- 143,154 ---- int minfree = 0; int desfree = 0; int lotsfree= 0; + + /* symbols added for performance prodding at request of corey@cac */ + int swload = TO_FIX(2); + int coreyf0 = 1; /* goto loop after swapout */ + int coreyf1 = 20; /* softswap enabled after this many seconds */ + int coreyf4 = 1; /* don't swap when RSS=0 */ #endif mips #ifdef vax *************** *** 291,297 **** (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree && #endif vax #ifdef mips ! (avenrun[0] >= TO_FIX(2) && imax(avefree, avefree30) < desfree && #endif mips (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) { desperate = 1; --- 297,307 ---- (avenrun[0] >= 2 && imax(avefree, avefree30) < desfree && #endif vax #ifdef mips ! /* ! * symbol "swload" added for performance prodding at ! * request of corey@cac 26 Feb 91 ! */ ! (avenrun[0] >= swload && imax(avefree, avefree30) < desfree && #endif mips (rate.v_pgin + rate.v_pgout > maxpgio || avefree < minfree))) { desperate = 1; *************** *** 340,353 **** case SSLEEP: case SSTOP: ! if ((freemem < desfree || pp->p_rssize == 0) && ! pp->p_slptime > maxslp && (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) && swappable(pp)) { /* * Kick out deadwood. */ pp->p_sched &= ~SLOAD; smp_unlock(&lk_rq); --- 350,366 ---- case SSLEEP: case SSTOP: ! if (coreyf1 && ! (freemem < desfree || (pp->p_rssize == 0 && coreyf4)) && ! pp->p_slptime > coreyf1 /* was maxslp */ && (!pp->p_textp || (pp->p_textp->x_flag&(XLOCK|XNOSW))==0) && swappable(pp)) { + int breakout; /* * Kick out deadwood. */ + breakout = pp->p_rssize ? 1 : 1-coreyf4; pp->p_sched &= ~SLOAD; smp_unlock(&lk_rq); *************** *** 357,363 **** goto loop; #endif vax #ifdef mips ! NEXTPROC; #endif mips } smp_unlock(&lk_rq); --- 370,381 ---- goto loop; #endif vax #ifdef mips ! if (coreyf0 && breakout) { ! goto loop; ! } ! else { ! NEXTPROC; ! } #endif mips } smp_unlock(&lk_rq); *************** *** 540,545 **** --- 558,564 ---- if (sleeper < pp->p_slptime) { p = pp; sleeper = pp->p_slptime; + if (sleeper == 127) return(p); /* Corey */ } } else if (!sleeper && (pp->p_stat==SRUN||pp->p_stat==SSLEEP)) { rppri = pp->p_rssize; : ----- cut here ----- cut here ----- cut here ----- cut here ----- : This is a "shell archive". Save everything after the cut mark : in a file called thisstuff, then feed it to sh by typing sh thisstuff. : SHAR archive format. Archive created Thu Apr 25 10:26:03 PDT 1991 echo x - kmem.c echo '-rw-r--r-- 1 corey 3925 Apr 25 10:24 kmem.c (as sent)' sed 's/^-//' >kmem.c <<'+FUNKY+STUFF+' -/* - * a tool to use in place of adb (on systems without adb) which lets you - * peek and poke at the values of kernel variables in /dev/kmem - * - * usage: kmem [-s#] var1 var2 ... varN - * or - * usage: kmem -w var1=val1 var2=val2 ... varN=valN - * - * If -s# is given, loop every # seconds and repeat. This is handy for - * watching variables like freemem or debugging flags. The following simple - * awk script can postprocess the output filtering out values which don't - * change: - * { if (NF > 2) date = $0 - * else if (seen[$1] != $2) { - * seen[$1] = $2 - * if (date != "") { - * print ""; print date - * date = "" - * } - * print - * } - * } - * - * Corey Satten, corey@cac.washington.edu, 9/6/90 - Ultrix 4.0 version - */ -#include<stdio.h> -#include<nlist.h> -#include<sys/file.h> -#include <time.h> - -struct nlist *nl; /* how we find locations of names */ -int *nv; /* the new values for each name */ -int w_flag = 0; /* write new values? */ -char *file = "/vmunix"; /* default file to read symbols from */ -int kmem; - -main(argc, argv) - int argc; - char *argv[]; -{ - int f; /* walks argv upto index of first non-flag */ - int i; /* walks through remaining arguments */ - int value = 0; - int rc = 0; - int sleeptime = 0; /* if set nonzero with -s#, repeat every # secs */ - - /* - * flag parsing - */ - for (f=1; f<argc && *(argv[f]) == '-'; ++f) { - switch(argv[f][1]) { - default: - fprintf(stderr, "%s: unknown flag -%c\n", argv[0], argv[f][1]); - exit(1); - case 'w': - w_flag = 1; - break; - case 's': - sscanf(argv[f][2] ? argv[f]+2 : argv[++f], "%d", &sleeptime); - break; - case 'f': - file = argv[++f]; - break; - } - } - - /* - * handle the remaining arguments as either symname or symname=value - * depending on whether -w (w_flag) was specified. - */ - - nl = (struct nlist *) malloc( sizeof(*nl) * (argc-f+1) ); - nv = (int *) malloc( sizeof(int) * (argc-f+1) ); - if (!nv || !nl) {perror("malloc"); exit(1);}; - - for (i=0; i<argc-f; ++i) { - char *name = (char *)malloc(strlen(argv[i+f])+1); - - if (!name) {perror("malloc"); exit(1);}; - rc = sscanf(argv[i+f], "%[^=]=%d", name, &value); - if (rc - w_flag != 1) { - fprintf(stderr, "%s: bad argument: %s\n", argv[0], argv[i+f]); - exit(1); - } - nl[i].n_name = name; - nv[i] = value; - } - nl[i].n_name = ""; - - /* - * now figure out where to read/write in /dev/kmem and do it - */ - - nlist(file, nl); - - kmem = open("/dev/kmem", w_flag ? O_RDWR : O_RDONLY); - if (kmem < 0) { - perror("/dev/kmem open"); - exit(1); - } - - sleeploop: - - if (sleeptime) { - putchar('\n'); date(); - } - - for (i=0; i<argc-f; ++i) { - long seekto = (long)nl[i].n_value; - - if (nl[i].n_type == 0) { - fprintf(stderr, "%s: symbol `%s' not found in namelist of %s\n", - argv[0], nl[i].n_name, file); - /* - * We promise to do all writes in command line order, so if one - * is going to fail, we'd best bail out rather than continue. - */ - if (w_flag) exit(2); - else continue; - } - if ( lseek(kmem, seekto, 0) != seekto ) { - perror("/dev/kmem lseek"); exit(2); - } - if ( read(kmem, &value, sizeof(int)) != sizeof(int) ) { - perror("/dev/kmem read"); exit(2); - } - - printf("%s(0x%x)\t%d", nl[i].n_name, nl[i].n_value, value); - - if (w_flag) { - if ( lseek(kmem, seekto, 0) != seekto ) { - perror("/dev/kmem lseek"); exit(2); - } - value = nv[i]; - printf(" -> %d", value); - if ( write(kmem, &value, sizeof(int)) != sizeof(int) ) { - perror("/dev/kmem write"); exit(2); - } - } - putchar('\n'); - } - - if (sleeptime) { - fflush(stdout); - sleep(sleeptime); - goto sleeploop; - } -} -/* - * print ascii version of current date and time - */ -date() { - char *at; - static char db[30]; - int i; - - time(&i); - at = asctime(localtime(&i)); - strcpy(db, at+4); - db[20] = 0; - puts(db); - } +FUNKY+STUFF+ chmod u=rw,g=r,o=r kmem.c ls -l kmem.c exit 0
ggm@brolga.cc.uq.oz.au (George Michaelson) (05/01/91)
corey@milton.u.washington.edu (Corey Satten) writes: >(This is cross-posted to unix-wizards because it may also apply to 4.3BSD) > Performance Tuning a DEC Ultrix 4.1 Workstation > Round 2 > Corey Satten, corey@cac.washington.edu This is absolutely fascinating reading. For the faint of heart (like me) Is there even a REMOTE chance that DECs Ultrix development people will fold in similar functionality sometime soon? I would really rather run something passed into me via the CD, since if it breaks I get to complain. You've done your work on Ultrix 4.1, 4.2 is already on some peoples machines... what sort of timescale for a 4.2[ABCD] upgrade might there be? ....probably I'm asking for a DECtype to reply... -George -- G.Michaelson Internet: G.Michaelson@cc.uq.oz.au Phone: +61 7 365 4079 Postal: George Michaelson, the Prentice Centre, The University of Queensland, St Lucia, QLD Australia 4072.
mogul@wrl.dec.com (Jeffrey Mogul) (05/02/91)
In article <1991Apr30.235901.3132@brolga.cc.uq.oz.au> ggm@brolga.cc.uq.oz.au (George Michaelson) writes: >> Performance Tuning a DEC Ultrix 4.1 Workstation >> Round 2 >> Corey Satten, corey@cac.washington.edu > >This is absolutely fascinating reading. For the faint of heart (like me) >Is there even a REMOTE chance that DECs Ultrix development people will >fold in similar functionality sometime soon? I would really rather run >something passed into me via the CD, since if it breaks I get to complain. > >....probably I'm asking for a DECtype to reply... I'm a "DECtype", but I'm in a research group, so you can't believe what I say. I took the liberty of forwarding the "Performance Tuning" paper to the people in the Ultrix group who are working on VM. They will probably find it fascinating reading, too. Whether these changes make it into Ultrix is an interesting question. One of the many potential benefits of a switch to an OSF/1 kernel is that it comes with the Mach virtual memory system. I suspect that Corey Satten's discoveries won't apply, since the Mach VM system has little or no connection with the BSD VM system that he has been fixing. Of course, it might have interesting new bugs. -Jeff