gordoni@chook.ua.oz.au (Gordon Irlam) (07/09/90)
A Guide to Sun-4 Virtual Memory Performance =========================================== Gordon Irlam, Adelaide University. (gordoni@cs.ua.oz.au or gordoni@chook.ua.oz.au) Throughput on a Sparcstation drops substantially once the amount of active virtual memory exceeds 16M, and by the time it reaches 25M the machine can be running up to 10 times slower than normal. This is the conclusion I reach from running a simplistic test program on an otherwise idle Sparcstation. Note that the limit involves the amount of ACTIVE virtual memory used. Additional virtual memory may be consumed by processes that remain idle without incurring any penalty. (SunOS usually steals pages from idle processes, so strictly speaking such memory is not normally considered to be part of the virtual memory consumed.) Also note that it is 16M of active VIRTUAL memory. Dynamically linked libraries, shared text segments, and copy on write forking means that the amount of PHYSICAL memory used could conceivably be as little as half this value. I would guess that any physical memory that is added to a typical Sparcstation beyond around 14M will effectively only be used as a disk cache. This problem exists on all Sun-4 systems. The problem is a result of poorly designed MMU hardware, and the failure of the operating system to attempt to minimize the effects of the design. Sun-4's have a fixed amount of memory that can be used for storing page tables, on Sparcstations in particular this memory area is far to small. This posting quantifies to some extent the performance losses resulting from the Sun-4 memory management subsystem, describes the cause of the problem, and suggests work-arounds that may be useful in overcoming some of the worst effects of the problem. This posting is based in part on a previous posting on the subject and the helpful responses received, many thanks. 1. Sparcstation Virtual Memory Performance ------------------------------------------- The following table shows the throughput of a Sparcstation-1 as a function of active virtual memory. The program used to obtain these figures is included at the end of this posting. The program forks several times and each child spends its life sequentially accessing pages of a shared 2M data segment over and over again. Forking and the use of a shared data segment allows the test program to be run on a machine with very little physical memory but otherwise does not significantly effect the results obtained. The first two columns show a sudden performance drop beyond 16M. The remaining columns contain raw data that can be used to understand what is happening. virtual relative elapsed user system translation swap memory speed time time time faults ins (Mb) (sec) (sec) (sec) 2 1.00 3.5 2.7 0.8 1224 1 4 1.09 6.4 5.3 1.1 1840 1 6 1.14 9.2 8.1 1.2 2442 0 8 1.15 12.2 10.7 1.4 2729 0 10 1.17 15.0 13.3 1.7 3381 0 12 1.17 18.0 16.1 1.9 4121 0 14 1.12 21.8 19.6 2.1 5275 0 16 1.08 25.9 22.6 3.1 8746 2 18 0.57 55.3 29.1 25.9 98251 6 20 0.40 87.7 34.4 53.0 200296 7 22 0.25 151.3 41.8 109.0 406885 12 24 0.11 388.3 61.9 325.3 1202899 20 26 0.12 371.9 62.6 304.5 1118388 22 28 0.06 764.8 91.8 655.4 2412144 39 30 0.03 1607.1 156.3 1446.2 5316313 56 32 0.02 2601.0 221.5 2373.1 8665839 88 Note that the test program is designed to illustrate the nature of the virtual memory problem in a simple fashion, not to provide realistic estimates of expected system performance. Realistic performance estimates can be much better made after having taken into account the issues raised in sections 3 and 4 below. In particular, system performance will probably not degrade as rapidly as shown in the above table. From this table it can be clearly seen that once the amount of active virtual memory exceeds 16M the system suddenly finds itself having to handle an incredibly large number of page faults. This causes a drastic increase in the amount of system time consumed, which results in a devastating drop in throughput. I emphasize here that the machine does not run out of physical memory at 16M. It has plenty of free memory during all of the tests - the free list is several megabytes in size, and the machine does not page fault to disk. 2. A Few Minor Details ----------------------- This section can be skipped. The first few figures show a relative speed slightly greater than 1.00. This is because the cost of invoking the initial image is amortized over a greater number of processes. When repeating the tests those that had a very low throughput produced figures that varied by around 30%. The slightest perturbation of the machine when it is very heavily loaded is found to significantly alter the elapsed time. When a test has been run several times the figures presented above are those with the smallest elapsed time. The amount of user time consumed grows at a faster rate beyond 16M of active virtual memory than below 16M. This may be a result of inaccuracies in the process accounting subsystem. Alternatively it could be some sort of user cost resulting from context invalidations. The swapping figures are not significant. They are the result of a strange feature of SunOS. Once all the page tables for a process's data segment have been stolen the process is conceptually swapped. This involves moving text pages that are not currently shared onto the free list. In this case no such pages exist. But even if they did no disk activity would occur because the free list has plenty of space. On a real system this quirk typically adds significantly to the performance degradation that occurs once the virtual memory limit has been exceeded. The possibility that the sudden increase in system time beyond 16M is a result of context switching can be discounted by running similar tests in which each process uses 4M instead of 2M. A sudden performance drop will be observed at around 20M. This figure is slightly higher than 16M because fewer page tables are wasted mapping less than the maximum possible amount of memory. The above figures were obtained under SunOS 4.0.3, however subsequent measurements have shown that essentially identical results are obtained under SunOS 4.1. 3. Implications for a Real System ---------------------------------- The amount of active virtual memory at which a sudden drop in throughput occurs, and the severity of the drop should not be viewed as precise parameters of the system. In a real system the observed performance will be heavily dependent on process sizes, memory access patterns, and context switching patterns. For instance, the elapsed time given above for 32M of active virtual memory would have been five times larger if every data access had resulted in a page fault. Alternatively, on a real system locality of address references could have had the opposite effect and reduced the elapsed time by a factor of 5. The context switching rate has a significant effect on the performance obtained when the system is short of pmegs since it determines how long a process will be given to run before having its pmegs stolen from it. If the context switching rate is too high processes will get very little useful work done since they will be spending all their time faulting on the pages of their resident sets, and never getting a chance to execute when all the pages of their resident sets are resident. Because the performance losses are a function of the amount of virtual memory used, dynamically linked libraries, shared code pages, and copy on write forking means that it is possible for these problems to occur on a machine with substantially less physical memory than the 16M of virtual memory at which the problem started to occur. On the other hand locality of reference will reduce the severity of the problem. Large scientific applications that don't display much data reference locality will be an exception. The impression I have is that virtual memory performance will not normally be a serious problem on a Sparcstation with less than 16M of physical memory, with between 16M and 32M it could be a problem depending upon the job mix, and it will almost certainly be a serious problem on any Sparcstation with 32M or more. If it isn't a problem on a machine with 32M or more you have almost certainly wasted your money buying the extra memory as you do not appear to be using it. [It's a sorry tale to go out and buy lots of memory to stop a system thrashing, install it, turn the machine on, and find the system still thrashes, but thanks to the large disk cache you have just installed it is now able to do so at previously unheard of rates.] A giveaway indication that the virtual memory system is a problem on a running system is the presence of swapping, as shown by "vmstat -S 5", but with a free list of perhaps a megabyte or more in size. This swapping does not involve real disk traffic. Pages are simply being moved back and forth onto the free list. Note that if you are only running one or two large processes this swapping behavior will probably not be observed. Regardless of whether you see this behavior or not vmstat should also be showing the system spending most of its time in system mode. The ratio of user time to system time obtained using vmstat should give you are rough estimate of the cost associated with the virtual memory management problems. You can get a more accurate estimate by looking at the number of translation faults (8.7 million in the previous table), and the time taken to handle them (2400 seconds). Then compute the time taken to handle a single fault (280us). Now look at a hatcnt data structure in the kernel using adb. # adb -k /vmunix /dev/mem physmem 17f4 hatcnt/8D _hatcnt: _hatcnt: 2129059 2034884 19942909 3173659 2685512 0 0 0 $q # The 4th word is the total number of pmeg allocations (see below) since the system has been booted (3173659), while the 5th word is the number of pmeg allocations that stole a pmeg from another process (2685512). Estimating, say 32 faults per stolen pmeg allocation you can work out the total time the system has spent handling these faults (7 hours). This time can then be compared to the total amount of time the system has been up (48 hours). On a non-Sparcstation Sun-4 you should estimate around 16 faults per stolen pmeg allocation, rather than 32. 4. The Sun-4 Memory Management Architecture ------------------------------------------- The 4/460 has a three level address translation scheme all other Sun-4 machines have a two level scheme. Sparcstations have 4k pages, all other machines have 8k pages. The level 2 page tables (level 3 tables on the 4/460) are referred to by Sun as page management entry groups, or simply pmegs. Each pmeg on a Sparcstation contains 64 entries, and since the pages are 4k in size this means that a single pmeg can map up to 256k of virtual memory. On all other Sun-4 machines the pmegs contain 32 entries, but the page size is 8k, so that once again a single pmeg can map up to 256k. Most systems use high speed static RAM to cache individual page table entries and hence speed up address translations. This is not done on Sun-4's. Instead all page tables (pmegs) are permanently stored in high speed static RAM. This results in address translation hardware that is both simple and reasonably fast. The downside however is that the number of pmegs that can be stored is limited by the amount of static RAM available. On the Sparcstations the static RAM can store up to 128 pmegs, giving a total mapping of up to 32M. A 4/1xx, or 4/3xx can map up to 64M, a 4/2xx can map up to 128M, and a 4/4xx can map up to 256M of virtual memory. 32M is the maximum amount of virtual memory that can be mapped on a Sparcstation, however since a pmeg can only be used to map pages within a single contiguous 256k aligned range of virtual address, the amount of virtual memory mapped when a machine runs out of pmegs will be substantially less. This is particularly evident when it is realized that separate pmegs will be assigned to map the text, data, and stack sections of each process, and some of these will probably be much smaller the 256k. Currently under SunOS pmegs are never shared between processes even if they may map identical virtual addresses to identical physical addresses, as could be the case with a common text segment. Dynamically linked libraries are also probably bad in this respect as they will require several pmegs per process, whereas if the process was statically linked the number of pmegs consumed would be reduced because pmegs would only be consumed mapping in the routines that are actually used. When a process needs to access a page that is not referenced by any of the pmegs that are currently being stored, and no free pmegs exist, it steals a pmeg belonging to another process. When the other process next goes to access a page contained in this pmeg it will get a translation fault and also have to steal a pmeg from some other process. Having got the pmeg back, however, all the page table entries associated with that pmeg will have been marked invalid, and thus the process will receive additional address translation faults when it goes to access each of the 64 pages that are associated with the pmeg (32 pages on a machine other than a Sparcstation). The problem is compounded by SunOS swapping out processes whose resident set size is zero. If all the pmegs belonging to a process get stolen from it the kernel determines that the process's resident set size is zero, and promptly swaps the process out. Fortunately this swapping only involves moving all of the processes pages onto the free list, and not to disk. But the CPU load associated with doing this appears to be substantial, and their is no obvious justification for doing it. 5. Working Around the Problem ------------------------------ Although the problems with the Sun-4 MMU hardware architecture probably can't be completely overcome by modifying SunOS a number of actions can probably be taken to diminish it's effect. Applications that have a large virtual address space and whose working set is spread out in a sparse manner are problematic for the Sun-4 memory management architecture, and the only alternatives may be to upgrade to a more expensive model, or switch to a different make of computer. Large numerical applications, certain Lisp programs, and large database applications are the most likely candidates. A reasonable solution to the problem in many cases would be for SunOS to keep a copy of all pmegs in software. Alternatively a cache of the active pmegs could be kept. In either case when a page belonging to a stolen group is next accessed, the entire pmeg can be loaded instead of each page in the pmeg causing a fault and being individually loaded. Doing this would probably involve between 50 and 100 lines of source code. It may also be desirable for pmegs to be shared between processes when both pmegs map the same virtual addresses to the same physical addresses. This would be useful for dynamically linked libraries, and shared text sections. Unfortunately this is probably difficult to do given SunOS's current virtual memory software architecture. Until a solution similar to the one proposed above is available a number of other options can be used as a stop gap measure. None of these solutions is entirely satisfactory, and depending on your job mix in certain circumstances they could conceivably make the situation worse. Preventing "eager swapping" will significantly improve performance in many cases. Despite this swapping not necessarily involving any disk traffic, we noticed a significant improvement on our machines when we did this; the response time during times of peak load probably improved by between a factor of 5 and 10. The simplest way to prevent eager swapping is to prevent all swapping. This is quite acceptable provided you have sufficient physical memory to keep the working sets of all active processes resident. # adb -w /vmunix nosched?W 1 $q # reboot Or use "adb -w /sys/sun4c/OBJ/vm_sched.o", to fix any future kernels you may build. A better solution is to only prevent eager swapping, although doing this is slightly more complex. The following example shows how to do this for a Sparcstation under SunOS 4.0.3. The offset will probably differ slightly on a machine other than a Sparcstation, or under SunOS 4.1, although hopefully not by too much. # adb -w /sys/sun4c/OBJ/vm_sched.o sched+1f8?5i _qs+0xf8: orcc %g0, %o0, %g0 bne,a _qs + 0x1b8 ld [%i5 + 0x8], %i5 ldsb [%i5 + 0x1e], %o1 sethi %hi(0x0), %o3 sched+1f8?W 80a02001 _qs+0xf8: 0x80900008 = 0x80a02001 sched+1f8?i _qs+0xf8: cmp %g0, 0x1 $q # "rebuild kernel and reboot" Another important thing to do is try to minimize the number of context switches that are occurring. How to do this will depend heavily on the applications you are running. Make sure you consider the effect of trivial applications such as clocks and load meters. These can significantly increase the context switching rate, and consume valuable pmegs. As a rough guide when a machine is short of pmegs, up to 40 context switches per second will probably be acceptable on a Sparcstation, while a larger machine should be able to cope with maybe 100 or 200 context switches per second. These figures will depend on the number of pmegs consumed by the average program, the number of pmegs the machine has, and whether context switching is occurring between the same few programs, or amongst different programs. The above values are based on observations of machines that are mainly used to run a large number of reasonably small applications. They are probably not valid for a site that runs a few large applications. Finally if you are using a Sparcstation to support many copies of a small number of programs that are dynamically linked to large libraries, you might want to try building them to use static libraries. For instance this would be the case if you are running 10 or more xterms, clocks, or window managers on the one machine. The benefit here is that page management groups won't be wasted mapping routines into processes that aren't ever called. The libraries have to be reasonably large for you to gain any benefit from doing this since only 1 page mangement group per process will be saved for every 256k of library code. And you need to be running multiple copies of the program so that the physical memory cost incurred by not sharing the library code used by other applications is amortized amongst the multiple instances of this application. 6. A Small Test Program ------------------------ The test program below can be run without any problems on any Sun-4 system with 8M of physical memory or more. Indeed it will probably work on a Sun-4 system with as little as 4M. The test program is intended to be used to illustrate in a simple fashion the nature of the problem with the Sun-4 virtual memory subsystem. It is not intended to be used to measure the performance of a Sun-4 under typical conditions. It should however allow you to get a rough feel for the amount of active virtual memory that you will typically be able to use. When running the test program make sure no-one else is using the system. And then run "vmtest 64", and "vmstat -S 5" concurrently for a minute or two to make sure that no paging to disk is occurring - "fre" should be greater than 256, and po and sr should both be 0. Note that swapping to the free list may be occurring due to a previously mentioned quirk of SunOS. Once the system has settled down kill vmtest and vmstat. To run the test program use the command "time vm_stat n", where n is the amount of virtual memory to use in megabytes. More detailed information can be determined using a command similar to the following and comparing the output of vmstat before and after each test. $ for mb in 2 4 8 16 32 64 do (echo "=== $mb megabytes ==="; vmstat -s; time vmtest $mb; vmstat -s) >> vm_stat.results done $ You may want to alter the process size or context switching rate to see what sort of effects these have on the results. Larger processes mean that fewer pmegs are wasted mapping less than a full address range. Hence the amount of active virtual memory that can be used before problems start to show up will increase. A faster context switching rate will reduce the amount of time a process gets to execute before being descheduled. If there is a pmeg shortage by the time the process is next scheduled many or all of its pmegs will be gone. Adjusting the context switching rate to a typical value seen on your system may be informative. /* vmtest.c, test Sun-4 virtual memory performance. * * Gordon Irlam (gordoni@cs.ua.oz.au), June 1990. * * Compile: cc -O vmtest.c -o vmtest * Run: time vmtest n * (Test performance for n Megabytes of active virtual memory, * n should be even.) */ #include <stdio.h> #include <sys/wait.h> #define STEP_SIZE 4096 /* Will step on every page. */ #define MEGABYTE 1048576 #define LOOP_COUNT 5000 #define PROCESS_SIZE (2 * MEGABYTE) #define CONTEXT_SWITCH_RATE 50 char blank[PROCESS_SIZE]; /* Shared data. */ main(argc, argv) int argc; char *argv[]; { int size, proc_count, pid, proc, count, i; if (argc != 2 || sscanf(argv[1], "%d", &size) != 1 || size > 500) { fprintf(stderr, "Usage: %s size\n", argv[0]); exit(1); } /* Touch zero fill pages so that they will be shared upon forking. */ for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE) blank[i] = 0; /* Fork several times creating processes that will use the memory. * Children will go into a loop accessing each of their pages in turn. */ proc_count = size * MEGABYTE / PROCESS_SIZE; for (proc = 0; proc < proc_count; proc++) { pid = fork(); if (pid == -1) fprintf(stderr, "Fork failed.\n"); if (pid == 0) { for (count = 0; count < LOOP_COUNT; count++) for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE) if (blank[i] != 0) fprintf(stderr, "Optimizer food.\n"); exit(0); } } /* Loop waiting for children to exit. Don't block, instead sleep for * short periods of time so as to create a realistic context switch rate. */ proc = proc_count; while (proc > 0) { usleep(2 * (1000000 / CONTEXT_SWITCH_RATE)); if (wait3(0, WNOHANG, 0) != 0) proc--; } }
jblind@griffith.eng.sun.com (Joanne Blind-Griffith) (07/25/90)
Sun Microsystem's Response to A Guide to Sun-4 Virtual Memory Performance Joanne Blind-Griffith, Product Manager, Sun Microsystems. The recent Sun-Spots posting by Gordon Irlam is essentially accurate in describing the hardware limitations of the Sun MMU. As he points out, whether this limitation is encountered on any particular machine depends on which Sun hardware is involved and what sort of applications are being used. It is our experience that this limitation is rarely encountered with applications which show typical locality of reference. Most common applications and job mixes will never encounter this limit. However, some very large applications, and some applications which share memory between many processes, will encounter this limit. The Sun MMU design results in a very fast MMU with a minimum of hardware. The Sun MMU is best thought of as a cache for virtual-to-physical mappings. As with all caches, the cache was designed to be large enough for the sort of typical applications to be run on the machine. Nearly all applications achieve a very high hit rate on this cache. However, like any cache, there are applications that will exceed the capacity of the cache, greatly lowering the hit rate. Since this cache (i.e., the Sun MMU) is loaded by software, the cost of a cache miss can be quite expensive. We have improved the algorithms that manage the Sun MMU. The improvment involves adding another level of caching between the MMU management software and the upper levels of the kernel. This is a classic space/ time tradeoff where a little bit of space for this software cache saves a lot of time in reloading the MMU for those applications which exceed the hardware limits of the MMU. In addition, many other changes have been made to the MMU management software to improve performance in general and to reduce the effects of some worst case behaviour. Following are the test results using Gordon's vmtest program run on a 12MB SPARCstation 1+ with the improved MMU management software: virtual elapsed user system memory time time time (MB) (sec) (sec) (sec) 2 2 2.3 0.6 4 5 4.7 0.8 8 10 9.4 1.1 10 13 11.9 1.2 12 16 14.3 1.4 14 18 16.8 1.5 16 21 19.5 1.7 18 25 22.5 1.9 20 27 25.3 2.0 22 30 27.6 2.2 24 33 30.4 2.5 26 36 33.3 2.5 28 39 35.7 2.7 30 41 38.1 2.9 32 44 40.8 3.1 Note that the performance is essentially linear through 32MB. This improved MMU management software will be included in the next release of SunOS. It will be available as a patch for SunOS 4.1 (Sun4c and Sun4 platforms) and 4.1 PSR A end of July and for SunOS 4.0.3c (Sun4c machines) early August.