[comp.sys.sun] Sun-4 MMU Performance

gordoni@chook.ua.oz.au (Gordon Irlam) (07/09/90)

           A Guide to Sun-4 Virtual Memory Performance
           ===========================================

               Gordon Irlam, Adelaide University.
         (gordoni@cs.ua.oz.au or gordoni@chook.ua.oz.au)

Throughput on a Sparcstation drops substantially once the amount of active
virtual memory exceeds 16M, and by the time it reaches 25M the machine can
be running up to 10 times slower than normal.  This is the conclusion I
reach from running a simplistic test program on an otherwise idle
Sparcstation.

Note that the limit involves the amount of ACTIVE virtual memory used.
Additional virtual memory may be consumed by processes that remain idle
without incurring any penalty.  (SunOS usually steals pages from idle
processes, so strictly speaking such memory is not normally considered to
be part of the virtual memory consumed.)  Also note that it is 16M of
active VIRTUAL memory.  Dynamically linked libraries, shared text
segments, and copy on write forking means that the amount of PHYSICAL
memory used could conceivably be as little as half this value.  I would
guess that any physical memory that is added to a typical Sparcstation
beyond around 14M will effectively only be used as a disk cache.

This problem exists on all Sun-4 systems.  The problem is a result of
poorly designed MMU hardware, and the failure of the operating system to
attempt to minimize the effects of the design.  Sun-4's have a fixed
amount of memory that can be used for storing page tables, on
Sparcstations in particular this memory area is far to small.

This posting quantifies to some extent the performance losses resulting
from the Sun-4 memory management subsystem, describes the cause of the
problem, and suggests work-arounds that may be useful in overcoming some
of the worst effects of the problem.  This posting is based in part on a
previous posting on the subject and the helpful responses received, many
thanks.

         1.  Sparcstation Virtual Memory Performance
         -------------------------------------------

The following table shows the throughput of a Sparcstation-1 as a function
of active virtual memory.  The program used to obtain these figures is
included at the end of this posting.  The program forks several times and
each child spends its life sequentially accessing pages of a shared 2M
data segment over and over again.  Forking and the use of a shared data
segment allows the test program to be run on a machine with very little
physical memory but otherwise does not significantly effect the results
obtained.  The first two columns show a sudden performance drop beyond
16M.  The remaining columns contain raw data that can be used to
understand what is happening.


 virtual  relative      elapsed    user system translation  swap
  memory   speed           time    time   time      faults   ins
    (Mb)                  (sec)   (sec)  (sec)  
       2    1.00            3.5     2.7    0.8        1224     1
       4    1.09            6.4     5.3    1.1        1840     1
       6    1.14            9.2     8.1    1.2        2442     0
       8    1.15           12.2    10.7    1.4        2729     0
      10    1.17           15.0    13.3    1.7        3381     0
      12    1.17           18.0    16.1    1.9        4121     0
      14    1.12           21.8    19.6    2.1        5275     0
      16    1.08           25.9    22.6    3.1        8746     2
      18    0.57           55.3    29.1   25.9       98251     6
      20    0.40           87.7    34.4   53.0      200296     7
      22    0.25          151.3    41.8  109.0      406885    12
      24    0.11          388.3    61.9  325.3     1202899    20
      26    0.12          371.9    62.6  304.5     1118388    22
      28    0.06          764.8    91.8  655.4     2412144    39
      30    0.03         1607.1   156.3 1446.2     5316313    56
      32    0.02         2601.0   221.5 2373.1     8665839    88


Note that the test program is designed to illustrate the nature of the
virtual memory problem in a simple fashion, not to provide realistic
estimates of expected system performance.  Realistic performance estimates
can be much better made after having taken into account the issues raised
in sections 3 and 4 below.  In particular, system performance will
probably not degrade as rapidly as shown in the above table.

From this table it can be clearly seen that once the amount of active
virtual memory exceeds 16M the system suddenly finds itself having to
handle an incredibly large number of page faults.  This causes a drastic
increase in the amount of system time consumed, which results in a
devastating drop in throughput.

I emphasize here that the machine does not run out of physical memory at
16M.  It has plenty of free memory during all of the tests - the free list
is several megabytes in size, and the machine does not page fault to disk.


                  2.  A Few Minor Details
                  -----------------------

This section can be skipped.

The first few figures show a relative speed slightly greater than 1.00.
This is because the cost of invoking the initial image is amortized over a
greater number of processes.

When repeating the tests those that had a very low throughput produced
figures that varied by around 30%.  The slightest perturbation of the
machine when it is very heavily loaded is found to significantly alter the
elapsed time.  When a test has been run several times the figures
presented above are those with the smallest elapsed time.

The amount of user time consumed grows at a faster rate beyond 16M of
active virtual memory than below 16M.  This may be a result of
inaccuracies in the process accounting subsystem.  Alternatively it could
be some sort of user cost resulting from context invalidations.

The swapping figures are not significant.  They are the result of a
strange feature of SunOS.  Once all the page tables for a process's data
segment have been stolen the process is conceptually swapped.  This
involves moving text pages that are not currently shared onto the free
list.  In this case no such pages exist.  But even if they did no disk
activity would occur because the free list has plenty of space.  On a real
system this quirk typically adds significantly to the performance
degradation that occurs once the virtual memory limit has been exceeded.

The possibility that the sudden increase in system time beyond 16M is a
result of context switching can be discounted by running similar tests in
which each process uses 4M instead of 2M.  A sudden performance drop will
be observed at around 20M.  This figure is slightly higher than 16M
because fewer page tables are wasted mapping less than the maximum
possible amount of memory.

The above figures were obtained under SunOS 4.0.3, however subsequent
measurements have shown that essentially identical results are obtained
under SunOS 4.1.

             3.  Implications for a Real System
             ----------------------------------

The amount of active virtual memory at which a sudden drop in throughput
occurs, and the severity of the drop should not be viewed as precise
parameters of the system.  In a real system the observed performance will
be heavily dependent on process sizes, memory access patterns, and context
switching patterns.  For instance, the elapsed time given above for 32M of
active virtual memory would have been five times larger if every data
access had resulted in a page fault.  Alternatively, on a real system
locality of address references could have had the opposite effect and
reduced the elapsed time by a factor of 5.  The context switching rate has
a significant effect on the performance obtained when the system is short
of pmegs since it determines how long a process will be given to run
before having its pmegs stolen from it.  If the context switching rate is
too high processes will get very little useful work done since they will
be spending all their time faulting on the pages of their resident sets,
and never getting a chance to execute when all the pages of their resident
sets are resident.

Because the performance losses are a function of the amount of virtual
memory used, dynamically linked libraries, shared code pages, and copy on
write forking means that it is possible for these problems to occur on a
machine with substantially less physical memory than the 16M of virtual
memory at which the problem started to occur.

On the other hand locality of reference will reduce the severity of the
problem.  Large scientific applications that don't display much data
reference locality will be an exception.

The impression I have is that virtual memory performance will not normally
be a serious problem on a Sparcstation with less than 16M of physical
memory, with between 16M and 32M it could be a problem depending upon the
job mix, and it will almost certainly be a serious problem on any
Sparcstation with 32M or more.  If it isn't a problem on a machine with
32M or more you have almost certainly wasted your money buying the extra
memory as you do not appear to be using it.

[It's a sorry tale to go out and buy lots of memory to stop a system
thrashing, install it, turn the machine on, and find the system still
thrashes, but thanks to the large disk cache you have just installed it is
now able to do so at previously unheard of rates.]

A giveaway indication that the virtual memory system is a problem on a
running system is the presence of swapping, as shown by "vmstat -S 5", but
with a free list of perhaps a megabyte or more in size.  This swapping
does not involve real disk traffic.  Pages are simply being moved back and
forth onto the free list.  Note that if you are only running one or two
large processes this swapping behavior will probably not be observed.
Regardless of whether you see this behavior or not vmstat should also be
showing the system spending most of its time in system mode.

The ratio of user time to system time obtained using vmstat should give
you are rough estimate of the cost associated with the virtual memory
management problems.  You can get a more accurate estimate by looking at
the number of translation faults (8.7 million in the previous table), and
the time taken to handle them (2400 seconds).  Then compute the time taken
to handle a single fault (280us).  Now look at a hatcnt data structure in
the kernel using adb.

# adb -k /vmunix /dev/mem
physmem 17f4
hatcnt/8D
_hatcnt:
_hatcnt:        2129059         2034884         19942909        3173659
                2685512         0               0               0
$q
#

The 4th word is the total number of pmeg allocations (see below) since the
system has been booted (3173659), while the 5th word is the number of pmeg
allocations that stole a pmeg from another process (2685512).  Estimating,
say 32 faults per stolen pmeg allocation you can work out the total time
the system has spent handling these faults (7 hours).  This time can then
be compared to the total amount of time the system has been up (48 hours).
On a non-Sparcstation Sun-4 you should estimate around 16 faults per
stolen pmeg allocation, rather than 32.

          4.  The Sun-4 Memory Management Architecture
          -------------------------------------------

The 4/460 has a three level address translation scheme all other Sun-4
machines have a two level scheme.  Sparcstations have 4k pages, all other
machines have 8k pages.  The level 2 page tables (level 3 tables on the
4/460) are referred to by Sun as page management entry groups, or simply
pmegs.  Each pmeg on a Sparcstation contains 64 entries, and since the
pages are 4k in size this means that a single pmeg can map up to 256k of
virtual memory.  On all other Sun-4 machines the pmegs contain 32 entries,
but the page size is 8k, so that once again a single pmeg can map up to
256k.

Most systems use high speed static RAM to cache individual page table
entries and hence speed up address translations.  This is not done on
Sun-4's.  Instead all page tables (pmegs) are permanently stored in high
speed static RAM.  This results in address translation hardware that is
both simple and reasonably fast.  The downside however is that the number
of pmegs that can be stored is limited by the amount of static RAM
available.  On the Sparcstations the static RAM can store up to 128 pmegs,
giving a total mapping of up to 32M.  A 4/1xx, or 4/3xx can map up to 64M,
a 4/2xx can map up to 128M, and a 4/4xx can map up to 256M of virtual
memory.

32M is the maximum amount of virtual memory that can be mapped on a
Sparcstation, however since a pmeg can only be used to map pages within a
single contiguous 256k aligned range of virtual address, the amount of
virtual memory mapped when a machine runs out of pmegs will be
substantially less.  This is particularly evident when it is realized that
separate pmegs will be assigned to map the text, data, and stack sections
of each process, and some of these will probably be much smaller the 256k.

Currently under SunOS pmegs are never shared between processes even if
they may map identical virtual addresses to identical physical addresses,
as could be the case with a common text segment.  Dynamically linked
libraries are also probably bad in this respect as they will require
several pmegs per process, whereas if the process was statically linked
the number of pmegs consumed would be reduced because pmegs would only be
consumed mapping in the routines that are actually used.

When a process needs to access a page that is not referenced by any of the
pmegs that are currently being stored, and no free pmegs exist, it steals
a pmeg belonging to another process.  When the other process next goes to
access a page contained in this pmeg it will get a translation fault and
also have to steal a pmeg from some other process.  Having got the pmeg
back, however, all the page table entries associated with that pmeg will
have been marked invalid, and thus the process will receive additional
address translation faults when it goes to access each of the 64 pages
that are associated with the pmeg (32 pages on a machine other than a
Sparcstation).

The problem is compounded by SunOS swapping out processes whose resident
set size is zero.  If all the pmegs belonging to a process get stolen from
it the kernel determines that the process's resident set size is zero, and
promptly swaps the process out.  Fortunately this swapping only involves
moving all of the processes pages onto the free list, and not to disk.
But the CPU load associated with doing this appears to be substantial, and
their is no obvious justification for doing it.

               5.  Working Around the Problem
               ------------------------------

Although the problems with the Sun-4 MMU hardware architecture probably
can't be completely overcome by modifying SunOS a number of actions can
probably be taken to diminish it's effect.

Applications that have a large virtual address space and whose working set
is spread out in a sparse manner are problematic for the Sun-4 memory
management architecture, and the only alternatives may be to upgrade to a
more expensive model, or switch to a different make of computer.  Large
numerical applications, certain Lisp programs, and large database
applications are the most likely candidates.

A reasonable solution to the problem in many cases would be for SunOS to
keep a copy of all pmegs in software.  Alternatively a cache of the active
pmegs could be kept.  In either case when a page belonging to a stolen
group is next accessed, the entire pmeg can be loaded instead of each page
in the pmeg causing a fault and being individually loaded.  Doing this
would probably involve between 50 and 100 lines of source code.

It may also be desirable for pmegs to be shared between processes when
both pmegs map the same virtual addresses to the same physical addresses.
This would be useful for dynamically linked libraries, and shared text
sections.  Unfortunately this is probably difficult to do given SunOS's
current virtual memory software architecture.

Until a solution similar to the one proposed above is available a number
of other options can be used as a stop gap measure.  None of these
solutions is entirely satisfactory, and depending on your job mix in
certain circumstances they could conceivably make the situation worse.

Preventing "eager swapping" will significantly improve performance in many
cases.  Despite this swapping not necessarily involving any disk traffic,
we noticed a significant improvement on our machines when we did this; the
response time during times of peak load probably improved by between a
factor of 5 and 10.

The simplest way to prevent eager swapping is to prevent all swapping.
This is quite acceptable provided you have sufficient physical memory to
keep the working sets of all active processes resident.

# adb -w /vmunix
nosched?W 1
$q
# reboot

Or use "adb -w /sys/sun4c/OBJ/vm_sched.o", to fix any future kernels you
may build.

A better solution is to only prevent eager swapping, although doing this
is slightly more complex.  The following example shows how to do this for
a Sparcstation under SunOS 4.0.3.  The offset will probably differ
slightly on a machine other than a Sparcstation, or under SunOS 4.1,
although hopefully not by too much.

# adb -w /sys/sun4c/OBJ/vm_sched.o
sched+1f8?5i
_qs+0xf8:       orcc    %g0, %o0, %g0
                bne,a   _qs + 0x1b8
                ld      [%i5 + 0x8], %i5
                ldsb    [%i5 + 0x1e], %o1
                sethi   %hi(0x0), %o3
sched+1f8?W 80a02001
_qs+0xf8:       0x80900008      =       0x80a02001
sched+1f8?i
_qs+0xf8:       cmp     %g0, 0x1
$q
# "rebuild kernel and reboot"

Another important thing to do is try to minimize the number of context
switches that are occurring.  How to do this will depend heavily on
the applications you are running.  Make sure you consider the effect
of trivial applications such as clocks and load meters.  These can
significantly increase the context switching rate, and consume
valuable pmegs.  As a rough guide when a machine is short of pmegs, up
to 40 context switches per second will probably be acceptable on a
Sparcstation, while a larger machine should be able to cope with maybe
100 or 200 context switches per second.  These figures will depend on
the number of pmegs consumed by the average program, the number of
pmegs the machine has, and whether context switching is occurring
between the same few programs, or amongst different programs.  The
above values are based on observations of machines that are mainly
used to run a large number of reasonably small applications.  They are
probably not valid for a site that runs a few large applications.

Finally if you are using a Sparcstation to support many copies of a small
number of programs that are dynamically linked to large libraries, you
might want to try building them to use static libraries.  For instance
this would be the case if you are running 10 or more xterms, clocks, or
window managers on the one machine.  The benefit here is that page
management groups won't be wasted mapping routines into processes that
aren't ever called.  The libraries have to be reasonably large for you to
gain any benefit from doing this since only 1 page mangement group per
process will be saved for every 256k of library code.  And you need to be
running multiple copies of the program so that the physical memory cost
incurred by not sharing the library code used by other applications is
amortized amongst the multiple instances of this application.

                    6.  A Small Test Program
                    ------------------------

The test program below can be run without any problems on any Sun-4 system
with 8M of physical memory or more.  Indeed it will probably work on a
Sun-4 system with as little as 4M.  The test program is intended to be
used to illustrate in a simple fashion the nature of the problem with the
Sun-4 virtual memory subsystem.  It is not intended to be used to measure
the performance of a Sun-4 under typical conditions.  It should however
allow you to get a rough feel for the amount of active virtual memory that
you will typically be able to use.

When running the test program make sure no-one else is using the system.
And then run "vmtest 64", and "vmstat -S 5" concurrently for a minute or
two to make sure that no paging to disk is occurring - "fre" should be
greater than 256, and po and sr should both be 0.  Note that swapping to
the free list may be occurring due to a previously mentioned quirk of
SunOS.  Once the system has settled down kill vmtest and vmstat.

To run the test program use the command "time vm_stat n", where n is the
amount of virtual memory to use in megabytes.  More detailed information
can be determined using a command similar to the following and comparing
the output of vmstat before and after each test.

$ for mb in 2 4 8 16 32 64
do
  (echo "=== $mb megabytes ===";
  vmstat -s;
  time vmtest $mb;
  vmstat -s) >> vm_stat.results
done
$

You may want to alter the process size or context switching rate to see
what sort of effects these have on the results.  Larger processes mean
that fewer pmegs are wasted mapping less than a full address range.  Hence
the amount of active virtual memory that can be used before problems start
to show up will increase.  A faster context switching rate will reduce the
amount of time a process gets to execute before being descheduled.  If
there is a pmeg shortage by the time the process is next scheduled many or
all of its pmegs will be gone.  Adjusting the context switching rate to a
typical value seen on your system may be informative.

/* vmtest.c, test Sun-4 virtual memory performance.
 *
 * Gordon Irlam (gordoni@cs.ua.oz.au), June 1990.
 *
 * Compile: cc -O vmtest.c -o vmtest
 * Run: time vmtest n
 *   (Test performance for n Megabytes of active virtual memory,
 *    n should be even.)
 */

#include <stdio.h>
#include <sys/wait.h>
#define STEP_SIZE 4096		/* Will step on every page. */
#define MEGABYTE 1048576

#define LOOP_COUNT 5000
#define PROCESS_SIZE (2 * MEGABYTE)
#define CONTEXT_SWITCH_RATE 50

char blank[PROCESS_SIZE];	/* Shared data. */

main(argc, argv)
int argc;
char *argv[];
{
  int size, proc_count, pid, proc, count, i;

  if (argc != 2 || sscanf(argv[1], "%d", &size) != 1 || size > 500) {
      fprintf(stderr, "Usage: %s size\n", argv[0]);
      exit(1);
  }

  /* Touch zero fill pages so that they will be shared upon forking. */
  for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE)
      blank[i] = 0;

  /* Fork several times creating processes that will use the memory.
   * Children will go into a loop accessing each of their pages in turn.
   */
  proc_count = size * MEGABYTE / PROCESS_SIZE;
  for (proc = 0; proc < proc_count; proc++) {
      pid = fork();
      if (pid == -1) fprintf(stderr, "Fork failed.\n");
      if (pid == 0) {
          for (count = 0; count < LOOP_COUNT; count++)
             for (i = 0; i < PROCESS_SIZE; i += STEP_SIZE)
                 if (blank[i] != 0) fprintf(stderr, "Optimizer food.\n");
          exit(0);
      }
  }

  /* Loop waiting for children to exit.  Don't block, instead sleep for
   * short periods of time so as to create a realistic context switch rate.
   */
  proc = proc_count;
  while (proc > 0) {
    usleep(2 * (1000000 / CONTEXT_SWITCH_RATE));
    if (wait3(0, WNOHANG, 0) != 0) proc--;
  }
}

jblind@griffith.eng.sun.com (Joanne Blind-Griffith) (07/25/90)

	 	   Sun Microsystem's Response to 
	     A Guide to Sun-4 Virtual Memory Performance

 	Joanne Blind-Griffith, Product Manager,	Sun Microsystems.

The recent Sun-Spots posting by Gordon Irlam is essentially accurate in
describing the hardware limitations of the Sun MMU.  As he points out,
whether this limitation is encountered on any particular machine depends
on which Sun hardware is involved and what sort of applications are being
used.  It is our experience that this limitation is rarely encountered
with applications which show typical locality of reference.  Most common
applications and job mixes will never encounter this limit.  However, some
very large applications, and some applications which share memory between
many processes, will encounter this limit.

The Sun MMU design results in a very fast MMU with a minimum of hardware.
The Sun MMU is best thought of as a cache for virtual-to-physical
mappings.  As with all caches, the cache was designed to be large enough
for the sort of typical applications to be run on the machine.  Nearly all
applications achieve a very high hit rate on this cache.  However, like
any cache, there are applications that will exceed the capacity of the
cache, greatly lowering the hit rate.  Since this cache (i.e., the Sun
MMU) is loaded by software, the cost of a cache miss can be quite
expensive.

We have improved the algorithms that manage the Sun MMU.  The improvment
involves adding another level of caching between the MMU management
software and the upper levels of the kernel.  This is a classic space/
time tradeoff where a little bit of space for this software cache saves a
lot of time in reloading the MMU for those applications which exceed the
hardware limits of the MMU.  In addition, many other changes have been
made to the MMU management software to improve performance in general and
to reduce the effects of some worst case behaviour.

Following are the test results using Gordon's vmtest program run on a 12MB
SPARCstation 1+ with the improved MMU management software:

	virtual	elapsed	user	system
	memory	time	time	time
	(MB)	(sec)	(sec)	(sec)
	 2	 2	 2.3	0.6
	 4	 5	 4.7	0.8
	 8	10	 9.4	1.1
	10	13	11.9	1.2
	12	16	14.3	1.4
	14	18	16.8	1.5
	16	21	19.5	1.7
	18	25	22.5	1.9
	20	27	25.3	2.0
	22	30	27.6	2.2
	24	33	30.4	2.5
	26	36	33.3	2.5
	28	39	35.7	2.7
	30	41	38.1	2.9
	32	44	40.8	3.1

Note that the performance is essentially linear through 32MB.

This improved MMU management software will be included in the next release
of SunOS.  It will be available as a patch for SunOS 4.1 (Sun4c and Sun4
platforms) and 4.1 PSR A end of July and for SunOS 4.0.3c (Sun4c machines)
early August.