[comp.compilers] Sparcstations: A Guide for Compiler Writers

gordoni@cs.adelaide.edu.au (Gordon Irlam) (03/27/91)
      The Low Level Computational Performance of Sparcstations
      ========================================================

                   A Guide for Compiler Writers
                   ----------------------------

                           Gordon Irlam
                   (gordoni@cs.adelaide.edu.au)

                            March 1991


This document provides information on the Sparcstation architecture
that may be of use in targeting compilers.  It is an attempt to
address the common problem of architectural definitions failing to
provide concrete discussions on performance related issues.

This document discusses the following aspects of performance: clock
rates, instruction timings, memory sub-system, register windows, and
address translation.  Issues relating to multiuser performance are not
discussed.

Familiarity with the SPARC processor architecture, but not the
Sparcstation architecture is assumed.


Overall architecture.
---------------------

The SS1 family (SS1, SLC, SS1+, IPC) use the Fujitsu MB86901A or its
equivalent.  The SS2 uses the Cypress CY7C601.

The SS1 and SLC have a 20MHz clock.  The SS1+ and IPC have a 25MHz
clock.  The SS2 has a 40MHz clock.

All machines have a 64k write through cache.

The SS2 is able to map twice as much virtual memory as the machines in
the SS1 family.


                 Sparcstation model summary


             Machine    Chip    Clock  Cache Pmegs

             SS1        Fujitsu 20MHz  64k   128
             SLC        Fujitsu 20MHz  64k   128
             SS1+       Fujitsu 25MHz  64k   128
             IPC        Fujitsu 25MHz  64k   128
             SS2        Cypress 40MHz  64k   256


SPARC instruction times.
------------------------

In the vast majority of cases the two different chips have identical
instruction timings.  With most instructions taking 1 cycle.  The
exceptions are mentioned here.

Single word loads (LD*) take 2 cycles.  Double word loads and single word
stores (LDD*, ST*) take 3 cycles.  Double word stores, and atomic load
store instructions (STD*, LDSTUB* and SWAP*) take 4 cycles.

The only common case in which the two chips take different times is for
branch instructions.  In most cases branches (B* and FB*) take 1 cycle,
but on the Fujitsu chip untaken branches consume 2 cycles.  For either
chip if the delay slot of the branch was annulled an additional 1 cycle
penalty will be incurred since the delay instruction is still fetched.

Jumps/indirect calls/returns and trap returns (JMPL and RETT) take 2
cycles.  Note, direct calls (CALL) take 1 cycle.

For delayed control transfer instructions (B*, FB*, CALL, JMPL and RETT)
if the following instruction slot cannot be filled with a useful
instruction, an additional 1 cycle penalty is effectively incurred.

Traps incur a penalty of 3 cycles while the pipeline is drained/refilled.

In the vast majority of cases bypass paths within the CPU allow an
instruction to use a value computed by the previous instruction in the
pipeline without a stall occurring.  The only exception to this is the
load instruction (LD*).  A 1 cycle stall penalty is incurred if the
instruction following a load attempts to read the register to which the
load is destined.  In the case of a double word load the penalty is only
incurred if the following instruction accesses the second register to be
loaded.

Floating point instructions depend upon the type of floating point
coprocessor used with the chip.  Unfortunately I don't have any timing
values for these.

The Fujitsu chip does not implement the atomic swap instruction, although
it does implement the atomic load-store byte instruction.  Neither chip
implements integer multiplication or division. But under SunOS 4.1 the
integer multiplication and division instructions are emulated in software.

Trivia: on the Cypress chip, but not the Fujitsu chip, loads to register
g0 (which is always 0) followed by reads of g0, do not incur the above
mentioned stall penalty.


                 SPARC instruction timings


     Instruction                  Cycles

    LD*              2 + stall if next inst. uses dest. reg.
    LDD*             3 + stall if next inst. uses 2nd dest. reg.
    ST*              3
    STD*             4
    LDSTUB*, SWAP*   4
    B*, FB*          1 + delay (2 + delay on Fujitsu if not taken)
    JMPL             2 + delay
    RETT             2 + delay
    CALL             1 + delay
    all others       1
    traps           +3


Sparcstation memory times.
--------------------------

The above times all assume instruction fetches, load instructions, and
store instructions never have to wait upon the memory system.  This is not
true.

Both the SS1 family, and the SS2 use a single level 64k virtually address
write-though cache with a line size of 16 bytes.

I currently don't have any information on memory timings for the SS2.

The following times refer to the SS1 family.  Attempting to measure memory
access times on a Sun is very difficult due to the large number of
possible contributing factors - cache size, cache organization, address
translation, write back queue, and DRAM access regime, as well as having
to deal with interference caused by instruction fetches and context
switching.  Thus the following data is not exact.  It is an attempt to
reconcile the results of a number of experiments, some of which are
contradictory.

When the referenced line is in the cache no penalty is incurred.  Thus a
single word load will take 2 cycles, and a single word store will take 3
cycles.

Before continuing with what happens when a cache miss occurs remember that
the SS1 family are low cost machines.  A certain amount of performance has
been traded off to end up with a box that is very cheap to produce.  The
more expensive Sun's offer a considerably improvement in this area, but
you have to pay for it.

If the referenced line is not in the cache a stall will occur while the
address is translated, and main memory is accessed.  For a load, or
presumably an instruction fetch, the stall lasts for 12 cycles while the
line is loaded into the cache.  Thus a single word load will take 14
cycles.  For a store the stall lasts 3 cycles, causing a single word store
to take 6 cycles.  In the case of a store the target line remains
uncached.

When several stores occur in sequence a stall will occur once the memory
write buffers fill up.  This will occur even if the target addresses are
all cached.  And is in addition to stalls due to cache misses.

On the SS1 family even a single store double instruction causes the write
buffer to fill.  The following table shows this.

       instruction sequence   total cycles      stall cycles

           ST                      3                 0
           ST; ST                 11                 5
           ST; ST; ST             17.5               8.5
           ST; ST; ST; ST         23.5              11.5

           STD                     8                 4
           STD; STD               21.5              13.5
           STD; STD; STD          33.5              21.5
           STD; STD; STD; STD     46                30

Non-integer cycle times are presumably the result of one or more of the
following: freezing the CPU while address translation is performed, cycle
stealing for DRAM refresh, use of a fancy DRAM access regime, the presence
of other devices on the bus, measurement error.

To prevent stalls a minimum of 3 cycles between two store instructions is
necessary.  Depending upon the addresses of successive stores 4 cycles may
be required.  If a write back stall is triggered it appears to lasts at
least 2 cycles.  Thus if two single word stores are separated by 3 nops it
will take either 2 * 3 + 3 * 1 = 9, or 2 * 3 + 3 * 1 + 2 = 11 cycles,
whereas two single word stores separated by 4 nops will take 2 * 3 + 4 * 1
= 10 cycles.  Based on my observations, I would recommend separating
stores by 4 cycles, only when all 4 intervening cycles can be usefully
filled.  Otherwise I would recommend adding nops to give a minimum of 3
cycles following a double word store before the next store, but only
filling as many cycles as can be usefully filled between a single word
store and the next store.

Note that on a SS1 class machine the double word store instruction
provides very little advantage over a pair of single word store
instructions.  A double word store takes 8 cycles.  Two single word
stores, plus 4 intervening operations will only require an additional 2
cycles.  Despite this it is probably sensible to use double word store
instructions when possible, unless it is clear that the code is likely to
be run exclusively on SS1 class machines.


          SS1 family memory sub-system architecture


        Operation    Cached cycles   Uncached cycles

           LD             2               14
           LDD            3               15
           ST             3                6
           STD            8               11

              Sequence        Required separation

           ST, .., store            3-4 cycles
           STD, .., store           3 cycles


SunOS register window handlers.
-------------------------------

For compilers that don't have sophisticated register allocation algorithms
register windows can offer a substantial reduction in the cost of
procedure calls and returns.  How register windows stack up against
sophisticated register allocation algorithms is an open question.

I believe the Fujitsu chip has 7 register windows.  The Cypress chip has 8
register windows.  Because one window is reserved for use by the register
window trap handler the user is left with either 6 or 7 register windows.

A register window overflow/underflow trap pair takes about 260 cycles on a
SS1 class machine (Hennessy and Patterson not withstanding).  This is due
to the poor performance of a sequence of 8 double word store instructions
on a write through cache, the cost of testing to make sure the required
page is present, checking to see if the window spans a page boundary, and
making sure the register window handler is not used to circumvent system
security.

The machines on which register windows were first developed (RISC I, RISC
II, SOAR, and SPUR), and from which register windows were proclaimed to
offer great performance advantages did not have to deal with any of these
problems, and consequently an overflow/underflow trap pair took about 40
cycles.

Despite the significant cost of handling register window traps on
Sparcstations, in the absence of heavy recursion, register windows
probably still offer some performance advantages for procedure call
intensive languages such as Smalltalk, or compilers that don't perform
careful register allocation.  For languages such as C however the benefits
could outway the implementation cost.  That is in comparison to the amount
of effort required to modify a compiler/debugger/OS to take advantage of
register windows the gains are probably not all that significant.

Thus I would recommend attempting to quantify the expected gains obtained
by using register windows before deciding to use them.  I would expect
they offer some performance advantages, say 10%, far less than the 70%
performance improvement found on SOAR.

One possible counter-argument to some of the doubts about the significance
of register windows that I have expressed is that by having register
windows, Sun have been able to expend less effort on maximizing the
performance of loads and stores.  This results in an interesting
circularity that complicates the evaluation of the significance of
register windows as an architectural feature.  By having register windows,
poor memory performance has been acceptable, which in turn forces compiler
writers to make use of register windows to get acceptable performance.


Sparcstation address translation.
---------------------------------

Sparcstations have a software managed cache that can hold the address
translations for each of several regions of virtual memory.

The SS1 family can concurrently map up to 128 regions of virtual memory,
each of size 256k, and aligned on a 256k boundary.  The SS2 can
concurrently map up to 256 regions of virtual memory, each of size 256k,
and aligned on a 256k boundary.

Attempting to make active use of more than 128 or 256 of these virtually
addressed regions at once will result in a large number of page faults.

This is unlikely to be a problem except for systems that have very sparse
memory access patterns.  Persistent environments and very large Lisp
applications are two application areas that are most likely to suffer from
these problems.  In addition it is important that such applications make
use of the vadvise/madvise system calls to control paging behavior.


Notes.
------

1. Some of the above information has been empirically determined, and is
undoubtedly wrong, corrections welcome.

2. I would be interested in low level details of the expected integer
performance of superscalar SPARC implementations.

3. Working out memory timing values is very difficult.  If anyone wants a
challenge they might have a go at doing this for the SS2.  Contact me for
some pointers/caveats.

4. I have some additional information on window handlers and address
translation if anyone is interested.  One day I plan to evaluate the
effect of register windows on the performance of the SPEC benchmarks.
Working out the costs is easy, the hard part is working out how many
additional loads and stores would be required in their absence.

5. Thanks to Peter Van Roy (vanroy@pisces.Berkeley.EDU) for helping me
clean up an early version of this document.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.