[comp.arch] Why is SPARC so slow? long, more data

mash@mips.UUCP (John Mashey) (12/17/87)
In article <36626@sun.uucp> garner@sun.UUCP (Robert Garner) writes:

....thank goodness...some useful discussion on this topic, from someone in
Sun that is close to the topic.
....

>Baskett's message was refreshing in that he accurately differentiated
>between implementation and architecture.  (Quite unlike previous
>criticisms, such as from the so-called "MIPS Performance Brief.")

I don't understand this.  The "Performance Brief" just shows a lot of
benchmark data, without any discussion or speculation on architecture versus
implementation.  The only architectural discussion was purely architectural,
and related to why Dhrystone was unusual.  What does "so-called" mean?

>However, Baskett's article continues to incorrectly portray the integer
>performance of Sun-4/200 workstations and SPARC in general.
>Sun's data on MIPS performance implies that the Sun-4/200
>has approximately the same INTEGER performance as the M/1000.

All you have to do is publish this data.  We've published comparative results
for Dhrystone, Stanford [Sun's choices in announcement materials],
4 UNIX command tests [grep, diff, yacc, nroff].  People get to make up
their own minds on the basis of the data: we at least show what we have
so people can shoot at it.  We wish we had meaningful, nonproprietary,
large integer benchmarks whose numbers we could publish, but we don't.
Note: we continue to think that the Sun-4 is at best about the same
as the "8-mips" M/800 on user-level integer performance, but usually a little
slower. Of the 6 benchmarks shown above, the Sun-4 is equal on 1 (grep),
and slower on 5. (at the end of this posting, I'll replicate the
"Performance Brief Update - December 7, 1987" that has this data.)

>This fact is frequently ignored since the Sun-4/200 floating-point
>performance is generally (but not always) less than the M/1000....
Again, publish some data.

>This hand waving is too fast!  A standard, off-the-shelf gate array is 
>NOT in the same league as a custom CMOS design.  Indeed, that a gate 
>array has the same integer performance as a tuned, full-custom, 
---------------^^^^---- data?
>"similar technology" implementation is an indication of the strength
>of the architecture!
I'm not the one to debate this one [let the VLSIers do that].  I do
observe that we won't have to wait long.  Cypress "20-mips", 25MHz
parts are supposed to appear in 2Q88, so we'll see how they compare
with whatever exists whenever they appear in Sun-4s.  Certainly they will have
fine CMOS technology.

>Forest attempted to deduce the gate-array CPI value for integer 
>and floating-point programs.  From this analysis, he concluded:
>
>> These ratios [based on CPIs] are also consistent with the benchmark 
>> results in the Performance Brief. 
>
>Yes, floating-point suffers because of the Weitek chips.
>And yes, MIPS' "Performance Brief" attempts to stigmatize SPARC 
>by dwelling on this:  its benchmark suite and MIPS-rate calculations
>are conveniently based almost entirely on floating-point programs!
------------------------------^^^^^^^^
Please re-read the Performance Brief. The main summary has 6 FP ones,
and 3 integer ones.  Of the integer ones, 1 is a geometric mean of
4 UNIX benchmarks (different programs), summarized there because
I couldn't fit them on the page separately, and because we think the
summary is a reasonable indicator of the typical UNIX command's performance.
Of the set of 9 summary items, 5 [DHRY, STAN INT, LNPK DP, LNPK SP,
and SPICE] were those used in Sun's product introduction. The Digital
Review one was initiated by Sun [requested a tape from the magazine and
got the numbers published].  To this set, we added Doduc and Livermore Loops
[both used inside Sun, BTW], and the MIPS UNIX set [4 programs].

Finally, I would point out that ANYONE (sun or otherwise who bases their comparisons against
VAXen (and Sun's literature says so), and uses a single number (Vax-mips)
to describe overall performance, and compares mips-ratings with 8800s,
really MUST understand DEC's performance methodology and how they
compute their own relative-performance numbers. [It includes floating-point].

I generally don't use CPI (where I = native instruction mips),
just because it is SO slippery to get a handle on, and it doesn't
lend itself to external comparisons.  We use "cycles-per-vax-mip equivalent",
i.e., derived from actual benchmarks.  As it happens, Forrest gets
a reasonable estimate anyways.  I personally prefer the following gross
estimator:
	1) measure the time on a 780 with good compilers = Tvax
	2) Measure the time Tx on machine X, with clock rate Y.
	3) then the vax-mips for X = xvmips = Tvax/Tx
	4) and the cycles/vax-mips = Y/xvmips
(Forrest and I may well disagree, but I've said before that CPIs
are fine things for computer architects, and within a family,
they're very useful figures of merit, but I have my reservations when
they get turned into marketing numbers.  Note that Sun's "RISC Tutorial",
July 1987, p15 says the Sun-4 has a CPI of 1.3.  The reason I don't
like CPI's shows this as a perfect example: how can anybody ever
verify this number (other than people with SPARC simulators), or even
know what it means?  If two machines have the same clock rate, and one
has a lower CPI, does it run a real program faster?)

>But no, one can not accurately judge different processors
>by comparing their implementation-dependent "cycles per instruction" 
>(CPI) values....
I agree, although it is interesting that Forrest's
analysis seemed to be a good approximation of what we see in live benchmarks.

>The Sun-4/200, for LARGE C, integer programs runs at about 1.65 CPI.  
>This includes 15% loads and 5% stores AND the miss cost associated>...
>with the 128K-byte cache and the large, asynchronous main memory.
>(Baskett's calculation assumed MIPS' distribution, 20% loads and 10% stores, 
>which is not applicable to SPARC.  Since cache effects can dominate 
>performance, I suspect that the M/1000, large-C-program CPI
>could be near 1.6 if its cache/memory is taken into account.)
It is around 1.5 cycles/vax-mip-equiv.

Note: we've published numbers on how many of our instructions are register
save/restores, and the highest instruction % we could find was 11%,
and the mean was 4.5%, so if you're getting only 15/5, instead of 20/10,
we must be doing different kinds of benchmarks (which is quite possible).
Could you name what the LARGE C, integer benchmarks are?

>Of course, the most error-free measure of performance is wall clock time.
>Until there are more results of some large integer programs running
>both on the Sun-4 and the M/1000, speculation can be unproductive.
For CPU performance, you can pick either user-level CPU or user+kernel.
Wall-clock time is a measure of CPU performance, operating system tuning,
and disk performance.  Both of these are useful.
We agree about speculation (although it's SUCH fun), but where's the
data?  One can argue about the validity or lack of specific benchmarks,
but it is hard to rationally discuss performance without real numbers.
If the performance is what is claimed, it should
be SO EASY to refute what we say with numbers that people can evaluate for
themselves. Half of my brain loves arguments about CPIs and qualititative
goodness/badness of features, but the other half insists I run benchmarks
of real programs and look at the numbers.

>Now, what about register windows?  In Baskett's second article
.....
>A minimum SPARC implementation could have 40 registers:  8 ins, 
>8 locals, 8 outs, 8 globals, and 8 local registers for the trap handler.
>Such as implementation is not precluded by the architecture, but
>would probably imply IRA-type optimizations.  It would function
>as if there were no windows, although window-based code would
>properly execute, albeit inefficiently. 
This is a perfectly fair technical comment.  In practice, will the
"shrink-wrapped, ABI, compatible-everywhere binary code"
	a) Use windows? OR
	b) Use IRA?

>Register windows have several advantages over a fixed set of registers,
>besides reducing the number of loads and stores by about 30%:
>They work well in LISP (incremental compilation) and object-oriented
>environments (type-specific procedure linking) where IRA is impractical...

These seem like fair possibilities for windows (over IRA).
Could you describe the benchmarks on which the loads and stores
are reduced by 30%? (and from what base?  The base is important: 
from our base, windows could eliminate at best 4.5% (over set of large
benchmarks)), and I suspect the kernel is probably MUCH less.

Well, in any case, thanks for posting the comments: it's good to see
some technical discussion, since, from outside Sun, ALL that we see are:
	a) Sun-4 == VAX 8800 advertising AND
	b) Benchmarking amongst prospects and customers

------appendix---- (no more discussion, except to note that we
discount the UNIX benchmarks about 5-10% for being versus 4.3 instead
of a VAX using a global optimizer).

                    MIPS Performance Brief - Update for 3.0

                               December 7, 1987

The October 1987 Issue of the MIPS Performance Brief used several UNIX bench-
mark numbers for the Sun-4 that were estimated from earlier versions of the
benchmarks, as described in section 4.1, and carried through to the summary.
We've since been able to get actual Sun-4 numbers.  Following are the replace-
ment tables for page 3 (Summary of Benchmark Results) and page 8 (MIPS UNIX
Benchmarks).  Our original estimate of the overall Sun-4 performance on the
MIPS UNIX benchmark set was about 5% low.
                         Summary of Benchmark Results

                     (VAX 11/780 = 1.0, Bigger is Faster)

   Integer (C)            Floating Point (FORTRAN)

MIPS   DHRY  STAN   LLNL  LNPK   LNPK   SPCE  DIG    DDUC   Publ
UNIX   1.1   INT     DP    DP     SP    2G6   REV           mips  System

 1      1     1     1      1      1     1      1      1       1   VAX 11/780#

 2.1    1.9   1.8   1.9    2.9    2.5   1.6   *2     *1.3     2   Sun3/160 FPA

*4      4.1   4.7   2.8    3.3    3.4   2.4   *3      1.7     4   Sun3/260 FPA
 5.5    7.4   7.2   2.5    4.3    3.7   3.4    4.9    3.8     5   MIPS M/500

*6      5.9   6.5   5.9    6.9    5.6   5.3    6.2    5.2     6   VAX 8700
 8.4   10.8   7.3   4.5    7.9    6.4   4.1    4.4    3.5    10   Sun4/260

 9.2   11.3  11.8   8.1    7.1    7.6   6.6    7.6    7.3     8   MIPS M/800
11.3   13.5  14.1   9.7    8.6    9.2   8.0    9.3    8.8    10   MIPS M/1000


# VAX 11/780 runs 4.3BSD for MIPS UNIX, Ultrix 2.0 (vcc) for Stanford, VAX/VMS
  for all others.  Use of 4.3BSD (no global optimizer) probably inflates the
  MIPS UNIX column by about 10%.

* Although it is nontrivial to gather full set of numbers, it is important to
  avoid holes in benchmark tables, as it is too easy to be misleading.  Thus,
  we had to make reasoned guesses at these numbers.  The MIPS UNIX values for
  VAX 8700 and Sun-3/260 were taken from the Published mips-ratings, which are
  consistent (+/- 10%) with experience with these machines.  DIG REV and DDUC
  were guessed by noting that most machines do somewhat better on DIG REV than
  on SPCE, and than a Sun-3/260 is usually 1.5X faster than a Sun-3/160 on
  floating-point benchmarks.

                          MIPS UNIX Benchmarks Results
     grep       diff        yacc        nroff       Total    Geom     System

  Secs  Rel.  Secs  Rel.  Secs  Rel.  Secs Rel.  Secs+  Rel. Mean

  11.2  1.0  246.4  1.0  101.1  1.0   18.8 1.0   377.5  1.0  1.0   11/780 4.3BSD
   5.6  2.0  105.3  2.3   48.1  2.1    9.0 2.1   168.0  2.2  2.1   Sun-3/160M
     -   -       -   -       -   -     5.0 3.8       -  3.8  3.8   DEC VAX 8600
   2.4  4.7   35.8  6.9   19.5  5.2    3.3 5.7    61.0  6.2  5.5   MIPS M/500

   1.6  7.0   25.1  9.7   11.8  8.6    2.2 8.6    40.7  9.3  8.4   Sun-4,-O3
   1.6  7.0   21.6  11.4  11.2  9.0    1.9 9.9    36.3  10.4 9.2   MIPS M/800
   1.3  8.6   18.0  13.7   9.3  10.9   1.5 12.5   30.1  12.5 11.3  MIPS M/1000

  + Simple summation of the time for all benchmarks.  "Total Rel." is ratio of
  the totals.

-the end!
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086