[comp.arch] SPARC vs. MIPS on gcc

edkelly%aisling@Sun.COM (Ed Kelly) (12/17/88)

            A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.

For the comparison we chose a large portable C program (the GNU C Compiler rev
1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler
to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
produce a MIPS binary.
      Then using the same data (the file gcc.c) we ran the
benchmark on both machines and gathered the dynamic trace statistics provided 
by SPIXSTATS and PIXIE, the respective statistics gathering programs for SPARC 
and MIPS. We also measured the user and system time on both machines.   
       The compiler optimization level was set at -O2 for MIPS (the highest 
that would compile) and at -O4 for SPARC. Both compilers were the standard
production versions as of sept 1988.  MIPS -O2 and SPARC -O4 are comparable 
levels of optimization. -O3 was the highest MIPS optimization available. From
data on other C programs -O3 produces a gain of less than 2% on average
over -O2, so we feel the comparison is valid.
     The following is divided into two sections. The first section covers a
SPARC vs. MIPS instruction set architecture comparison and the second is an 
implementation comparison of the Sun-4/260 vs. the M/1000. The architecture
comparison counts INSTRUCTIONS and is  useful for comparing instruction sets 
and compiler efficiency. This will not vary across implementations if compilers
are a constant. If you are interested in architecture and wish to avoid the 
confusion of implementation details these are the numbers of most interest. 
The implementation comparison counts CYCLES and includes effects like
multi-cycle loads and cache misses etc.


__________________________________________________________________
  INSTRUCTION SET/REGISTER ARCHITECTURE AND COMPILER COMPARISON
__________________________________________________________________

			SPARC		MIPS		MIPS-SPARC

Total Instructions	16,313,907	18,635,185	+2,321,278
------------------------------------------------------------------
                     Detailed Breakdown
------------------------------------------------------------------
Branch nops		109,079		1,170,306
Load nops		na		1,113,019
Jump nops		102,417		211,409
other nops		20,110		99,495
annulled delay slots	(634,700)	na
load interlock cycles	(1,474,619)	na
------------------------------------------------------------------
nops sub-total		231,606	(1.4%)	2,594,229 (14%)	+2,362,623

loads			3,242,293(19.9%)3,928,710(21%)	+686,417

stores			1,175,530(7.2%)	2,037,266(10.9%)+861,736	

conditional branches	2,699,885	2,559,648
unconditional branches	225,739		190,456
jumps			326,578		498,865
calls			214,662		213,118
------------------------------------------------------------------
jmp/branch sub-total 	3,466,864(21%)	3,462,087(18.5%)-4,777


shift			716,666		890,281
logical set cc		850,121
logical			1,396,645       1,335,473
arithmetic set cc	1,914,842
arithmetic		1,853,659	3,241,789
set			na		666,309
save/restore		337,820		na		
others			84,094		41,838
------------------------------------------------------------------
computational sub-total	7,192,084(44%)	6,175,690(33%)	-1,016,394


sethi/lui		1,003,713(6.15%)437,207(2.3%)	-566,506
------------------------------------------------------------------

Some notes on the categories. 

	MIPS "set" could be categorized as arithmetic or arithmetic set cc.

	SPARC "save/restore" could be categorized as arithmetic. They 
	adjust the stack pointer and the increment/decrement window pointer; 
        The equivalent MIPS operation is adjust the stack pointer.

	The SPARC nops listed as "other" are mostly associated with 
	calls.
 
	The "others" category is mostly multiply related 

As will surprise most observers, SPARC executes fewer instructions than MIPS.

Some specific observations.

1) SPARC has many fewer loads and stores,(1,548,153) which points out the
significant ARCHITECTURAL advantage of register windows.  Stated another way,
for this case MIPS has 35% more loads and stores than SPARC. This benchmark 
contains more loads and stores than our "average" case of 15% loads and 6% 
stores so the benefits of register windows may actually be understated here. 

2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. 
NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad 
for code density, and it increases instruction cache miss penalties(due to more
memory accesses and greater probability of a miss).
       A subtle point about the NOPs is that it distorts statistics presented as
percentages. MIPS's combined load/store percentage is 32% for this benchmark. 
If there were no NOPs the percentage would be 37% vs. SPARC's 27%. 
     Current SPARC implementations incur a clock cycle penalty for some of the
cases where MIPS has to insert NOPs however, so counting all NOPs against MIPS 
overstates the situation. This includes the load-use interlock case(1,474,619),
and the untaken annulled branch case(634,700). While these cycles are not 
"architectural" many implementations will incur these penalties.
     The ARCHITECTURAL advantage that the annulling feature confers on SPARC 
probably needs more explanation.  As the MIPS numbers demonstrate, it is 
difficult to fill branch delay slots. SPARC uses standard delayed branches 
until it cannot fill branch delay slots. It then uses annulling branches and 
fills almost all the remaining branch delay slots. Annulling branches that are 
taken incur no penalty and represent a performance win for SPARC that MIPS 
cannot realize. 
           Minimizing the number of load interlock cycles and predicting 
conditional branches is a function of compiler technology. The load interlock 
cost could be around 1,000,000 cycles from a comparison with the MIPS number. 
The number of annulled instructions that incur a penalty is reduced with 
reasonable branch prediction. Several papers have shown that static branch 
prediction can get to 85% for C programs. Currently the Sun compiler gets 60% 
correct prediction for this benchmark. 85% prediction would reduce the untaken 
annulled branch cycles lost to 263,306.
     The bottom line about NOPs is SPARC is better due to the annulling 
ARCHITECTURE feature.

3) SPARC has more sethi instructions(566,506). Most of these are due to the way 
addresses to global data are generated by the compiler. An optimization 
that MIPS employs would eliminate these instructions. SPARC once performed the 
optimization (during early development) but we decided to keep the old a.out 
format and the old linker and so postponed the benefit. The SPARC ABI will 
allow us to remedy this situation.

4)  The category that has the biggest discrepancy against SPARC is computational
(1,016,394). Some of this is probably due to the need to set condition codes,
an ARCHITECTURAL feature of SPARC, but it is not straightforward to analyze. 

5) There are other significant ARCHITECTURAL differences between MIPS and
   SPARC that either are not represented in this benchmark or cannot be
   isolated with the data. I include this list for completeness.
 
 a)  SPARC has a register + register addressing mode for loads and stores that 
    MIPS lacks. 

 b) MIPS has integer multiply and divide instructions that SPARC in lacks
    in current implementations. 

 c) SPARC has load and store double operations(integer and floating point)
     MIPS has no equivalent instructions.
 
 d) MIPS has instructions to move data directly between the integer registers
    and the floating point registers. SPARC has no equivalent instructions.


In summary, for this benchmark, the ARCHITECTURAL benefits of register windows 
and annulling more than balance the ARCHITECTURAL losses in computational. The
relatively simple enhancements of sethi elimination, branch prediction and 
load interlock removal can buy more than 1,000,000 instructions for SPARC. 
From random inspection of code sequences the current SPARC compiler appears to 
produce redundant code, so some improvement can be expected in this area as 
well.
      For many observers the interesting fact is that for this benchmark, the 
MIPS compiler is not significantly better than the current SPARC compiler. 
Considering the bad press, I will admit I was surprised by this myself. 
             Being a SPARC advocate I would claim that SPARC is ARCHITECTURALLY
fundamentally better, but the degree of difference is probably in the noise in 
the broader scheme of things.




                        IMPLEMENTATION ANALYSIS.

This is mainly for historical perspective and to present a complete picture.

___________________________________________________________________________
			User machine cycles comparison.
___________________________________________________________________________

			Sun-4/280		MIPS M1000

instructions		16,313,907 		18,635,185
loads (extra cycle)	3,242,293
stores (extra cycles)	2,351,060
load interlock	 (")	1,474,619
untaken branch	 (")	1,179,319
annulled cycles	 (")	634,700
jmp		 (")	326,578
mult/div	 (")	na			363,987
basic block interlock?	na			51,983
-----------------------------------------------------------
total raw cycles	25,522,476		18,999,172

cache miss cycles	4,427,524*		14,000,828*
-----------------------------------------------------------
total machine cycles	29,950,000		33,000,000
-----------------------------------------------------------
CPI			1.84			1.77
CPUI(Cycles per Useful
Instruction)**		1.86			2.02
MIPS			9.06			9.23
MUIPS(Millions of
Useful Instructions/Sec)8.95			8.06



___________________________________________________________________________
  			Rough Memory System Analysis
___________________________________________________________________________
memory references	20,731,730		24,601,161 (+3,869,431 +18.6%)
average penalty		10			10??*
misses/other		442,752(2.1%)*		1,400,083(5.7%)??*


Benchmark Data
			Sun-4/280		MIPS M1000
clock			16.67MHz		15MHz
user time		1.797secs		2.2secs
system time		.285secs		.3secs

   * These are rough numbers working backwards from the time necessary to run
     the program and the clock frequency. The MIPS cache is write through and
     incurs significant penalties in write stalls. I cannot distinguish the 
     magnitude of this effect here.

   ** Useful Instructions are all instructions not including NOPs.


______________________________________________________________________________
                        OPERATING SYSTEM OVERHEAD
______________________________________________________________________________

The time spent in the operating system is broadly comparable on both machines. 
Detailed analysis of how this breaks down is difficult. In current SPARC
implementations window overflow/underflow is accomplished with trap handlers. 
MIPS currently handles TLB misses with trap handlers. 
     The number of overflows for the Sun-4/280 (with 7 windows) was 4,439 
and underflows 4,438, for a total of 8,877 traps. For SPARC the number of 
overflows and underflows is dependent on the number of register windows in an 
implementation. (e.g. A Cypress based design with 8 windows would have 
2569 overflows and 2568 underflows for this program). Each overflow performs 
either 8 load doubles or eight store doubles. This is equivalent to 71,024 
extra loads and 71,024 extra stores for the 4/280, a tiny fraction(3%) of the 
total loads and stores.
 
     If the TLB miss rate for MIPS was .1% (an optimistic assumption) 
this would have resulted in 24,601 traps. As an approximation both machines
trap overheads appear comparable for this benchmark. Most of the system 
overhead is not in these trap handlers. For the 4/280 the overflow/underflow 
trap handlers take about 545,932 cycles out of the approximately 5,000,000 
cycles of system time. 

     I should clarify why I am treating underflow and overflow penalties
in this section and not under architecture. As the numbers above show, nearly 
all aspects of underflow/overflow penalties are IMPLEMENTATION specific. The
number of register windows and details of hardware or trap handler organization,
all of which are determined by hardware or kernel implementations, are what
account for this overhead.

______________________________________________________________________________
			GENERAL IMPLEMENTATION COMMENTS
------------------------------------------------------------------------------

These numbers represent significant differences in the IMPLEMENTATION
philosophies at Sun and at MIPS. The central goal at MIPS appears to have been
to achieve a single cycle per instruction, even at the cost of cycle time and
complexity. Clearly that was not a central goal at Sun. 
    Most of the raw CPI differences are due to the multi-cycle loads and stores.
This is due to the single 32-bit bus vs MIPS's multiplexed 32-bit bus. The
single 32-bit bus was chosen for system simplicity. It also facilitates
designing low cost systems and Multi-Processor systems. 
     Our goals were dominated by cycle time and system simplicity. Performance
on large programs was our design metric.
The first SPARC implementation achieved a faster cycle time than the best
of MIP's first implementations, despite inferior technology. The Cypress SPARC
implementation is achieving a better cycle time than the latest MIPS 
implementation from Performance Semi.(33MHz vs 25Mhz). This is not 
co-incidental. Fujitsu has announced a new SPARC part for next year that will 
have multiple 64-bit busses that will demonstrate a good CPI and bury the myth
that SPARC is tied to multi-cycle loads and stores.
      MIPS generates more memory references (18.6%,see above) than SPARC and 
the first implementations compounded this with poor cache/memory system design 
which means that large integer programs perform better overall on the SPARC 
implementation which has a better cache/memory system.
      The MIPS performance brief has concentrated on relatively small 
integer programs that fit in the cache and so benefit well from the single cycle
loads and stores. This overstates the integer performance for large programs,
which are after all what people buy fast machines to run. MIPS implicitly
acknowledges this by calling the M1000 a 10 MIP box despite the fact that all
the published data in the MIPS performance brief would say integer performance
is greater than 12 MIPs. The performance brief also leans heavily on the 
floating point performance side where the first SPARC implementations are 
clearly inferior to the first MIPS implementations. This weakness was 
redressed by the parts announced by Cypress some time ago.

    As the data demonstrates, for a real and significant program, the Sun-4/280
is comparable to the M1000. The data also shows that for this program the 
SPARC instruction set and compiler duo are comparable to the MIPS instruction 
set and compiler duo.

Ed Kelly

The opinions here are my own and do not necessarily represent those of
Sun Microsystems.

aglew@mcdurb.Urbana.Gould.COM (12/18/88)

Wow! I suppose Ed Kelly has started a performance analysis war 
between MIPS and SUN. Don't worry, I'm not getting into it -
I'm new enough in Motorola that I don't want this much exposure
just yet. :-(

Ed, can you give us any floating point comparisons between SPARC and MIPS?

I'm sure someone from MIPS will make a detailed response. 
Me, I just want to ask a few questions:

>loads			3,242,293(19.9%)3,928,710(21%)	+686,417
>stores			1,175,530(7.2%)	2,037,266(10.9%)+861,736	

Phew! Now I'll expose how new I am at this game by giving a big sigh of
relief. I had to do some explaining a while back about why my measurements
were showing load/store ratios of 2-2.5:1, as opposed to the 3:1
everyone KNEW was the typical ratio of loads/stores. (NB. this was not
On an 88K. I cannot report any 88K data.) Investigation showed that many
of the extra stores were in register saves - which may also be shown
by the above difference between the SPARC with register windows and the MIPS
without. Does anyone at MIPS have breakdowns for their load/store traffic
according to purpose?


>     The bottom line about NOPs is SPARC is better due to the annulling 
>ARCHITECTURE feature.

Just so long as annulling doesn't cost you anything in cycle time.
As I am sure everyone will point out.

>instructions		16,313,907 		18,635,185
>-----------------------------------------------------------
>total raw cycles	25,522,476		18,999,172
>
>cache miss cycles	4,427,524*		14,000,828*
>-----------------------------------------------------------
>total machine cycles	29,950,000		33,000,000

Well, I'm afraid that I don't see these numbers as expressing
a fundamental difference between the two processors.

Architecturally SPARC wins out with fewer instructions.

Implementationally (is that a word?) MIPS wins out with fewer
cycles - but SPARC might always be implemented with fewer cycles
per instruction (but then, so might a VAX :-).

SPARC takes fewer cache misses, due to register windows and better
code density - but it is always possible that in a future version 
of MIPS a better memory system, possibly in the form of a large
on-chip cache using the space that register windows occupies,
might win this back (one of my favorites is caching the first word
of every cache lines on chip, with the rest off-chip -- but then,
I was thinking about vector processing mostly in my last job (that's
head of vector caching)).

Finally, for the next step in microprocessor architectures, 
I'd guess that it would be easier to dispatch more than one at once of MIPS'
instructions, rather than SPARC's comparatively complex instructions
(addressing modes and condition codes are a bitch).



Once again, before the wars start - I'd like to thank Ed Kelly for
presenting this data.


Andy "Krazy" Glew   aglew@urbana.mcd.mot.com   uunet!uiucdcs!mcdurb!aglew
   Motorola Microcomputer Division, Champaign-Urbana Design Center
	   1101 E. University, Urbana, Illinois 61801, USA.
   
My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

elg@killer.DALLAS.TX.US (Eric Green) (12/18/88)

in article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) says:
>             A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.
> 
> For the comparison we chose a large portable C program (the GNU C Compiler rev
> 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler
> to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
> produce a MIPS binary.

Step 1: choose a program. Fine. You did that right. Even used the
right compiler -- the standard one.


>       Then using the same data (the file gcc.c) we ran the
> benchmark on both machines and gathered the dynamic trace statistics provided 
> by SPIXSTATS and PIXIE, 
> If you are interested in architecture and wish to avoid the 
> confusion of implementation details these are the numbers of most
> interest. 

OK, so you captured dynamic trace statistics. So what. Lower number of
instructions executed doesn't necessarily mean faster execution, or
else the Vax 780 would be the world's fastest machine ;-). I happen to
agree that some sort of register window setup is a Big Advantage
architecturally, but don't think that a dogmatic "Register windows are
better" is warranted.

> 2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. 
> NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad 
> for code density, and it increases instruction cache miss penalties(due to more
> memory accesses and greater probability of a miss).

The delay slots filled by NOPs also allow you to schedule instructions
on LOADs etc. when the pipeline would otherwise be stalled, which
seems to me to make the whole issue somewhat of a tossup. You can do
the same sort of instruction rearrangment without that guaranteed
delay, but it becomes more of an iffy proposition.
     As I mentioned before, if code density was the sole detirminant
of architectural quality, we should all use Vaxen.

> In summary, for this benchmark, the ARCHITECTURAL benefits of register windows 
> and annulling more than balance the ARCHITECTURAL losses in
> computational.

Hmm... I wouldn't be quite so dogmatic about it if I were you. The
information presented looks fairly convincing, but there may be
alternate explanations. The only things certain in life are death, and
taxes. 

> MIPS compiler is not significantly better than the current SPARC compiler. 
> Considering the bad press, I will admit I was surprised by this
> myself. 

Doesn't surprise me too greatly. The register windows compensate quite
well for outdated compiler technology, which is why the UCB guys used
them in the first place (so they could re-target PCC, instead of
having to dig up come compiler guys to do a moby optimizing hack).

> -----------------------------------------------------------
> total raw cycles	25,522,476		18,999,172
> 
> cache miss cycles	4,427,524*		14,000,828*
> -----------------------------------------------------------
> total machine cycles	29,950,000		33,000,000

Looks like the R1000 machine used needed a larger cache. As David
Patterson explains so ably in his various papers, a larger cache can
make up for a lot of memory bandwidth (which is why a RISC can be
faster than a Vax 780). Statistics on how much cache was available on
each machine were not published with this so-called "performance
comparison". I would not be surprised if a MIPS processor needed a
larger cache than a SPARC, just as I would not be surprised if a SPARC
needed a larger cache than a Vax 780. Again, no clear performance
advantage here. Take a few million cache misses, and the MIPS looks
better than the SPARC (cycle-wise).

[specs on cycle times, other implementation features:]

> These numbers represent significant differences in the IMPLEMENTATION
> philosophies at Sun and at MIPS.

I suspect it's a matter of cash. The more cash you have, the faster
process technology you can buy. Sun isn't exactly cash-starved ;-).

>       The MIPS performance brief has concentrated on relatively small 
> integer programs that fit in the cache and so benefit well from the single cycle
> loads and stores. This overstates the integer performance for large programs,
> which are after all what people buy fast machines to run. 

This, I agree with. So, apparently, does MIPS, since they're part of a
group trying to design better benchmarks. 

> The opinions here are my own and do not necessarily represent those of
> Sun Microsystems.

Are you sure?

I mean, it sounded so lot like a product of the Sun Microsystems PR
department! (except that they would not be so clumsy about it, of
course). 

I don't particularly like the MIPS architecture (my favorite of the
recent RISCs is the AMD29000), but the above statistics did not seem
to warrant the conclusions drawn. 

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
Netter A: In Hell they run VMS.
Netter B: No.  In Hell, they run MS-DOS.  And you only get 256k.

aoki@faerie.Berkeley.EDU (Paul M. Aoki) (12/18/88)

In article <6476@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>in article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) says:
>> For the comparison we chose a large portable C program (the GNU C Compiler rev
>> 1.24)
>Step 1: choose a program. Fine. You did that right. 

Is it necessarily right?  How about "Step 1: choose a large number of common 
integer and floating point programs"?

>> If you are interested in architecture and wish to avoid the 
>> confusion of implementation details these are the numbers of most
>> interest. 
>OK, so you captured dynamic trace statistics. So what. Lower number of
>instructions executed doesn't necessarily mean faster execution, or
>else the Vax 780 would be the world's fastest machine ;-).

Hey, I get to pull out those notes from Patterson's class again!
(Actually I'm pulling this out of [cache?] memory.)

[ CR = clock rate (cycles/sec), IC = # inst (/prog), CPI = cycles/inst, 
  P = "performance" (prog/sec) ]

"Performance" is CR/(CPI * IC).  A 11/780 may have a lower IC but the CR/CPI 
isn't at all comparable to a Sun4 or M/1000.  One the other hand, if two 
machines have similar CR/CPI figures (as these two do) the machine that 
executes the fewest instructions wins (until the technology changes again).

So IC really does matter here, and it will continue to matter a lot as
long as the CR/CPIs are comparable.

Got that?  There will a quiz at the end of this posting...

>> MIPS compiler is not significantly better than the current SPARC compiler. 
>> Considering the bad press, I will admit I was surprised by this
>> myself. 
>Doesn't surprise me too greatly.

Well, here are some more sample dynamic instruction counts from pixie 
and spixstats, in millions:

Machine:	Sun4	M/1000
Opt Level:	-O4	-O3

bison		28.5	21.9
cc1 (gcc-1.30)	10.9	12.0	[ -O2 for mips, uld dumped core at -O3 ]
compress	197	202	[ two loops ]
gnu diff	30.3	102	[ bug in mips cc ]
gnu egrep	3.3	5.1	[ one loop, difference is all nops, addr calc ]
gnu awk-1.1	28	27	[ weird code, both optimizers had a hard time ]
TimberWolf3.3	230	175
doduc		366	287	[ sun does lots of extra s<->d prec conversion ]

So it can go both ways, for both compiler and ISA reasons.  I have my 
own opinions about the compilers from looking at assembly code but 
I'll let qualified people pass official judgment on them.  [ I'm
in enough trouble, grad students aren't supposed to have opinions in 
the first place :-) ]

I find it hard to argue that SPARC is better architecturally because 
it executes fewer instructions -- it really isn't always true, and 
sometimes it REALLY isn't always true.  I mean -- sweeping generalities
based on a sample of one?

>				  The register windows compensate quite
>well for outdated compiler technology, which is why the UCB guys used
>them in the first place (so they could re-target PCC, instead of
>having to dig up come compiler guys to do a moby optimizing hack).

Well, he wasn't *just* talking about loads and stores...

>> The opinions here are my own and do not necessarily represent those of
>> Sun Microsystems.
>Are you sure?
>I mean, it sounded so lot like a product of the Sun Microsystems PR
>department! (except that they would not be so clumsy about it, of
>course). 

Sigh.  Looks like the RISC wars really are on again, bigger and badder 
than ever ... 

[ OK, so I lied about the quiz. ]
----------------
Paul M. Aoki
CS Division, Dept. of EECS // UCB // Berkeley, CA 94720		(415) 642-1863
aoki@postgres.Berkeley.EDU					...!ucbvax!aoki

pavlov@hscfvax.harvard.edu (G.Pavlov) (12/19/88)

In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes:
> 
>             A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.
> 
> For the comparison we chose a large portable C program (the GNU C Compiler rev
> 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler
> to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
> produce a MIPS binary.              ^^^^^^^^^^^

  Why did you not use a current-generation MIPS in the comparison ???

dce@mips.COM (David Elliott) (12/19/88)

In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes:
>In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes:

[comparison of code generated by SPARC on Sun 4 and MIPS on an M/1000]

>  Why did you not use a current-generation MIPS in the comparison ???

There's no point.  The M/120 and M/2000 have the same compilers and run
the same object code.  Since the comparison is of code generation and
not execution speed, the actual machine doesn't make a big difference.

-- 
David Elliott		dce@mips.com  or  {ames,prls,pyramid,decwrl}!mips!dce
"Did you see his eyes?  Did you see his crazy eyes?" -- Iggy (who else?)

dce@mips.COM (David Elliott) (12/20/88)

In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes:
>  Why did you not use a current-generation MIPS in the comparison ???

In another article, I stated that this didn't matter.  I wasn't
paying attention to the original article enough to realize that
there were some system-dependent things being measured with respect
to the cache.

I tried to cancel that article, but upon failing that, am submitting
this retraction.  If our news administrator can figure out how to
cancel this article, he can cancel the other one as well.

-- 
David Elliott		dce@mips.com  or  {ames,prls,pyramid,decwrl}!mips!dce
"Did you see his eyes?  Did you see his crazy eyes?" -- Iggy (who else?)

csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (12/20/88)

In article <6476@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>I don't particularly like the MIPS architecture (my favorite of the
>recent RISCs is the AMD29000),
>--
>Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg

Not that anyone well care, but I was kind of thinking that the
MIPS architechture was the sexiest architechture I'd seen since
the PDP-11.  I feel like I stand a reasonable chance of keeping
the complete instruction set in my head, including all the special
privleged instructions.

(Nice job guys!)

-- Chuck

andrew@frip.gwd.tek.com (Andrew Klossner) (12/20/88)

> Benchmark Data
> 			Sun-4/280		MIPS M1000
> clock			16.67MHz		15MHz
> user time		1.797secs		2.2secs
> system time		.285secs		.3secs

The MIPS data looks suspect.  How many times was the program run, and
what were the standard deviations for these measurements?

If, as it appears, you ran the MIPS job only once, you don't have
enough precision to draw the conclusions you did.  And even if these
figures are precise, to derive a seven-digit number like 1,400,083 from
two-digit numbers seems a little silly.

  -=- Andrew Klossner   (uunet!tektronix!hammer!frip!andrew)    [UUCP]
                        (andrew%frip.gwd.tek.com@relay.cs.net)  [ARPA]

cosmos@druhi.ATT.COM (Ronald A. Guest) (12/20/88)

In article <697@hscfvax.harvard.edu>, pavlov@hscfvax.harvard.edu (G.Pavlov) writes:
> In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes:
> > 
> >             A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.
> > 
> > For the comparison we chose a large portable C program (the GNU C Compiler rev
> > 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler
> > to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
> > produce a MIPS binary.              ^^^^^^^^^^^
> 
>   Why did you not use a current-generation MIPS in the comparison ???

I agree.  Trying to pick the same clock rate is bogus.  What customers care
about is what they can get their hands on today.  MIPS has both a M/120 and
an M/2000.  Somehow I think if you had used an M/2000 you would have gotten
different performance results.  I could care less about subjective measures
of architectural nicety, since I am a CPU user.  What I care about is cost
and performance.  And who said a compiler was a good benchmark?  Is this the
only public program you did this test on?  Doesn't really matter, I suppose.
The mud-slinging does make this one of the more interesting 'technical' new
groups!

Ronald A. Guest, Supervisor     cosmos@druhi.ATT.COM  or  att!druhi!cosmos
AT&T Bell Laboratories          <--- but these are my thoughts, not theirs
12110 N. Pecos St.              Denver, Colorado 80234          (303) 538-4896

root@helios.toronto.edu (Operator) (12/23/88)

In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes:
>In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes:
>> 
>>             A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.
>> 
>> For the comparison we ...
>> ... compiled the identical source on a Sun-4/280 with the SPARC compiler
>> to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
>> produce a MIPS binary.              ^^^^^^^^^^^
>
>  Why did you not use a current-generation MIPS in the comparison ???

But that would hardly be a fair test. The Sun 4/280 has been out for well
over a year now, as has the M/1000. They are same-generation machines. 
And while we all know that MIPS has a new R3000 RISC chip, has anybody
seen a machine using it in full operation yet? It will probably be out
shortly, but I don't doubt Sun has something in the works as well (they'd
have to if they want to keep selling machines). Then we can compare the two 
of those.

Of course, even comparing a year-old Sun 4 to a year-old M/1000, the MIPS
machine wins hands down. We have found the M/1000 to be at least 3 times
faster than a Sun 4/280, for both C and Fortran. Admittedly part of this
advantage is due to MIPS' terrific compilers, but hey, that's all part of
the contest. Ultimate performance is what counts, not just hardware speed,
and more companies would do well to follow MIPS' example of carefully
optimizing their compilers for their RISC architecture.
-- 
 Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
 Systems Manager      BITNET - sysruth@utorphys
 U. of Toronto        INTERNET - sysruth@helios.physics.utoronto.ca
  Physics/Astronomy/CITA Computing Consortium

jk3k+@andrew.cmu.edu (Joe Keane) (12/23/88)

I call a win for MIPS.  MIPS has 15% more instructions, accounted for almost
exactly by the difference in NOPs (everything else balances out).  SPARC has 34%
more raw cycles, due mostly to (surprise, surprise) loads and stores.
Unfortunately, the M1000 seems to lose big from a too-small cache.  But i have
no doubt that (at any given time) the newest MIPS implementation should have
more MHz and a bigger cache than the newest SPARC implementation.  Flame away...

--Joe

cosmos@druhi.ATT.COM (Ronald A. Guest) (12/27/88)

In article <677@helios.toronto.edu>, root@helios.toronto.edu (Operator) writes:
> In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes:
> >In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes:
> >> 
> >>             A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM.
> >> 
> >> For the comparison we ...
> >> ... compiled the identical source on a Sun-4/280 with the SPARC compiler
> >> to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to 
> >> produce a MIPS binary.              ^^^^^^^^^^^
> >
> >  Why did you not use a current-generation MIPS in the comparison ???
> 
> But that would hardly be a fair test. The Sun 4/280 has been out for well

Ahhh....But from a user's standpoint it would be a very fair test.  As a
user, I compare the best of what is available today from all vendors.  And,
I interpreted the article (the original posting by Sun) as one oriented more
toward users than architectural niceties.  As other posters have
pointed out, it really wasn't a scientific study of the pros and cons of the
two architectures (and they both do have pros and cons).  As far as which
architecture is 'best', I think that can really only be answered in terms of
the application.  Gcc might be typical for some applications, but for others
it would yield misleading results.  And since we are talking about fast RISC
machines, has anyone done extensive independent benchmarking on the
Silicon Graphics multiprocessor system?  I understand they have implemented
a second level cache and cache snooping.

Ronald A. Guest, Supervisor     cosmos@druhi.ATT.COM  or  att!druhi!cosmos
AT&T Bell Laboratories          <--- but these are my thoughts, not theirs
12110 N. Pecos St.              Denver, Colorado 80234          (303) 538-4896

mash@mips.COM (John Mashey) (12/31/88)

Well, I've been over in the Far East for a couple weeks and then out with the
holidays, so I'm only just finally getting dug out enough to read the news and
see I've missed all the fun.
Thanx to Ed Kelly for starting this, raising some issues, and posting some
actual numbers so that people can do some analysis.  Some of the
comments I might have made have already been made by people quicker with the
keyboards.  This also stirred me up to complete some relevant discussion
that I've been working on for a while.

Here's the outline:

PART 1: Necessary Benchmarking Background (mashey)
1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS
	CASE 0: (generic case)
	CASE 1: (R(i) = Geom(B) for all i)
	CASE 2: Modest variation
	CASE 3: (even more variation)
	CASE 4: wild variation
1.2 HARDWARE FACTORS THAT AFFECT VARIATIONS
1.3 SPEC
1.3 IMPLICATIONS

PART 2A: Some quick notes on duplicating the benchmark (killian)
PART 2B: Detailed Analysis of the GCC Benchmark (mashey)

(to follow: Part 1 is already long; too many of the relevant people have been
away, and we had a little trouble duplicating the numbers; look for it
in about a week; it's about the same size as this one.)

---------
1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS

Suppose you run N benchmarks on two machines, A and B.  For benchmark i, let
T(i,x) be the time to run on x, and compute the performance ratio R(i) for
benchmark i of T(i,A)/T(i,B).  [If A happened to be a VAX-11/780 under VMS, this
number would be the VUP (VAX Unit of Performance) number for the specific
benchmark.]
Now, renumber the benchmarks in order of increasing R(i) and graph the result,
on a scale where (of course) the performance of A is 1 on every benchmark,
and where Geom(x) = geometric mean of the performance ratios for machine x.
(Why geometric mean?  Arithmetic mean is wrong for ratios; one could argue
about harmonic...)  Here are some cases you might see:

CASE 0: (generic case)
Big R 	|
	|						B	B
	|					B
Geom(B)	|.	.	.	.	B	.	.	.
	|			B
	|	B	B
Small R	|B
	 1	2	3	4	5	6	7...	N

Just to be clear, this means B underperforms Geom(B) on 1-4 and outperforms
it on 6...N.

People who've seen graphs like this sometimes ask why we usually sort it
to get one or more of the machines monotonic or close.  Nothing magic: we've
tried various ways, and this has the least visual confusion, especially
when graphing 3-5 machines together on one chart.  Of course, it's not generally
possible to get ALL machines in such a chart monotonic, especially for
unrelated machines; nevertheless, people seem to see the graphs easier
if some sorting is done to remove the jerkiness.  Finally, it is
sometimes easier to see patterns when this is done.

Of course, I didn't say where Geom(A) (= 1.0 by definition) was on the chart:
it could be < Geom(B), in which case you'd claim that A was slower than
B on this set of benchmarks, it could be ==, it could be >.

Now, one can observe some interesting cases, depending on the benchmarks
selected:

CASE 1: (R(i) = Geom(B) for all i)
R 	|
Geom(B)	|B	B	B	B	B	B	B	B
	|					
Geom(A)	|.	.	.	.	.	.	.	. (1.0)
	|
	|
	 1	2	3	4	5	6	7...	N

In this case, B is uniformly faster than A.  The only realistic circumstances
I've ever seen this close to happening is where you're doing CPU benchmarks
of two machines that use the same CPU, same software, same memory system,
and only differ in clock rate THAT SCALES UNIFORMLY THROUGHOUT THE CPU+MEMORY
SYSTEM.  Needless to say, few peripherals work this way, so I/O tests of
two systems like this don't scale the same way.  In addition, on some
tests you might get surprised by: timer interrupts (the faster
machine has less of them to deal with if the timer isn't scaled up, which
usually is not done), or DRAM refresh overhead (maybe).  That means most
graphs look more like CASE 0 after all, which means that it is now time to look
at the values on the vertical axis.

EXAMPLE 1: at MIPS, the closest case to this is the M/800(M/1000) pair,
which are essentially identical, except for the clock rates (12.5Mhz or 15MHz).

EXAMPLE 2: various PC families show this behavior, differing only by clock rate.

Consider the case where there is some variation, but many of ratios
cluster around the same value, probably close to the geometric mean:

CASE 2: Modest variation
R 	|
1.1G(B)	|						B	B
Geom(B)	|.	.	B	B	B	B	.	.
	|	B
.9G(B)	|B
	 1	2	3	4	5	6	7...	N

(Note, this doesn't mean that Geom(B) == 1, it means that the R values
cluster from .9*G(B) to 1.1*G(B).

You'd guess from this that the two machines are part of the same family,
with the same software, and moderate differences in clock-rate or
small details of implementation. (Note: major differences in clock-rate
will probably spread things further apart).

EXAMPLE 1:
VAX 8650:8700 comparison looks this way (I think), with differences in
pipelining and the switch from write-back cache (in 8650) to write-thru
(in 8700) sometimes making noticeable differences in either direction. (Compare
differences in Whetstone & Linpack between them, for example).  I think the
two machines are about the same speed; maybe someone from DEC will comment.

EXAMPLE 2: 386-based PCs sometimes show this behavior more than do 286-based
ones, as the former more often use different sorts of memory systems.

EXAMPLE 3: the MIPS M/1000 versus M/120 is sort of this way:
	a) Both use 64KI + 64KD caches, 4-deep write-buffers, 1-word
	cache-refill.
	b) The clock rate changed from 15MHz to 16.7MHz, which by itself
	would not spread the ratios.
	c) The M/120 has a lower-latency memory system, i.e., so that it
	survives high-cache-miss rate programs better.  Thus, programs
	with low cache-miss rates will tend towards the left of the chart,
	where a 120 gets only the clock-rate difference (16.7/15);
	higher cache-miss rate programs will tend towards the right side.

CASE 3: (even more variation)
R 	|
1.5*G(B)|						B	B
	|					B
Geom(B)	|.	.	.	.	B	.	.	.
	|			B
	|	B	B
.75*G(B)|B
	 1	2	3	4	5	6	7...	N

This says that there is a 2X variation (1.5/.75) in the relative
performance of the two machines.  This is not at all atypical of a
randomly-selected pair of real machines.  As shown by DEC (McInnis,
Kusik, and Bhandarkar, "VAX 8800 System Overview", IEEE CH2409-01/87)
a VAX 8700 was anywhere from 3X to 7X faster
than an 11/780, even with the same software.   Most of the MIPS systems
versus VAXen have a similar-looking chart, as a gross first-order
approximation.   Some of the more extreme MIPS pairwise combinations
get up around a 1.3X variation (for example, M/2000 versus M/1000):
	if a benchmark has a low cache miss rate, the ratio is close to
		the clock-rate difference.
	if a benchmark has a high data cache miss rate, and block-fetch
		works, the 2000 is better than the 1000 by more than the
		clock rate.
	if a benchmark has a high data cache miss rate, and block-fetch
		DOESN'T work (compress is the notorious example), then
		a 2000 is not as much better as the clock-rate difference.
		(Compress is notorious because it hashes data into a huge
		sparse array & 1-word-refilled data caches are BETTER
		than N-word-refilled caches, which is not often true.)


CASE 4: wild variation
100G(B)	|							B
	|						B
	|					B
Geom(B)	|.	.	.	.	B	.	.	.
	|			B
	|	B	B
.5G(B)	|B
	 1	2	3	4	5	6	7...	N

This is what you typically would see when comparing a vector machine (B)
with a scalar machine, or maybe two vector machines optimized towards
shorter or longer vector lengths, or multiprocessors of various kinds,
or....  There is of course, nothing inherently bad about such variation,
except that the MORE VARIATION THERE IS, THE MORE YOU'D BETTER BE CAREFUL
ABOUT YOUR OWN WORKLOAD AND UNDERSTANDING WHICH BENCHMARKS, IF ANY, ARE
TRULY REPRESENTATIVE OF IT.

1.2 HARDWARE FACTORS THAT AFFECT THE DISTRIBUTIONS IN CPU PERFORMANCE:

1) Cached versus non-cached systems
The easiest machines to compare are simple non-cached ones with simple
memory systems, because the graphs will tend to look like CASE 2.
In particular, a few integer benchmarks, and a few FP ones will quickly
give you some idea of what is happening.  Unfortunately, this domain is
basically limited to the slower microprocessor designs, as most others
either use caches, or if not, not may be vector machines with memory systems
optimized for vector transfers.

2) If cached:
	size of cache
	joint versus split I & D
	level of associativity
	write-thru versus write-back
	linesize
	block-transfer size
	etc, etc.
	Size is one of easiest ones to get surprised with, especially
	on scientific benchmarks with varying array sizes.
	(You can REALLY get misled if you happen to pick a benchmark where
	the particular size happens to fit into machine A's cache, but not
	quite into machine B's.  You can get odd effects where B performs
	relatively better on small problems and big problems, but A does
	better on middle-sized problems.)  This is becoming especially
	relevant as caches grow in size to consume popular benchmarks,
	i.e., the typical 100x100 Linpack is noticeably helped by current
	caches! (I think this is why Dongarra & friends are emphasizing
	plotting of MFLOPS rates over many array sizes to avoid weird
	single-point effects.)

3) Memory system
	random-access-equal versus random-access-unequal variations,
	such as when using page-mode DRAMS, SCRAM-cache (as in Sun-4/110),
	banking schemes, etc, where any of the following might occur:
		2nd access to same page is faster than random
		2nd access is faster if not to the same bank

4) Vector versus scalar system design
	This is one of the most common causes of large variations in
	performance.

5) Multi-processor versus uniprocessor

6) Memory-management design.
	Small programs; big programs; sparse versus dense data, etc, etc.

ALL OF THE ABOVE CAN AFFECT SINGLE-THREAD PERFORMANCE ON CPU COMPUTATIONAL
PERFORMANCE.  Finally, along with other design elements (like exception-
	processing), they can affect other performance attributes, which
	this analysis has no pretense of doing anything but listing:
	multi-user/server (versus workstation) performance
	big jobs versus little ones
	user code versus kernel code (they're different)
	commercial versus technical applications
	balance across different applications versus tuned for a few

1.2 SPEC
Most high-speed machines OF COURSE use combinations of the things that
will cause more variations.  This is one of the reasons that {Apollo,
HP, MIPS, Sun} have started SPEC: a lot of the simpler benchmarks used in
the micro world just don't cope very well with the inherent variability
of current high-performance machines; we need more good benchmarks that
cover a wider range of applications.  It is not that this is a new problem,
of course, it's just that in the last few years, fast machines are getting
very cheap, and so the complexity issues found in mainframes/supercomputers,
and then superminis, are migrating into cheap machines, and hence are
more visible to more people.

1.3 IMPLICATIONS
1) If you have N benchmarks, and you select M << N of them, you can usually
prove almost anything about the relative performance of A & B.
For example, there exist benchmarks that can drag almost any machine
down to DRAM random-access speed, no matter what its cache architecture is.
Of course, different machines drag down different architectures worse.
Some of the nasties are even real programs (like compress, for example).
Note that this implies that a GOOD benchmark suite would:
	a) Avoid tiny programs that fit into trivial caches.
	b) Include some nasties that break everybody's caches (at least
	this year; next year's caches are going to be harder to fill!)
	c) Include some that are real programs, but may fit into
	some caches. [You can't hope to use only b), because caches are
	getting bigger fast, and many real programs are significantly
	helped by 1988/1989 microprocessor caches.]
	d) When possible, have benchmarks whose data sizes are easily
	variable, so that you can plot multiple points, avoiding surprises
	from happenstances of sizing.  This is probably easiest with
	scientific programs; sometimes it's just about impossible with
	some systems programs.  Note that the point is not to make a benchmark
	run long enough to be interesting [that's a separate issue], but to
	vary the sizes to better analyze memory-system effects.

2) As usual, your own application is the best benchmark.  If you don't have 
that, your best bet is to hope to find that some of the benchmarks are
ones that you've found usually correlate with your own applications.

3) If you are able to make N fairly large, you may be able to find
patterns, like "good integer performance, bad floating-point",
"good vector floating-point, bad integer", "good on small benchmarks,
bad on big ones", etc.  The most common patterns are:
	a:Balanced integer and scalar floating-point (superminis, mainframes)
	b:Integer noticeably better than FP (most microprocessors)
	c:FP better than integer, and vector FP much better than scalar VP,
	relative to the mainframe/supermini pattern (most supercomputers and
	mini-supers, although not all.)

4) Most of this was about CPU performance.  When you add other realistic
systems benchmarks, it only gets more complicated, although some of the
same syndromes show up [like testing disk I/O with repeated access to
files that do / do not fit into UNIX buffer caches.]

In PART 2, we'll look at the gcc benchmark described by Ed Kelly in
<82150@sun.uucp>.  Although the test case shown doesn't run as long as one would
like, using gcc as a benchmark is a reasonable and worthy thing, as it:
	a. is a large integer program
	b. is widely available in source form, if not public domain
	c. is an example of something that people really use
and hence, it is probably a good candidate for inclusion in good benchmark
suites, especially as it is excruciatingly hard to get compilers that obey b.

It is also a good example to analyze in detail:
	It is typical of other integer programs in some ways
	It is somewhat atypical of them, in other ways
	It offers a good example of some of the specific cautions mentioned
		above in data interpretation, especially as applied to
		fast machines with differing memory system designs
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

earl@wright.mips.com (Earl Killian) (01/04/89)

Ed Kelly of Sun studied gcc compiling gcc.c on MIPS and SPARC, and
posted some statistics together with his analysis and conclusions.  I
decided to take a look myself (also, it's a likely SPEC benchmark, so
understanding it will be useful).  At first I was unable to duplicate
Kelly's statistics.

gcc compiled on MIPS with cc -O3 and ran without hitch, whereas Kelly
said -O3 didn't work (-O4 also works if you fix a trivial bug in the
gcc source).  Subsequently we were told that Sun's -O3 problem was
that it ran out of space in /tmp on their machine and not a compiler
bug.

With -O3 I get 17.40M instructions.  At -O2, I get 17.82M instructions
instead of his 18.64M, so there was a big difference to explain.

The major difference between -O2 and -O3 is inter-procedural register
allocation.  A minor difference is that -O2 by default declines to
optimize "big" procedures (> 500 basic blocks) to save on compilation
time during program development.  It warns you by saying
	uopt: Warning: expand_expr: this procedure not optimized
	      because it exceeds size threshold; to optimize this
	      procedure, use -Olimit option with value >= 656.
For benchmarking, I go back and add a -Olimit to the Makefile and
recompile, just as the warning suggests.  If I leave off the the
-Olimit then several procedures remain unoptimized and the result
is 18.25M instructions.  Closer to Kelly's result, but still not
there.  (Note that two of the unoptimized procedures are yyparse and
yylex, which are the 2nd and 3rd heaviest contributors to CPU
cycles...)

Kelly was running this benchmark on a System V M/1000 as opposed to a
BSD M/1000 (MIPS sells both flavors of Unix).  When I tried it on
System V I got link errors for BSD-only routines such as bcopy and
bzero, which I solved by adding -lbsd to the command line.  My guess
is that Kelly didn't know about -lbsd and choose to use
straight-forward byte-at-a-time bcopy/bzero substitutes.  When I try
that I get 18.68M instructions, which is quite close to his result.

In summary:
18.68M	-O2, no opt of yyparse, yylex etc., no use of library bcopy/bzero
18.64M	posted number
18.25M	-O2, no opt of yyparse, yylex, etc.
17.82M	-O2, optimize yyparse, yylex, etc.
17.40M	-O3
(All results use the MIPS 1.31 compilers, which were released in Mid 88.)

The point of this was to show that Kelly's analysis was built on
questionable statistics.  But even with his statistics as a basis,
some of his conclusions are unwarranted.

As many people pointed out, gcc is only one data point, and it is
unreasonable to conclude anything from a single data point.  There
might be something anomalous in that one case, for example.

One thing I learned in porting gcc is that the MIPS compiler generates
poor code for a C construct that gcc uses heavily (a bit-field enum
that is both aligned and 16 bits in length).  Oh well, every compiler
has some simple things it doesn't bother to special case.  This will
be fixed in a future compiler release.  With that compiler the gcc
instruction count on Kelly's input is 16.47M instructions at -O3
(about 6% fewer instructions).  It is exactly this sort of sensitivity
to small details that make single data point conclusions unreliable.

It also turns out that 6% of the instruction cycles are spent in
printf etc.  I don't know whether the SPARC printf has been heavily
tuned or not; ours has not.  It is fair to include the cost of this as
a system test: that's what the user sees.  However, it is hard to draw
conclusions about Instruction Set Architecture (ISA) + Compilers,
where one is concerned about a % here or there, when noticeable parts
of the code are from libraries.

With those caveats in mind, let's look at some of Kelly's remarks:

   "As will surprise most observers, SPARC executes fewer instructions
   than MIPS."

This doesn't surprise me when I look closer and see how the
instruction counts differ.  After all, the RISC vs. CISC wars were
begun with the premise that instructions were only one term in the
performance equation.  Total performance is what matters.

As several people pointed out on the net, the difference in
instruction counts is primarily attributable to MIPS using a NOP
instruction instead of a hardware interlock for load instructions
(shifting responsibility from hardware to software).  With
interlocking, the load NOPs would be replaced by a single-cycle stall,
so the load NOPs have no direct performance impact (an indirect effect
is the increase in code size affects i-cache miss rates).  To
compensate for the difference in interlocking approach (hardware vs.
software), you can either subtract load nops (.91M) from the MIPS
counts or add SPARC interlocks (1.47M) to the SPARC counts.  With our
1.31 compilers, that makes the difference +-1% for adjusted
instruction count.  (With the compiler that optimizes aligned 16-bit
bit-fields to halfwords, it is 5 to 8% in favor of MIPS.)

But again, instruction counts aren't a good basis for comparison.  I
don't think you can compare ISAs without looking at implementations.
For example, MIPS has a divide instruction and SPARC has none.  Should
we add in our divide interlocks to be fair?  But a hypothetical MIPS
machine could have a 8-cycle divide, so maybe we ought to use 8, not
35, in ISA comparison?  How can this work?

In contrast, comparing cycles or time is more meaningful.  Kelly gives
25.52M as the raw cpu cycle count.  The corresponding MIPS number
(1.31 compilers) is 17.74M.  The large difference is of course due to
the Fujitsu SPARC chip using one extra cycle on loads, 2 extra cycles
on stores, and one extra cycle on untaken branches.

To go beyond cpu performance we need to pick a memory system.  This is
probably a good place to point out that the M/1000 Kelly used is a
lower performance machine than anything we now sell; it has been
essentially obsoleted by the 16.7MHz M/120 (like the M/1000, based on
the R2000) and the 25MHz M/2000 (based on the R3000), both of which
are in production and shipping.

Adding in cache miss cycles, Kelly gives a total of 29.95M cycles for
the Sun 4/280.  For the MIPS M/120 I get 24.19M (27.80M for the
M/1000).  Since the cycle time is the same for both the 4/280 and the
M/120, the cycle counts are directly related to time.

I don't think there's much to squabble about here.  Time is time.  All
the trade-offs have been reduced to a single number.  Kelly might
object that a hypothetical SPARC implementation might avoid the extra
load/store/branch cycles.  Such an implementation is said to be in
progress.  When it's appropriate, why not use it for comparison with
the corresponding MIPS system?

	"For many observers the interesting fact is that for this
   benchmark, the MIPS compiler is not significantly better than the
   current SPARC compiler.  Considering the bad press, I will admit I was
   surprised by this myself."

This statement was unsubstantiated; it is not obvious to me how to
compare compilers based on instruction statistics from different
architectures, especially on only one benchmark.  The few things that
do come to mind suggest that the MIPS compiler is doing a better job,
but given the importance of library code in this benchmark, the whole
subject is on thin ice.  Perhaps Kelly can elaborate?

	"Being a SPARC advocate I would claim that SPARC is
   ARCHITECTURALLY fundamentally better, but the degree of difference is
   probably in the noise in the broader scheme of things."

(-: Gee, being a MIPS advocate, and given the corrected numbers,
should I claim that MIPS ISA is 5-8% fundamentally better?
:-)

Kelly moves on to discuss the architecture of the entire system, not
just the ISA.  I have some quibbles with his methodology (e.g.
inferring anything from Unix runtimes on the order of 1-2 seconds,
where the error per measurement is probably 10% or more), but I really
have to restrict myself to addressing a few of his off-hand remarks
(this posting is already too long).

   "These numbers represent significant differences in the IMPLEMENTATION
   philosophies at Sun and at MIPS. The central goal at MIPS appears to
   have been to achieve a single cycle per instruction, even at the cost
   of cycle time and complexity. Clearly that was not a central goal at
   Sun."

Certainly single cycle execution was one of several MIPS goals, but I
would not say it was at expense of cycle time or complexity at all.
The most significant pressure on cycle time in the R2000 is due to
physical instead of virtual caches, not single-cycle execution.
Virtual caches simplify the CPU at the expense of multi-programming
performance and multi-processing implementation complexity.

	"Our goals were dominated by cycle time and system simplicity.
   Performance on large programs was our design metric.  The first
   SPARC implementation achieved a faster cycle time than the best of
   MIP's first implementations, despite inferior technology."

This is not true.  Both the Fujitsu SPARC and the R2000 are 16.7MHz
chips.  The M/1000 system, based on the R2000, was 15MHz instead of
16.7MHz because it used memory boards from the M/500 generation system
(you could upgrade with a cpu board replacement), and those memory
boards are good to 15MHz.  (The M/500 was introduced 18 months before
the Sun 4/260.)  Both MIPS and its customers ship systems based on the
R2000 at 16.7MHz (the M/1000 just isn't one of them.)

Is the Fujitsu SPARC implemented in an inferior technology to the
R2000?  That's hard to call.  The Fujitsu SPARC is implemented in what
is, I think, a 1.5 micron CMOS gate array technology whereas the R2000
is implemented in 2.0 micron custom CMOS technology.  I'm not sure how
to compare these particular apples and oranges.

	 "The MIPS performance brief has concentrated on relatively
   small integer programs that fit in the cache and so benefit well
   from the single cycle loads and stores."

The MIPS performance brief concentrates on large programs.  It is the
case that the large programs are floating point; large public domain
floating point programs are easier to find than large public domain
integer programs.  The UNIX commands listed in the Brief are at least
reasonably-sized real programs, not toys, and they're what a lot of
people use.  What about the Sun performance brief?  It relies on the
dhrystone and stanford benchmarks, which are much smaller than the
MIPS Unix suite.

   "This overstates the integer performance for large programs, which
   are after all what people buy fast machines to run. MIPS implicitly
   acknowledges this by calling the M1000 a 10 MIP box despite the
   fact that all the published data in the MIPS performance brief
   would say integer performance is greater than 12 MIPs."

Unlike Sun, but like DEC, we consider both floating point and integer
performance when assigning a VUPS (sometimes called MIPS) rating to
our machines.  And yes, we don't use toys like dhrystone and stanford
for our ratings (we give results because they're popular).  Read
section 2.1 of the MIPS performance brief for details.  Is there
something wrong with basing ratings on large, real programs?
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

lgy@blake.acs.washington.edu (Laurence Yaffe) (01/04/89)

In article <677@helios.toronto.edu> sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:
-[...] And while we all know that MIPS has a new R3000 RISC chip, has anybody
-seen a machine using it in full operation yet? It will probably be out
-shortly, ...
- 
- Ruth Milner          UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
- Systems Manager      BITNET - sysruth@utorphys
- U. of Toronto        INTERNET - sysruth@helios.physics.utoronto.ca
-  Physics/Astronomy/CITA Computing Consortium

	I've been using a MIPS M/2000, containing an R3000 cpu for the
last three months.  Right now, its only running with a 20 MHz clock,
but it should be getting upgraded to 25 Mhz any day now.  Even at 20 MHz
its pretty impressive - I've been meaning to post some performance figures
from my own real programs, but haven't found time.  Sometime soon, hopefully.



-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
Department of Physics, FM-15	      or: yaffe@phast.phys.washington.edu
University of Washington	Bitnet:   yaffe@phast.bitnet
Seattle WA 98195

lgy@blake.acs.washington.edu (Laurence Yaffe) (01/04/89)

In article <10436@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

	[[Much stuff about benchmarking deleted]]

-Some of the more extreme MIPS pairwise combinations
-get up around a 1.3X variation (for example, M/2000 versus M/1000):
-	if a benchmark has a low cache miss rate, the ratio is close to
-		the clock-rate difference.
-	if a benchmark has a high data cache miss rate, and block-fetch
-		works, the 2000 is better than the 1000 by more than the
-		clock rate.
-	if a benchmark has a high data cache miss rate, and block-fetch
-		DOESN'T work (compress is the notorious example), then
-		a 2000 is not as much better as the clock-rate difference.
-		(Compress is notorious because it hashes data into a huge
-		sparse array & 1-word-refilled data caches are BETTER
-		than N-word-refilled caches, which is not often true.)
					     ^^^^^^^^^^^^^^^^^^^^^^^

    I'm curious about the basis for this judgement.  In much of my own 
recent work, I've been dealing with several large, integer programs which
do special purpose symbolic algebra.  Much of the execution time of these
programs is devoted to searches in a large ordered hash table (~50 Kb),
plus assorted string operations which typically only access the first
few characters in a string.  These appear to be examples of programs
for which multi-word data cache refill is not helpful.  For example,
comparing a MIPS M/120 (16.7 MHz) versus a 20 Mhz M/2000, I've found:

Program #1 ("obsgen")	MIPS M/120 (-O2)	 821 sec
			MIPS M/2000 (-O2;3.10)	 795

Program #2 ("scrgen")	MIPS M/120 (-O2)	 808 sec
			MIPS M/2000 (-O2;3.10)	 826

Obviously, these two programs may not be representative of "typical"
programs (whatever those are).  However, I would not be surprised if
many "data-management" type programs (with large hash tables, binary
trees, etc.) have similar behavior - namely better performance with
single word data cache refill.

-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
-UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
-DDD:  	408-991-0253 or 408-720-1700, x253
-USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
Department of Physics, FM-15	      or: yaffe@phast.phys.washington.edu
University of Washington	Bitnet:   yaffe@phast.bitnet
Seattle WA 98195

earl@wright.mips.com (Earl Killian) (01/07/89)

When I was unable to duplicate Ed Kelly's gcc results I speculated on
a couple of things that could have caused the discrepency.  Afterward
Ed Kelly and Bob Cmelik from Sun came by and we swapped gcc sources
and found the real answer.  The sources were essentially identical,
but the bison generated grammer was fairly different.  I don't know
why, but Sun's grammer spends 1.22x more time in yyparse.  So we now
agree that with that grammer, compiled -O2, the result is 18.63M total
instructions.  The -O3 result is 18.19M.

The rest of my comments stand with some modification of the numbers:

Of the 12% difference in instruction count, 6% is due to load nops, 4%
is due to instructions that are fetched but annulled on SPARC, leaving
2% more real instructions on MIPS.  The cycle difference before memory
system is now 38% more for SPARC.  Cpu+memory cycle count is 17%
higher on the Sun4/260 than the M/120.  (As before, there's a 6%
further improvement in instructions/cycles when using the next
compiler, which fixes the enum:16 bit-field embarrassment.)

I'd like to thank Ed and Bob for bringing their source and helping me
get my facts straight.  Things will be a lot easier with SPEC...
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086