edkelly%aisling@Sun.COM (Ed Kelly) (12/17/88)
A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. For the comparison we chose a large portable C program (the GNU C Compiler rev 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to produce a MIPS binary. Then using the same data (the file gcc.c) we ran the benchmark on both machines and gathered the dynamic trace statistics provided by SPIXSTATS and PIXIE, the respective statistics gathering programs for SPARC and MIPS. We also measured the user and system time on both machines. The compiler optimization level was set at -O2 for MIPS (the highest that would compile) and at -O4 for SPARC. Both compilers were the standard production versions as of sept 1988. MIPS -O2 and SPARC -O4 are comparable levels of optimization. -O3 was the highest MIPS optimization available. From data on other C programs -O3 produces a gain of less than 2% on average over -O2, so we feel the comparison is valid. The following is divided into two sections. The first section covers a SPARC vs. MIPS instruction set architecture comparison and the second is an implementation comparison of the Sun-4/260 vs. the M/1000. The architecture comparison counts INSTRUCTIONS and is useful for comparing instruction sets and compiler efficiency. This will not vary across implementations if compilers are a constant. If you are interested in architecture and wish to avoid the confusion of implementation details these are the numbers of most interest. The implementation comparison counts CYCLES and includes effects like multi-cycle loads and cache misses etc. __________________________________________________________________ INSTRUCTION SET/REGISTER ARCHITECTURE AND COMPILER COMPARISON __________________________________________________________________ SPARC MIPS MIPS-SPARC Total Instructions 16,313,907 18,635,185 +2,321,278 ------------------------------------------------------------------ Detailed Breakdown ------------------------------------------------------------------ Branch nops 109,079 1,170,306 Load nops na 1,113,019 Jump nops 102,417 211,409 other nops 20,110 99,495 annulled delay slots (634,700) na load interlock cycles (1,474,619) na ------------------------------------------------------------------ nops sub-total 231,606 (1.4%) 2,594,229 (14%) +2,362,623 loads 3,242,293(19.9%)3,928,710(21%) +686,417 stores 1,175,530(7.2%) 2,037,266(10.9%)+861,736 conditional branches 2,699,885 2,559,648 unconditional branches 225,739 190,456 jumps 326,578 498,865 calls 214,662 213,118 ------------------------------------------------------------------ jmp/branch sub-total 3,466,864(21%) 3,462,087(18.5%)-4,777 shift 716,666 890,281 logical set cc 850,121 logical 1,396,645 1,335,473 arithmetic set cc 1,914,842 arithmetic 1,853,659 3,241,789 set na 666,309 save/restore 337,820 na others 84,094 41,838 ------------------------------------------------------------------ computational sub-total 7,192,084(44%) 6,175,690(33%) -1,016,394 sethi/lui 1,003,713(6.15%)437,207(2.3%) -566,506 ------------------------------------------------------------------ Some notes on the categories. MIPS "set" could be categorized as arithmetic or arithmetic set cc. SPARC "save/restore" could be categorized as arithmetic. They adjust the stack pointer and the increment/decrement window pointer; The equivalent MIPS operation is adjust the stack pointer. The SPARC nops listed as "other" are mostly associated with calls. The "others" category is mostly multiply related As will surprise most observers, SPARC executes fewer instructions than MIPS. Some specific observations. 1) SPARC has many fewer loads and stores,(1,548,153) which points out the significant ARCHITECTURAL advantage of register windows. Stated another way, for this case MIPS has 35% more loads and stores than SPARC. This benchmark contains more loads and stores than our "average" case of 15% loads and 6% stores so the benefits of register windows may actually be understated here. 2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad for code density, and it increases instruction cache miss penalties(due to more memory accesses and greater probability of a miss). A subtle point about the NOPs is that it distorts statistics presented as percentages. MIPS's combined load/store percentage is 32% for this benchmark. If there were no NOPs the percentage would be 37% vs. SPARC's 27%. Current SPARC implementations incur a clock cycle penalty for some of the cases where MIPS has to insert NOPs however, so counting all NOPs against MIPS overstates the situation. This includes the load-use interlock case(1,474,619), and the untaken annulled branch case(634,700). While these cycles are not "architectural" many implementations will incur these penalties. The ARCHITECTURAL advantage that the annulling feature confers on SPARC probably needs more explanation. As the MIPS numbers demonstrate, it is difficult to fill branch delay slots. SPARC uses standard delayed branches until it cannot fill branch delay slots. It then uses annulling branches and fills almost all the remaining branch delay slots. Annulling branches that are taken incur no penalty and represent a performance win for SPARC that MIPS cannot realize. Minimizing the number of load interlock cycles and predicting conditional branches is a function of compiler technology. The load interlock cost could be around 1,000,000 cycles from a comparison with the MIPS number. The number of annulled instructions that incur a penalty is reduced with reasonable branch prediction. Several papers have shown that static branch prediction can get to 85% for C programs. Currently the Sun compiler gets 60% correct prediction for this benchmark. 85% prediction would reduce the untaken annulled branch cycles lost to 263,306. The bottom line about NOPs is SPARC is better due to the annulling ARCHITECTURE feature. 3) SPARC has more sethi instructions(566,506). Most of these are due to the way addresses to global data are generated by the compiler. An optimization that MIPS employs would eliminate these instructions. SPARC once performed the optimization (during early development) but we decided to keep the old a.out format and the old linker and so postponed the benefit. The SPARC ABI will allow us to remedy this situation. 4) The category that has the biggest discrepancy against SPARC is computational (1,016,394). Some of this is probably due to the need to set condition codes, an ARCHITECTURAL feature of SPARC, but it is not straightforward to analyze. 5) There are other significant ARCHITECTURAL differences between MIPS and SPARC that either are not represented in this benchmark or cannot be isolated with the data. I include this list for completeness. a) SPARC has a register + register addressing mode for loads and stores that MIPS lacks. b) MIPS has integer multiply and divide instructions that SPARC in lacks in current implementations. c) SPARC has load and store double operations(integer and floating point) MIPS has no equivalent instructions. d) MIPS has instructions to move data directly between the integer registers and the floating point registers. SPARC has no equivalent instructions. In summary, for this benchmark, the ARCHITECTURAL benefits of register windows and annulling more than balance the ARCHITECTURAL losses in computational. The relatively simple enhancements of sethi elimination, branch prediction and load interlock removal can buy more than 1,000,000 instructions for SPARC. From random inspection of code sequences the current SPARC compiler appears to produce redundant code, so some improvement can be expected in this area as well. For many observers the interesting fact is that for this benchmark, the MIPS compiler is not significantly better than the current SPARC compiler. Considering the bad press, I will admit I was surprised by this myself. Being a SPARC advocate I would claim that SPARC is ARCHITECTURALLY fundamentally better, but the degree of difference is probably in the noise in the broader scheme of things. IMPLEMENTATION ANALYSIS. This is mainly for historical perspective and to present a complete picture. ___________________________________________________________________________ User machine cycles comparison. ___________________________________________________________________________ Sun-4/280 MIPS M1000 instructions 16,313,907 18,635,185 loads (extra cycle) 3,242,293 stores (extra cycles) 2,351,060 load interlock (") 1,474,619 untaken branch (") 1,179,319 annulled cycles (") 634,700 jmp (") 326,578 mult/div (") na 363,987 basic block interlock? na 51,983 ----------------------------------------------------------- total raw cycles 25,522,476 18,999,172 cache miss cycles 4,427,524* 14,000,828* ----------------------------------------------------------- total machine cycles 29,950,000 33,000,000 ----------------------------------------------------------- CPI 1.84 1.77 CPUI(Cycles per Useful Instruction)** 1.86 2.02 MIPS 9.06 9.23 MUIPS(Millions of Useful Instructions/Sec)8.95 8.06 ___________________________________________________________________________ Rough Memory System Analysis ___________________________________________________________________________ memory references 20,731,730 24,601,161 (+3,869,431 +18.6%) average penalty 10 10??* misses/other 442,752(2.1%)* 1,400,083(5.7%)??* Benchmark Data Sun-4/280 MIPS M1000 clock 16.67MHz 15MHz user time 1.797secs 2.2secs system time .285secs .3secs * These are rough numbers working backwards from the time necessary to run the program and the clock frequency. The MIPS cache is write through and incurs significant penalties in write stalls. I cannot distinguish the magnitude of this effect here. ** Useful Instructions are all instructions not including NOPs. ______________________________________________________________________________ OPERATING SYSTEM OVERHEAD ______________________________________________________________________________ The time spent in the operating system is broadly comparable on both machines. Detailed analysis of how this breaks down is difficult. In current SPARC implementations window overflow/underflow is accomplished with trap handlers. MIPS currently handles TLB misses with trap handlers. The number of overflows for the Sun-4/280 (with 7 windows) was 4,439 and underflows 4,438, for a total of 8,877 traps. For SPARC the number of overflows and underflows is dependent on the number of register windows in an implementation. (e.g. A Cypress based design with 8 windows would have 2569 overflows and 2568 underflows for this program). Each overflow performs either 8 load doubles or eight store doubles. This is equivalent to 71,024 extra loads and 71,024 extra stores for the 4/280, a tiny fraction(3%) of the total loads and stores. If the TLB miss rate for MIPS was .1% (an optimistic assumption) this would have resulted in 24,601 traps. As an approximation both machines trap overheads appear comparable for this benchmark. Most of the system overhead is not in these trap handlers. For the 4/280 the overflow/underflow trap handlers take about 545,932 cycles out of the approximately 5,000,000 cycles of system time. I should clarify why I am treating underflow and overflow penalties in this section and not under architecture. As the numbers above show, nearly all aspects of underflow/overflow penalties are IMPLEMENTATION specific. The number of register windows and details of hardware or trap handler organization, all of which are determined by hardware or kernel implementations, are what account for this overhead. ______________________________________________________________________________ GENERAL IMPLEMENTATION COMMENTS ------------------------------------------------------------------------------ These numbers represent significant differences in the IMPLEMENTATION philosophies at Sun and at MIPS. The central goal at MIPS appears to have been to achieve a single cycle per instruction, even at the cost of cycle time and complexity. Clearly that was not a central goal at Sun. Most of the raw CPI differences are due to the multi-cycle loads and stores. This is due to the single 32-bit bus vs MIPS's multiplexed 32-bit bus. The single 32-bit bus was chosen for system simplicity. It also facilitates designing low cost systems and Multi-Processor systems. Our goals were dominated by cycle time and system simplicity. Performance on large programs was our design metric. The first SPARC implementation achieved a faster cycle time than the best of MIP's first implementations, despite inferior technology. The Cypress SPARC implementation is achieving a better cycle time than the latest MIPS implementation from Performance Semi.(33MHz vs 25Mhz). This is not co-incidental. Fujitsu has announced a new SPARC part for next year that will have multiple 64-bit busses that will demonstrate a good CPI and bury the myth that SPARC is tied to multi-cycle loads and stores. MIPS generates more memory references (18.6%,see above) than SPARC and the first implementations compounded this with poor cache/memory system design which means that large integer programs perform better overall on the SPARC implementation which has a better cache/memory system. The MIPS performance brief has concentrated on relatively small integer programs that fit in the cache and so benefit well from the single cycle loads and stores. This overstates the integer performance for large programs, which are after all what people buy fast machines to run. MIPS implicitly acknowledges this by calling the M1000 a 10 MIP box despite the fact that all the published data in the MIPS performance brief would say integer performance is greater than 12 MIPs. The performance brief also leans heavily on the floating point performance side where the first SPARC implementations are clearly inferior to the first MIPS implementations. This weakness was redressed by the parts announced by Cypress some time ago. As the data demonstrates, for a real and significant program, the Sun-4/280 is comparable to the M1000. The data also shows that for this program the SPARC instruction set and compiler duo are comparable to the MIPS instruction set and compiler duo. Ed Kelly The opinions here are my own and do not necessarily represent those of Sun Microsystems.
aglew@mcdurb.Urbana.Gould.COM (12/18/88)
Wow! I suppose Ed Kelly has started a performance analysis war between MIPS and SUN. Don't worry, I'm not getting into it - I'm new enough in Motorola that I don't want this much exposure just yet. :-( Ed, can you give us any floating point comparisons between SPARC and MIPS? I'm sure someone from MIPS will make a detailed response. Me, I just want to ask a few questions: >loads 3,242,293(19.9%)3,928,710(21%) +686,417 >stores 1,175,530(7.2%) 2,037,266(10.9%)+861,736 Phew! Now I'll expose how new I am at this game by giving a big sigh of relief. I had to do some explaining a while back about why my measurements were showing load/store ratios of 2-2.5:1, as opposed to the 3:1 everyone KNEW was the typical ratio of loads/stores. (NB. this was not On an 88K. I cannot report any 88K data.) Investigation showed that many of the extra stores were in register saves - which may also be shown by the above difference between the SPARC with register windows and the MIPS without. Does anyone at MIPS have breakdowns for their load/store traffic according to purpose? > The bottom line about NOPs is SPARC is better due to the annulling >ARCHITECTURE feature. Just so long as annulling doesn't cost you anything in cycle time. As I am sure everyone will point out. >instructions 16,313,907 18,635,185 >----------------------------------------------------------- >total raw cycles 25,522,476 18,999,172 > >cache miss cycles 4,427,524* 14,000,828* >----------------------------------------------------------- >total machine cycles 29,950,000 33,000,000 Well, I'm afraid that I don't see these numbers as expressing a fundamental difference between the two processors. Architecturally SPARC wins out with fewer instructions. Implementationally (is that a word?) MIPS wins out with fewer cycles - but SPARC might always be implemented with fewer cycles per instruction (but then, so might a VAX :-). SPARC takes fewer cache misses, due to register windows and better code density - but it is always possible that in a future version of MIPS a better memory system, possibly in the form of a large on-chip cache using the space that register windows occupies, might win this back (one of my favorites is caching the first word of every cache lines on chip, with the rest off-chip -- but then, I was thinking about vector processing mostly in my last job (that's head of vector caching)). Finally, for the next step in microprocessor architectures, I'd guess that it would be easier to dispatch more than one at once of MIPS' instructions, rather than SPARC's comparatively complex instructions (addressing modes and condition codes are a bitch). Once again, before the wars start - I'd like to thank Ed Kelly for presenting this data. Andy "Krazy" Glew aglew@urbana.mcd.mot.com uunet!uiucdcs!mcdurb!aglew Motorola Microcomputer Division, Champaign-Urbana Design Center 1101 E. University, Urbana, Illinois 61801, USA. My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
elg@killer.DALLAS.TX.US (Eric Green) (12/18/88)
in article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) says: > A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. > > For the comparison we chose a large portable C program (the GNU C Compiler rev > 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler > to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to > produce a MIPS binary. Step 1: choose a program. Fine. You did that right. Even used the right compiler -- the standard one. > Then using the same data (the file gcc.c) we ran the > benchmark on both machines and gathered the dynamic trace statistics provided > by SPIXSTATS and PIXIE, > If you are interested in architecture and wish to avoid the > confusion of implementation details these are the numbers of most > interest. OK, so you captured dynamic trace statistics. So what. Lower number of instructions executed doesn't necessarily mean faster execution, or else the Vax 780 would be the world's fastest machine ;-). I happen to agree that some sort of register window setup is a Big Advantage architecturally, but don't think that a dogmatic "Register windows are better" is warranted. > 2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. > NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad > for code density, and it increases instruction cache miss penalties(due to more > memory accesses and greater probability of a miss). The delay slots filled by NOPs also allow you to schedule instructions on LOADs etc. when the pipeline would otherwise be stalled, which seems to me to make the whole issue somewhat of a tossup. You can do the same sort of instruction rearrangment without that guaranteed delay, but it becomes more of an iffy proposition. As I mentioned before, if code density was the sole detirminant of architectural quality, we should all use Vaxen. > In summary, for this benchmark, the ARCHITECTURAL benefits of register windows > and annulling more than balance the ARCHITECTURAL losses in > computational. Hmm... I wouldn't be quite so dogmatic about it if I were you. The information presented looks fairly convincing, but there may be alternate explanations. The only things certain in life are death, and taxes. > MIPS compiler is not significantly better than the current SPARC compiler. > Considering the bad press, I will admit I was surprised by this > myself. Doesn't surprise me too greatly. The register windows compensate quite well for outdated compiler technology, which is why the UCB guys used them in the first place (so they could re-target PCC, instead of having to dig up come compiler guys to do a moby optimizing hack). > ----------------------------------------------------------- > total raw cycles 25,522,476 18,999,172 > > cache miss cycles 4,427,524* 14,000,828* > ----------------------------------------------------------- > total machine cycles 29,950,000 33,000,000 Looks like the R1000 machine used needed a larger cache. As David Patterson explains so ably in his various papers, a larger cache can make up for a lot of memory bandwidth (which is why a RISC can be faster than a Vax 780). Statistics on how much cache was available on each machine were not published with this so-called "performance comparison". I would not be surprised if a MIPS processor needed a larger cache than a SPARC, just as I would not be surprised if a SPARC needed a larger cache than a Vax 780. Again, no clear performance advantage here. Take a few million cache misses, and the MIPS looks better than the SPARC (cycle-wise). [specs on cycle times, other implementation features:] > These numbers represent significant differences in the IMPLEMENTATION > philosophies at Sun and at MIPS. I suspect it's a matter of cash. The more cash you have, the faster process technology you can buy. Sun isn't exactly cash-starved ;-). > The MIPS performance brief has concentrated on relatively small > integer programs that fit in the cache and so benefit well from the single cycle > loads and stores. This overstates the integer performance for large programs, > which are after all what people buy fast machines to run. This, I agree with. So, apparently, does MIPS, since they're part of a group trying to design better benchmarks. > The opinions here are my own and do not necessarily represent those of > Sun Microsystems. Are you sure? I mean, it sounded so lot like a product of the Sun Microsystems PR department! (except that they would not be so clumsy about it, of course). I don't particularly like the MIPS architecture (my favorite of the recent RISCs is the AMD29000), but the above statistics did not seem to warrant the conclusions drawn. -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 Netter A: In Hell they run VMS. Netter B: No. In Hell, they run MS-DOS. And you only get 256k.
aoki@faerie.Berkeley.EDU (Paul M. Aoki) (12/18/88)
In article <6476@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >in article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) says: >> For the comparison we chose a large portable C program (the GNU C Compiler rev >> 1.24) >Step 1: choose a program. Fine. You did that right. Is it necessarily right? How about "Step 1: choose a large number of common integer and floating point programs"? >> If you are interested in architecture and wish to avoid the >> confusion of implementation details these are the numbers of most >> interest. >OK, so you captured dynamic trace statistics. So what. Lower number of >instructions executed doesn't necessarily mean faster execution, or >else the Vax 780 would be the world's fastest machine ;-). Hey, I get to pull out those notes from Patterson's class again! (Actually I'm pulling this out of [cache?] memory.) [ CR = clock rate (cycles/sec), IC = # inst (/prog), CPI = cycles/inst, P = "performance" (prog/sec) ] "Performance" is CR/(CPI * IC). A 11/780 may have a lower IC but the CR/CPI isn't at all comparable to a Sun4 or M/1000. One the other hand, if two machines have similar CR/CPI figures (as these two do) the machine that executes the fewest instructions wins (until the technology changes again). So IC really does matter here, and it will continue to matter a lot as long as the CR/CPIs are comparable. Got that? There will a quiz at the end of this posting... >> MIPS compiler is not significantly better than the current SPARC compiler. >> Considering the bad press, I will admit I was surprised by this >> myself. >Doesn't surprise me too greatly. Well, here are some more sample dynamic instruction counts from pixie and spixstats, in millions: Machine: Sun4 M/1000 Opt Level: -O4 -O3 bison 28.5 21.9 cc1 (gcc-1.30) 10.9 12.0 [ -O2 for mips, uld dumped core at -O3 ] compress 197 202 [ two loops ] gnu diff 30.3 102 [ bug in mips cc ] gnu egrep 3.3 5.1 [ one loop, difference is all nops, addr calc ] gnu awk-1.1 28 27 [ weird code, both optimizers had a hard time ] TimberWolf3.3 230 175 doduc 366 287 [ sun does lots of extra s<->d prec conversion ] So it can go both ways, for both compiler and ISA reasons. I have my own opinions about the compilers from looking at assembly code but I'll let qualified people pass official judgment on them. [ I'm in enough trouble, grad students aren't supposed to have opinions in the first place :-) ] I find it hard to argue that SPARC is better architecturally because it executes fewer instructions -- it really isn't always true, and sometimes it REALLY isn't always true. I mean -- sweeping generalities based on a sample of one? > The register windows compensate quite >well for outdated compiler technology, which is why the UCB guys used >them in the first place (so they could re-target PCC, instead of >having to dig up come compiler guys to do a moby optimizing hack). Well, he wasn't *just* talking about loads and stores... >> The opinions here are my own and do not necessarily represent those of >> Sun Microsystems. >Are you sure? >I mean, it sounded so lot like a product of the Sun Microsystems PR >department! (except that they would not be so clumsy about it, of >course). Sigh. Looks like the RISC wars really are on again, bigger and badder than ever ... [ OK, so I lied about the quiz. ] ---------------- Paul M. Aoki CS Division, Dept. of EECS // UCB // Berkeley, CA 94720 (415) 642-1863 aoki@postgres.Berkeley.EDU ...!ucbvax!aoki
pavlov@hscfvax.harvard.edu (G.Pavlov) (12/19/88)
In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes: > > A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. > > For the comparison we chose a large portable C program (the GNU C Compiler rev > 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler > to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to > produce a MIPS binary. ^^^^^^^^^^^ Why did you not use a current-generation MIPS in the comparison ???
dce@mips.COM (David Elliott) (12/19/88)
In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes: >In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes: [comparison of code generated by SPARC on Sun 4 and MIPS on an M/1000] > Why did you not use a current-generation MIPS in the comparison ??? There's no point. The M/120 and M/2000 have the same compilers and run the same object code. Since the comparison is of code generation and not execution speed, the actual machine doesn't make a big difference. -- David Elliott dce@mips.com or {ames,prls,pyramid,decwrl}!mips!dce "Did you see his eyes? Did you see his crazy eyes?" -- Iggy (who else?)
dce@mips.COM (David Elliott) (12/20/88)
In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes: > Why did you not use a current-generation MIPS in the comparison ??? In another article, I stated that this didn't matter. I wasn't paying attention to the original article enough to realize that there were some system-dependent things being measured with respect to the cache. I tried to cancel that article, but upon failing that, am submitting this retraction. If our news administrator can figure out how to cancel this article, he can cancel the other one as well. -- David Elliott dce@mips.com or {ames,prls,pyramid,decwrl}!mips!dce "Did you see his eyes? Did you see his crazy eyes?" -- Iggy (who else?)
csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (12/20/88)
In article <6476@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >I don't particularly like the MIPS architecture (my favorite of the >recent RISCs is the AMD29000), >-- >Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Not that anyone well care, but I was kind of thinking that the MIPS architechture was the sexiest architechture I'd seen since the PDP-11. I feel like I stand a reasonable chance of keeping the complete instruction set in my head, including all the special privleged instructions. (Nice job guys!) -- Chuck
andrew@frip.gwd.tek.com (Andrew Klossner) (12/20/88)
> Benchmark Data > Sun-4/280 MIPS M1000 > clock 16.67MHz 15MHz > user time 1.797secs 2.2secs > system time .285secs .3secs The MIPS data looks suspect. How many times was the program run, and what were the standard deviations for these measurements? If, as it appears, you ran the MIPS job only once, you don't have enough precision to draw the conclusions you did. And even if these figures are precise, to derive a seven-digit number like 1,400,083 from two-digit numbers seems a little silly. -=- Andrew Klossner (uunet!tektronix!hammer!frip!andrew) [UUCP] (andrew%frip.gwd.tek.com@relay.cs.net) [ARPA]
cosmos@druhi.ATT.COM (Ronald A. Guest) (12/20/88)
In article <697@hscfvax.harvard.edu>, pavlov@hscfvax.harvard.edu (G.Pavlov) writes: > In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes: > > > > A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. > > > > For the comparison we chose a large portable C program (the GNU C Compiler rev > > 1.24) and compiled the identical source on a Sun-4/280 with the SPARC compiler > > to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to > > produce a MIPS binary. ^^^^^^^^^^^ > > Why did you not use a current-generation MIPS in the comparison ??? I agree. Trying to pick the same clock rate is bogus. What customers care about is what they can get their hands on today. MIPS has both a M/120 and an M/2000. Somehow I think if you had used an M/2000 you would have gotten different performance results. I could care less about subjective measures of architectural nicety, since I am a CPU user. What I care about is cost and performance. And who said a compiler was a good benchmark? Is this the only public program you did this test on? Doesn't really matter, I suppose. The mud-slinging does make this one of the more interesting 'technical' new groups! Ronald A. Guest, Supervisor cosmos@druhi.ATT.COM or att!druhi!cosmos AT&T Bell Laboratories <--- but these are my thoughts, not theirs 12110 N. Pecos St. Denver, Colorado 80234 (303) 538-4896
root@helios.toronto.edu (Operator) (12/23/88)
In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes: >In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes: >> >> A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. >> >> For the comparison we ... >> ... compiled the identical source on a Sun-4/280 with the SPARC compiler >> to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to >> produce a MIPS binary. ^^^^^^^^^^^ > > Why did you not use a current-generation MIPS in the comparison ??? But that would hardly be a fair test. The Sun 4/280 has been out for well over a year now, as has the M/1000. They are same-generation machines. And while we all know that MIPS has a new R3000 RISC chip, has anybody seen a machine using it in full operation yet? It will probably be out shortly, but I don't doubt Sun has something in the works as well (they'd have to if they want to keep selling machines). Then we can compare the two of those. Of course, even comparing a year-old Sun 4 to a year-old M/1000, the MIPS machine wins hands down. We have found the M/1000 to be at least 3 times faster than a Sun 4/280, for both C and Fortran. Admittedly part of this advantage is due to MIPS' terrific compilers, but hey, that's all part of the contest. Ultimate performance is what counts, not just hardware speed, and more companies would do well to follow MIPS' example of carefully optimizing their compilers for their RISC architecture. -- Ruth Milner UUCP - {uunet,pyramid}!utai!helios.physics!sysruth Systems Manager BITNET - sysruth@utorphys U. of Toronto INTERNET - sysruth@helios.physics.utoronto.ca Physics/Astronomy/CITA Computing Consortium
jk3k+@andrew.cmu.edu (Joe Keane) (12/23/88)
I call a win for MIPS. MIPS has 15% more instructions, accounted for almost exactly by the difference in NOPs (everything else balances out). SPARC has 34% more raw cycles, due mostly to (surprise, surprise) loads and stores. Unfortunately, the M1000 seems to lose big from a too-small cache. But i have no doubt that (at any given time) the newest MIPS implementation should have more MHz and a bigger cache than the newest SPARC implementation. Flame away... --Joe
cosmos@druhi.ATT.COM (Ronald A. Guest) (12/27/88)
In article <677@helios.toronto.edu>, root@helios.toronto.edu (Operator) writes: > In article <697@hscfvax.harvard.edu> pavlov@hscfvax.harvard.edu (G.Pavlov) writes: > >In article <82150@sun.uucp>, edkelly%aisling@Sun.COM (Ed Kelly) writes: > >> > >> A COMPARISON OF SPARC VS MIPS ON A LARGE C PROGRAM. > >> > >> For the comparison we ... > >> ... compiled the identical source on a Sun-4/280 with the SPARC compiler > >> to produce a SPARC binary, and on a MIPS M/1000 with the MIPS compiler to > >> produce a MIPS binary. ^^^^^^^^^^^ > > > > Why did you not use a current-generation MIPS in the comparison ??? > > But that would hardly be a fair test. The Sun 4/280 has been out for well Ahhh....But from a user's standpoint it would be a very fair test. As a user, I compare the best of what is available today from all vendors. And, I interpreted the article (the original posting by Sun) as one oriented more toward users than architectural niceties. As other posters have pointed out, it really wasn't a scientific study of the pros and cons of the two architectures (and they both do have pros and cons). As far as which architecture is 'best', I think that can really only be answered in terms of the application. Gcc might be typical for some applications, but for others it would yield misleading results. And since we are talking about fast RISC machines, has anyone done extensive independent benchmarking on the Silicon Graphics multiprocessor system? I understand they have implemented a second level cache and cache snooping. Ronald A. Guest, Supervisor cosmos@druhi.ATT.COM or att!druhi!cosmos AT&T Bell Laboratories <--- but these are my thoughts, not theirs 12110 N. Pecos St. Denver, Colorado 80234 (303) 538-4896
mash@mips.COM (John Mashey) (12/31/88)
Well, I've been over in the Far East for a couple weeks and then out with the holidays, so I'm only just finally getting dug out enough to read the news and see I've missed all the fun. Thanx to Ed Kelly for starting this, raising some issues, and posting some actual numbers so that people can do some analysis. Some of the comments I might have made have already been made by people quicker with the keyboards. This also stirred me up to complete some relevant discussion that I've been working on for a while. Here's the outline: PART 1: Necessary Benchmarking Background (mashey) 1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS CASE 0: (generic case) CASE 1: (R(i) = Geom(B) for all i) CASE 2: Modest variation CASE 3: (even more variation) CASE 4: wild variation 1.2 HARDWARE FACTORS THAT AFFECT VARIATIONS 1.3 SPEC 1.3 IMPLICATIONS PART 2A: Some quick notes on duplicating the benchmark (killian) PART 2B: Detailed Analysis of the GCC Benchmark (mashey) (to follow: Part 1 is already long; too many of the relevant people have been away, and we had a little trouble duplicating the numbers; look for it in about a week; it's about the same size as this one.) --------- 1.1 BENCHMARKING PERFORMANCE DISTRIBUTIONS Suppose you run N benchmarks on two machines, A and B. For benchmark i, let T(i,x) be the time to run on x, and compute the performance ratio R(i) for benchmark i of T(i,A)/T(i,B). [If A happened to be a VAX-11/780 under VMS, this number would be the VUP (VAX Unit of Performance) number for the specific benchmark.] Now, renumber the benchmarks in order of increasing R(i) and graph the result, on a scale where (of course) the performance of A is 1 on every benchmark, and where Geom(x) = geometric mean of the performance ratios for machine x. (Why geometric mean? Arithmetic mean is wrong for ratios; one could argue about harmonic...) Here are some cases you might see: CASE 0: (generic case) Big R | | B B | B Geom(B) |. . . . B . . . | B | B B Small R |B 1 2 3 4 5 6 7... N Just to be clear, this means B underperforms Geom(B) on 1-4 and outperforms it on 6...N. People who've seen graphs like this sometimes ask why we usually sort it to get one or more of the machines monotonic or close. Nothing magic: we've tried various ways, and this has the least visual confusion, especially when graphing 3-5 machines together on one chart. Of course, it's not generally possible to get ALL machines in such a chart monotonic, especially for unrelated machines; nevertheless, people seem to see the graphs easier if some sorting is done to remove the jerkiness. Finally, it is sometimes easier to see patterns when this is done. Of course, I didn't say where Geom(A) (= 1.0 by definition) was on the chart: it could be < Geom(B), in which case you'd claim that A was slower than B on this set of benchmarks, it could be ==, it could be >. Now, one can observe some interesting cases, depending on the benchmarks selected: CASE 1: (R(i) = Geom(B) for all i) R | Geom(B) |B B B B B B B B | Geom(A) |. . . . . . . . (1.0) | | 1 2 3 4 5 6 7... N In this case, B is uniformly faster than A. The only realistic circumstances I've ever seen this close to happening is where you're doing CPU benchmarks of two machines that use the same CPU, same software, same memory system, and only differ in clock rate THAT SCALES UNIFORMLY THROUGHOUT THE CPU+MEMORY SYSTEM. Needless to say, few peripherals work this way, so I/O tests of two systems like this don't scale the same way. In addition, on some tests you might get surprised by: timer interrupts (the faster machine has less of them to deal with if the timer isn't scaled up, which usually is not done), or DRAM refresh overhead (maybe). That means most graphs look more like CASE 0 after all, which means that it is now time to look at the values on the vertical axis. EXAMPLE 1: at MIPS, the closest case to this is the M/800(M/1000) pair, which are essentially identical, except for the clock rates (12.5Mhz or 15MHz). EXAMPLE 2: various PC families show this behavior, differing only by clock rate. Consider the case where there is some variation, but many of ratios cluster around the same value, probably close to the geometric mean: CASE 2: Modest variation R | 1.1G(B) | B B Geom(B) |. . B B B B . . | B .9G(B) |B 1 2 3 4 5 6 7... N (Note, this doesn't mean that Geom(B) == 1, it means that the R values cluster from .9*G(B) to 1.1*G(B). You'd guess from this that the two machines are part of the same family, with the same software, and moderate differences in clock-rate or small details of implementation. (Note: major differences in clock-rate will probably spread things further apart). EXAMPLE 1: VAX 8650:8700 comparison looks this way (I think), with differences in pipelining and the switch from write-back cache (in 8650) to write-thru (in 8700) sometimes making noticeable differences in either direction. (Compare differences in Whetstone & Linpack between them, for example). I think the two machines are about the same speed; maybe someone from DEC will comment. EXAMPLE 2: 386-based PCs sometimes show this behavior more than do 286-based ones, as the former more often use different sorts of memory systems. EXAMPLE 3: the MIPS M/1000 versus M/120 is sort of this way: a) Both use 64KI + 64KD caches, 4-deep write-buffers, 1-word cache-refill. b) The clock rate changed from 15MHz to 16.7MHz, which by itself would not spread the ratios. c) The M/120 has a lower-latency memory system, i.e., so that it survives high-cache-miss rate programs better. Thus, programs with low cache-miss rates will tend towards the left of the chart, where a 120 gets only the clock-rate difference (16.7/15); higher cache-miss rate programs will tend towards the right side. CASE 3: (even more variation) R | 1.5*G(B)| B B | B Geom(B) |. . . . B . . . | B | B B .75*G(B)|B 1 2 3 4 5 6 7... N This says that there is a 2X variation (1.5/.75) in the relative performance of the two machines. This is not at all atypical of a randomly-selected pair of real machines. As shown by DEC (McInnis, Kusik, and Bhandarkar, "VAX 8800 System Overview", IEEE CH2409-01/87) a VAX 8700 was anywhere from 3X to 7X faster than an 11/780, even with the same software. Most of the MIPS systems versus VAXen have a similar-looking chart, as a gross first-order approximation. Some of the more extreme MIPS pairwise combinations get up around a 1.3X variation (for example, M/2000 versus M/1000): if a benchmark has a low cache miss rate, the ratio is close to the clock-rate difference. if a benchmark has a high data cache miss rate, and block-fetch works, the 2000 is better than the 1000 by more than the clock rate. if a benchmark has a high data cache miss rate, and block-fetch DOESN'T work (compress is the notorious example), then a 2000 is not as much better as the clock-rate difference. (Compress is notorious because it hashes data into a huge sparse array & 1-word-refilled data caches are BETTER than N-word-refilled caches, which is not often true.) CASE 4: wild variation 100G(B) | B | B | B Geom(B) |. . . . B . . . | B | B B .5G(B) |B 1 2 3 4 5 6 7... N This is what you typically would see when comparing a vector machine (B) with a scalar machine, or maybe two vector machines optimized towards shorter or longer vector lengths, or multiprocessors of various kinds, or.... There is of course, nothing inherently bad about such variation, except that the MORE VARIATION THERE IS, THE MORE YOU'D BETTER BE CAREFUL ABOUT YOUR OWN WORKLOAD AND UNDERSTANDING WHICH BENCHMARKS, IF ANY, ARE TRULY REPRESENTATIVE OF IT. 1.2 HARDWARE FACTORS THAT AFFECT THE DISTRIBUTIONS IN CPU PERFORMANCE: 1) Cached versus non-cached systems The easiest machines to compare are simple non-cached ones with simple memory systems, because the graphs will tend to look like CASE 2. In particular, a few integer benchmarks, and a few FP ones will quickly give you some idea of what is happening. Unfortunately, this domain is basically limited to the slower microprocessor designs, as most others either use caches, or if not, not may be vector machines with memory systems optimized for vector transfers. 2) If cached: size of cache joint versus split I & D level of associativity write-thru versus write-back linesize block-transfer size etc, etc. Size is one of easiest ones to get surprised with, especially on scientific benchmarks with varying array sizes. (You can REALLY get misled if you happen to pick a benchmark where the particular size happens to fit into machine A's cache, but not quite into machine B's. You can get odd effects where B performs relatively better on small problems and big problems, but A does better on middle-sized problems.) This is becoming especially relevant as caches grow in size to consume popular benchmarks, i.e., the typical 100x100 Linpack is noticeably helped by current caches! (I think this is why Dongarra & friends are emphasizing plotting of MFLOPS rates over many array sizes to avoid weird single-point effects.) 3) Memory system random-access-equal versus random-access-unequal variations, such as when using page-mode DRAMS, SCRAM-cache (as in Sun-4/110), banking schemes, etc, where any of the following might occur: 2nd access to same page is faster than random 2nd access is faster if not to the same bank 4) Vector versus scalar system design This is one of the most common causes of large variations in performance. 5) Multi-processor versus uniprocessor 6) Memory-management design. Small programs; big programs; sparse versus dense data, etc, etc. ALL OF THE ABOVE CAN AFFECT SINGLE-THREAD PERFORMANCE ON CPU COMPUTATIONAL PERFORMANCE. Finally, along with other design elements (like exception- processing), they can affect other performance attributes, which this analysis has no pretense of doing anything but listing: multi-user/server (versus workstation) performance big jobs versus little ones user code versus kernel code (they're different) commercial versus technical applications balance across different applications versus tuned for a few 1.2 SPEC Most high-speed machines OF COURSE use combinations of the things that will cause more variations. This is one of the reasons that {Apollo, HP, MIPS, Sun} have started SPEC: a lot of the simpler benchmarks used in the micro world just don't cope very well with the inherent variability of current high-performance machines; we need more good benchmarks that cover a wider range of applications. It is not that this is a new problem, of course, it's just that in the last few years, fast machines are getting very cheap, and so the complexity issues found in mainframes/supercomputers, and then superminis, are migrating into cheap machines, and hence are more visible to more people. 1.3 IMPLICATIONS 1) If you have N benchmarks, and you select M << N of them, you can usually prove almost anything about the relative performance of A & B. For example, there exist benchmarks that can drag almost any machine down to DRAM random-access speed, no matter what its cache architecture is. Of course, different machines drag down different architectures worse. Some of the nasties are even real programs (like compress, for example). Note that this implies that a GOOD benchmark suite would: a) Avoid tiny programs that fit into trivial caches. b) Include some nasties that break everybody's caches (at least this year; next year's caches are going to be harder to fill!) c) Include some that are real programs, but may fit into some caches. [You can't hope to use only b), because caches are getting bigger fast, and many real programs are significantly helped by 1988/1989 microprocessor caches.] d) When possible, have benchmarks whose data sizes are easily variable, so that you can plot multiple points, avoiding surprises from happenstances of sizing. This is probably easiest with scientific programs; sometimes it's just about impossible with some systems programs. Note that the point is not to make a benchmark run long enough to be interesting [that's a separate issue], but to vary the sizes to better analyze memory-system effects. 2) As usual, your own application is the best benchmark. If you don't have that, your best bet is to hope to find that some of the benchmarks are ones that you've found usually correlate with your own applications. 3) If you are able to make N fairly large, you may be able to find patterns, like "good integer performance, bad floating-point", "good vector floating-point, bad integer", "good on small benchmarks, bad on big ones", etc. The most common patterns are: a:Balanced integer and scalar floating-point (superminis, mainframes) b:Integer noticeably better than FP (most microprocessors) c:FP better than integer, and vector FP much better than scalar VP, relative to the mainframe/supermini pattern (most supercomputers and mini-supers, although not all.) 4) Most of this was about CPU performance. When you add other realistic systems benchmarks, it only gets more complicated, although some of the same syndromes show up [like testing disk I/O with repeated access to files that do / do not fit into UNIX buffer caches.] In PART 2, we'll look at the gcc benchmark described by Ed Kelly in <82150@sun.uucp>. Although the test case shown doesn't run as long as one would like, using gcc as a benchmark is a reasonable and worthy thing, as it: a. is a large integer program b. is widely available in source form, if not public domain c. is an example of something that people really use and hence, it is probably a good candidate for inclusion in good benchmark suites, especially as it is excruciatingly hard to get compilers that obey b. It is also a good example to analyze in detail: It is typical of other integer programs in some ways It is somewhat atypical of them, in other ways It offers a good example of some of the specific cautions mentioned above in data interpretation, especially as applied to fast machines with differing memory system designs -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
earl@wright.mips.com (Earl Killian) (01/04/89)
Ed Kelly of Sun studied gcc compiling gcc.c on MIPS and SPARC, and posted some statistics together with his analysis and conclusions. I decided to take a look myself (also, it's a likely SPEC benchmark, so understanding it will be useful). At first I was unable to duplicate Kelly's statistics. gcc compiled on MIPS with cc -O3 and ran without hitch, whereas Kelly said -O3 didn't work (-O4 also works if you fix a trivial bug in the gcc source). Subsequently we were told that Sun's -O3 problem was that it ran out of space in /tmp on their machine and not a compiler bug. With -O3 I get 17.40M instructions. At -O2, I get 17.82M instructions instead of his 18.64M, so there was a big difference to explain. The major difference between -O2 and -O3 is inter-procedural register allocation. A minor difference is that -O2 by default declines to optimize "big" procedures (> 500 basic blocks) to save on compilation time during program development. It warns you by saying uopt: Warning: expand_expr: this procedure not optimized because it exceeds size threshold; to optimize this procedure, use -Olimit option with value >= 656. For benchmarking, I go back and add a -Olimit to the Makefile and recompile, just as the warning suggests. If I leave off the the -Olimit then several procedures remain unoptimized and the result is 18.25M instructions. Closer to Kelly's result, but still not there. (Note that two of the unoptimized procedures are yyparse and yylex, which are the 2nd and 3rd heaviest contributors to CPU cycles...) Kelly was running this benchmark on a System V M/1000 as opposed to a BSD M/1000 (MIPS sells both flavors of Unix). When I tried it on System V I got link errors for BSD-only routines such as bcopy and bzero, which I solved by adding -lbsd to the command line. My guess is that Kelly didn't know about -lbsd and choose to use straight-forward byte-at-a-time bcopy/bzero substitutes. When I try that I get 18.68M instructions, which is quite close to his result. In summary: 18.68M -O2, no opt of yyparse, yylex etc., no use of library bcopy/bzero 18.64M posted number 18.25M -O2, no opt of yyparse, yylex, etc. 17.82M -O2, optimize yyparse, yylex, etc. 17.40M -O3 (All results use the MIPS 1.31 compilers, which were released in Mid 88.) The point of this was to show that Kelly's analysis was built on questionable statistics. But even with his statistics as a basis, some of his conclusions are unwarranted. As many people pointed out, gcc is only one data point, and it is unreasonable to conclude anything from a single data point. There might be something anomalous in that one case, for example. One thing I learned in porting gcc is that the MIPS compiler generates poor code for a C construct that gcc uses heavily (a bit-field enum that is both aligned and 16 bits in length). Oh well, every compiler has some simple things it doesn't bother to special case. This will be fixed in a future compiler release. With that compiler the gcc instruction count on Kelly's input is 16.47M instructions at -O3 (about 6% fewer instructions). It is exactly this sort of sensitivity to small details that make single data point conclusions unreliable. It also turns out that 6% of the instruction cycles are spent in printf etc. I don't know whether the SPARC printf has been heavily tuned or not; ours has not. It is fair to include the cost of this as a system test: that's what the user sees. However, it is hard to draw conclusions about Instruction Set Architecture (ISA) + Compilers, where one is concerned about a % here or there, when noticeable parts of the code are from libraries. With those caveats in mind, let's look at some of Kelly's remarks: "As will surprise most observers, SPARC executes fewer instructions than MIPS." This doesn't surprise me when I look closer and see how the instruction counts differ. After all, the RISC vs. CISC wars were begun with the premise that instructions were only one term in the performance equation. Total performance is what matters. As several people pointed out on the net, the difference in instruction counts is primarily attributable to MIPS using a NOP instruction instead of a hardware interlock for load instructions (shifting responsibility from hardware to software). With interlocking, the load NOPs would be replaced by a single-cycle stall, so the load NOPs have no direct performance impact (an indirect effect is the increase in code size affects i-cache miss rates). To compensate for the difference in interlocking approach (hardware vs. software), you can either subtract load nops (.91M) from the MIPS counts or add SPARC interlocks (1.47M) to the SPARC counts. With our 1.31 compilers, that makes the difference +-1% for adjusted instruction count. (With the compiler that optimizes aligned 16-bit bit-fields to halfwords, it is 5 to 8% in favor of MIPS.) But again, instruction counts aren't a good basis for comparison. I don't think you can compare ISAs without looking at implementations. For example, MIPS has a divide instruction and SPARC has none. Should we add in our divide interlocks to be fair? But a hypothetical MIPS machine could have a 8-cycle divide, so maybe we ought to use 8, not 35, in ISA comparison? How can this work? In contrast, comparing cycles or time is more meaningful. Kelly gives 25.52M as the raw cpu cycle count. The corresponding MIPS number (1.31 compilers) is 17.74M. The large difference is of course due to the Fujitsu SPARC chip using one extra cycle on loads, 2 extra cycles on stores, and one extra cycle on untaken branches. To go beyond cpu performance we need to pick a memory system. This is probably a good place to point out that the M/1000 Kelly used is a lower performance machine than anything we now sell; it has been essentially obsoleted by the 16.7MHz M/120 (like the M/1000, based on the R2000) and the 25MHz M/2000 (based on the R3000), both of which are in production and shipping. Adding in cache miss cycles, Kelly gives a total of 29.95M cycles for the Sun 4/280. For the MIPS M/120 I get 24.19M (27.80M for the M/1000). Since the cycle time is the same for both the 4/280 and the M/120, the cycle counts are directly related to time. I don't think there's much to squabble about here. Time is time. All the trade-offs have been reduced to a single number. Kelly might object that a hypothetical SPARC implementation might avoid the extra load/store/branch cycles. Such an implementation is said to be in progress. When it's appropriate, why not use it for comparison with the corresponding MIPS system? "For many observers the interesting fact is that for this benchmark, the MIPS compiler is not significantly better than the current SPARC compiler. Considering the bad press, I will admit I was surprised by this myself." This statement was unsubstantiated; it is not obvious to me how to compare compilers based on instruction statistics from different architectures, especially on only one benchmark. The few things that do come to mind suggest that the MIPS compiler is doing a better job, but given the importance of library code in this benchmark, the whole subject is on thin ice. Perhaps Kelly can elaborate? "Being a SPARC advocate I would claim that SPARC is ARCHITECTURALLY fundamentally better, but the degree of difference is probably in the noise in the broader scheme of things." (-: Gee, being a MIPS advocate, and given the corrected numbers, should I claim that MIPS ISA is 5-8% fundamentally better? :-) Kelly moves on to discuss the architecture of the entire system, not just the ISA. I have some quibbles with his methodology (e.g. inferring anything from Unix runtimes on the order of 1-2 seconds, where the error per measurement is probably 10% or more), but I really have to restrict myself to addressing a few of his off-hand remarks (this posting is already too long). "These numbers represent significant differences in the IMPLEMENTATION philosophies at Sun and at MIPS. The central goal at MIPS appears to have been to achieve a single cycle per instruction, even at the cost of cycle time and complexity. Clearly that was not a central goal at Sun." Certainly single cycle execution was one of several MIPS goals, but I would not say it was at expense of cycle time or complexity at all. The most significant pressure on cycle time in the R2000 is due to physical instead of virtual caches, not single-cycle execution. Virtual caches simplify the CPU at the expense of multi-programming performance and multi-processing implementation complexity. "Our goals were dominated by cycle time and system simplicity. Performance on large programs was our design metric. The first SPARC implementation achieved a faster cycle time than the best of MIP's first implementations, despite inferior technology." This is not true. Both the Fujitsu SPARC and the R2000 are 16.7MHz chips. The M/1000 system, based on the R2000, was 15MHz instead of 16.7MHz because it used memory boards from the M/500 generation system (you could upgrade with a cpu board replacement), and those memory boards are good to 15MHz. (The M/500 was introduced 18 months before the Sun 4/260.) Both MIPS and its customers ship systems based on the R2000 at 16.7MHz (the M/1000 just isn't one of them.) Is the Fujitsu SPARC implemented in an inferior technology to the R2000? That's hard to call. The Fujitsu SPARC is implemented in what is, I think, a 1.5 micron CMOS gate array technology whereas the R2000 is implemented in 2.0 micron custom CMOS technology. I'm not sure how to compare these particular apples and oranges. "The MIPS performance brief has concentrated on relatively small integer programs that fit in the cache and so benefit well from the single cycle loads and stores." The MIPS performance brief concentrates on large programs. It is the case that the large programs are floating point; large public domain floating point programs are easier to find than large public domain integer programs. The UNIX commands listed in the Brief are at least reasonably-sized real programs, not toys, and they're what a lot of people use. What about the Sun performance brief? It relies on the dhrystone and stanford benchmarks, which are much smaller than the MIPS Unix suite. "This overstates the integer performance for large programs, which are after all what people buy fast machines to run. MIPS implicitly acknowledges this by calling the M1000 a 10 MIP box despite the fact that all the published data in the MIPS performance brief would say integer performance is greater than 12 MIPs." Unlike Sun, but like DEC, we consider both floating point and integer performance when assigning a VUPS (sometimes called MIPS) rating to our machines. And yes, we don't use toys like dhrystone and stanford for our ratings (we give results because they're popular). Read section 2.1 of the MIPS performance brief for details. Is there something wrong with basing ratings on large, real programs? -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
lgy@blake.acs.washington.edu (Laurence Yaffe) (01/04/89)
In article <677@helios.toronto.edu> sysruth@helios.physics.utoronto.ca (Ruth Milner) writes:
-[...] And while we all know that MIPS has a new R3000 RISC chip, has anybody
-seen a machine using it in full operation yet? It will probably be out
-shortly, ...
-
- Ruth Milner UUCP - {uunet,pyramid}!utai!helios.physics!sysruth
- Systems Manager BITNET - sysruth@utorphys
- U. of Toronto INTERNET - sysruth@helios.physics.utoronto.ca
- Physics/Astronomy/CITA Computing Consortium
I've been using a MIPS M/2000, containing an R3000 cpu for the
last three months. Right now, its only running with a 20 MHz clock,
but it should be getting upgraded to 25 Mhz any day now. Even at 20 MHz
its pretty impressive - I've been meaning to post some performance figures
from my own real programs, but haven't found time. Sometime soon, hopefully.
--
Laurence G. Yaffe Internet: lgy@newton.phys.washington.edu
Department of Physics, FM-15 or: yaffe@phast.phys.washington.edu
University of Washington Bitnet: yaffe@phast.bitnet
Seattle WA 98195
lgy@blake.acs.washington.edu (Laurence Yaffe) (01/04/89)
In article <10436@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
[[Much stuff about benchmarking deleted]]
-Some of the more extreme MIPS pairwise combinations
-get up around a 1.3X variation (for example, M/2000 versus M/1000):
- if a benchmark has a low cache miss rate, the ratio is close to
- the clock-rate difference.
- if a benchmark has a high data cache miss rate, and block-fetch
- works, the 2000 is better than the 1000 by more than the
- clock rate.
- if a benchmark has a high data cache miss rate, and block-fetch
- DOESN'T work (compress is the notorious example), then
- a 2000 is not as much better as the clock-rate difference.
- (Compress is notorious because it hashes data into a huge
- sparse array & 1-word-refilled data caches are BETTER
- than N-word-refilled caches, which is not often true.)
^^^^^^^^^^^^^^^^^^^^^^^
I'm curious about the basis for this judgement. In much of my own
recent work, I've been dealing with several large, integer programs which
do special purpose symbolic algebra. Much of the execution time of these
programs is devoted to searches in a large ordered hash table (~50 Kb),
plus assorted string operations which typically only access the first
few characters in a string. These appear to be examples of programs
for which multi-word data cache refill is not helpful. For example,
comparing a MIPS M/120 (16.7 MHz) versus a 20 Mhz M/2000, I've found:
Program #1 ("obsgen") MIPS M/120 (-O2) 821 sec
MIPS M/2000 (-O2;3.10) 795
Program #2 ("scrgen") MIPS M/120 (-O2) 808 sec
MIPS M/2000 (-O2;3.10) 826
Obviously, these two programs may not be representative of "typical"
programs (whatever those are). However, I would not be surprised if
many "data-management" type programs (with large hash tables, binary
trees, etc.) have similar behavior - namely better performance with
single word data cache refill.
-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc>
-UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com
-DDD: 408-991-0253 or 408-720-1700, x253
-USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
--
Laurence G. Yaffe Internet: lgy@newton.phys.washington.edu
Department of Physics, FM-15 or: yaffe@phast.phys.washington.edu
University of Washington Bitnet: yaffe@phast.bitnet
Seattle WA 98195
earl@wright.mips.com (Earl Killian) (01/07/89)
When I was unable to duplicate Ed Kelly's gcc results I speculated on a couple of things that could have caused the discrepency. Afterward Ed Kelly and Bob Cmelik from Sun came by and we swapped gcc sources and found the real answer. The sources were essentially identical, but the bison generated grammer was fairly different. I don't know why, but Sun's grammer spends 1.22x more time in yyparse. So we now agree that with that grammer, compiled -O2, the result is 18.63M total instructions. The -O3 result is 18.19M. The rest of my comments stand with some modification of the numbers: Of the 12% difference in instruction count, 6% is due to load nops, 4% is due to instructions that are fetched but annulled on SPARC, leaving 2% more real instructions on MIPS. The cycle difference before memory system is now 38% more for SPARC. Cpu+memory cycle count is 17% higher on the Sun4/260 than the M/120. (As before, there's a 6% further improvement in instructions/cycles when using the next compiler, which fixes the enum:16 bit-field embarrassment.) I'd like to thank Ed and Bob for bringing their source and helping me get my facts straight. Things will be a lot easier with SPEC... -- UUCP: {ames,decwrl,prls,pyramid}!mips!earl USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086