baskett@baskett (12/10/87)
I have been asking myself the question, why is SPARC so slow? I've been sparked by John Mashey's fascinating "Performance Brief" and by continuing reports from our customers that our own 4D/70 12.5 MHz MIPS based workstations outperform Sun-4's on their CPU intensive applications including image rendering and mechanical design and analysis in a manner consistent with the benchmarks reported in the Performance Brief. SPARC is not slow compared to traditional microprocessors, granted. But as a Risc microprocessor it seems to have some problems, at least in the first two implementations. Below are my observations so far on why the Fujitsu version of SPARC is slow compared to the MIPS Risc microprocessor. At least some of the problems of the Fujitsu version (the one in the Sun-4) are also present in the Cypress version, according to the preliminary data sheets. These problems don't necessarily mean that the SPARC architecture has problems but I'd be reluctant to accept SPARC as the basis for an Application Binary Interface standard until I saw some evidence that high performance implementations of SPARC are possible. Loads and stores are slow. Loads on both implementations take two cycles and stores take 3 cycles for 32-bit words compared to one cycle for each on a MIPS R2000. There are several interrelated reasons for this situation. Briefly, they are lack of a separate address adder, lack of split instruction and data caches, and inability to cycle the address and data bus twice per main clock cycle. Details follow. Lack of a separate address adder for loads and stores. The R2000 can start the address generation for a load or a store in the second stage of the pipeline because the register access is fast and an address adder is present. Thus the load or store can "execute" in stage 3 of the pipeline, just like the rest of the instructions. On SPARCs (so far) address generation appears to use the regular ALU in the third stage of the pipeline and then begin the actual cache access in the fourth stage. For a load, you then need an extra stage to get the data back. Lack of split instruction and data caches. Because both SPARCs have a single cache rather than the separate instruction and data caches of the R2000, the extra pipeline stage needed to get the data back for a load can't be used to fetch an instruction anyway. For a store the relevant cache line is read on the fourth cycle and updated and written back on the fifth cycle. So there are two cycles that can't be used to fetch instructions, bringing the total cost of a store to three cycles. Inability to cycle the address and data bus twice per main clock cycle. The SPARC chips aren't double cycling the address and data bus so that both loads and stores mean that you can't fetch instructions. The R2000 also has a single address bus and a single data bus but it can use them twice per cycle. This means you can then split your cache into an instruction cache and a data cache and make use of the extra bandwidth by fetching an instruction every cycle in spite of loads and stores. However, if register windows eliminated enough loads and stores, these two SPARC implementations might represent reasonable engineering design decisions. Both benchmarks and careful studies of code sequences indicate that the load and store savings are not that great, generally less than five percent. We can also ask if the overhead of register windows leaves enough time in the second stage of the pipe to do an address add assuming we could fit such an adder into the implementation. (Windowed registers take up a lot of space.) Branches are slow. Since taken branches need only one delay slot there must be an address adder for the program counter. But with a single cache you have to decide early what the next instruction address is. Both SPARC chips always decide that a branch will be taken so there is an additional cycle penalty when the condition isn't satisfied and you have to junk the instruction you fetched and fetch the right one. On the R2000, the instruction address comes out in the second half of the cycle on the double-cycled address bus so you have time to check the condition in the first half of the cycle and put out the right target address every time. The separate instruction and data cache only run at single cycle rates but they run a half cycle out of phase with each other so it all works out. (Pretty slick, don't you think?) The first delay slot can be used by a useful instruction a majority of the time on both architectures so they are even there. However, the SPARC architecture requires that conditional branches be based on a value in a condition code register rather than the value in a regular register, as in the MIPS architecture. Honest people can (and do) disagree about which approach is better. But the compiler studies I have seen indicate that, on the average, you need an extra instruction for setting the condition code a noticable fraction of the time. So my guesstimate is that the average conditional branch on a SPARC is 2.5 cycles and on an R2000 is 1.5 cycles. (Further study is needed here.) Floating point is very slow. Here we only know about the Fujitsu version of the architecture. The Cypress version is likely to be better since the Weitek parts that the Fujitsu version uses are rather old designs (WTL 1164 and WTL 1165). Weitek's more recent designs are faster and so we presume the Cypress version will be better, too. Nevertheless, here are the numbers (from the data sheets). I use cycle counts just to keep it simple. Fujitsu SPARC MIPS R2000 SP DP SP DP add/subtract 9 11 2 2 multiply 9 12 4 5 divide 34 65 12 19 These are the total latency times from start to finish for both systems. Both systems can execute other integer operations in parallel with floating point operations after the floating point operations are launched. However the launch cost on SPARC is two cycles while it is one cycle on the R2000. The launch time is included in the above table. Both systems appear able to do simultaneous multiplies and adds with no pipelining. If we summarize these cycles per instruction by looking at a conservative estimate of instruction frequencies we get the following results, first for integer programs and then for single precision floating point programs. SPARC MIPS frequency cycles cycles (percent) loads 2 1 20 stores 3 1 10 branches 2.5 1.5 15 most other 1 1 55 rare other >1 >1 ~0 average 1.63 1.08 ratio = 1.51 SPARC MIPS frequency cycles cycles (percent) loads 2 1 20 stores 3 1 10 branches 2.5 1.5 15 most other 1 1 45 sp fp other 9 2 10 average 2.43 1.18 ratio = 2.06 These ratios are also consistent with the benchmark results in the Performance Brief. Since MIPS and Sun seem to be producing these systems with similar technologies at similar clock rates at similar times in history, these differences in the cycle counts for our most favorite and popular instructions seem to go a long way toward explaining why SPARC is so slow. Forest Baskett Silicon Graphics Computer Systems
bcase@apple.UUCP (Brian Case) (12/11/87)
In article <8809@sgi.SGI.COM> baskett@baskett writes: [A lot of well-considered stuff about why the current and soon-to-be SPARC machines are/will be "so slow."] Forest, I agree completely with your reasoning on most points: slow loads and stores, slow branches, and (intertwined with the previous) only one bus cycled only once per cycle. >necessarily mean that the SPARC architecture has problems but I'd be >reluctant to accept SPARC as the basis for an Application Binary >Interface standard until I saw some evidence that high performance >implementations of SPARC are possible. I also agree that a standardization of this kind is not the right idea. But I believe it is possible to have a high-performance implementation of the SPARC. By high performance, I mean close enough to others in its class so as to make the difference not worth too much worry. Without large, on-chip caches, processors in the class about which we are speaking need chip-boundary bandwidth commensurate with on-chip data and instruction consumption rates. The lack of such bandwidth is the main, in my opinion, failing of the SPARC implementation. Notice that the Cypress version will be no better, if not worse (the floating-point bus is gone!). With regards to: >The R2000 >also has a single address bus and a single data bus but it can use them >twice per cycle. This means you can then split your cache into an >instruction cache and a data cache and make use of the extra bandwidth >by fetching an instruction every cycle in spite of loads and stores. ... >The separate instruction and data cache only run >at single cycle rates but they run a half cycle out of phase with each >other so it all works out. (Pretty slick, don't you think?) Yes, I do think it is pretty slick, but I also think this is a liability at clock speeds higher than 16 Mhz (and maybe even at 16MHz). I am sure, though, that MIPS has a plan to fix this problem. It sure seems like the way to go at 8 Mhz. Preventing bus crashes (i.e. meeting real-world timing constraints) can be problem. And: >Since MIPS and Sun seem to be producing these systems with similar >technologies at similar clock rates at similar times in history, these >differences in the cycle counts for our most favorite and popular >instructions seem to go a long way toward explaining why SPARC is so >slow. >Forest Baskett Thanks again for the analysis. However, I have one last point of contention. SUN is not MIPS in many respects, not the least of which is dedication to working with fabs and process technologies. SUN's business seems to be standards. In light of their constraints, I applaud their success in squeezing so much on a lowly gate array. I am sure one of their chief concerns was future ECL implementation. Sure the SPARC processor core (the stuff that actually does the work, minus register file) is virtually the same as anyone else's in function and in size (at least I think this is true), and with that in mind, the MIPS R2000, the Am2900, or whatever are all equally scalable (the other components on chips are largely implmenetations of integrated system, not architectural, functions; the Branch Target Cache of the 29000 is NOT an architectural feature.). But by choosing register windows (which lets them vary the number of registers, in window increments, for a given implementation) and a very simple definition otherwise, SUN simply did the best they could to make future implementation easy. However, I am a little dismayed (but happy for SUN) at the incredible backing SPARC is getting in the world of huge, influential conglomerates. I think the standardization of UNIX is good, but the standardization of processors is BAD. We should have a way to achieve processor independence without necessarily transporting source code (and in fact, I have an idea for this, but can't share it). We must not bet our future on a given processor! Comments?
pf@diab.UUCP (Per Fogelstrom) (12/11/87)
Well, the history repeats once again. A new RISC chip is launched and peopels expectations reaches new "high scores". A few years ago there was another risc chip set brougth to the market, called the Clipper. This processors performence was climed to sweep all competitors off the sceene. Often compared to the DEC 8x00 computers. For this chip set the picture has cleared now. The perfor- mence range is not much more than can be achived with a 16-20 Mhz 68020. The most i have seen of the 33Mhz versions is one running at room temprature. Intergraph is one of the companys who is still using the Clipper (They recently bought the rights for the chip set from NS/Fairchild) . From what i recall they throw out the NS32032 for the Clipper. Well they could have had 2-3 times the clipper performance with the NS32532 today. And they called the buy a bargin ! It's not suprising that the MIPS 2000 gives most power/Mhz, The architecture has evolved during many years, without a hard pressure from the marketing such as 'We must have it NOW!!!'. (John Mashey mayby has another opinion, only my guess) SO: Why is everybody so suprised ????!
davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (12/11/87)
In article <6964@apple.UUCP> bcase@apple.UUCP (Brian Case) writes: | conglomerates. I think the standardization of UNIX is good, but the | standardization of processors is BAD. We should have a way to achieve | processor independence without necessarily transporting source code (and | in fact, I have an idea for this, but can't share it). We must not bet our | future on a given processor! Comments? The concept of portable object code is not new... the "UCSD Pascal P-System" allowed compilers to generate pseudo code from a number of languages, and port the Pcode. Later some Pcode compilers were developed to give the speed of compiled code without passing source code around. Then there was a peekhole optimizer for Pcode, and, as I recall, there was a compatible ADA compiler. I used it, but the name of the vendor has escapes me, hopefully forever. Hope your idea for portable code can do better. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
baskett@baskett (12/12/87)
In article <6964@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes: > ... > >The separate instruction and data cache only run > >at single cycle rates but they run a half cycle out of phase with each > >other so it all works out. (Pretty slick, don't you think?) > > Yes, I do think it is pretty slick, but I also think this is a liability > at clock speeds higher than 16 Mhz (and maybe even at 16MHz). I am sure, > though, that MIPS has a plan to fix this problem. It sure seems like the > way to go at 8 Mhz. Preventing bus crashes (i.e. meeting real-world > timing constraints) can be problem. The 16 MHz MIPS parts we have work fine. If it becomes a problem, the fix is pretty obvious, too. > I am sure one of their chief concerns was future ECL implementation. I have an ECL implementation of an experimental Risc processor (board) in my office. My experience with the team that designed and built it (a great group of people at DEC's Western Research Lab, by the way) tells me that the MIPS architecture is more suitable for ECL implementation than the SPARC architecture. (see next comment) > by choosing register windows (which lets them vary the number of registers, > in window increments, for a given implementation) and a very simple > definition otherwise, SUN simply did the best they could to make future > implementation easy. It may have been the best they could do but it looks like a mistake to me. In higher performance technologies the speed of register access becomes more and more critical so about the only thing you can do with register windows is to scale them down. And as the number of windows goes down, the small gain that you might have had goes away and procedure call overhead goes up. Attacking the procedure call overhead problem at compile time rather than at run time is a more scalable approach. Forest Baskett Silicon Graphics Computer Systems
dennisr@ncr-sd.SanDiego.NCR.COM (Dennis Russell) (12/12/87)
In article <8809@sgi.SGI.COM> baskett@baskett writes: > > >I have been asking myself the question, why is SPARC so slow? >....... >Loads and stores are slow. Loads on both implementations take two >cycles and stores take 3 cycles for 32-bit words compared to one cycle >for each on a MIPS R2000. There are several interrelated reasons for >this situation. Briefly, they are lack of a separate address adder, >lack of split instruction and data caches, and inability to cycle the >address and data bus twice per main clock cycle. Details follow. > >Lack of a separate address adder for loads and stores. The R2000 can >start the address generation for a load or a store in the second stage >of the pipeline because the register access is fast and an address adder >is present. Thus the load or store can "execute" in stage 3 of the >pipeline, just like the rest of the instructions. On SPARCs (so far) >address generation appears to use the regular ALU in the third stage of >the pipeline and then begin the actual cache access in the fourth stage. >For a load, you then need an extra stage to get the data back. > The block diagram in the data sheet of the Fujitsu SPARC shows an Address Generation Unit that is separate from the Arithmetic and Logic Unit. Both branch target addresses and load/store addresses are calculated in the AGU. Further on in the data sheet the four stage pipeline is described: Fetch, Decode, Execute, and Write. It is stated explicitly that "Memory addresses are evaluated for loads, stores, and control transfers" in the Decode stage. It can be concluded that the Fujitsu SPARC does indeed have a separate address adder and that load/store addresses are generated in the second stage (Decode) of the pipeline. The R2000 has a five stage pipeline: Fetch, Decode, Execute, Memory Access, Write Back. Memory address generation occurs in the third stage (Execute) and the load/store "executes" in the fourth stage (Memory Access). The reason for the 2 cycle load in the Fujitsu SPARC is the multiplexing of the external address and data busses between instructions and memory data. A SPARC load requires 1 cycle of the external busses so that instruction fetching stalls for this 1 cycle. >Lack of split instruction and data caches. Because both SPARCs have a >single cache rather than the separate instruction and data caches of >the R2000, the extra pipeline stage needed to get the data back for a >load can't be used to fetch an instruction anyway. For a store the >relevant cache line is read on the fourth cycle and updated and written >back on the fifth cycle. So there are two cycles that can't be used >to fetch instructions, bringing the total cost of a store to three cycles. > SPARC supports base register plus index register memory addressing. During the first half of the Decode stage the base and index registers are accessed. During the second half they are added together to form the virtual memory address. Since the register file in the Fujitsu SPARC has only 2 ports, store data cannot be accessed from the register file until the third stage (Execute). Thus, on a store the address goes out during the third stage (Execute) and the data during the fourth stage (Write). Since stores use the external busses for two consecutive cycles during which time fetching of instructions is suspended, the execution time for stores is 3 cycles. >Inability to cycle the address and data bus twice per main clock cycle. >The SPARC chips aren't double cycling the address and data bus so that >both loads and stores mean that you can't fetch instructions. The R2000 >also has a single address bus and a single data bus but it can use them >twice per cycle. This means you can then split your cache into an >instruction cache and a data cache and make use of the extra bandwidth >by fetching an instruction every cycle in spite of loads and stores. > This is indeed true. The price the R2000 pays for this is a complex clocking scheme whereby a 4 phase input clock at double frequency is required in order to control the double cycle external busses. Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to be seen whether the H/W architecture of the R2000 is scaleable - can it be carried to 25-30MHz where the bus must run at 50-60MHz ? >Branches are slow. Since taken branches need only one delay slot >there must be an address adder for the program counter. But with a >single cache you have to decide early what the next instruction address >is. Both SPARC chips always decide that a branch will be taken so there >is an additional cycle penalty when the condition isn't satisfied and you >have to junk the instruction you fetched and fetch the right one. On > I think there might be some confusion here on the operation of the Annul Bit during conditional branches. It is my understanding that when this bit is 0 then the delay instruction (the instruction following the branch) is executed whether the branch is taken or not. When this bit is 1 then the delay instruction is executed only if the branch is taken - if the branch is not taken then the delay instruction which is already in the pipeline is aborted. Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle whether the branch is taken or not. With the Annul Bit at 1 a taken branch executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the branch and 1 cycle for the aborted delay instruction. The advantage of the Annul Bit is in conditional branches that terminate loops. With the Annul Bit at 1 a loop instruction can be placed in the delay slot. This instruction is executed when the loop is executed and is not executed when you fall thru the loop. -- Dennis Russell | NCR Corp., M/S 4720 phone: 619-485-3214 | 16550 W. Bernardo Dr. UUCP: ...{ihnp4|pyramid}!ncr-sd!dennisr | San Diego, CA 92128
elh@mips.UUCP (Ed Hudson) (12/13/87)
In article <1941@ncr-sd.SanDiego.NCR.COM> dennisr@ncr-sd.SanDiego.NCR.COM writes: >In article <8809@sgi.SGI.COM> baskett@baskett writes: >>both loads and stores mean that you can't fetch instructions. The R2000 >>also has a single address bus and a single data bus but it can use them >>twice per cycle. This means you can then split your cache into an >>instruction cache and a data cache and make use of the extra bandwidth >>by fetching an instruction every cycle in spite of loads and stores. >> <dennisr@ncr-sd.SanDiego.NCR.COM writes:> >This is indeed true. The price the R2000 pays for this is a complex >clocking scheme whereby a 4 phase input clock at double frequency is >required in order to control the double cycle external busses. the cost of the 'complex' interface is a few dollars for a tapped delay line. pretty cheap for a scheme that allows sufficient control of the processors' io timings well enough to double the available pin bandwidth with moderately cheap, fast srams. further, in chip and package design, it's the io transistions themselves that are expensive, the rate of the transitions is secondary. > >Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to >be seen whether the H/W architecture of the R2000 is scaleable - can it be >carried to 25-30MHz where the bus must run at 50-60MHz ? > 20ns ttl-io srams today support transaction rates of 50mhz, and 15ns rams hit 67mhz. although a full cpu subsytem is a little more demanding, this is indicative of what is technologically possible. the current r2000 interface was the result of carefull optimization between the cpu subsystem (ie, 16-64k cache rams, AS ttl), 1985 cpu process and packaging technology. i expect that future mips implement- ations will also be similar optimizitions of the then available technologies. -Ed Hudson DISCLAIMER: I speak only for myself. elh@mips.com or {ames,decwrl,prls}!mips!elh MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 720-1700 -- -Ed Hudson DISCLAIMER: I speak only for myself. elh@mips.com or {ames,decwrl,prls}!mips!elh MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 (408) 720-1700
jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (12/13/87)
In article <1111@mips.UUCP> elh@mips.UUCP (Ed Hudson) writes: ><dennisr@ncr-sd.SanDiego.NCR.COM writes:> >>This is indeed true. The price the R2000 pays for this is a complex >>clocking scheme whereby a 4 phase input clock at double frequency is >>required in order to control the double cycle external busses. > the cost of the 'complex' interface is a few dollars for > a tapped delay line. pretty cheap for a scheme that allows > sufficient control of the processors' io timings well enough > to double the available pin bandwidth with moderately cheap, > fast srams. further, in chip and package design, it's the io > transistions themselves that are expensive, the rate of the > transitions is secondary. >>Since at 16.7 MHz the R2000's I/O interface runs at 33.3MHz it remains to >>be seen whether the H/W architecture of the R2000 is scaleable - can it be >>carried to 25-30MHz where the bus must run at 50-60MHz ? > 20ns ttl-io srams today support transaction rates of 50mhz, and 15ns > rams hit 67mhz. although a full cpu subsytem is a little more > demanding, this is indicative of what is technologically possible. > the current r2000 interface was the result of carefull optimization > between the cpu subsystem (ie, 16-64k cache rams, AS ttl), 1985 cpu > process and packaging technology. i expect that future mips implement- > ations will also be similar optimizitions of the then available > technologies. The real problem is the fact that your chip edge is clocked at twice your instruction freqency. Running a higher-speed clock than the instruction rate is fine, and makes internal design much easier. However, packaging technology will be your limiting factor for some time to come, not really ram speed per se. For the large number of pins required, it is hard to find packages certified at that speed. Given current technology, r2000 could probably be scaled to about 20 MHz. However, custom RISC designs in CMOS are now reaching 40 MHz, which would be impossible with the double-clocked interface currently on the r2000. Perhaps the interface could be removed, given enough pins, but that gets you back into the packaging limits. One of the prime considerations in state-of-the-art RISC design today HAS to be chip-edge bandwidth, how to improve it and how to minimize it's usage (conserve it). Even going to off-chip cache is expensive at these speeds. I suspect you'll be seeing interesting attempts in this area soon. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// lunge!jesup@beowulf.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup)
mash@mips.UUCP (John Mashey) (12/14/87)
In article <1941@ncr-sd.SanDiego.NCR.COM> dennisr@ncr-sd.SanDiego.NCR.COM (0000-Dennis Russell) writes: >In article <8809@sgi.SGI.COM> baskett@baskett writes: ...... >>Branches are slow. Since taken branches need only one delay slot >>there must be an address adder for the program counter. But with a >>single cache you have to decide early what the next instruction address >>is. Both SPARC chips always decide that a branch will be taken so there >>is an additional cycle penalty when the condition isn't satisfied and you >>have to junk the instruction you fetched and fetch the right one. On >> >I think there might be some confusion here on the operation of the Annul >Bit during conditional branches. It is my understanding that when this bit >is 0 then the delay instruction (the instruction following the branch) is >executed whether the branch is taken or not. When this bit is 1 then the >delay instruction is executed only if the branch is taken - if the branch >is not taken then the delay instruction which is already in the pipeline is >aborted. > >Therefore, with the Annul Bit equal to 0 branches execute in 1 cycle >whether the branch is taken or not. With the Annul Bit at 1 a taken branch >executes in 1 cycle while an untaken branch takes 2 cycles - 1 cycle for the >branch and 1 cycle for the aborted delay instruction. Forrest and Dennis are talking about different things. See Fujitsu SPARC datasheet,and Namjoo&Agrawal, "Preserve high speed in CPU-to-cache transfers", Electronic Design, August 20, 1987, 91-96. These are consistent in saying: Fujitsu: "In performing delayed control transfer, the MB86900 processor always fetches the next instruction following a control transfer. Then the processor either executes this instruction or annuls it....This enables the pipeline to advance while the control target instruction is being fetched...By assuming a conditional branch to be taken, the processor minimizes pipeline interlock by providing one cycle execution for taken branches, or two cycle execution for untaken branches." Namjoo,Agrawal: "In this pipeline, the fetch address for instruction n is generated during the decoding stage of instruction n-2. Since all branch instructions are delayed by one cycle, all relative branch instructions take one cycle if the branch condition is true because the target instruction is fetched before the condition codes are ready. If, after condition codes are evaluated, it was determined that the branch was not taken, the processor ignores the target instruction and continues to fetch the next instruction in the sequence." Thus, given instructions: 1: conditional branch 2: branch delay slot 3: after branch delay slot N: target of branch Taken branch: 1, 2*, N (*= might be annulled) Untaken branch: 1, 2*, N**, 3 (** = ignored) The implication is that the CPU doesn't quite know the condition codes result in time, and thus has to guess. I can't tell from the Cypress datasheet whether or not they do the same thing.[Does anybody know who can say?] Given that one has decided to take some hit, this is probably the right way, in that taken conditional branches are on the order of 15% of instructions and untaken ones are on the order of 5% (on our machines), although this does vary: 1/3 of the programs we looked at had more untaken than taken branches. [I think earl killian posted this data a while back]. Thus, the SPARC branch design has (in terms of +=good, -=bad): + annul bit + ability to set condition codes on ALU ops - extra cycle for untaken conditional branch - condition-code based branch, i.e., often requires compare for eq, neq, etc that could actually be done as 1-cycle cmp-branches ALso, in looking at SPARC assembly code, one notes that cmp's are usually moved away from the conditional branches, so that perhaps these CPUs, or later ones, will take advantage of cases where the condition code setting is early enough to avoid the extra I-fetch. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
bcase@apple.UUCP (Brian Case) (12/15/87)
In article <8885@sgi.SGI.COM> baskett@baskett writes: >In article <6964@apple.UUCP>, bcase@apple.UUCP (Brian Case) writes: >> ... >> >The separate instruction and data cache only run >> >at single cycle rates but they run a half cycle out of phase with each >> >other so it all works out. (Pretty slick, don't you think?) >> >> Yes, I do think it is pretty slick, but I also think this is a liability >> at clock speeds higher than 16 Mhz (and maybe even at 16MHz). I am sure, >> though, that MIPS has a plan to fix this problem. It sure seems like the >> way to go at 8 Mhz. Preventing bus crashes (i.e. meeting real-world >> timing constraints) can be problem. > >The 16 MHz MIPS parts we have work fine. If it becomes a problem, the fix >is pretty obvious, too. Oh, I am sure they work great. I didn't mean that they would be flaky or intermittent or something, just that the system design is trickier. >> I am sure one of their chief concerns was future ECL implementation. >I have an ECL implementation of an experimental Risc processor (board) [Yes, that's a good machine! I hear it is the "DEC Dorado."] >in my office. My experience with the team that designed and built it >(a great group of people at DEC's Western Research Lab, by the way) >tells me that the MIPS architecture is more suitable for ECL implementation >than the SPARC architecture. (see next comment) > >> by choosing register windows (which lets them vary the number of registers, >> in window increments, for a given implementation) and a very simple >> definition otherwise, SUN simply did the best they could to make future >> implementation easy. > >It may have been the best they could do but it looks like a mistake to me. Well, notice that it was *I* who said that they were doing "the best they could." Please don't take my word as the official SUN position! Seldom does anyone really do "the best they could." One man's mistake is another man's stroke of genius. >In higher performance technologies the speed of register access becomes >more and more critical so about the only thing you can do with register >windows is to scale them down. Yes, in the first ECL single-chip implementation. Then, as the technology gets denser, you can scale them back up to the desired level. I was not talking about discrete ECL implementation; I should have made that clear. You may think that even single-chip ECL implementations suffer with large register files, but I don't believe so (but I'm still youngish and naive). >And as the number of windows goes down, >the small gain that you might have had goes away and procedure call >overhead goes up. Attacking the procedure call overhead problem at >compile time rather than at run time is a more scalable approach. Well, I understand what you are saying: "the available density of the technology is irrelevant, to a degree, with a smallish [my opinion], fixed-size register file." On the other hand, *by definition,* the SUN approach is more scalable since there is at least some opportunity for scaling; a fixed-size register file cannot, by definition, be scaled. (Or, have I missed something? Sorry if so.) 1) Notice that if SUN decides to dump the overlapping register window approach, they can! They can treat one procedure context as the only context available and use a procedure calling mechanism like MIPS. Compatibility can be maintained by having the old instructions trap and do the right thing. This will allow them to implement a register file the same size of the MIPS register file. Presumably, we'll be at such processing speeds then that old binaries, which use the old procedure calling mechanism, will run fast enough, even with the trap overhead. (The idea here makes sense, but I'm not sure I'm communicating it well.) 2) Didn't David Wall do research on register allocation at link time that showed that lots of registers are better? Admittedly, his approach needed a large pool of registers, like in the Am29000, not the overlapping register windows of the SPARC (couldn't resist! :-). Do you now think that the MIPS 32-entry file is as good as the 64-entry file on the experiemental machine to which you refer? I'm genuinely curious here, not asking a rhetorical question. I was under the impression that register allocation at link time was sorta "the wave of the future" (I hate that expression); if so, wouldn't 32 be too small? 3) You have to remember that it will be necessary to have at least some TLB-type or other cache-type function finish in one machine cycle. True, the array technology used for TLBs can be denser, and therefore a little faster, than multi-ported register file array technology. However, if you can get your TLB array access and compare in one cycle, why do you think that you can't get your register-file-array access and address compute (be it add, or whatever) in one cycle? What was the cycle-limiting factor in the experimental machine that you have in your office? Thanks in advance.
mash@mips.UUCP (John Mashey) (12/16/87)
In article <344@ma.diab.UUCP> pf@ma.UUCP (Per Fogelstrom) writes: .... >It's not suprising that the MIPS 2000 gives most power/Mhz,The architecture has >evolved during many years, without a hard pressure from the marketing such as >'We must have it NOW!!!'.(John Mashey mayby has another opinion, only my guess) Be serious! In a startup, the marketing pressure is for "we must have it yesterday!" As is well-documented in papers and presentations, the architecture owes a lot to earlier work, like the IBM 801 and Stanford MIPS, but almost all of the basic architecture work for the existing R2000/R2010 was done in about 6-8 months, starting in Nov-Dec 1984. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
hansen@mips.UUCP (Craig Hansen) (12/16/87)
In article <140@imagine.PAWL.RPI.EDU>, jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) writes: > The real problem is the fact that your chip edge is clocked at twice your > instruction freqency. Running a higher-speed clock than the instruction rate > is fine, and makes internal design much easier. However, packaging > technology will be your limiting factor for some time to come, not really > ram speed per se. For the large number of pins required, it is hard to find > packages certified at that speed. For speeds well above 40 MHz in CMOS technology, our studies suggest that this will not be a limiting factor at all, and that multiplexed busses can work as fast as non-multiplexed buses at least up to that clock rate. The real enemy to high clock rates is clock skew, and controlling that skew is the reason why we use a tapped-delay-line clocking system, and phase-locked-loop technology. The tapped-delay lines also allow the timing of the chip to be adjusted to accomodate variations in the timing specifications of SRAM chips; the phase-locked-loop technology allows the CPU and FPU to have matched timings, no matter how the CMOS processing causes circuit speed variations. In our earlier days, some other company that shall remain nameless, (but recently had their "ship" repossessed), made some wild claims (that we heard repeatedly but always "second"-hand) that the MIPS chip was no faster (when used at half speed) than theirs (when used at full speed), but was actually clocked twice as fast as we said it was. After all, we have double-frequency clock inputs. Well, OK. But the other companies chip had double-frequency clock inputs, too, and when you compared the two chips at their specified clock rate, ours runs more than twice as fast (benchmark-wise), and theirs had double the input clock rate. Talk about double-speak! The multiplexed buses, far from being a limiting factor, are an important reason why the MIPS chip is "fast" and the SPARC chip is "slow." By putting all the important cache interface logic on the CPU chip, the critical paths in the cache are entirely set by the speed of static RAM chips, without interference by tag comparison logic, and parity generation and checking logic. Because SRAMs are used as technology drivers for new CMOS and BiCMOS technologies, MIPS can be assured of a good supply of highly agressive SRAMs that will work with the MIPS part. The real problem with the other RISC designs is that they require specialized cache RAMs or external tag comparitors. SRAM vendors can do good business selling RAMs with internal tag comparison logic that try to cover up the faults of the RISC processor designers, yet are based on the previous generation technology - to be blunt, specialized cache RAMs are a winner for the RAM vendor (who gets to sell a proprietary part on old technology for a high price) but a loser for the RAM vendee (who pays too much per bit for slower RAM). ...and the specialized cache chips from the RISC vendors have been even slower, smaller and more expensive per bit than standard SRAMs. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...{ames,decwrl,prls}!mips!hansen or hansen@mips.com
lindsay@K.GP.CS.CMU.EDU (Donald Lindsay) (12/16/87)
In article <6993@apple.UUCP> bcase@apple.UUCP (Brian Case) writes: >On the other hand, *by definition,* the SUN >approach is more scalable since there is at least some opportunity for >scaling; a fixed-size register file cannot, by definition, be scaled. There is an optimum size for a windowed register-set. The optimum may be hard to locate precisely, but it has to exist. For example, there are programs which don't overflow the window set of current chips. Building chips with more registers, cannot speed those programs up. A more interesting question is, just what makes something scalable ? Sun's answer seems mostly to be "complexity" - they tried to minimize the chip design time so that implementation can track the implementation technology. But, of course, eventually, someone will build a complex implementation, because they had all those gates left over ... and so it goes. Hewlett-Packard just cancelled the project that was building a Spectrum out of ECL. Reportedly they were $20M in. Does anyone know why ? Does this mean anything for the ECL SPARC, or for scalability ? -- Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
ian@esl.UUCP (Ian Kaplan) (12/16/87)
In article <344@ma.diab.UUCP> pf@ma.UUCP (Per Fogelstrom) writes: >Well, the history repeats once again. A new RISC chip is launched and peopels >expectations reaches new "high scores". A few years ago there was another risc >chip set brougth to the market, called the Clipper. [ deleted text ] >For this chip set the picture has cleared now. The perfor- >mence range is not much more than can be achived with a 16-20 Mhz 68020. [ deleted text ] >Well they could have had 2-3 times the >clipper performance with the NS32532 today. And they called the buy a bargin ! I think that the discussion on the SPARC vs. the MIPS R2000 centers around why the SPARC is not faster than it is - specifically, why it is not as fast as the MIPS processor. What seems to have been missed here, if I properly understand Mr. Fogelstrom's article, is that the SPARC is quite fast. The lab I work in has a Sun 4/280 and I can tell you that it smokes. It may be 20% slower than the MIPS processor, but it is by no means a failure. The SPARC is much faster than the Motorola 68020 and, I would bet, the National processors. How the MIPS and SPARC scale remain to be seen. You should remember that neither Sun nor MIPS will keep their hardware architecture static. I have greatly enjoyed the discussion of SPARC vs MIPS architecture. This sort of interchange makes comp.arch worth reading. A happy holiday season to you all, Ian L. Kaplan ESL, Advanced Technology Systems M/S 302 495 Java Dr. P.O. Box 3510 Sunnyvale, CA 94088-3510 decvax!decwrl!borealis!\ sdcsvax!seismo!- ames!esl!ian ucbcad!ucbvax!/ / ihnp4!lll-lcc!
garner@gaas.Sun.COM (Robert Garner) (12/16/87)
The expositions on comp.arch about SPARC and the gate array implementation are interesting. Some of the inaccuracies have been addressed but others remain unanswered. Mashey's recent article <1115@winchester.UUCP> did clear up the confusion surrounding the implementation of conditional branches that was incorrectly portrayed by Forest Baskett <8809@sgi.SGI.COM> and Dennis Russell <1941@ncr-sd.SanDiego.NCR.COM>. Brian Case has taken an fairly impartial look at the architecture in <6964@apple.UUCP> and <6993@apple.UUCP>. Baskett's message was refreshing in that he accurately differentiated between implementation and architecture. (Quite unlike previous criticisms, such as from the so-called "MIPS Performance Brief.") However, Baskett's article continues to incorrectly portray the integer performance of Sun-4/200 workstations and SPARC in general. Sun's data on MIPS performance implies that the Sun-4/200 has approximately the same INTEGER performance as the M/1000. This fact is frequently ignored since the Sun-4/200 floating-point performance is generally (but not always) less than the M/1000. Baskett correctly deduces that this is due to the use of the Weitek 1164/54 floating-point chips, which are slow compared to MIPS' custom FPU. The Fujitsu gate arrays plus the Weitek chips were a reasonable vehicle for a SYSTEMS company like Sun to prove and quickly bring to market an OPEN, RISC-based workstation/server plus a wide range of application SOFTWARE. Sun, unlike MIPS, is not organized around the task of designing and fine tuning custom-designed ICs. It has even taken MIPS, whose lifeblood depends on a fast processor, more time than expected to deliver parts at speed (15-16 MHz). Now that SPARC is established, Sun is working closely with semiconductor companies themselves. This work includes improved floating-point implementations. Forest concluded his article by saying: > Since MIPS and Sun seem to be producing these systems with similar > technologies at similar clock rates at similar times in history, these > differences in the cycle counts for our most favorite and popular > instructions seem to go a long way toward explaining why SPARC is so slow. This hand waving is too fast! A standard, off-the-shelf gate array is NOT in the same league as a custom CMOS design. Indeed, that a gate array has the same integer performance as a tuned, full-custom, "similar technology" implementation is an indication of the strength of the architecture! Forest attempted to deduce the gate-array CPI value for integer and floating-point programs. From this analysis, he concluded: > These ratios [based on CPIs] are also consistent with the benchmark > results in the Performance Brief. Yes, floating-point suffers because of the Weitek chips. And yes, MIPS' "Performance Brief" attempts to stigmatize SPARC by dwelling on this: its benchmark suite and MIPS-rate calculations are conveniently based almost entirely on floating-point programs! But no, one can not accurately judge different processors by comparing their implementation-dependent "cycles per instruction" (CPI) values. Performance also depends on the number of instructions (N) issued by a compiler. For example, MIPS's delayed load does not affect their CPI but increases their N when NOPs are required, whereas SPARC's interlocked load decreases N but counts against its CPI. SPARC's register windows and corresponding fewer loads and stores also decrease its N relative to MIPS. By avoiding a more detailed analysis that includes N (via simulations), one ignores the state of the compilers and associated optimizations (via SPARC's annul bit, for instance.) In general, there is always room for improvement in compiler generated code. The Sun-4/200, for LARGE C, integer programs runs at about 1.65 CPI. This includes 15% loads and 5% stores AND the miss cost associated with the 128K-byte cache and the large, asynchronous main memory. (Baskett's calculation assumed MIPS' distribution, 20% loads and 10% stores, which is not applicable to SPARC. Since cache effects can dominate performance, I suspect that the M/1000, large-C-program CPI could be near 1.6 if its cache/memory is taken into account.) As processor cycle time shrinks, the CPI for CPUs of all types increases because the miss cost rises. This is because main memory access times are not scaling as rapidly as processor cycle times. This negative effect on CPIs must be offset by improvements in CPU pipelines and is even more pronounced in low-CPI machines. SPARC implementations are balanced in a way that achieve shorter cycle times, do not cause an increase in CPI, and carefully consider chip-edge bandwidth issues. SPARC implementations include single-cycle loads and single-cycle untaken branches. Of course, the most error-free measure of performance is wall clock time. Until there are more results of some large integer programs running both on the Sun-4 and the M/1000, speculation can be unproductive. Now, what about register windows? In Baskett's second article <8885@sgi.SGI.COM>, he writes: > It may have been the best they could do but it looks like a mistake to me. > In higher performance technologies the speed of register access becomes > more and more critical so about the only thing you can do with register > windows is to scale them down. And as the number of windows goes down, > the small gain that you might have had goes away and procedure call > overhead goes up. Attacking the procedure call overhead problem at > compile time rather than at run time is a more scalable approach. Two points: (1) It is hard to visualize the future difference between implementing 1K-bit vs. 4K-bit register files (i.e., 32 registers versus 128 registers). Memories can turn out larger and faster than intuition indicates. (2) SPARC does NOT PRECLUDE interprocedural register allocation (IRA) optimizations and thus ALLOWS for "attacking the procedure call overhead problem at compile time rather than at run time." SPARC has two mechanisms to reduce load/store traffic: register windows and IRA! In SPARC, the procedure call and return instructions are different from the ones that increment and decrement the window pointer. (SPARC's "save" and "restore" instructions decrement and increment the window pointer. They also perform an "add", which usually adjusts the stack pointer. The pc-relative "call" and register-indirect "jump-and-link" do NOT effect the window pointer.) A minimum SPARC implementation could have 40 registers: 8 ins, 8 locals, 8 outs, 8 globals, and 8 local registers for the trap handler. Such as implementation is not precluded by the architecture, but would probably imply IRA-type optimizations. It would function as if there were no windows, although window-based code would properly execute, albeit inefficiently. Register windows have several advantages over a fixed set of registers, besides reducing the number of loads and stores by about 30%: They work well in LISP (incremental compilation) and object-oriented environments (type-specific procedure linking) where IRA is impractical. They can also be used in specialized controller applications that require extremely fast context switching: a pair of windows (32 registers) can be allowed per context. -------------------------------- Robert Garner Sun Microsystems P.S. There will be two sessions devoted to SPARC at the IEEE Spring Compcon: One session will cover the architecture, compilers, and the SunOS port and the other will cover the Fujitsu, Cypress, and BIT implementations. DISCLAIMER: I speak for myself only and do not represent the views of Sun Microsystems, or any other company.
mash@mips.UUCP (John Mashey) (12/17/87)
In article <538@esl.UUCP> ian@esl.UUCP (Ian Kaplan) writes: ... > I think that the discussion on the SPARC vs. the MIPS R2000 centers > around why the SPARC is not faster than it is - specifically, why it is > not as fast as the MIPS processor. What seems to have been missed here, > if I properly understand Mr. Fogelstrom's article, is that the SPARC is > quite fast. The lab I work in has a Sun 4/280 and I can tell you that > it smokes. It may be 20% slower than the MIPS processor, but it is by > no means a failure. The SPARC is much faster than the Motorola 68020 > and, I would bet, the National processors.... I don't think that anyone is arguing that SPARC is slower than a 68K, or a failure. (It's not, and it isn't.) What is going on is some serious architectural debate, (which is what this newsgroup is for!). Of course, some of this has been stirred up people finally being able to get some actual data, and then trying to understand what's going on. Note that with one exception [Dave Hough's careful posting a while back of various FP benchmarks], the only performance appraisals that have been offered by Sun to the general public [to my knowledge, I'll be glad to hear of others]: 1) The original set of benchmarks in the Sun-4 introduction publicity. (Dhrystone, Stanford, Linpack SP, Linpack DP, and Spice). 2) Introduction materials, brochures, and advertisements: The original announcement described the Sun-4 as ``10 mips'', ``ten times faster than a VAX 11/780'' [1], and ``in the same performance class as the VAX 8800''. [2] "Relative to other manufacturer's high-end offerings, the Sun-4/200 excels in floating-point performance. In fact, the Sun-4/200 will execute floating-point-intensive applications faster than the VAX 8800 superminicomputer." [3] "Our new Sun-4/260, the first born of our brand new family of supercomputing workstations and servers. In computer-ese, it delivers the performance of 10 MIPS. For the sake of comparison, that's as much horsepower as a minicomputer like the DEC VAX 8800." [4] "SPARC is an open architecture, available today, that Sun uses to implement the best price/performance system available, [5] reach as low as $4000 per million instructions per second." [6] "SPARC is the first [7] RISC architecture to incorporate the features found in supercomputers such as Cray systems. Single-cycle access to a large cache memory, [7a] a large register file, [7b] and pipelining, [7c] features pioneered on supercomputers, are part of SPARC. Register-to-register and load/store design, [7d] along with fixed-format [7e] instructions and more concurrency [7f] in our architecture, propel Sun's RISC machines to unmatched performance." [8] "SPARC is an open, scalable architecture. It is the most scalable [9] RISC architecture available today." Now, of the numbered assertions, some are amenable to quantitative analysis, by gathering data carefully and publishing it. If properly done, and if enough real benchmarks can be obtained, people can reach their own conclusions about whether or not they believe these assertions. [1], [2], [3], [4] (except horsepower is a little fuzzy: does it include floating point? does it include multi-user performance?), [8] are statements about performance, and should be testable. [5] and [6] are a little more slippery, as one needs to compute performance first, and then get costs for comparable configurations. [7] can be analyzed by comparing with all other RISC architectures that shipped before SPARC. [9] is hard to evaluate, since scalability is not easily measured. Anyway, when Forrest asks "why is SPARC slow", I think what he means is that neither the published data nor customer benchmarking seems to justify the conclusions reached above, and from the outside, those are the hypotheses that the rest of us get to deal with. Ian: since you have a 4/280, perhaps you might offer some benchmarks, which would add to our knowledge of a (controversial) topic. In particular, it would be wonderful if you've got any large, actual integer applications, especially if they can be made public-domain so that they can be run anywhere. [floating-point ones are fine, too, but there already exist lots of those, whereas there's a sad lack of integer ones.] Unfortunately, saying that a machine "smokes" doesn't help as much! Also, it would help to specify which Motorola implementation it was much faster than: a recent Computerworld article fell into the trap of saying the Sun-4 was exceeding expectations, because it looked 3-5X faster (than the Sun-3), more even than the claimed 2.5X. If you consider 2-mips 3/100s and 4-mips 3/200s, you can see what happened. To summarize: statements about performance are either completely meaningless, or they're actually supposed to tell something about how computers behave. If they're the latter, you should be able to test them. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mike@ivory.SanDiego.NCR.COM (Michael Lodman) (12/17/87)
In article <36626@sun.uucp> garner@sun.UUCP (Robert Garner) writes: >Sun's data on MIPS performance implies that the Sun-4/200 >has approximately the same INTEGER performance as the M/1000. The data I've seen in no way backs up this statement. From integer benchmarks I've run on the Sun-4 and a 12Mhz MIPS M/800, the MIPS is indeed about 20%-30% faster conservatively. I haven't yet run any floating-point benchmarks. -- Michael Lodman (619) 485-3335 Advanced Development NCR Corporation E&M San Diego mike.lodman@ivory.SanDiego.NCR.COM {sdcsvax,cbatt,dcdwest,nosc.ARPA,ihnp4}!ncr-sd!ivory!mike When you die, if you've been very, very good, you'll go to ... Montana.
henry@utzoo.uucp (Henry Spencer) (12/17/87)
> ALso, in looking at SPARC assembly code, one notes that cmp's are usually > moved away from the conditional branches, so that perhaps these CPUs, > or later ones, will take advantage of cases where the condition code setting > is early enough to avoid the extra I-fetch. AT&T's CRISP machine in fact takes this to its logical (?) extreme: it basically has one condition-code bit, and if you can manage to set that slightly ahead of time, then the execution time for an in-cache branch is *zero*. (The actual story is a bit more complicated, but that's the general idea, as I recall it from the paper in Sigarch 14.) -- Those who do not understand Unix are | Henry Spencer @ U of Toronto Zoology condemned to reinvent it, poorly. | {allegra,ihnp4,decvax,utai}!utzoo!henry
ian@esl.UUCP (Ian Kaplan) (12/18/87)
In article <1156@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >Ian: since you have a 4/280, perhaps you might offer some benchmarks, >which would add to our knowledge of a (controversial) topic. >In particular, it would be wonderful if you've got any large, >actual integer applications, especially if they can be made public-domain >so that they can be run anywhere. [floating-point ones are fine, too, >but there already exist lots of those, whereas there's a sad lack of >integer ones.] >Unfortunately, saying that a machine "smokes" doesn't help as much! > >To summarize: statements about performance are either completely meaningless, >or they're actually supposed to tell something about how computers >behave. If they're the latter, you should be able to test them. >-- We had a chance to use a Sun-4 for several months before we actually purchased the machine. We ran the standard benchmarks (e.g., drystone and whetstone) and several of our in-house applications. From this our results show us that the Sun-4 is about 8 VAX 11/780 MIPS on some benchmarks and as high as 10 VAX MIPS on others. Our VLSI group has been running their design software and they have found that the Sun 4 is almost three times faster than a Sun 3/260, or around 10 MIPS, for one of their design packages, which uses primarily integer arithmetic. On one of my group's graphics applications which does 3-D rotation (which is fairly floatin point intensive) the Sun-4 is over 4 times the speed of a Sun 3/180 with a floating point accelerator board. The VLSI group has not run HSPICE on the Sun 4 yet, but if and when they do, I will post the results. Unfortunately all the code for the applications I have mentioned is proprietary and cannot be distributed. > >Also, it would help to specify which Motorola implementation it was much >faster than: a recent Computerworld article fell into the trap of >saying the Sun-4 was exceeding expectations, because it looked 3-5X faster >(than the Sun-3), more even than the claimed 2.5X. If you consider >2-mips 3/100s and 4-mips 3/200s, you can see what happened. > Well, I guess that I fell into the same "trap" as Computerworld. My remarks regarding the SPARC vs. the 680x0 are relative to the Sun 3 computer systems. My understanding is that the Sun 3 family does a fairly good job of utilizing the 68020. I will state once again, that I have really enjoyed the discussion of SPARC vs. MIPS. This discussion has been an example of comp.arch at its best. There is no question that solid benchmark data is needed to evaluate various architectural approaches. However there are many factors that go into making a successful commercial computer system. One of these is system price. I do not think that Sun is trying to build the fastest computer in its class, but I do think that they are trying to build the computer with the best system price per MIP. Ian L. Kaplan ESL, Advanced Technology Systems M/S 302 495 Java Dr. P.O. Box 3510 Sunnyvale, CA 94088-3510 decvax!decwrl!\ sdcsvax!seismo!- ames!esl!ian ucbcad!ucbvax!/ / ihnp4!lll-lcc!
irf@kuling.UUCP (Stellan Bergman) (12/18/87)
How does the SPARC architecture differ from HPs HPA/RISC. The latter has been around for quite some while now. I am curious to know since we plan to move to something faster soon. Bo Thide', Swedish Institute of Space Physics. UUCP: ...enea!kuling!irfu!bt
jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (12/18/87)
<John Mashey posts a long article about Sun and Mips performance, at the end include unix benchmarks for grep, diff, yacc, etc.> John, one thing was left unmentioned: the unix benchmarks you give are VERY dependant on filing-system implementation and, even more so, on the disks used and how fragmented they are. I'm sure I could cause a factor of two decrease in performance by fragmenting your disks or making you use slower ones. Any benchmarks that rely on file-system access should either use the same disks, preferably the exact same ones so the fragmentation is the same (I know, impractical), or at least try to quantify these differences and state exactly what the hardware is, avg seek times, how fragmented, etc. Better yet is not to rely on any OS and certainly not file-system dependant benchmarks. If you are testing the performance of a specific system configuration (CPU, memory, disk, OS, etc, etc) fine, do so. If you are addressing the performance of the CPU/FPU/cache (which seems to be what is discussed in this group), don't use those benchmarks or at least try to factor out the peripheral/OS/whatever differences. Just trying to adhere to your philosophy of real data wherever possible. :-) // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// lunge!jesup@beowulf.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup)
mash@mips.UUCP (John Mashey) (12/20/87)
In article <167@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: ><John Mashey posts a long article about Sun and Mips performance, at the >end include unix benchmarks for grep, diff, yacc, etc.> >John, one thing was left unmentioned: the unix benchmarks you give are >VERY dependant on filing-system implementation and, even more so, on >the disks used and how fragmented they are. I'm sure I could cause a >factor of two decrease in performance by fragmenting your disks or making you >use slower ones. Any benchmarks that rely on file-system access should either >use the same disks, preferably the exact same ones so the fragmentation is >the same (I know, impractical), or at least try to quantify these differences >and state exactly what the hardware is, avg seek times, how fragmented, etc. >Better yet is not to rely on any OS and certainly not file-system dependant >benchmarks. >If you are testing the performance of a specific system configuration (CPU, >memory, disk, OS, etc, etc) fine, do so. If you are addressing the >performance of the CPU/FPU/cache (which seems to be what is discussed in this >group), don't use those benchmarks or at least try to factor out the >peripheral/OS/whatever differences. >Just trying to adhere to your philosophy of real data wherever possible. :-) Most are perfectly valid points. Fortunately, I followed all of the rules. Unfortunately, I didn't replicate the context, so anybody who hadn't seen the earlier posting, or hadn't gotten a copy of the Brief, might be misled. The benchmarks posted were an update to the last Performance Brief, posted here, and something we distribute to anyone who asks. It spells out in EXCRUCIATING detail how things are measured, what the configurations were, why we measured what we measured, what we think each benchmark measures, etc, etc. The UNIX benchmarks listed have almost nothing to do with the file system and OS issues listed. They give user cpu times. Now, on a cached machine, user-cpu times are not necesarily independent of kernel CPU times, and kernel CPU times are certainly not independent of disks. However, the system CPU times on these range from 1-20% of the user CPU (specifically: yacc: 1%, diff: 5%, nroff:15%, grep:20%), or, about a mean of 10%, hence, any system interference is fairly small. As a guess, the influence of disk fragmentation upon the numbers presented is at worst in the 1% range, which is probably below the noise level of UNIX timing. To predict UNIX command performance (which is highly relevant to some people), it is difficult to find benchmarks that people understand, might be able to duplicate, and might relate to real performance, but which are not OS-dependent, and do no I/O. We picked ones that did a fair amount of user-level computing, compared to kernel computing, to do our best to be clear about what was, and was not being measured. (OS performance is a whole separate area, quite important also.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
peter@sugar.UUCP (Peter da Silva) (12/22/87)
Ah, benchmarks. Here's something to consider when you start believing benchmarks. The Commodore-Amiga can approach the equivalent of 10 MIPS for certain logical operations on arrays, if one codes a tight loop using the Blitter (a special purpose processor that does BitBlt operations). The only non-graphics algorithm that does anywhere near this that I have heard of is a 20-generation-per-second 320-by-200 LIFE program. Of course the machine as a whole doesn't do anywhere near even a single MIP for general purpose work. It's just a 68000, after all. So think of that next time you believe benchmarks. -- -- Peter da Silva `-_-' ...!hoptoad!academ!uhnix1!sugar!peter -- Disclaimer: These U aren't mere opinions... these are *values*.