kissell@garth.UUCP (Kevin Kissell) (10/16/86)
Keywords: In article <722@mips.UUCP> mash@mips.UUCP (John Mashey) writes: >However, a useful attribute of Roger's measure's (or variant thereof) >is that looking at the measure (units of real performance) per Mhz, >you some idea of architectural efficiency, i.e., smaller numbers are >better, in that (cycle time) is likely to be a property of the technology, >and hard to improve, at a given level of technology. [This is clearly >a RISC-style argument of reducing the cycle count for delivered performance, >and then letting technology carry you forward.] Using the numbers above, >one gets KiloWhets / Mhz, for example: I don't understand how someone of John's sophistication can insist on repeating such a clearly fallacious argument. The statement "cycle time is likely to be a property of the technology" is simply untrue, as I have pointed out in previous postings. Cycle time is a the product of gate delays (a property of technology) and the number of sequential gates between latches (a property of architecture). For example, let us consider two machines that are familiar to John and myself and yet of interest to the newsgroup: the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built with essentially the same 2-micron CMOS technology. I somehow doubt that Fairchild's CMOS transistors switch four times faster than that of whoever is secretly building R2000s this week. The difference is architectural. As I understand it, the R2000 was designed to take advantage of delayed load/branch techniques, and to execute instructions in a small number of clocks, which in fact go hand-in-hand. A load or branch can take as little as two clocks. But the addition of two numbers cannot take less than one clock, and so the ALU has a leasurely 125ns to do something that it could in principle have done more quickly, had it been more heavily pipelined. The Clipper was designed from fairly well-established supercomputer and mainframe techniques. The cycle time is the time required to do the smallest amount of useful work - an integer ALU operation at 30ns. Other instructions must then of course be multiples of that basic unit. Assuming cache hits, a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes 9 (270ns vs. 250ns for the R2000). It should be noted that both machines allow for the overlapped execution of instructions, but in different ways. The R2000 overlaps register operations with loads and branches using delay slots. The Clipper overlaps loads but not branches, using resource scoreboarding instead of delay slots. This means that the R2000 can branch more efficiently (assuming the assembler can fill the delay slot), but the Clipper can have more instructions executing concurrently than the R2000 (4 vs 2) in in-line code. Draw your own conclusions about "architectural efficiency". >Machine Mhz KWhet KWhet/Mhz >80287 8 300 40 >32332-32081 15 728 50 (these from Ray Curry, >32332-32381 15 1200 80 in <3833@nsc.UUCP>) (projected) >32332-32310 15 1600 100* "" "" (projected) >Clipper? 33 1200? 40 guess? anybody know better #? >68881 12.5 755 60 (from discussion) >68881 20 1240 60 claimed by Moto, in SUN3-260 >SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160) >MIPS R2360 8 1160 140* DP (interim, with restrictions) >MIPS R2010 8 4500 560 DP (simulated) John's guess for the Clipper is off by over a factor of two. The Clipper FORTRAN compiler was brought up only recently. In its present sane but unoptimizing state, I obtained the following result on an Interpro 32C running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green Hills Clipper FORTRAN compiler with Fairchild math libraries: Mhz Kwhet Kwhet/Mhz Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of more practical consequence. Kevin D. Kissell Fairchild Advanced Processor Division
gnu@hoptoad.uucp (John Gilmore) (10/17/86)
Speaking from MIPS, mash@mips.UUCP (John Mashey) writes: > ...looking at the measure (units of real performance) per Mhz, >you some idea of architectural efficiency, i.e., smaller numbers are >better, in that (cycle time) is likely to be a property of the technology, >and hard to improve, at a given level of technology. Speaking from Fairchild, kissell@garth.UUCP (Kevin Kissell) writes: > I don't understand how someone of John's sophistication can insist on > repeating such a clearly fallacious argument. The statement "cycle time > is likely to be a property of the technology" is simply untrue... I love it! The Intel/Motorola wars have been fun, but I'm glad they're temporarily in abeyance. Onward with the RISC versus RISC wars! B*} -- John Gilmore {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu jgilmore@lll-crg.arpa (C) Copyright 1986 by John Gilmore. May the Source be with you!
jason@hpcnoe.UUCP (Jason Zions) (10/19/86)
jlg@lanl.ARPA (Jim Giles) / 12:59 pm Oct 15, 1986 / > Mflops:(Millions of FLoating point OPerations per Second) > MHz: (Millions of cycles per second) > > Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per > sec^2) > 'Scuse me? Mflops per MHz = Mflops / MHz Millions Flop / Sec = -------------------- Millions Hertz / Sec = Floating Point OP / Hertz = FLOP per Cycle. In other words, how many (or few) floating point operations happen in a single cycle. Yeah, it's gonna be small number, but not as silly a number as your derivation shows. > J. Giles > Los Alamos -- Jason Zions Hewlett-Packard Colorado Networks Division 3404 E. Harmony Road Mail Stop 102 Ft. Collins, CO 80525 {ihnp4,seismo,hplabs,gatech}!hpfcdc!hpcnoe!jason
hansen@mips.UUCP (10/19/86)
In article <377@garth.UUCP> kissell@garth.UUCP (Kevin Kissell) writes: >I don't understand how someone of John's sophistication can insist on >repeating such a clearly fallacious argument. The statement "cycle time >is likely to be a property of the technology" is simply untrue, as I have >pointed out in previous postings. Cycle time is a the product of gate delays >(a property of technology) and the number of sequential gates between latches >(a property of architecture). For example, let us consider two machines >that are familiar to John and myself and yet of interest to the newsgroup: >the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time >of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built >with essentially the same 2-micron CMOS technology. I somehow doubt that >Fairchild's CMOS transistors switch four times faster than that of whoever >is secretly building R2000s this week. The difference is architectural. "cycle time is likely to be a property of the technology" is clearly a simplification that is useful for making relatively crude comparisons between widely varying machine designs. Cycle time, while a crude measure, has the advantage that it is clearly observable and well-documented. In practice, the number of sequential gates between latches is also generally a property of the technology, given that designers are attempting to optimize their own design. It is counterproductive to over-pipeline a design, as pipe registers themselves add delay and complexity. Let me emphasize, however, that I do not intend to assert that the Fairchild design is over-pipelined. Now let us address the general issue of a comparison of the technology of the two machines discussed above, (two machines that were clearly chosen entirely at random). It is indeed safe to assume that an 8 MHz R2000 has a cycle time of 125 ns. However, 8 MHz is not the maximum clock rate that the silicon will support - that figure is 16.67 MHz, or a cycle time of 60 ns (worst case over commercial temperatures). This 16.67 MHz R2000 part is built in a 2-micron CMOS technology, and Fairchild's part is built in a process that is also described as a 2-micron CMOS technology. However, the phrase "2-micron CMOS technology" is actually very vague. The available public literature from both companies is not sufficient to compare these technologies point-by-point, but I fully expect that Fairchild has pushed harder on effective transistor gate length and oxide thickness to reach 33 MHz than MIPS has yet employed to reach 16.67 MHz. A difference in comparable gate speed of a factor of two is actually entirely plausable, though we believe the actual ratio is more on the order of 1.5. We have been getting our process technology from the same suppliers week after week. By using a slightly less agressive technology, we are able to get reliable, multiple-sourced processing. >As I understand it, the R2000 was designed to take advantage of delayed >load/branch techniques, and to execute instructions in a small number of >clocks, which in fact go hand-in-hand. A load or branch can take as little >as two clocks. But the addition of two numbers cannot take less than one >clock, and so the ALU has a leasurely 125ns to do something that it could >in principle have done more quickly, had it been more heavily pipelined. I have to disagree on several of the points claimed here. The R2000 design will execute load and branch instructions at a rate of one instruction per cycle (a 60 ns cycle), and takes one 60 ns cycle to perform an integer ALU operation. In fact, the R2000 will execute ALL instructions in a single cycle, which substantially simplified the design. It is, of course, entirely untrue that the addition of two numbers cannot take less than one clock, but this is not the heart of the matter: the integer ALU is not the critical path in the R2000 design. >The Clipper was designed from fairly well-established supercomputer and >mainframe techniques. The cycle time is the time required to do the smallest >amount of useful work - an integer ALU operation at 30ns. Other instructions >must then of course be multiples of that basic unit. Assuming cache hits, >a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes >9 (270ns vs. 250ns for the R2000). Correcting the numbers above, we have 120/180 ns (Clipper) vs. 60 ns (R2000) for a load, and 270 ns vs 60 ns for a branch. >It should be noted that both machines allow for the overlapped execution >of instructions, but in different ways. The R2000 overlaps register >operations with loads and branches using delay slots. The Clipper >overlaps loads but not branches, using resource scoreboarding instead >of delay slots. This means that the R2000 can branch more efficiently >(assuming the assembler can fill the delay slot), but the Clipper can >have more instructions executing concurrently than the R2000 (4 vs 2) >in in-line code. Resource scoreboarding is no more effective at using load delay slots (which are delays inherent in the computation) than static scheduling. Since instructions are issued in the order in which they are presented in a scoreboard controller, an operation that depends on the value of a pending load instruction must wait for the load to complete on either machine. The number of delay cycles, is, however, an important factor in determining performance. It is hardly advantageous to have 4 cycle (is is it 6 cycle?) load instructions, no matter how slickly this is portrayed as a feature with the phrase "can have more instructions executing concurrently." The R2000 can fill the delay slot with a useful instruction, (which can even be an additional load instruction) over 70% of the time. With what frequency can Clipper compilers find three instructions, none of which can be a load, to fill the three load delay slots on a Clipper? >Draw your own conclusions about "architectural efficiency". The Clipper designers claim 5 MIPS performance at 33 MHz, while the R2000 performs at 10 MIPS at 16.67 MHz. The Fairchild technology is as much as twice as agressive as the R2000 technology, but the Clipper only achieves half the performance. My conclusion is that the R2000 is two-four times as "efficient" an architecture. For Clipper to reach the same performance in the same technology, using their current architecture, they need 66 MHz parts, with an input clock rate well above the broadcast FM radio band. >>Machine Mhz KWhet KWhet/Mhz >>80287 8 300 40 >>32332-32081 15 728 50 (these from Ray Curry, >>32332-32381 15 1200 80 in <3833@nsc.UUCP>) (projected) >>32332-32310 15 1600 100* "" "" (projected) >>Clipper? 33 1200? 40 guess? anybody know better #? >>68881 12.5 755 60 (from discussion) >>68881 20 1240 60 claimed by Moto, in SUN3-260 >>SUN FPA 16.6 1700 100* DP (from Hough) (in SUN3-160) >>MIPS R2360 8 1160 140* DP (interim, with restrictions) >>MIPS R2010 8 4500 560 DP (simulated) > >John's guess for the Clipper is off by over a factor of two. The Clipper >FORTRAN compiler was brought up only recently. In its present sane but >unoptimizing state, I obtained the following result on an Interpro 32C >running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green >Hills Clipper FORTRAN compiler with Fairchild math libraries: > > Mhz Kwhet Kwhet/Mhz >Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of > more practical consequence. > >Kevin D. Kissell >Fairchild Advanced Processor Division Clipper 33 2930 90 = Kwhet/MHz I'd like thank Kevin for providing this performance data and point out that this ratio is a respectable accomplishment on Fairchild's part - this number is comparable to the values obtained by using multiple-chip FP processors built with Weitek arithmetic units and interfaced to microcoded processors. While the FP arithmetic operations take longer in the Clipper than in Weitek parts (which are built in an unmistakably slower technology), by reducing communications overhead, the overall performance comes out comparably well. Let me make clear why Kwhet/MHz or MIPS/MHz ratios are useful: they provide some insight into where the emphasis was placed in the design, and where future derivative designs can reach. It's my view that Kevin's remarks confirm that the Clipper design was intended from the start to build a machine with a low MIPS/MHz ratio, with the clock rate based on the lowest conceivable executable unit. It should also be clear what level of architectural efficiency results from optimizing integer ALU operations (Clipper), rather than by optimizing the architecture to execute load, store and branch operations (MIPS). -- Craig Hansen | "Evahthun' tastes MIPS Computer Systems | bettah when it ...decwrl!mips!hansen | sits on a RISC"
mash@mips.UUCP (10/19/86)
In article <377@garth.UUCP> kissell@garth.UUCP (Kevin Kissell) writes: >In article <722@mips.UUCP> mash@mips.UUCP (John Mashey) writes: >...that are familiar to John and myself and yet of interest to the newsgroup: >the MIPS R2000 and the Fairchild Clipper. An 8 Mhz R2000 has a cycle time >of 125ns. A 33Mhz Clipper has a cycle time of 30ns. Yet both are built >with essentially the same 2-micron CMOS technology. I somehow doubt that >Fairchild's CMOS transistors switch four times faster than that of whoever >is secretly building R2000s this week. The difference is architectural. (One of my colleagues got here first, hansen@mips, in 726@mips.UUCP, so I'll just add a few notes where they don't overlap too much.) There was no intent in the original posting to start a MIPS versus Clipper war [contrary to John Gilmore's posting in <1198@hoptoad.uucp>: sorry John, another Moto versus Intel battle we do not need, fun though it may be to watch!] I was only trying to be reasonably inclusive of relevant 32-bit micros. However, now that the issue has been raised..... An 8Mhz R2000 isn't pushing the technology very hard, ON PURPOSE!!! 8Mhz parts appear first, followed by 12s and 16s, for the same reasons you got 12Mhz 68020s before 16s and 25s. Also, I'm told that the 2u design doesn't push 2u technology as hard as it might have, in order to let the same design be shrunk to 1.5u and 1.2u with minimal effort. Now, the reason one might care about MWhets/MHz (or any similar measure that compares the delivered real performance with some basic technology speed) is to understand the margin and headroom in a design. Since Kevin brought the issue up, some hypothetical questions: a) Will there be 66Mhz Clippers in 2u CMOS? [To get actual performance like 16Mhz R2000 in 2u;] [If the answer is yes, I know a bunch of people, not all at MIPS, either, who have some real tough questions involving transmission-line effects, how to do ECL or other reduced- voltage-swing I/O, etc.] b) If they will be, what year will they be? [1987?] c) When will there be bigger / (more in parallel) CAMMU chips? [Because if there aren't, how are the caches going to get enough bigger to keep the delivered performance in line with the CPU clock speed improvements? (for real programs)? Chips gets faster with shrinks, but they don't magically get re-laid-out to acquire more memory. CAMMU chips have some good ideas in them, but they're not very big, especially compared with the needs of some of the real programs that people would like to run on high-performance micros. (There is some real nasty stuff lurking out there! People keep putting them on our machines, so we know....If the Clipper FORTRAN compilers just came up recently, and they haven't yet tried running 500KLOC FORTRAN programs...interesting times are ahead....) > >The Clipper was designed from fairly well-established supercomputer and >mainframe techniques.... "fairly well-established supercomputer and mainframe techniques" is interesting. I can think of 2 ways to read this assertion: a) High-performance VLSI designs should be done just like big machines. OR b) High-performance VLSI should be designed with good understanding of big machines, as well as good understanding of the tradeoffs necessary for VLSI [margin, headroom, packaging constraints, processes, etc, etc], where those are different from the design tradeoffs of the big ECL boxes. I hope Kevin meant b), which most people would agree with. > >John's guess for the Clipper is off by over a factor of two. The Clipper Thanks for the info: all I'd seen were random guesses from people around the net, and it's a useful contribution to see numbers from somebody that knows. Hopefully, we'll see more? [I assume that was DP?] >FORTRAN compiler was brought up only recently. In its present sane but >unoptimizing state, I obtained the following result on an Interpro 32C >running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green >Hills Clipper FORTRAN compiler with Fairchild math libraries: > > Mhz Kwhet Kwhet/Mhz >Clipper 33 2920 Who cares? Kwhet/Kg and Kwhet/cm2 are of > more practical consequence. As hansen@mips noted, these are reasonable results, and I'd assume they'll improve somewhat with more mature compiler technology. Actually, this raises a set of questions that might be of general interest in this newsgroup, basically: 1) What metrics are interesting? 2) How do you define them? 3) In what problem domains are they relevant? 4) What are different constraints that people use? 5) How do different metrics correlate, specifically, are some of the simpler (easier-to-measure) good predictors of the more complex ones? For example, here are some metrics, all of which have appeared in this newsgroup at some time or other. Proposals are solicited: a) Clock rate. (Mhz) -- b) Peak Mips [i.e., typically back-to-back cached, register-register adds]. -- c) Sustained Mips ? d) Benchmark performance relative to other computers ++ e) Peak Mflops [i.e., "" "" for FP] -- f) Dhrystones g) Whetstones + h) LINPACK MFLops ++ i) Kwhets / Mflops [g/e] - j) Kwhets / Mhz [g/a] + k) Kg l) cm2 (or cm3) m) Watts n) $$ +++ o) Kwhets / Kg [g/k] p) Kwhets / cm2 [g/l] + q) Kwhets / Watt [g/m] + r) (any of the above) / $$ +++(esp if d)) --------- (-- & ++ indicate general impression of these metrics) What's interesting is that people have all sorts of different constraint combinations or optimization functions over any of these. Let me try a few examples, and solicit some more: 1) Maximize g), h) etc, subject to few constraints, i.e., for people who buy CRAYs, etc, money is (almost( no object. 2) Maximize one of the performance numbers, subject to some constraint. The constraint might be: absolute cm2 or cm3, as in some avionics things, i.e., if it doesn't fit, it doesn't matter how fast it is! $$: get me the most for some fixed amount of money, and I don't care if it's 2X faster, even if it's more cost-effective. 3) Performance may not be particularly important at all, relative to object-code compatbility, software availability, service, etc. Comments? What sorts of metrics are important to the people who read this newsgroup? What kinds of constraints? How do you buy machines? If you buy CPU chips, how do you decide what to pick? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rentsch@unc.UUCP (Tim Rentsch) (10/27/86)
In article <727@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > Now, the reason one might care about MWhets/MHz (or any similar measure > that compares the delivered real performance with some basic technology > speed) is to understand the margin and headroom in a design. There is a subtle pitfall in arguing that FLOPS/HZ (or IPS/HZ) is a measure of architectural "goodness". Certainly, measuring FLOPS/HZ is a reasonable attempt to factor out the particulars of the device fabrication, which are obviously irrelevant to architecture. (If your chip runs twice as fast as my chip only because it is 5 times as small, your process technology is better than mine, but your architecture may not be.) BUT -- and here is the pitfall -- it just might be that given identical fabrication methods, the better FLOPS/HZ choice would still run slower because it would not support the higher clock rate. RISC proponents would argue that one reason for having simple instruction sets is to *lower the cycle time* so that the machine can run faster and get more work done. Your machine's FLOPS/HZ may be twice as good as mine, but if my HZ is three times yours (in identical technology), my machine is faster -- and so my architecture is better. > Comments? What sorts of metrics are important to the people who read > this newsgroup? What kinds of constraints? How do you buy machines? > If you buy CPU chips, how do you decide what to pick? The metrics I'm interested in measure speed. (Basically, I'm hooked on fast machines.) Other constraints are less interesting because: (1) I will buy the fastest machine I can afford, and (2) in terms of architecture, speed is the bottom line -- all else is just mitigating circumstances. ("I know machine X runs 3 times as fast as machine Y, but machine X is Gallium Arsenide." Compare architectures, not technologies.) Here are my favorite metrics (in no particular order): (1) micro-micro-benchmark: well defined task, with well defined algorithm, hand coded in lowest level language available (microcode if it comes to that) by arbitrarily clever programmer who can take advantage of all machine dependencies (instruction timings, overlaps and/or interlocks, special instructions, cache sizes, etc.). Algorithm can change slightly to take advantage of machine characteristics, but must be "recognizable". (1a) same as above, but at assembly language level. instruction set cleverness is allowed; microcode and special knowledge such as cache size is not. (2) micro-benchmark: well defined task, with algorithm given in some particular programming language (and benchmark must be compiled from the given algorithm). The point here is to measure the speed of the machine in "typical" situations, including compiler effectiveness. the time taken to do the compile is irrelevant, as long as it is reasonably finite. (3) macro-benchmark: the problem with (1) and (2) is that they don't measure all kinds of things that inevitably take place in real systems. (on the other hand (1) and (2) are easy to run, and also easy to fudge, so they are more often done.) a macro-benchmark is like (2) in having a given program, except that the given program is very large, so that code size is comparable to amount of real memory on the machine (hopefully code > real memory). now the effectiveness of the machine for problems-in-the-large will be measured, including things like swapping speeds and TLB hit rates, etc. sadly, this is a vague measure because there are so few large programs which can be used as the benchmark, and many variable parameters creep in (such as how fast the disk seeks are, etc.). even so, it is worth remembering that speed in the small is different from speed in the large, and that the latter is really what we desire. (or should that be, "what I desire"? :-) cheers, txr
crowl@rochester.ARPA (Lawrence Crowl) (10/27/86)
>>In article <727@mips.UUCP> mash@mips.UUCP (John Mashey) writes: >> Now, the reason one might care about MWhets/MHz (or any similar measure >> that compares the delivered real performance with some basic technology >> speed) is to understand the margin and headroom in a design. >In article <103@unc.unc.UUCP> rentsch@unc.UUCP (Tim Rentsch) writes: > There is a subtle pitfall in arguing that FLOPS/HZ (or IPS/HZ) is a measure > of architectural "goodness". Certainly, measuring FLOPS/HZ is a reasonable > attempt to factor out the particulars of the device fabrication, which are > obviously irrelevant to architecture. ... BUT -- and here is the pitfall > -- it just might be that given identical fabrication methods, the better > FLOPS/HZ choice would still run slower because it would not support > the higher clock rate. Perhaps what we are missing is that for a given level of technology, a longer clock cycle allows us to have a larger depth of combinational circuitry. That is, we can have each clock work through more gates. So, a 4 MHz clock which governs propogation through a combinational circuit 4 gates deep will do roughly the same work as a 1 MHz clock governing propogation through a combinational circuit 16 gates deep. Perhaps a better measure is the depth of gates required to implement a FLOP, (or an instruction, or a window, etc.). The very fast clock, heavily pipelined machines like the Cray and Clipper follow the first approach, while the slower clock, less pipelined machines like the Berkley RISC and MIPS follow the second approach. Which is better is probably dependent upon the technology used to implement the architecture and the desired speed. For instance, if we want a very fast vector processor, we should probably choose the fast clock, more pipelined architecture. If we want a better price/performance ratio, we should probably choose the slow clock, less pipelined architecture. BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The quality of an architecture is dependent on the technology used to implement it, and no architecture is "best" under more than a limited range of technologies. For instance, under technologies in which the bandwidth to memory is most limited, stack architectures (Burroughs, Lilith) will be "better". Under technologies where the ability to process instructions is most limited, the wide register to register architectures will be "better". -- Lawrence Crowl 716-275-5766 University of Rochester crowl@rochester.arpa Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627
bcase@amdcad.UUCP (Brian Case) (10/28/86)
>Perhaps what we are missing is that for a given level of technology, a longer >clock cycle allows us to have a larger depth of combinational circuitry. That >is, we can have each clock work through more gates. So, a 4 MHz clock which >governs propogation through a combinational circuit 4 gates deep will do >roughly the same work as a 1 MHz clock governing propogation through a >combinational circuit 16 gates deep. Perhaps a better measure is the depth of >gates required to implement a FLOP, (or an instruction, or a window, etc.). Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the pipeline can be kept full (one of the major goals of RISC), then it will do 4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or MIPS/MHz or whatever/MHz will be the same! Thus, I still think this isn't such a bad metric to use for comparison. If pipelining can't be implemented or the pipeline can't be kept full for a reasonable portion of the time, the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator. >The very fast clock, heavily pipelined machines like the Cray and Clipper >follow the first approach, while the slower clock, less pipelined machines >like the Berkley RISC and MIPS follow the second approach. Which is better is Now wait a minute. I don't think anyone at Berkeley, Stanford, or MIPS Co. will agree with this statement. The clock speeds may vary among the machines you mention, but that is basically a consequense of implementation technology. I think everyone is trying to make pipestages as short as possible so that future implementations will be able to exploit future technology to the fullest extent. >probably dependent upon the technology used to implement the architecture and >the desired speed. For instance, if we want a very fast vector processor, we >should probably choose the fast clock, more pipelined architecture. If we want >a better price/performance ratio, we should probably choose the slow clock, >less pipelined architecture. I certainly agree that if a very fast vector processor is required, the higest clock speed possible with the most pipelining that makes sense should be chosen. But why should we chose a different approach for the better price/ performance ratio? Unless you are trying only to decrease price (which is not the same as increasing price/performance), one should still aim for the highest possible clock speed and pipelining. If the price/performance is right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz. In addition, for little extra cost (I claim but can't unconditionally prove), the 4 at 4 Mhz version will in some cases give me the option of 4 times the throughput. I do acknowledge that I am starting to talk about a machine for which FLOPS/MHz may not be a good comparison metric. >BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The >quality of an architecture is dependent on the technology used to implement it, >and no architecture is "best" under more than a limited range of technologies. >For instance, under technologies in which the bandwidth to memory is most >limited, stack architectures (Burroughs, Lilith) will be "better". Under >technologies where the ability to process instructions is most limited, the >wide register to register architectures will be "better". I agree that technology influences (or maybe "should influence") architecture. But I don't think limited memory bandwidth indicates a stack architecture, rather, I would say a stack archtitecture is contraindicated! If memory bandwidth is a limiting factor on performance, then many registers are needed! Optimizations which reduce memory bandwidth requirements are those that keep computed results in registers for later re-use; such optimizations are difficult, at best, to realize for a stack architecture. When you say "the ability to process instructions is most limited" I guess that you mean "the ability to fetch instructions is most limited" (because any processor whose ability to actually process its own instructions is most limited is probably not worth discussing). In this case, I would think that shorter instructions in which some part of operand addressing is implicit (e.g. instructions for a stack machine) would be indicated; "wide register to register" instructions would simply make matters worse. Probably the best thing to do is design the machine right the first time, i.e. give it enough instruction bandwidth. I fear that this posting reads like a flame; it is not intended to be a flame.
mash@mips.UUCP (John Mashey) (10/29/86)
In article <21944@rochester.ARPA> crowl@rochtest.UUCP (Lawrence Crowl) writes: >>>In article <727@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > ... MWhets/Mhz, etc, as way to factor out transient technology... > >Perhaps what we are missing is that for a given level of technology, a longer >clock cycle allows us to have a larger depth of combinational circuitry. That >is, we can have each clock work through more gates. So, a 4 MHz clock which >governs propogation through a combinational circuit 4 gates deep will do >roughly the same work as a 1 MHz clock governing propogation through a >combinational circuit 16 gates deep. Perhaps a better measure is the depth of >gates required to implement a FLOP, (or an instruction, or a window, etc.). Can you suggest some numbers for different machines? One of the reasons I proposed a (simplsitic) measure is the absolute difficulty of finding such thing out. > > >BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The >quality of an architecture is dependent on the technology used to implement it, >and no architecture is "best" under more than a limited range of technologies. >For instance, under technologies in which the bandwidth to memory is most >limited, stack architectures (Burroughs, Lilith) will be "better". Under >technologies where the ability to process instructions is most limited, the >wide register to register architectures will be "better". Much of this seems true. We always claim that the real meaning of RISC in VLSI RISC is "Response to Inherent Shifts in Computer technology", i.e in hardware: fast, dense, cheap SRAMs; higher-pincount VLSI packages, and in software: more use of high-level languages; portable OS's like UX/. In the days of core memories, it is likley that the more aggressively undense RISCs [i.e., those with only 32-bit instructions] would have been bad ideas for anything but high-end machines. Given: TTL, NMOS, CMOS, ECL, GaAs, for example, it would be interesting to hear what people think who are / have implmented same machine over multiple technologies [such as DEC VAXen, IBM 370s, HP SPectrums, all of which are supposed exist in at least 3 of the first 4 of the above; I think most GaAs designs are RISCs, given smaller gate counts.] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rentsch@unc.UUCP (Tim Rentsch) (11/02/86)
In article <1903@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: > I must disagree. Reliability is at least as important as speed. Right. But reliability is a measure of the *implementation*, not of the architecture. cheers, txr
crowl@rochester.ARPA (Lawrence Crowl) (11/04/86)
>>> mash@mips.UUCP (John Mashey) )) crowl@rochtest.UUCP (Lawrence Crowl) > bcase@amdcad.UUCP (Brian Case) ] mash@mips.UUCP (John Mashey) crowl@rochtest.UUCP (Lawrence Crowl) >>> ... MWhets/Mhz, etc, as way to factor out transient technology... ))Perhaps what we are missing is that for a given level of technology, a longer ))clock cycle allows us to have a larger depth of combinational circuitry. That ))is, we can have each clock work through more gates. So, a 4 MHz clock which ))governs propogation through a combinational circuit 4 gates deep will do ))roughly the same work as a 1 MHz clock governing propogation through a ))combinational circuit 16 gates deep. Perhaps a better measure is the depth of ))gates required to implement a FLOP, (or an instruction, or a window, etc.). ]Can you suggest some numbers for different machines? One of the reasons ]I proposed a (simplsitic) measure is the absolute difficulty of finding ]such thing out. No, I cannot suggest numbers. I suspect they would be difficult to obtain. Maybe I should think more next time. >Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the >pipeline can be kept full (one of the major goals of RISC), then it will do >4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or >MIPS/MHz or whatever/MHz will be the same! Thus, I still think this isn't >such a bad metric to use for comparison. If pipelining can't be implemented >or the pipeline can't be kept full for a reasonable portion of the time, >the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator. One of us is confused here, and I do not know which. Assume a IPS takes a constant 16 combinational gates. The 4 MHz and 4 gates will require 4 stages while the 1 MHz and 16 gates will require one stage. Both machines will execute 1 MIPS. But they have a factor of 4 difference in MHz/MIPS. If we pipeline the 4 MHz and 4 gates into a four stage pipeline, the MHz/MIPS will be the same but the performance will be a factor of 4 different. ))The very fast clock, heavily pipelined machines like the Cray and Clipper ))follow the first approach, while the slower clock, less pipelined machines ))like the Berkley RISC and MIPS follow the second approach. Which is better is >Now wait a minute. I don't think anyone at Berkeley, Stanford, or MIPS Co. >will agree with this statement. The clock speeds may vary among the machines >you mention, but that is basically a consequense of implementation technology. >I think everyone is trying to make pipestages as short as possible so that >future implementations will be able to exploit future technology to the >fullest extent. There are at least two approaches, exemplified by the following two examples. The first has a clock controlling progress through three stages from the register bank to the ALU, through the ALU, and back to the register bank. The second approach is to do all this in one stage. The first approach has the potential to pipe while the second has a lower clock rate. In both cases faster clock rates allow faster implementations. Which machines take which approach? ))probably dependent upon the technology used to implement the architecture and ))the desired speed. For instance, if we want a very fast vector processor, we ))should probably choose the fast clock, more pipelined architecture. If we ))want a better price/performance ratio, we should probably choose the slow ))clock, less pipelined architecture. >I certainly agree that if a very fast vector processor is required, the higest >clock speed possible with the most pipelining that makes sense should be >chosen. But why should we chose a different approach for the better price/ >performance ratio? Unless you are trying only to decrease price (which is >not the same as increasing price/performance), one should still aim for the >highest possible clock speed and pipelining. If the price/performance is >right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz. In >addition, for little extra cost (I claim but can't unconditionally prove), >the 4 at 4 Mhz version will in some cases give me the option of 4 times the >throughput. I do acknowledge that I am starting to talk about a machine >for which FLOPS/MHz may not be a good comparison metric. Higher clock rates generally imply higher quality parts, more EMI shielding, etc, which implies a higher cost. You do not expect a 3000 RPM engine to cost the same as a 8000 RPM engine do you? In addition, exploiting pipeline potential generally costs significant development effort and gates to control the piping. Now, adding some pipeling to a simple scheme is probably cost effective, but adding as much as is possible is not. We must find a balance. ))BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent. The ))quality of an architecture is dependent on the technology used to implement ))it, and no architecture is "best" under more than a limited range of ))technologies. For instance, under technologies in which the bandwidth to ))memory is most limited, stack architectures (Burroughs, Lilith) will be ))"better". Under technologies where the ability to process instructions is ))most limited, the wide register to register architectures will be "better". >I agree that technology influences (or maybe "should influence") architecture. >But I don't think limited memory bandwidth indicates a stack architecture, >rather, I would say a stack archtitecture is contraindicated! If memory >bandwidth is a limiting factor on performance, then many registers are needed! >Optimizations which reduce memory bandwidth requirements are those that keep >computed results in registers for later re-use; such optimizations are >difficult, at best, to realize for a stack architecture. Stacks and registers are not incompatible. It is easy to imagine a machine which did pushes and pops between the stack and a register bank. If register to register architectures are allowed to store temporaries and local variables in registers, the stack architecture should be allowed to also. We should separate the notion of registers as a means to evaluate expressions and as a storage media. >When you say "the ability to process instructions is most limited" I guess >that you mean "the ability to fetch instructions is most limited" (because >any processor whose ability to actually process its own instructions is most >limited is probably not worth discussing). In this case, I would think that >shorter instructions in which some part of operand addressing is implicit >(e.g. instructions for a stack machine) would be indicated; "wide register to >register" instructions would simply make matters worse. Probably the best >thing to do is design the machine right the first time, i.e. give it enough >instruction bandwidth. "The ability to fetch instructions" is precisely what I did NOT mean. You seem to have effectively argued for a stack architecture when bandwidth to memory is limited. After all, instructions are in memory. What I meant by "the ability to process instructions" is once you have the instruction in the CPU, how quickly can you deal with it (relative to getting it into the CPU in the first place). -- Lawrence Crowl 716-275-5766 University of Rochester crowl@rochester.arpa Computer Science Department ...!{allegra,decvax,seismo}!rochester!crowl Rochester, New York, 14627
dvk@sei.cmu.edu (Daniel Klein) (11/04/86)
Okay, blatant, flaming opinion time... I really don't care how fast the internal engine has to run to produce my output. If my little Alfa Romeo is tooling down the highway at 70 MPH with an internal engine cycle time of 3100 RPM, and I get passed by a Ferrari doing 110 MPH with an internal engine speed of 4900 RPM, who is going faster? Certainly not me, not matter how you multiply the numbers! My MPH/RPM is a little higher, but I got my doors blown off nonetheless. So if I am able to build some bizarre semi-synchronous architecture with a 2 GHz clock rate, does it mean my machine is slower (when you divide out the clock in MFlops/MHz)? I don't think so. If we are looking for an esoteric comparison of architectural efficiency, *then* perhaps we have a reasonable metric here. Now, wasn't it interesting how the MIPS machines appeared at the top of the performance chart in the initial posting by Mashey? Personally, I think RISC architectures are a good idea, so I'm not arguing architectural values here. But RISC looks just *great* when you use the clever little formula of MFlops/MHz. All I care about though, is who gets my jobs done the fastest. --> The standard disclaimer: my opinions are my own, so there, nyaa nyaa. -- --=============--=============--=============--=============--=============-- Daniel V. Klein, who lives in Pittsburgh, allegedly works for the Software Engineering Institute, and strives to survive as best he can. ARPA: dvk@sei.cmu.edu USENET: {ucbvax,harvard,cadre}!dvk@sei.cmu.edu "The only thing that separates us from the animals is superstition and mindless rituals".
guy@sun.uucp (Guy Harris) (11/06/86)
> I really don't care how fast the internal engine has to run to produce my > output. If my little Alfa Romeo is tooling down the highway at 70 MPH with > an internal engine cycle time of 3100 RPM, and I get passed by a Ferrari > doing 110 MPH with an internal engine speed of 4900 RPM, who is going > faster? Certainly not me, not matter how you multiply the numbers! > My MPH/RPM is a little higher, but I got my doors blown off nonetheless. Yes, but what if: 1) Horsepower, say, were linearly proportional to RPM 2) The horsepower need by both cars to sustain a particular speed were the same 3) Your Alfa had a redline of 20,000 RPM, while the Ferrari had a redline of 6000 RPM 4) "All other things are equal" Then just step on the gas hard enough to get near the redline, and blow the Ferrari's doors off. I believe Mashey's thesis is that this is more-or-less the proper analogy; the maximum clock rate possible is mainly a function of the chip technology, not the architecture, so an architecture that gets more work done per clock tick can ultimately be made to run faster than ones that get less work done per clock tick. I shall voice no opinion on whether this is the case or not (I don't know enough to *have* an opinion on this) , but will just let the chip designers battle it out. > So if I am able to build some bizarre semi-synchronous architecture with a > 2 GHz clock rate, does it mean my machine is slower (when you divide out the > clock in MFlops/MHz)? I don't think so. Since MFlops/MHz is !N*O*T! a measure of machine speed, and was never intended as such by Mashey, the machine is neither faster nor slower "when you divide out the clock in MFlops/MHz". If you don't divide out the clock, no, it doesn't mean your machine is slower. Nobody would argue that it did. > If we are looking for an esoteric comparison of architectural efficiency, > *then* perhaps we have a reasonable metric here. Well, what did you *think* MFlops/MHz was intended as? It *was* intended for comparing architectural efficiency! Please, people, before you flame this measure as absurd, make sure you're not flaming it for not being a measure of raw speed; it wasn't *intended* to be a measure of raw speed. *You*, the end-user, may not be interested in architectural efficiency, but may only be interested in "how fast something gets your job done"; the person who has to design and build that something, however, is going to be interested in architectural efficiency. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
kds@mipos3.UUCP (Ken Shoemaker ~) (11/07/86)
arguments about architectural effeciency aside, you'd have an easier time making a system that runs at 8 MHz than one that runs at 33 MHz (or whatever) even if the overall memory access time requirement is the same. And you'd have a much easier time making a system that goes at 16 MHz than one that goes at 66 MHz. -- The above views are personal. I've seen the future, I can't afford it... Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds csnet/arpanet: kds@mipos3.intel.com
franka@mmintl.UUCP (Frank Adams) (01/01/87)
>> Comments? What sorts of metrics are important to the people who read >> this newsgroup? What kinds of constraints? How do you buy machines? >> If you buy CPU chips, how do you decide what to pick? > >The metrics I'm interested in measure speed. (Basically, I'm hooked >on fast machines.) Other constraints are less interesting because: >(1) I will buy the fastest machine I can afford, and (2) in terms of >architecture, speed is the bottom line -- all else is just >mitigating circumstances. I must disagree. Reliability is at least as important as speed. Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Multimate International 52 Oakland Ave North E. Hartford, CT 06108