steve@kontron.UUCP (Steve McIntosh) (07/26/85)
[ From the DTACK newsletter #44 (August 1985) ] "RISC ARCHITECTURES: Although simplified instruction sets were pioneered by Seymor Cray and by IBM's 801 project, the RISC concept was popularized by Patterson of UC Berkeley, assisted by the traditional slave labor available to a university professor. UC Berkley provides a very nice ivory tower where competitive pressures and real-world pragmatism need not be considered. Patterson provided a number of benchmarks which conclusively proved that the RISC was superior to anything around. He also chose the benchmarks, controlled the conditions of the tests, and most particularly, chose to perform benchmarks in the HLL language 'C' exclusively. This provides opportunities for much mischief (as when Intel benchmarks its 16-megabyte addressing-range 286 EXCLUSIVELY in a 64k baby-pen). ( Author goes on for 7 paragraphs ... general description of RISC, attributes performance advantages of RISC architecture (if any) to sizable overlapping register set ... lack of software support for new chips etc...) WHAT IS A MIP? Technically, a MIP is a million instructions per second. OK then, what's an instruction? Ah! That's a very good question! Take, for example, the following 68000 instruction: MOVE.W D7,(A3)+ That instruction stores the lower word of the 32-bit register D7 at the address contained in the 32-bit address register A3, and then increments A3 by two (two bytes = one word). Here is that instruction's equivalent for the Nat Semi 32016: MOVW D7,(SB) ADDQD 2,SB The same operation in a hypothetical RISC machine: MOVE R7,(R#) LOAD 2,R4 ADD R4,R# For simplicity, suppose that those three computers each performed that equivalent instruction (or instruction sequence) in exactly one microsecond. Then the 68000 would be operating at 1 MIP, the 32000 series at 2 MIPS, and the hypothetical RISC machine at 3 MIPS. EACH COMPUTER WOULD BE PERFORMING EXACTLY THE SAME AMOUNT OF WORK! When a RISC advocate proudly points to the MIP performance rating of his preferred architecture, remember that a RISC HAS to run at a lot more MIPS than a CISC machine just to keep up. WHAT IS A MIP, INDEED! THE VON NEUMANN BOTTLENECK: No von Neumann machine can run faster than allowed by its bus-bandwidth. Both RISC and CISC architectures are von Neumann machines. Let us examine those three equivalent instruction (sequences) from a bus-bandwidth standpoint: The 68000 requires two memory cycles, the 32016 three memory cycles and the RISC machine either three or four memory cycles depending on whether it includes a LOAD QUICK instruction which contains small integers inside the instruction field. In fact, the 68000 executes that instruction (singular in the case of the 68000) in exactly two memory cycles (eight clocks). The 32016 requires AT LEAST three memory cycles (12 clocks) to execute, and can be even slower than that if the instructions are not word aligned. (68000 instructions are ALWAYS word aligned) The hypothetical risk machine requires 3 or 4 memory cycles to perform the same operation. Therefore, the poor little 68000, operating at a mere 1 MIPS, is running AT LEAST 50% faster than a 2 MIPS 32016 and 50% to 100% faster than a 3 MIPS RISC machine! To further emphasize this important point, our hypothetical RISC machine would have to have a memory cycle time 50% to 100% faster than the 68000 to have equal performance, and that would give it a 4.5 (50% faster) or 6.0 (100% faster) MIPS rating. If the 68000 is running at 12.5Mhz with zero wait states using 120 nsec DRAM (say, two megabytes of it) then JUST WHERE DO WE FIND DRAM FOR THAT RISC WHICH IS 50% TO 100% FASTER? (At an affordable price - srmc) ... Since a 68000 in fact performs that instruction in 640 nsec, corresponding to 1.56 MIPS, that hypothetical RISC machine would have to run at 7.03 to 9.38 MIPS to equal the MIPS 68000! WE STRONGLY SUGGEST THAT YOU IGNORE THE MIPS RATING OF RISC MACHINES! RISC machines HAVE to have a high MIPS rating just to get out of their own way!" - DTACK GROUNDED $15/10 issues (US & Canada) 1415 E. McFadden, Ste.E $25/10 issues elsewhere Santa Ana CA 92705 (US funds) ======================================================================= I do not work for DTACK, I just enjoy the newsletter. Flames to them, please (they like it, and DO publish flame letters that they get in response to the newsletter.) ======================================================================= Steve McIntosh, Kontron Electronics, Irvine CA / usual disclaimers /
mahar@weitek.UUCP (mahar) (07/29/85)
In article <419@kontron.UUCP>, steve@kontron.UUCP (Steve McIntosh) writes: > "RISC ARCHITECTURES: > what's an instruction? Ah! That's a very good question! > Take, for example, the following 68000 instruction: > > MOVE.W D7,(A3)+ > > That instruction stores the lower word of the 32-bit register D7 at the > address contained in the 32-bit address register A3, and then > increments A3 by two (two bytes = one word). > The same operation in a hypothetical RISC machine: > > MOVE R7,(R#) > LOAD 2,R4 > ADD R4,R# On the Berkeley RISC, the equivalent istruction sequence is: MOVE R7,(R4) ADD #2,R4,R4 > > For simplicity, suppose that those three computers each performed that > equivalent instruction (or instruction sequence) in exactly one > microsecond. Then the 68000 would be operating at 1 MIP, the 32000 > series at 2 MIPS, and the hypothetical RISC machine at 3 MIPS. > > EACH COMPUTER WOULD BE PERFORMING EXACTLY THE SAME AMOUNT OF WORK! > > When a RISC advocate proudly points to the MIP performance rating of > his preferred architecture, remember that a RISC HAS to run at a lot > more MIPS than a CISC machine just to keep up. WHAT IS A MIP, INDEED! I agree that MIP is not a very good measure of computer performance. A rule of thumb that I have heard is that the VAX 780 is 1 MIP. If your computer does more work in less time then a 780 it is faster then 1 MIP. If it does less, it's slower. > > THE VON NEUMANN BOTTLENECK: > > No von Neumann machine can run faster than allowed by its > bus-bandwidth. Both RISC and CISC architectures are von Neumann > machines. Let us examine those three equivalent instruction (sequences) > from a bus-bandwidth standpoint: > > The 68000 requires two memory cycles, the 32016 three memory cycles > and the RISC machine either three or four memory cycles depending on > whether it includes a LOAD QUICK instruction which contains small > integers inside the instruction field. Once again the Berkeley RISC is two memory cycles. > > In fact, the 68000 executes that instruction (singular in the case of > the 68000) in exactly two memory cycles (eight clocks). The 32016 > requires AT LEAST three memory cycles (12 clocks) to execute, and can > be even slower than that if the instructions are not word aligned. > (68000 instructions are ALWAYS word aligned) The hypothetical risk > machine requires 3 or 4 memory cycles to perform the same operation. Since the Berkely machine is also two memory cycles (2 clocks), this is a tie. > > Therefore, the poor little 68000, operating at a mere 1 MIPS, is > running AT LEAST 50% faster than a 2 MIPS 32016 and 50% to 100% faster > than a 3 MIPS RISC machine! > > To further emphasize this important point, our hypothetical RISC > machine would have to have a memory cycle time 50% to 100% faster than > the 68000 to have equal performance, and that would give it a 4.5 (50% > faster) or 6.0 (100% faster) MIPS rating. If the 68000 is running at > 12.5Mhz with zero wait states using 120 nsec DRAM (say, two megabytes > of it) then JUST WHERE DO WE FIND DRAM FOR THAT RISC WHICH IS 50% TO > 100% FASTER? (At an affordable price - srmc) With that same 120 nsec DRAM, the Berkely RISC could run at 8 Mhz. Since the 68000 does it work in 8 cycles it takes 8/12.5 or about 640 nsec. The Berkely RISC does its work in 2 cycles. So, at 8 Mhz it takes 2/8 or 250 nsec. All for the same memory band width. In fact, however, the Berkeley RISC only runs at about 4 Mhz. The memory bandwidth is 250 nsec. The same sequence would be 500 nsec. A 4 Mhz RISC takes 25/32 the time to do what a 12.5 Mhz 68000 would do. I agree, In the example given the RISC took twice as many instructions to do the same job. The MIP designation is ambiguous. One must look at how much work is done in a given amount of time.
wfmans@ihuxb.UUCP (w. mansfield) (08/01/85)
No, I'n not going to repeat the whole article. It contrasted RISC and CISC micros and determined from the MIP ratings that RISC micros are silly. 1. MIP rating have been discussed here before, and all agree that they are a stupid measure of anything. A RISC will generally achieve one instruction per cycle (that's the goal, anyway), while a CISC requires many cycles to do an instruction. Also, typical CISCs report their MIP ratings for their shortest instruction (NOP?), and inflate their MIPs accordingly. 2. The basic premise of RISC machines is to do the instructions that are used often very fast. Indexing is a complex operation that just isn't done that often (and which can usually be simplified by good compilers (e.g. sophisticated compilers as used by 801 and MIPS projects)). 3. It is becoming apparent to folks doing objective measurements from models of RISCs that much of the performance of the RISC isn't from the reduced instructions, its from the register windows et.al. Agreed, these architectural improvements aren't part of RISC per se, but try getting them to fit in silicon along with a complex instruction set. No flames, just observations. Newsletter sounds interesting, like a reprint of net.bellicose.
hammond@petrus.UUCP (Rich A. Hammond) (08/02/85)
> In article <419@kontron.UUCP>, steve@kontron.UUCP (Steve McIntosh) writes: > > Take, for example, the following 68000 instruction: > > > > MOVE.W D7,(A3)+ > > This takes 2 16 bit memory cycles (instruction fetch, operand store) > > > On the Berkeley RISC, the equivalent istruction sequence is: > > MOVE R7,(R4) > ADD #2,R4,R4 > This takes 3 32 bit memory cycles: (i.e. 3 instruction cycles) (instruction fetch, operand store, instruction fetch) The RISC I takes 3 clock cycles per instruction cycle.(See Computer Spet. 82) The Berkeley RISC takes 2 instruction cycles per load/store (see CACM Jan '85). The point was that for the above operation, the more compact encoding of the 68000 requires less memory cycles and hence is faster. The number of clock cycles per memory cycle, assuming a reasonable architecture, is irrelevant, since the RISC can do at most 1 instruction/memory cycle since it has to fetch an instruction. Note that the 68020 in fact uses only 3 clock cycles per memory cycle (like the RISC I). What does all this have to do with RISC vs CISC. Is the auto-increment mode common in compiler generated code? How about other operations? In other words, I can accept that certain operations will be slower if the overall performance improves, so picking on an individual sequence only helps if we know its relative frequency in real code. A side note: The Berkeley RISC's have no absolute addressing mode, they fake it by using R0 (always 0) plus an offset. BUT, the offset can only be 13 bits, hence they can only absolutely address the first 2**13 locations in memory. Large programs, eg the UNIX kernel (particularly from Berkeley) use much more than 2**13 (like 2**19) for instructions, hence the problem is how well would a RISC do when it takes 2 instructions to form an absolute address and probably requires a register? I'll accept RISCs when I see one runnning 4.3 BSD faster than an 11/780.
joel@peora.UUCP (Joel Upchurch) (08/05/85)
>A side note: The Berkeley RISC's have no absolute addressing mode, >they fake it by using R0 (always 0) plus an offset. BUT, the offset >can only be 13 bits, hence they can only absolutely address the first >2**13 locations in memory. Large programs, eg the UNIX kernel >(particularly from Berkeley) use much more than 2**13 (like 2**19) for >instructions, hence the problem is how well would a RISC do when it >takes 2 instructions to form an absolute address and probably requires >a register? I'll accept RISCs when I see one runnning 4.3 BSD faster >than an 11/780. I would like to point out that the IBM 370 (usually considered a CISC :-> ) doesn't have absolute addressing and that it only has a displacement of 2**12 bytes. They seem to be able to get some rather large operating systems, including UNIX, to run on it.
hammond@petrus.UUCP (Rich A. Hammond) (08/06/85)
> >... The Berkeley RISC's have no absolute addressing mode ... > >I'll accept RISCs when I see one runnning 4.3 BSD faster than an 11/780. > > I would like to point out that the IBM 370 (usually considered > a CISC :-> ) doesn't have absolute addressing and that it only > has a displacement of 2**12 bytes. They seem to be able to get > some rather large operating systems, including UNIX, to run on > it. I never claimed it was impossible to get something running, I was more concerned with whether the RISC retains its speed advantage when faced with large amounts of absolute addressing. Perhaps a couple global registers used as base registers (ala IBM 360) would cover the commonly accessed global data structures (or a smart loader would pack most of them together). Anyway, the RISC claims will always be slightly dubious to me until I actually see a machine performing as claimed in a real situation. Hidden gotchas have a way of getting missed in paper exercises.
jtb@kitc.UUCP (John Burgess) (08/06/85)
I hate to be picky folks, but MIP is not an acronym of anything. The correct acronym is MIPS -- Million Instructions Per Second. What is a "Million Instructions Per" ??? Note that MIPS is both singular and plural. (Actually its always plural, but saying 1 MIPS makes it look singluar.) Another way to tell that the S is necessary. If it weren't, you'd say 1 MIP, but 2 MIPs (lower-case s) to make it plural. OK, enough already. Just mind your Ps and Qs and Ss! -- John Burgess ATT-IS Labs, So. Plainfield NJ (HP 1C-221) {most Action Central sites}!kitc!jtb (201) 561-7100 x2481 (8-259-2481)
darrell@sdcsvax.UUCP (Darrell Long) (08/07/85)
In article <437@petrus.UUCP> hammond@petrus.UUCP (Rich A. Hammond) writes: >[...] I'll accept RISCs when >I see one runnning 4.3 BSD faster than an 11/780. You should see how much faster our Pyramid runs 4.2 than our 11/780! -- Darrell Long Department of Electrical Engineering and Computer Science University of California, San Diego USENET: sdcsvax!darrell ARPA: darrell@sdcsvax
bcase@uiucdcs.Uiuc.ARPA (08/07/85)
/* Written 7:28 am Aug 6, 1985 by hammond@petrus.UUCP in uiucdcs:net.arch */ Anyway, the RISC claims will always be slightly dubious to me until I actually see a machine performing as claimed in a real situation. Hidden gotchas have a way of getting missed in paper exercises. /* End of text from uiucdcs:net.arch */ You probably won't have to wait TOO long since Acorn computer co. of England just announced (in Electronics magazine) that they have good, working die, the first time, after only 18 months of effort by 4 people. The machine is able to sustain 3 of its MIPS and running real programs (code produced by compilers) was "about 2 times a VAX 11/780" and 10 times an IBM PC-AT. The article did not mention anything about memory management, so they may or may not be winning because of the lack of translation. Anyway, they will be selling an evaluation board for $2000 (which contains 1 MByte of memory, the processor and some bootstrap ROM) which plugs into the $400 Acorn 6502-based micro- computer. Thus for about $3000, a person can get a really nice personal workstation, and what performance. Oh, the board comes with a BCPL compiler and a MODULA-2 compiler, a small operating system, and a window-oriented text editor. More software is on the way (C, Pascal, etc.). It is true that this RISC will have about the same performance as the 68020, but this RISC is fabricated with MUCH less aggressive technology, and when shrunk to modern design rules, will probably be significantly faster than more complex 32-bitters (the minor cycle time of a 68020 is 60 ns. (at 16.67 MHz) while the minor cycle time of this RISC is 150 ns.). True, we should not count our RISCs before they are hatched, but there are real advanteges to RISC.
hammond@petrus.UUCP (Rich A. Hammond) (08/08/85)
> In article <437@petrus.UUCP> I said > >[...] I'll accept RISCs when > >I see one runnning 4.3 BSD faster than an 11/780. > > Darrell Long replies: > You should see how much faster our Pyramid runs 4.2 than our 11/780! 1) I'm not sure I'll accept the claim that the Pyamids, ridge, ... are truly RISC machines, they have taken the overlapping registers idea and that alone, even on a CISC, gives a great advantage. 2) What I should have said was that I wanted to see a RISC chip running 4.? BSD faster than a VAX. The claims from UCB about the RISC I & II were based on simulations which avoided the nasty problems of making the kernel run. As I noted before, hidden gotchas have a way of popping up when you actually try and get something running. Also, I want a chip fabricated with the technology used for the M68000 when it came out. It seems clear that since a 68020 can run faster than a 780 that a RISC chip made now with leading edge technology should also run faster. 3) The claims that a RISC is better have to be taken with 3 provisions: a) Technology is important (i.e. if you need to have 1 memory cycle per instruction you'd better have fairly fast memory relative to the CPU implementation. This is the current state, but it may change. b) I take a large grain of salt with the claim that RISC was designed faster than conventional micros, since a lot of what I suspect is complex on other micros is interrupt, trap, supervisor vs non-priv support and documentation, none of which the UCB people did much of. c) Although UCB claims to have "avoided complications" in their comparisons by using the same technology for the compiler (pcc) in comparisons, I think they introduced a very serious bias. The pcc was never designed to generate good code and the RISC architecture might simply be a better match for pcc than a CISC architecture. This seems to be supported by the CAN article which said that recoding some of the benchmarks for the RISC, 68000 and Z8000 in assembly resulted in code which was 1/2 the size of the RISC and significantly faster. In summary, I like the ideas of RISC, I'm not convinced they're the only way to go, but it was a good area to explore. Rich Hammond
mash@mips.UUCP (John Mashey) (08/11/85)
> >> Anyway, the RISC claims will always be slightly dubious to me until I >> actually see a machine performing as claimed in a real situation. >> Hidden gotchas have a way of getting missed in paper exercises. >> /* End of text from uiucdcs:net.arch */ > > You probably won't have to wait TOO long since Acorn computer co. of > England just announced (in Electronics magazine) that they have good, > working die, the first time, after only 18 months of effort by 4 > people. The machine is able to sustain 3 of its MIPS and running > real programs (code produced by compilers) was "about 2 times a VAX > 11/780" and 10 times an IBM PC-AT. The article did not mention anything > about memory management, so they may or may not be winning because of > the lack of translation..... > .... True, we should not count our > RISCs before they are hatched, but there are real advanteges to RISC. 1) Real advantages to RISC : yes. 2) lack of memory management: unless one is just interested in a point product with a fairly narrow performance range, one must be exceedingly careful in memory management design for RISC, or you rapidly discover that you get a chip OK for controllers or unprotected systems, and near-useless for reasonable multi-tasking operating systems. 3) (Opinion, perhaps quite biased): it's pretty easy to do a RISC that's 2X a 780. What's harder, but necessary, is to figure out how the same chip architecture gets you to 5X, and soon 10X in a reasonable way; just running the clock faster doesn't do it. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
henry@utzoo.UUCP (Henry Spencer) (08/13/85)
> ... I was more > concerned with whether the RISC retains its speed advantage when faced > with large amounts of absolute addressing. What, pray tell, would *require* large amounts of absolute addressing? The number of simple variables in a program is generally modest, and they tend to be local rather than global. Arrays and such tend to require address arithmetic anyway, so there is little penalty in simply parking a pointer to them in an anonymous simple variable. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
boston@celerity.UUCP (Boston Office) (08/14/85)
In article <419@kontron.UUCP> steve@kontron.UUCP (Steve McIntosh) writes: >[ From the DTACK newsletter #44 (August 1985) ] > >"RISC ARCHITECTURES: > >WHAT IS A MIP? > >Technically, a MIP is a million instructions per second. OK then, >what's an instruction? Ah! That's a very good question! > >Take, for example, the following 68000 instruction: > > MOVE.W D7,(A3)+ > >That instruction stores the lower word of the 32-bit register D7 at the >address contained in the 32-bit address register A3, and then >increments A3 by two (two bytes = one word). Here is that instruction's >equivalent for the Nat Semi 32016: > > MOVW D7,(SB) > ADDQD 2,SB > >The same operation in a hypothetical RISC machine: > > MOVE R7,(R#) > LOAD 2,R4 > ADD R4,R# > >For simplicity, suppose that those three computers each performed that >equivalent instruction (or instruction sequence) in exactly one >microsecond. Then the 68000 would be operating at 1 MIP, the 32000 >series at 2 MIPS, and the hypothetical RISC machine at 3 MIPS. > >EACH COMPUTER WOULD BE PERFORMING EXACTLY THE SAME AMOUNT OF WORK! > ... PRECISELY! That is why, when evaluating a system's power, we need standards apart from the individual architecture. Ask for Whetstone MIPS if you seek a general raw-power figure, or other benchmarks that more reflect your need. (When Celerity quotes MIPS, for example, we quote whetstone MIPS - even within a RISC-like architecture.)
cheong@uicsl.UUCP (08/15/85)
/* Written 6:33 am Aug 2, 1985 by hammond@petrus.UUCP in uicsl:net.arch */ > In article <419@kontron.UUCP>, steve@kontron.UUCP (Steve McIntosh) writes: > > Take, for example, the following 68000 instruction: > > > > MOVE.W D7,(A3)+ > > This takes 2 16 bit memory cycles (instruction fetch, operand store) > > > On the Berkeley RISC, the equivalent istruction sequence is: > > MOVE R7,(R4) > ADD #2,R4,R4 > This takes 3 32 bit memory cycles: (i.e. 3 instruction cycles) (instruction fetch, operand store, instruction fetch) The RISC I takes 3 clock cycles per instruction cycle.(See Computer Spet. 82) The Berkeley RISC takes 2 instruction cycles per load/store (see CACM Jan '85). The point was that for the above operation, the more compact encoding of the 68000 requires less memory cycles and hence is faster. The number of clock cycles per memory cycle, assuming a reasonable architecture, is irrelevant, since the RISC can do at most 1 instruction/memory cycle since it has to fetch an instruction. Note that the 68020 in fact uses only 3 clock cycles per memory cycle (like the RISC I). What does all this have to do with RISC vs CISC. Is the auto-increment mode common in compiler generated code? How about other operations? In other words, I can accept that certain operations will be slower if the overall performance improves, so picking on an individual sequence only helps if we know its relative frequency in real code. A side note: The Berkeley RISC's have no absolute addressing mode, they fake it by using R0 (always 0) plus an offset. BUT, the offset can only be 13 bits, hence they can only absolutely address the first 2**13 locations in memory. Large programs, eg the UNIX kernel (particularly from Berkeley) use much more than 2**13 (like 2**19) for instructions, hence the problem is how well would a RISC do when it takes 2 instructions to form an absolute address and probably requires a register? I'll accept RISCs when I see one runnning 4.3 BSD faster than an 11/780. /* End of text from uicsl:net.arch */
eugene@ames.UUCP (Eugene Miya) (08/19/85)
I have been following this discussion for some time. So what's the bottom line? If architecture is your thing, the proof of the putting are going to be running systems. The critics of RICS then focus their attack on a definition of MIPS. We've had two (or more) "What is a MIP letter?" in addition to the original one. A couple of years ago just before the Winter Usenix conference, I saw the Massively Parallel Processor (MPP). It is rated at peak speed of 16 billion instructions per second (16 GIPS). Now they only add later that this is an 8-bit word with 1-bit serial data paths. It burns me up when recently I saw a television ad that Goodyear Aerospace did this `great' thing for NASA at the cost of 10 million dollars. {oops, sorry, got out of hand, flame off} A problem is, certainly, how we measure things. One letter brought out the need to define what an instruction was. The letter did not specifically mention a property by name: that was `atomicity.' Another problem is a common base set of units: is there an appropriate conversion factor of a 64-bit instruction to an 8-bit? Is a factor of 8 good enough, probably not. Let's keep "standards" out of the discussion for now, and explore this a bit. Whetstones were mentioned in another letter, but the only people who use these are computer manufacturers. [MGT]FLOPS are another gastly measure. What qualities do our performance metrics need to have? --eugene miya NASA Ames Research Center {hplabs,ihnp4,dual,hao,decwrl,allegra}!ames!aurora!eugene emiya@ames-vmsb
gas@lanl.ARPA (08/20/85)
Another metric is a sampling of large, mostly unmodifiable, commercial codes. My preference is MSC/Nastran (.5e6+lines of finite element code). It runs on a surprising number of machines, and is a rigorous test of not only performance (cpu and io) but the scientific/engineering environment available to the average user. It helps draw the line between special purpose and general purpose environments (or, less tactfully, usable and unusable machines).. george spix gas@lanl (MSC - MacNeal-Schwendler Corporation, Los Angeles)
jer@peora.UUCP (J. Eric Roskos) (08/22/85)
> Whetstones were mentioned in another letter, but the only people who use > these are computer manufacturers. This statement isn't true. For example, back when I was a graduate-student researcher in computer architectures, we used the Whetstones to test our vertical-migration software. > What qualities do our performance metrics need to have? I think you need to make your performance measurements in such a way that you get a set of distinct numbers which can be used analytically to determine performance for a given program if you know certain properties of the program. For example: 1) The rate of execution of each member of the set of arithmetic operations provided by the machine's instruction set, assuming the operands are all in registers, with cache disabled. This would give you an approximation of the execution time of the algorithms for the arithmetic operations, along with the instruction-fetch times, when not aided by caching. 2) The rate of execution of 1-word memory-to-memory moves, with cache disabled. This gives you the word-sized operand fetch and store times, along with (again) the instruction-fetch times for these instructions. 3) The rate of execution of a tight loop performing only register-to-register moves, with cache disabled. 4) The rate of execution of a tight loop performing only register-to-register moves, with cache enabled. 5) The rate of execution of a tight loop performing (same word size as #3 and #4 above) memory-to-memory moves that produce all cache "hits", with cache enabled. Note that this gives you two properties of your cache: your speedup for operand fetch and store resulting from caching, and any performance penalties resulting from a write-through vs. write-back cache. 6) Specifications such as the number of registers available to the user, the size of the cache, etc. Well, you get the idea, anyway... personally I tend to feel that statistical performance measurements are not nearly as useful as analytical ones; I would rather see a list of fairly distinct performance properties of a pro- cessor anytime, since I think you can do more with them in terms of saying how the machine will perform for a given application that way. To do this, you do need to understand your application, though. I separated out the various forms of caching (operations in registers, and use of a cache between the CPU and the primary memory) because so many people "fudge" their results that way without giving any information from which you can determine real performance. The above list is just meant to suggest "qualities" rather than being an exhaustive list; i.e., that the performance metrics should reveal (rather than hide) the set of factors that actually influence performance. [Unfortunately, this would never suit most marketing organizations nor customers, since they want an all- encompassing number.] The metrics should also be compiler-independent, especially if you are making measurements on microcomputers, since the majority of microcomputer compilers today generate terrible object code (see my posting awhile back of a "hand-compiled" 68000 program for the Macintosh in net.micro.mac for an example of how bad this can be (and how little the significance of this was understood!)). -- Shyy-Anzr: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642 "Gurl ubyq gur fxl/Ba gur bgure fvqr/Bs obeqreyvarf..."
chuck@dartvax.UUCP (Chuck Simmons) (08/23/85)
> A problem is, certainly, how we measure things. One letter brought out > the need to define what an instruction was. The letter did not specifically > mention a property by name: that was `atomicity.' ... > Whetstones were mentioned in another letter, but the only people > who use these are computer manufacturers. [MGT]FLOPS are another gastly > measure. What qualities do our performance metrics need to have? > > --eugene miya It might be interesting to define some fairly simple standard operations and ask how long it takes to perform the operations. Typical standard operations might be: (high-level language pseudo-code in parens) add -- takes two words (at least 32 bits) from memory, adds them together, and puts the result back in memory. (A := B + C) index -- picks up an array offset from memory, performs bounds checking on the offset (we don't all write in C), and loads the indexed element into a register. (A[i]) ptr_load -- picks up a pointer and an offset into a record and loads the appropriate word from the pointed to record. (P->Record.Field) array_loop -- load each element of an array into a register. It is cheating to assume that the array contains a special value at either end. (for i = 1 to n do ... A[i] ...;) The advantage of using these simple operations instead of FLOPS is that a lot of programs don't use floating point operations very much. These simple operations would be a better measure than even simpler instructions because each operation does something "useful". These operations can also have advantages over high-level language benchmarks because they are not dependent on the quality of a compiler. The qualities that I am aiming for here are primarily usefulness and simplicity. Each of the above operations will be found in a wide variety of programs, and each operation should be easy to implement on most machines in that machines native assembler. These are short pieces of code that a compiler would generate fairly often. -- Chuck chuck@dartvax
bobbyo@celerity.UUCP (Bob Ollerton) (09/03/85)
I agree that using some real, heavy duty, commercial codes can be a good way of measuring the performance of CPU architectures. RISC CPUs can sometimes be difficult to get a handle on if the particular implementation is strong in some cases, and weak in others. Here are some results from a Finite Element Modeler from Swanson Analysis, called ANSYS. It is a large fortran program written quite a few years ago and continuously enhanced. It uses both single and double precision math, I/O, and lots of virtual memory. Please note that these results while supplied by the various vendors, are being presented to you from a biased source; Me. --------------------------------------------------------------------- Combined ANSYS benchmarks SP1, SP2, SP3, SP4: Vendor CPU seconds ----------------------------- Prime 750 6505 RIDGE 32 5750 APOLLO X60 5372 VAX 780 w/fpa 4574 DG MV8000 3290 IBM 4341-1 2973 Celerity C1200 2506 -- Bob Ollerton; Celerity Computing; 9692 Via Excelencia; San Diego, Ca 92126; (619) 271 9940 {decvax || ucbvax || ihnp4}!sdcsvax!celerity!bobbyo akgua!celerity!bobbyo