reiter@endor.harvard.edu (Ehud Reiter) (03/17/87)
In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >The Am29000 ... 25 MHz clock (40 ns cycle time) ... >25 MIPS max., 17 MIPS sustained running big programs MIPS is of course one of the most unfortunate terms in computer performance. It seems to have two meanings: a) how much faster a manufacturer thinks his machine is compared to a VAX-11/780 (usually comparing integer C programs against a 4.2 BSD VAX using standard Berkeley cc). b) the number of million instructions per second that a computer executes. In a sane world, (b) would be the common definition. Unfortunately, in our world, (a) seems to be the common definition. So, when someone uses the word "MIPS" and means (b) (as seems to be the case in the above posting), I suspect that many people get confused and think that definition (a) was meant. I don't mean to criticize the Am29000, which I'm sure is a fine machine. But in the interest of taking whatever feeble measures are possible to reduce the confusion level in this area, I wish people would stick to definition (a) of "MIPS", unless they explicitly say they're using definition (b). Ehud Reiter reiter@harvard (ARPA,BITNET,UUCP)
tim@amdcad.UUCP (Tim Olson) (03/18/87)
In article <1423@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: >In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >>The Am29000 ... 25 MHz clock (40 ns cycle time) ... >>25 MIPS max., 17 MIPS sustained running big programs > >MIPS is of course one of the most unfortunate terms in computer performance. >It seems to have two meanings: > a) how much faster a manufacturer thinks his machine is compared >to a VAX-11/780 (usually comparing integer C programs against a 4.2 BSD VAX >using standard Berkeley cc). > b) the number of million instructions per second that a computer executes. > You are correct, the numbers quoted above are Am29000 MIPS. The 17 MIPS sustained average translates to around 14 - 15 VEMs (Vax-Equivalent MIPS) (Gee, maybe we should use that term for b! :-) "MIPS" is not necessarily a meaninless indicator. It can provide information on processor throughput (i.e. how much the processor is affected by pipeline stalls, jumps, cache and TLB misses...) when running real programs. However, it must be combined with the number of instructions executed to derive the real measure of performance (1/s). -- Tim Olson Advanced Micro Devices Product Planning, Programmable Processors (tim@amdcad)
tim@amdcad.UUCP (Tim Olson) (03/18/87)
>(Gee, maybe we should use that term for b! :-)
^
I guess its too early in the morning for me -- I meant "a" here.
-- Tim Olson
Advanced Micro Devices
Product Planning, Programmable Processors
(tim@amdcad)
klein%gravity@Sun.COM (Mike Klein) (03/18/87)
In article <1423@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: >In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >>The Am29000 ... 25 MHz clock (40 ns cycle time) ... >>25 MIPS max., 17 MIPS sustained running big programs > >I don't mean to criticize the Am29000, which I'm sure is a fine machine. But >in the interest of taking whatever feeble measures are possible to reduce the >confusion level in this area, I wish people would stick to definition (a) >of "MIPS", unless they explicitly say they're using definition (b). In this case, the meaning of Brian Case's MIPS should be pretty obvious. A 40 nS cycle means 25 peak MIPS, for the case where the Am29000 is executing a loop of NOPs out of its on-board I cache (as one example). Elsewhere, I saw it mentioned that the average (simulated) cycles per instruction on this machine is about 1.5 on large C programs; dividing 25 MIPS by 1.5 gives you 16.67 MIPS. Neither number is a comparison against a VAX/780. I don't see any numbers on the performance impacts of things like cache misses, interrupts, and the processor running an operating system. -- Mike Klein klein@sun.{arpa,com} Sun Microsystems, Inc. {ucbvax,hplabs,ihnp4,seismo}!sun!klein Mountain View, CA
cmcmanis@sun.uucp (Chuck McManis) (03/19/87)
Recently several people have pointed out the MIPS debate to those on the net that haven't been exposed to it yet (lucky them). I propose that in the future that the small letter 'm' be prepended to machine specifications in 'machine' MIPS, and the small letter 'v' be prepended to specifications referring to 'Vax 11/780' MIPS. So the AMD29000 would have a performance rate of 17 mMIPS and the Sun 3/200 is rated at 4 vMIPS. I would clear things up noticeably, also if you are relaying information why not just use ?MIPS if you aren't sure which way your source was representing them. -- --Chuck McManis uucp: {anywhere}!sun!cmcmanis BIX: cmcmanis ARPAnet: cmcmanis@sun.com These opinions are my own and no one elses, but you knew that didn't you.
tim@amdcad.UUCP (Tim Olson) (03/19/87)
In article <15243@sun.uucp>, klein%gravity@Sun.COM (Mike Klein) writes: > A 40 nS cycle means 25 peak MIPS, for the case where the Am29000 is executing > a loop of NOPs out of its on-board I cache (as one example). Elsewhere, I saw > it mentioned that the average (simulated) cycles per instruction on this > machine is about 1.5 on large C programs; dividing 25 MIPS by 1.5 gives you > 16.67 MIPS. Neither number is a comparison against a VAX/780. > > I don't see any numbers on the performance impacts of things like cache misses, > interrupts, and the processor running an operating system. Perhaps a small description of our simulation environment is in order. Our internal simulator simulates the Am29000 in conjunction with an external memory environment, which may also include caches (both instruction and data). The simulator is written at a very detailed level, incorporating all of the possible pipeline stalls and exceptions (and their interactions) that may be encountered during the execution of a program. The external memory model used to derive these numbers consists of separate, 64K byte instruction and data caches, which have a 2-cycle access time, with a single-cycle burst mode interface. Main memory has a 4-cycle (160 ns) access with single-cycle burst for reload. Branch target cache misses, TLB misses and reloads, and external instruction and data cache misses were included in the simulations. Actually, external caches aren't required for decent performance; we can also interface to video-DRAMS quite well. With this simulation model, we have attempted to describe a real (i.e. semi-easily attainable) system, instead of a "hot-box". We have seen real programs running at 24.5 MIPS on this system. -- Tim Olson Advanced Micro Devices (tim@amdcad)
lyang%jennifer@Sun.COM (Larry Yang) (03/19/87)
In article <15213@amdcad.UUCP> tim@amdcad.UUCP (Tim Olson) writes: >In article <1423@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: >>In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes: >>>The Am29000 ... 25 MHz clock (40 ns cycle time) ... >>>25 MIPS max., 17 MIPS sustained running big programs >> >>MIPS is of course one of the most unfortunate terms in computer performance. >>It seems to have two meanings: >> a) how much faster a manufacturer thinks his machine is compared >>to a VAX-11/780 (usually comparing integer C programs against a 4.2 BSD VAX >>using standard Berkeley cc). >> b) the number of million instructions per second that a computer executes. > >"MIPS" is not necessarily a meaninless indicator. It can provide >information on processor throughput (i.e. how much the processor is >affected by pipeline stalls, jumps, cache and TLB misses...) when >running real programs. However, it must be combined with the number of >instructions executed to derive the real measure of performance (1/s). If the field of computer architecture existed in a vacuum, then (b) would be the most important figure. You can look at it to study how the architecture performs despite branches, cache misses, etc. But processors exist in the context of real world, where we all want to be able to compare processors against the other. That is why (a) is a much more meaningful figure. All in all, to use a *very* old joke: "MIPS = Meaningless Index of Processor Speed" ================================================================================ --Larry Yang [lyang@sun.com,{backbone}!sun!lyang]| A REAL _|> /\ | Sun Microsystems, Inc., Mountain View, CA | signature | | | /-\ |-\ /-\ "The attention span of a computer is only as | <|_/ \_| \_/\| |_\_| long as its electrical cord." | _/ _/
shebanow@ji.Berkeley.EDU (Mike Shebanow) (03/19/87)
In article <15217@amdcad.UUCP> tim@amdcad.UUCP (Tim Olson) writes: >Perhaps a small description of our simulation environment is in order. Our >internal simulator simulates the Am29000 in conjunction with an external >memory environment, which may also include caches (both instruction and data). >The simulator is written at a very detailed level, incorporating all of the >possible pipeline stalls and exceptions (and their interactions) that may be >encountered during the execution of a program. > >The external memory model used to derive these numbers consists of separate, >64K byte instruction and data caches, which have a 2-cycle access time, with >a single-cycle burst mode interface. Main memory has a 4-cycle (160 ns) >access with single-cycle burst for reload. Branch target cache misses, TLB >misses and reloads, and external instruction and data cache misses were >included in the simulations. Actually, external caches aren't required for >decent performance; we can also interface to video-DRAMS quite well. I have several questions regarding the simulation. It would be quite unfair to compare the running times for a real VAX 11/780 against a simulation, unless the following effects are included in the simulation: 1) Were cold start effects included? If so, how are they simulated? 2) How were page faults simulated? Are these times included? 3) Was logic for the cache and TLB (an assumption) simulated? 4) Are the effects of I/O simulated (say 10, 25, and 50% bus bandwidth consumed by I/O devices)? What model? 5) How are system times (I assume that UNIX is used) calculated? Is this work done by the UNIX kernel simulated? I don't mean to put the AM29000 down (as including all of the above into a simulation is beyond difficult), but using simulation times to compare performance against a real machine is unreasonable (as is a MIPS to MIPS comparison). Mike Shebanow shebanow@ji.berkeley.edu
tihor@acf4.UUCP (Stephen Tihor) (03/19/87)
Digital Product managers talking to intelligent users tend to refer to VUPS ("Vax Units of Performance" relative to the platinum-iridium VAX-11/780 in the bell jar in the Mill.). Historically, I was given to understand, the MIP refered to some IBM box.
tim@amdcad.UUCP (Tim Olson) (03/20/87)
Mike Shebanow writes: | I have several questions regarding the simulation. It would | be quite unfair to compare the running times for a real VAX 11/780 against | a simulation, unless the following effects are included in the simulation: +----- Agreed. You must always take *any* benchmark comparisons with a grain of salt. The numbers posted were only meant to provide a rough idea of the performance attainable. | 1) Were cold start effects included? If so, how are they simulated? +----- Yes. All caches (including the TLB) are initially invalidated. The simulated program exists in memory, but is "faulted in" to the TLB during execution. | 2) How were page faults simulated? Are these times included? +----- Page faults are a secondary effect of TLB misses, and times for them are greatly dependent on disk speed, etc. Page fault processing time is not included in the accumulated user time on real machines, so we don't include it either. We *do* count all of the processing (both user and system) time it takes to perform TLB miss handling and system call entry/exit. | 3) Was logic for the cache and TLB (an assumption) simulated? +----- Yes. All of the caches (both internal and external) were fully simulated with "real" misses, reloading, etc. The models are not derived from statistical averages, and the numbers we used (both size and access time) were taken from what we felt would be feasible both now and in the near future. | 4) Are the effects of I/O simulated (say 10, 25, and 50% bus bandwidth | consumed by I/O devices)? What model? +----- This opens a whole new "can of worms", and is why most benchmarks don't address the I/O issue. I/O effects were not simulated, but, as far as bus bandwidth is concerned, external cache reload is only consuming 20% to 25%, so concurrent DMA into memory should not degrade the performance too much. Note that our large number of registers reduce the number of data cache accesses, which further reduces the reload bandwidth requirement. | 5) How are system times (I assume that UNIX is used) calculated? Is this | work done by the UNIX kernel simulated? +----- If you are asking whether we are simulating an entire UNIX kernel, the answer is no. We are simulating compiled C code supported by a C runtime library. System calls are implemented as traps, some of which are executed directly in 29000 code, others which are I/O dependent (like fopen) are passed to the host system for processing. However, we are not comparing system times, just user times. Granted, there is some interaction; we have tried to stay on the conservative side with our numbers. | I don't mean to put the AM29000 down (as including all of the above into | a simulation is beyond difficult), but using simulation times to compare | performance against a real machine is unreasonable (as is a MIPS | to MIPS comparison). | | Mike Shebanow | shebanow@ji.berkeley.edu +---- You certainly do bring up valid points; we also inform customers about our potential simulation limitations when they run benchmarks on our simulator. We aren't trying to "fool" anyone here, just attempting to provide a realistic assessment of performance until parts are available. -- Tim Olson Advanced Micro Devices
mash@mips.UUCP (03/22/87)
In article <17915@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes: >In article <15217@amdcad.UUCP> tim@amdcad.UUCP (Tim Olson) writes: >>Perhaps a small description of our simulation environment is in order.... > >I have several questions regarding the simulation. It would >be quite unfair to compare the running times for a real VAX 11/780 against >a simulation, unless the following effects are included in the simulation: >...(list of important effects)... > >I don't mean to put the AM29000 down (as including all of the above into >a simulation is beyond difficult), but using simulation times to compare >performance against a real machine is unreasonable (as is a MIPS >to MIPS comparison). In defense of AMD, comparing simulation against a real machine is NOT unreasonable; it's been done plenty of times, and if done well, can be quite accurate [we do this all of the time, and generally get within several percent, sometimes better.] Serious computer design requires accurate simulators: there are too many tradeoffs to do it by intuition alone. Note, of course, that simulators can have bugs, and that one really only knows when you can start corellating simualtions with the real numbers. Finally, Tim O. had replied (well) to the original list of questions. Let me add one more: How often were caches swept in the simulations to account for context-switch, clocks [running under unix, you can lose several % to the fact that clock interrupts execute a bunch of code], system-call overhead, etc? Note that even when you only measure user time, going in and out of the kernel can blast your I-cache, and, if you're not careful, every single read/write can zap big hunks of your D-cache. The kernel time to do this is (properly) not counted, but you can be surprised by cache effects in this type of machine. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
smeeta@amdcad.UUCP (03/23/87)
In article <218@winchester.mips.UUCP>, mash@mips.UUCP (John Mashey) writes: > > How often were caches swept in the simulations to account for context-switch, > clocks [running under unix, you can lose several % to the fact that > clock interrupts execute a bunch of code], system-call overhead, etc? > The simulations done for benchmarking the Am29000 did not explicitly sweep the caches to account for context-switches in between simulation runs. However, most of the simulations performed on the Am29000 executed in 60,000 cycles or less. So the cost for a cold start for filling caches and TLB entries is amortized over the relatively short runtime. For example, we modified Dhrystone 1.1 to run for 50 iterations instead of the 500,000 iterations (standard). This made the simulation runtime more manageable and also had the effect of a full "cache sweep" every 2.4 msec. For the longer running benchmarks one should take into account the context- switch time. However, we have not yet determined the best model for simulating the effect of a multiprogramming environment on caches, without actually implementing the full OS kernel. Smeeta Gupta Tim Olson Strategic Development for Processor Products. Advanced Micro Devices, Sunnyvale,Ca.