[comp.arch] Am29000 and MIPS

reiter@endor.harvard.edu (Ehud Reiter) (03/17/87)

In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>The Am29000 ...  25 MHz clock (40 ns cycle time) ...
>25 MIPS max., 17 MIPS sustained running big programs

MIPS is of course one of the most unfortunate terms in computer performance.
It seems to have two meanings:
   a)  how much faster a manufacturer thinks his machine is compared
to a VAX-11/780 (usually comparing integer C programs against a 4.2 BSD VAX
using standard Berkeley cc).
   b)  the number of million instructions per second that a computer executes.

In a sane world, (b) would be the common definition.  Unfortunately, in our
world, (a) seems to be the common definition.  So, when someone uses the word
"MIPS" and means (b) (as seems to be the case in the above posting), I suspect
that many people get confused and think that definition (a) was meant.

I don't mean to criticize the Am29000, which I'm sure is a fine machine.  But
in the interest of taking whatever feeble measures are possible to reduce the
confusion level in this area, I wish people would stick to definition (a)
of "MIPS", unless they explicitly say they're using definition (b).

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)

tim@amdcad.UUCP (Tim Olson) (03/18/87)

In article <1423@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>>The Am29000 ...  25 MHz clock (40 ns cycle time) ...
>>25 MIPS max., 17 MIPS sustained running big programs
>
>MIPS is of course one of the most unfortunate terms in computer performance.
>It seems to have two meanings:
>   a)  how much faster a manufacturer thinks his machine is compared
>to a VAX-11/780 (usually comparing integer C programs against a 4.2 BSD VAX
>using standard Berkeley cc).
>   b)  the number of million instructions per second that a computer executes.
>
You are correct, the numbers quoted above are Am29000 MIPS.  The 17 MIPS
sustained average translates to around 14 - 15 VEMs (Vax-Equivalent MIPS)
(Gee, maybe we should use that term for b! :-)

"MIPS" is not necessarily a meaninless indicator.  It can provide
information on processor throughput (i.e. how much the processor is
affected by pipeline stalls, jumps, cache and TLB misses...) when
running real programs.  However, it must be combined with the number of
instructions executed to derive the real measure of performance (1/s).

	-- Tim Olson
	Advanced Micro Devices
	Product Planning, Programmable Processors
	(tim@amdcad)

tim@amdcad.UUCP (Tim Olson) (03/18/87)

>(Gee, maybe we should use that term for b! :-)
					 ^
I guess its too early in the morning for me -- I meant "a" here.

	-- Tim Olson
	Advanced Micro Devices
	Product Planning, Programmable Processors
	(tim@amdcad)

klein%gravity@Sun.COM (Mike Klein) (03/18/87)

In article <1423@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>>The Am29000 ...  25 MHz clock (40 ns cycle time) ...
>>25 MIPS max., 17 MIPS sustained running big programs
>
>I don't mean to criticize the Am29000, which I'm sure is a fine machine.  But
>in the interest of taking whatever feeble measures are possible to reduce the
>confusion level in this area, I wish people would stick to definition (a)
>of "MIPS", unless they explicitly say they're using definition (b).

In this case, the meaning of Brian Case's MIPS should be pretty obvious.
A 40 nS cycle means 25 peak MIPS, for the case where the Am29000 is executing
a loop of NOPs out of its on-board I cache (as one example).  Elsewhere, I saw
it mentioned that the average (simulated) cycles per instruction on this
machine is about 1.5 on large C programs; dividing 25 MIPS by 1.5 gives you
16.67 MIPS.  Neither number is a comparison against a VAX/780.

I don't see any numbers on the performance impacts of things like cache misses,
interrupts, and the processor running an operating system.
--
	Mike Klein		klein@sun.{arpa,com}
	Sun Microsystems, Inc.	{ucbvax,hplabs,ihnp4,seismo}!sun!klein
	Mountain View, CA

cmcmanis@sun.uucp (Chuck McManis) (03/19/87)

Recently several people have pointed out the MIPS debate to those on 
the net that haven't been exposed to it yet (lucky them). I propose
that in the future that the small letter 'm' be prepended to machine
specifications in 'machine' MIPS, and the small letter 'v' be prepended
to specifications referring to 'Vax 11/780' MIPS. So the AMD29000
would have a performance rate of 17 mMIPS and the Sun 3/200 is rated
at 4 vMIPS. I would clear things up noticeably, also if you are relaying
information why not just use ?MIPS if you aren't sure which way your
source was representing them.


-- 
--Chuck McManis
uucp: {anywhere}!sun!cmcmanis   BIX: cmcmanis  ARPAnet: cmcmanis@sun.com
These opinions are my own and no one elses, but you knew that didn't you.

tim@amdcad.UUCP (Tim Olson) (03/19/87)

In article <15243@sun.uucp>, klein%gravity@Sun.COM (Mike Klein) writes:
> A 40 nS cycle means 25 peak MIPS, for the case where the Am29000 is executing
> a loop of NOPs out of its on-board I cache (as one example).  Elsewhere, I saw
> it mentioned that the average (simulated) cycles per instruction on this
> machine is about 1.5 on large C programs; dividing 25 MIPS by 1.5 gives you
> 16.67 MIPS.  Neither number is a comparison against a VAX/780.
> 
> I don't see any numbers on the performance impacts of things like cache misses,
> interrupts, and the processor running an operating system.

Perhaps a small description of our simulation environment is in order.  Our
internal simulator simulates the Am29000 in conjunction with an external
memory environment, which may also include caches (both instruction and data).
The simulator is written at a very detailed level, incorporating all of the
possible pipeline stalls and exceptions (and their interactions) that may be
encountered during the execution of a program. 

The external memory model used to derive these numbers consists of separate,
64K byte instruction and data caches, which have a 2-cycle access time, with
a single-cycle burst mode interface.  Main memory has a 4-cycle (160 ns)
access with single-cycle burst for reload.  Branch target cache misses, TLB
misses and reloads, and external instruction and data cache misses were
included in the simulations.  Actually, external caches aren't required for
decent performance; we can also interface to video-DRAMS quite well.

With this simulation model, we have attempted to describe a real (i.e.
semi-easily attainable) system, instead of a "hot-box".  We have seen
real programs running at 24.5 MIPS on this system.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad)

lyang%jennifer@Sun.COM (Larry Yang) (03/19/87)

In article <15213@amdcad.UUCP> tim@amdcad.UUCP (Tim Olson) writes:
>In article <1423@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>>In article <15192@amdcad.UUCP> bcase@amdcad.UUCP (Brian Case) writes:
>>>The Am29000 ...  25 MHz clock (40 ns cycle time) ...
>>>25 MIPS max., 17 MIPS sustained running big programs
>>
>>MIPS is of course one of the most unfortunate terms in computer performance.
>>It seems to have two meanings:
>>   a)  how much faster a manufacturer thinks his machine is compared
>>to a VAX-11/780 (usually comparing integer C programs against a 4.2 BSD VAX
>>using standard Berkeley cc).
>>   b)  the number of million instructions per second that a computer executes.
>
>"MIPS" is not necessarily a meaninless indicator.  It can provide
>information on processor throughput (i.e. how much the processor is
>affected by pipeline stalls, jumps, cache and TLB misses...) when
>running real programs.  However, it must be combined with the number of
>instructions executed to derive the real measure of performance (1/s).

If the field of computer architecture existed in a vacuum, then (b)
would be the most important figure.  You can look at it to study how
the architecture performs despite branches, cache misses, etc.

But processors exist in the context of real world, where we all want
to be able to compare processors against the other.  That is why (a) is
a much more meaningful figure.

All in all, to use a *very* old joke:

	"MIPS = Meaningless Index of Processor Speed"

================================================================================

--Larry Yang [lyang@sun.com,{backbone}!sun!lyang]|   A REAL _|> /\ |
  Sun Microsystems, Inc., Mountain View, CA      | signature |   | | /-\ |-\ /-\
  "The attention span of a computer is only as   |          <|_/ \_| \_/\| |_\_|
   long as its electrical cord."                 |                _/          _/

shebanow@ji.Berkeley.EDU (Mike Shebanow) (03/19/87)

In article <15217@amdcad.UUCP> tim@amdcad.UUCP (Tim Olson) writes:
>Perhaps a small description of our simulation environment is in order.  Our
>internal simulator simulates the Am29000 in conjunction with an external
>memory environment, which may also include caches (both instruction and data).
>The simulator is written at a very detailed level, incorporating all of the
>possible pipeline stalls and exceptions (and their interactions) that may be
>encountered during the execution of a program. 
>
>The external memory model used to derive these numbers consists of separate,
>64K byte instruction and data caches, which have a 2-cycle access time, with
>a single-cycle burst mode interface.  Main memory has a 4-cycle (160 ns)
>access with single-cycle burst for reload.  Branch target cache misses, TLB
>misses and reloads, and external instruction and data cache misses were
>included in the simulations.  Actually, external caches aren't required for
>decent performance; we can also interface to video-DRAMS quite well.

I have several questions regarding the simulation. It would
be quite unfair to compare the running times for a real VAX 11/780 against
a simulation, unless the following effects are included in the simulation:

1) Were cold start effects included? If so, how are they simulated?

2) How were page faults simulated?  Are these times included?

3) Was logic for the cache and TLB (an assumption) simulated?

4) Are the effects of I/O simulated (say 10, 25, and 50% bus bandwidth
consumed by I/O devices)? What model?

5) How are system times (I assume that UNIX is used) calculated? Is this
work done by the UNIX kernel simulated?

I don't mean to put the AM29000 down (as including all of the above into
a simulation is beyond difficult), but using simulation times to compare
performance against a real machine is unreasonable (as is a MIPS
to MIPS comparison).

Mike Shebanow
shebanow@ji.berkeley.edu

tihor@acf4.UUCP (Stephen Tihor) (03/19/87)

Digital Product managers talking to intelligent users tend to
refer to VUPS ("Vax Units of Performance" relative to the platinum-iridium
VAX-11/780 in the bell jar in the Mill.).

Historically, I was given to understand, the MIP refered to some IBM box.

tim@amdcad.UUCP (Tim Olson) (03/20/87)

Mike Shebanow writes:

| I have several questions regarding the simulation. It would
| be quite unfair to compare the running times for a real VAX 11/780 against
| a simulation, unless the following effects are included in the simulation:
+-----
Agreed. You must always take *any* benchmark comparisons with a grain of
salt. The numbers posted were only meant to provide a rough idea of the
performance attainable.

| 1) Were cold start effects included? If so, how are they simulated?
+-----
Yes.  All caches (including the TLB) are initially invalidated.  The
simulated program exists in memory, but is "faulted in" to the TLB
during execution.
 
| 2) How were page faults simulated?  Are these times included?
+-----
Page faults are a secondary effect of TLB misses, and times for them are
greatly dependent on disk speed, etc.  Page fault processing time is not
included in the accumulated user time on real machines, so we don't
include it either.  We *do* count all of the processing (both user and
system) time it takes to perform TLB miss handling and system call
entry/exit.

| 3) Was logic for the cache and TLB (an assumption) simulated?
+-----
Yes.  All of the caches (both internal and external) were fully
simulated with "real" misses, reloading, etc.  The models are not
derived from statistical averages, and the numbers we used (both size
and access time) were taken from what we felt would be feasible both now
and in the near future.

| 4) Are the effects of I/O simulated (say 10, 25, and 50% bus bandwidth
| consumed by I/O devices)? What model?
+-----
This opens a whole new "can of worms", and is why most benchmarks don't
address the I/O issue.  I/O effects were not simulated, but, as far as
bus bandwidth is concerned, external cache reload is only consuming 20%
to 25%, so concurrent DMA into memory should not degrade the
performance too much.  Note that our large number of registers reduce
the number of data cache accesses, which further reduces the reload
bandwidth requirement.

| 5) How are system times (I assume that UNIX is used) calculated? Is this
| work done by the UNIX kernel simulated?
+-----
If you are asking whether we are simulating an entire UNIX kernel, the
answer is no.  We are simulating compiled C code supported by a C
runtime library.  System calls are implemented as traps, some of which
are executed directly in 29000 code, others which are I/O dependent
(like fopen) are passed to the host system for processing.  However, we
are not comparing system times, just user times.  Granted, there is some
interaction; we have tried to stay on the conservative side with our
numbers.


| I don't mean to put the AM29000 down (as including all of the above into
| a simulation is beyond difficult), but using simulation times to compare
| performance against a real machine is unreasonable (as is a MIPS
| to MIPS comparison).
| 
| Mike Shebanow
| shebanow@ji.berkeley.edu
+----
You certainly do bring up valid points; we also inform customers about
our potential simulation limitations when they run benchmarks on our
simulator. We aren't trying to "fool" anyone here, just attempting to
provide a realistic assessment of performance until parts are available.

	-- Tim Olson
	Advanced Micro Devices

mash@mips.UUCP (03/22/87)

In article <17915@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes:
>In article <15217@amdcad.UUCP> tim@amdcad.UUCP (Tim Olson) writes:
>>Perhaps a small description of our simulation environment is in order....
>
>I have several questions regarding the simulation. It would
>be quite unfair to compare the running times for a real VAX 11/780 against
>a simulation, unless the following effects are included in the simulation:
>...(list of important effects)...
>
>I don't mean to put the AM29000 down (as including all of the above into
>a simulation is beyond difficult), but using simulation times to compare
>performance against a real machine is unreasonable (as is a MIPS
>to MIPS comparison).

In defense of AMD, comparing simulation against a real machine is NOT
unreasonable; it's been done plenty of times, and if done well, can be
quite accurate [we do this all of the time, and generally get within
several percent, sometimes better.]  Serious computer design requires
accurate simulators: there are too many tradeoffs to do it by intuition alone.

Note, of course, that simulators can have bugs, and that one really only
knows when you can start corellating simualtions with the real numbers.

Finally, Tim O. had replied (well) to the original list of questions.
Let me add one more:

How often were caches swept in the simulations to account for context-switch,
clocks [running under unix, you can lose several % to the fact that
clock interrupts execute a bunch of code], system-call overhead, etc?

Note that even when you only measure user time, going in and out of the
kernel can blast your I-cache, and, if you're not careful, every single
read/write can zap big hunks of your D-cache.  The kernel time to
do this is (properly) not counted, but you can be surprised by cache
effects in this type of machine.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

smeeta@amdcad.UUCP (03/23/87)

In article <218@winchester.mips.UUCP>, mash@mips.UUCP (John Mashey) writes:
> 
> How often were caches swept in the simulations to account for context-switch,
> clocks [running under unix, you can lose several % to the fact that
> clock interrupts execute a bunch of code], system-call overhead, etc?
> 
The simulations done for benchmarking the Am29000 did not explicitly sweep
the caches to account for context-switches in between simulation runs. However,
most of the simulations performed  on the Am29000 executed in 60,000 cycles or 
less. So the cost for a cold start for filling caches and TLB entries is 
amortized over the relatively short runtime. 

For example, we modified Dhrystone 1.1 to run for 50 iterations instead of the 
500,000 iterations (standard). This made the simulation runtime more
manageable and also had the effect of a full "cache sweep" every 2.4 msec.

For the longer running benchmarks one should take into account the context-
switch time. However, we have not yet determined the best model for simulating
the effect of a multiprogramming environment on caches, without actually 
implementing the full OS kernel. 


Smeeta Gupta
Tim Olson

Strategic Development for Processor Products.

Advanced Micro Devices, Sunnyvale,Ca.