[net.arch] RISC and MIPS

steve@kontron.UUCP (Steve McIntosh) (07/26/85)

[ From the DTACK newsletter #44 (August 1985) ]

"RISC ARCHITECTURES:

Although simplified instruction sets were pioneered by Seymor Cray  and
by IBM's 801 project, the RISC concept was popularized by Patterson of
UC Berkeley, assisted by the traditional slave labor available to a
university professor. UC Berkley provides a very nice ivory tower where
competitive pressures and real-world pragmatism need not be considered.

Patterson provided a number of benchmarks which conclusively proved
that the RISC was superior to anything around. He also chose the
benchmarks, controlled the conditions of the tests, and most
particularly, chose to perform benchmarks in the HLL language 'C'
exclusively. This provides opportunities for much mischief (as when
Intel benchmarks its 16-megabyte addressing-range 286 EXCLUSIVELY in a
64k baby-pen).

( Author  goes on  for 7  paragraphs ...   general  description of RISC,
attributes  performance  advantages  of  RISC  architecture  (if any) to
sizable overlapping register set ...  lack of software support for new 
chips etc...) 

WHAT IS A MIP?

Technically, a MIP is a million instructions per second. OK then,
what's an instruction? Ah! That's a very good question!

Take, for example, the following 68000 instruction:

	MOVE.W D7,(A3)+

That instruction stores the lower word of the 32-bit register D7 at the
address contained in the 32-bit address register A3, and then
increments A3 by two (two bytes = one word). Here is that instruction's
equivalent for the Nat Semi 32016:

	MOVW D7,(SB)
	ADDQD 2,SB

The same operation in a hypothetical RISC machine:

	MOVE R7,(R#)
	LOAD 2,R4
	ADD  R4,R#

For simplicity, suppose that those three computers each performed that
equivalent instruction (or instruction sequence) in exactly one
microsecond. Then the 68000 would be operating at 1 MIP, the 32000
series at 2 MIPS, and the hypothetical RISC machine at 3 MIPS. 

EACH COMPUTER WOULD BE PERFORMING EXACTLY THE SAME AMOUNT OF WORK!

When a RISC advocate proudly points to the MIP performance rating of
his preferred architecture, remember that a RISC HAS to run at a lot
more MIPS than a CISC machine just to keep up. WHAT IS A MIP, INDEED!

THE VON NEUMANN BOTTLENECK:

No von Neumann machine can run faster than allowed by its
bus-bandwidth. Both RISC and CISC architectures are von Neumann
machines. Let us examine those three equivalent instruction (sequences)
from a bus-bandwidth standpoint:

The 68000 requires two memory cycles, the 32016 three memory cycles
and the RISC machine either three or four memory cycles depending on
whether it includes a LOAD QUICK instruction which contains small
integers inside the instruction field.

In fact, the 68000 executes that instruction (singular in the case of
the 68000) in exactly two memory cycles (eight clocks). The 32016
requires AT LEAST three memory cycles (12 clocks) to execute, and can
be even slower than that if the instructions are not word aligned.
(68000 instructions are ALWAYS word aligned) The hypothetical risk
machine requires 3 or 4 memory cycles to perform the same operation.

Therefore, the poor little 68000, operating at a mere 1 MIPS, is
running AT LEAST 50% faster than a 2 MIPS 32016 and 50% to 100% faster
than a 3 MIPS RISC machine!

To further emphasize this important point, our hypothetical RISC
machine would have to have a memory cycle time 50% to 100% faster than
the 68000 to have equal performance, and that would give it a 4.5 (50%
faster) or 6.0 (100% faster) MIPS rating. If the 68000 is running at
12.5Mhz with zero wait states using 120 nsec DRAM (say, two megabytes
of it) then JUST WHERE DO WE FIND DRAM FOR THAT RISC WHICH IS 50% TO
100% FASTER? (At an affordable price - srmc)

... Since a 68000 in fact performs that instruction in 640 nsec,
corresponding to 1.56 MIPS, that hypothetical RISC machine would have
to run at 7.03 to 9.38 MIPS to equal the MIPS 68000! WE STRONGLY
SUGGEST THAT YOU IGNORE THE MIPS RATING OF RISC MACHINES! RISC machines
HAVE to have a high MIPS rating just to get out of their own way!"

- DTACK GROUNDED		$15/10 issues (US & Canada)
  1415 E. McFadden, Ste.E	$25/10 issues elsewhere
  Santa Ana CA 92705	        (US funds)

=======================================================================
I do not work for DTACK, I just enjoy the newsletter. Flames to them,
please (they like it, and DO publish flame letters that they get in
response to the newsletter.)
=======================================================================
Steve McIntosh, Kontron Electronics, Irvine CA / usual disclaimers /

mahar@weitek.UUCP (mahar) (07/29/85)

In article <419@kontron.UUCP>, steve@kontron.UUCP (Steve McIntosh) writes:
> "RISC ARCHITECTURES:
> what's an instruction? Ah! That's a very good question!
> Take, for example, the following 68000 instruction:
> 
> 	MOVE.W D7,(A3)+
> 
> That instruction stores the lower word of the 32-bit register D7 at the
> address contained in the 32-bit address register A3, and then
> increments A3 by two (two bytes = one word).
> The same operation in a hypothetical RISC machine:
> 
> 	MOVE R7,(R#)
> 	LOAD 2,R4
> 	ADD  R4,R#

	On the Berkeley RISC, the equivalent istruction sequence is:

	MOVE R7,(R4)
	ADD #2,R4,R4

> 
> For simplicity, suppose that those three computers each performed that
> equivalent instruction (or instruction sequence) in exactly one
> microsecond. Then the 68000 would be operating at 1 MIP, the 32000
> series at 2 MIPS, and the hypothetical RISC machine at 3 MIPS. 
> 
> EACH COMPUTER WOULD BE PERFORMING EXACTLY THE SAME AMOUNT OF WORK!
> 
> When a RISC advocate proudly points to the MIP performance rating of
> his preferred architecture, remember that a RISC HAS to run at a lot
> more MIPS than a CISC machine just to keep up. WHAT IS A MIP, INDEED!

I agree that MIP is not a very good measure of computer performance.
A rule of thumb that I have heard is that the VAX 780 is 1 MIP. If your
computer does more work in less time then a 780 it is faster then 1 MIP.
If it does less, it's slower.
> 
> THE VON NEUMANN BOTTLENECK:
> 
> No von Neumann machine can run faster than allowed by its
> bus-bandwidth. Both RISC and CISC architectures are von Neumann
> machines. Let us examine those three equivalent instruction (sequences)
> from a bus-bandwidth standpoint:
> 
> The 68000 requires two memory cycles, the 32016 three memory cycles
> and the RISC machine either three or four memory cycles depending on
> whether it includes a LOAD QUICK instruction which contains small
> integers inside the instruction field.

Once again the Berkeley RISC is two memory cycles.

> 
> In fact, the 68000 executes that instruction (singular in the case of
> the 68000) in exactly two memory cycles (eight clocks). The 32016
> requires AT LEAST three memory cycles (12 clocks) to execute, and can
> be even slower than that if the instructions are not word aligned.
> (68000 instructions are ALWAYS word aligned) The hypothetical risk
> machine requires 3 or 4 memory cycles to perform the same operation.

Since the Berkely machine is also two memory cycles (2 clocks), this is
a tie.

> 
> Therefore, the poor little 68000, operating at a mere 1 MIPS, is
> running AT LEAST 50% faster than a 2 MIPS 32016 and 50% to 100% faster
> than a 3 MIPS RISC machine!
> 
> To further emphasize this important point, our hypothetical RISC
> machine would have to have a memory cycle time 50% to 100% faster than
> the 68000 to have equal performance, and that would give it a 4.5 (50%
> faster) or 6.0 (100% faster) MIPS rating. If the 68000 is running at
> 12.5Mhz with zero wait states using 120 nsec DRAM (say, two megabytes
> of it) then JUST WHERE DO WE FIND DRAM FOR THAT RISC WHICH IS 50% TO
> 100% FASTER? (At an affordable price - srmc)

With that same 120 nsec DRAM, the Berkely RISC could run at 8 Mhz.
Since the 68000 does it work in 8 cycles it takes 8/12.5 or about 640 nsec.
The Berkely RISC does its work in 2 cycles. So, at 8 Mhz it takes 2/8 or
250 nsec. All for the same memory band width. In fact, however, the Berkeley
RISC only runs at about 4 Mhz. The memory bandwidth is 250 nsec. The same
sequence would be 500 nsec. A 4 Mhz RISC takes 25/32 the time to do what
a 12.5 Mhz 68000 would do.

I agree, In the example given the RISC took twice as many instructions
to do the same job. The MIP designation is ambiguous. One must look at
how much work is done in a given amount of time.

wfmans@ihuxb.UUCP (w. mansfield) (08/01/85)

No, I'n not going to repeat the whole article.
It contrasted RISC and CISC micros and determined from the MIP
ratings that RISC micros are silly.

1.  MIP rating have been discussed here before, and all agree that
    they are a stupid measure of anything.  A RISC will generally
    achieve one instruction per cycle (that's the goal, anyway),
    while a CISC requires many cycles to do an instruction.  Also,
    typical CISCs report their MIP ratings for their shortest
    instruction (NOP?), and inflate their MIPs accordingly.
    
2.  The basic premise of RISC machines is to do the instructions
    that are used often very fast.  Indexing is a complex operation
    that just isn't done that often (and which can usually be
    simplified by good compilers (e.g. sophisticated compilers as
    used by 801 and MIPS projects)).
    
3.  It is becoming apparent to folks doing objective measurements
    from models of RISCs that much of the performance of the RISC
    isn't from the reduced instructions, its from the register
    windows et.al.  Agreed, these architectural improvements aren't
    part of RISC per se, but try getting them to fit in silicon
    along with a complex instruction set.
    
    No flames, just observations.  Newsletter sounds interesting,
    like a reprint of net.bellicose.
    

hammond@petrus.UUCP (Rich A. Hammond) (08/02/85)

> In article <419@kontron.UUCP>, steve@kontron.UUCP (Steve McIntosh) writes:
> > Take, for example, the following 68000 instruction:
> > 
> > 	MOVE.W D7,(A3)+
> >
This takes 2 16 bit memory cycles (instruction fetch, operand store)
> > 
> 	On the Berkeley RISC, the equivalent istruction sequence is:
> 
> 	MOVE R7,(R4)
> 	ADD #2,R4,R4
>
This takes 3  32 bit memory cycles: (i.e. 3 instruction cycles)
(instruction fetch, operand store, instruction fetch)

The RISC I takes 3 clock cycles per instruction cycle.(See Computer Spet. 82)
The Berkeley RISC takes 2 instruction cycles per load/store (see CACM Jan '85).

The point was that for the above operation, the more compact encoding of
the 68000 requires less memory cycles and hence is faster.  The number of
clock cycles per memory cycle, assuming a reasonable architecture, is
irrelevant, since the RISC can do at most 1 instruction/memory cycle
since it has to fetch an instruction.  Note that the 68020 in fact
uses only 3 clock cycles per memory cycle (like the RISC I).

What does all this have to do with RISC vs CISC.  Is the auto-increment mode
common in compiler generated code?  How about other operations?  In other
words, I can accept that certain operations will be slower if the overall
performance improves, so picking on an individual sequence only helps if
we know its relative frequency in real code.

A side note:  The Berkeley RISC's have no absolute addressing mode, they
fake it by using R0 (always 0) plus an offset.  BUT, the offset can only
be 13 bits, hence they can only absolutely address the first 2**13 locations
in memory.  Large programs, eg the UNIX kernel (particularly from Berkeley)
use much more than 2**13 (like 2**19) for instructions, hence the problem
is how well would a RISC do when it takes 2 instructions to form an
absolute address and probably requires a register?  I'll accept RISCs when
I see one runnning 4.3 BSD faster than an 11/780.

joel@peora.UUCP (Joel Upchurch) (08/05/85)

>A side note:  The Berkeley RISC's have no  absolute  addressing  mode,
>they  fake  it by using R0 (always 0) plus an offset.  BUT, the offset
>can only be 13 bits, hence they can only absolutely address the  first
>2**13  locations  in  memory.  Large  programs,  eg  the  UNIX  kernel
>(particularly from Berkeley) use much more than 2**13 (like 2**19) for
>instructions,  hence  the  problem is how well would a RISC do when it
>takes 2 instructions to form an absolute address and probably requires
>a  register?  I'll accept RISCs when I see one runnning 4.3 BSD faster
>than an 11/780.

        I would like to point out that the IBM 370 (usually considered
	a CISC :-> ) doesn't have absolute addressing and that it only
	has a displacement of 2**12 bytes. They seem to be able to get
	some rather large operating systems, including UNIX, to run on
	it.

hammond@petrus.UUCP (Rich A. Hammond) (08/06/85)

> >...  The Berkeley RISC's have no  absolute  addressing  mode ...
> >I'll accept RISCs when I see one runnning 4.3 BSD faster than an 11/780.
> 
>         I would like to point out that the IBM 370 (usually considered
> 	a CISC :-> ) doesn't have absolute addressing and that it only
> 	has a displacement of 2**12 bytes. They seem to be able to get
> 	some rather large operating systems, including UNIX, to run on
> 	it.

I never claimed it was impossible to get something running, I was more
concerned with whether the RISC retains its speed advantage when faced
with large amounts of absolute addressing.  Perhaps a couple global registers
used as base registers (ala IBM 360) would cover the commonly accessed
global data structures (or a smart loader would pack most of them together).
Anyway, the RISC claims will always be slightly dubious to me until I
actually see a machine performing as claimed in a real situation.
Hidden gotchas have a way of getting missed in paper exercises.

jtb@kitc.UUCP (John Burgess) (08/06/85)

I hate to be picky folks, but MIP is not an acronym of anything.
The correct acronym is MIPS -- Million Instructions Per Second.
What is a "Million Instructions Per" ???

Note that MIPS is both singular and plural.
(Actually its always plural, but saying 1 MIPS makes it look
singluar.)

Another way to tell that the S is necessary.
If it weren't, you'd say 1 MIP, but 2 MIPs (lower-case s)
to make it plural.

OK, enough already.  Just mind your Ps and Qs and Ss!
-- 
John Burgess
ATT-IS Labs, So. Plainfield NJ  (HP 1C-221)
{most Action Central sites}!kitc!jtb
(201) 561-7100 x2481  (8-259-2481)

darrell@sdcsvax.UUCP (Darrell Long) (08/07/85)

In article <437@petrus.UUCP> hammond@petrus.UUCP (Rich A. Hammond) writes:
>[...]  I'll accept RISCs when
>I see one runnning 4.3 BSD faster than an 11/780.

You should see how much faster our Pyramid runs 4.2 than our 11/780!
-- 
Darrell Long
Department of Electrical Engineering and Computer Science
University of California, San Diego

USENET: sdcsvax!darrell
ARPA:   darrell@sdcsvax

bcase@uiucdcs.Uiuc.ARPA (08/07/85)

/* Written  7:28 am  Aug  6, 1985 by hammond@petrus.UUCP in uiucdcs:net.arch */
Anyway, the RISC claims will always be slightly dubious to me until I
actually see a machine performing as claimed in a real situation.
Hidden gotchas have a way of getting missed in paper exercises.
/* End of text from uiucdcs:net.arch */

You probably won't have to wait TOO long since Acorn computer co. of
England just announced (in Electronics magazine) that they have good,
working die, the first time, after only 18 months of effort by 4
people.  The machine is able to sustain 3 of its MIPS and running
real programs (code produced by compilers) was "about 2 times a VAX
11/780" and 10 times an IBM PC-AT.  The article did not mention anything
about memory management, so they may or may not be winning because of
the lack of translation.  Anyway, they will be selling an evaluation
board for $2000 (which contains 1 MByte of memory, the processor and
some bootstrap ROM) which plugs into the $400 Acorn 6502-based micro-
computer.  Thus for about $3000, a person can get a really nice
personal workstation, and what performance.  Oh, the board comes with
a BCPL compiler and a MODULA-2 compiler, a small operating system,
and a window-oriented text editor.  More software is on the way (C,
Pascal, etc.).  It is true that this RISC will have about the same
performance as the 68020, but this RISC is fabricated with MUCH less
aggressive technology, and when shrunk to modern design rules, will
probably be significantly faster than more complex 32-bitters (the
minor cycle time of a 68020 is 60 ns. (at 16.67 MHz) while the minor
cycle time of this RISC is 150 ns.).  True, we should not count our
RISCs before they are hatched, but there are real advanteges to RISC.

hammond@petrus.UUCP (Rich A. Hammond) (08/08/85)

> In article <437@petrus.UUCP> I said
> >[...]  I'll accept RISCs when
> >I see one runnning 4.3 BSD faster than an 11/780.
> 
> Darrell Long replies:
> You should see how much faster our Pyramid runs 4.2 than our 11/780!

1) I'm not sure I'll accept the claim that the Pyamids, ridge, ...
are truly RISC machines, they have taken the overlapping registers idea
and that alone, even on a CISC, gives a great advantage.

2) What I should have said was that I wanted to see a RISC chip running
4.? BSD faster than a VAX.  The claims from UCB about the RISC I & II
were based on simulations which avoided the nasty problems of making
the kernel run.  As I noted before, hidden gotchas have a way of popping
up when you actually try and get something running.  Also, I want a
chip fabricated with the technology used for the M68000 when it came out.
It seems clear that since a 68020 can run faster than a 780 that a RISC
chip made now with leading edge technology should also run faster.

3) The claims that a RISC is better have to be taken with 3 provisions:
  a) Technology is important (i.e. if you need to have 1 memory cycle
     per instruction you'd better have fairly fast memory relative to
     the CPU implementation.  This is the current state, but it may change.
  b) I take a large grain of salt with the claim that RISC was designed
     faster than conventional micros, since a lot of what I suspect is
     complex on other micros is interrupt, trap, supervisor vs non-priv
     support and documentation, none of which the UCB people did much of.
  c) Although UCB claims to have "avoided complications" in their comparisons
     by using the same technology for the compiler (pcc) in comparisons,
     I think they introduced a very serious bias.  The pcc was never
     designed to generate good code and the RISC architecture might simply
     be a better match for pcc than a CISC architecture.  This seems to
     be supported by the CAN article which said that recoding some of the
     benchmarks for the RISC, 68000 and Z8000 in assembly resulted in
     code which was 1/2 the size of the RISC and significantly faster.

In summary, I like the ideas of RISC, I'm not convinced they're the only
way to go, but it was a good area to explore.

Rich Hammond

mash@mips.UUCP (John Mashey) (08/11/85)

> 
>> Anyway, the RISC claims will always be slightly dubious to me until I
>> actually see a machine performing as claimed in a real situation.
>> Hidden gotchas have a way of getting missed in paper exercises.
>> /* End of text from uiucdcs:net.arch */
> 
> You probably won't have to wait TOO long since Acorn computer co. of
> England just announced (in Electronics magazine) that they have good,
> working die, the first time, after only 18 months of effort by 4
> people.  The machine is able to sustain 3 of its MIPS and running
> real programs (code produced by compilers) was "about 2 times a VAX
> 11/780" and 10 times an IBM PC-AT.  The article did not mention anything
> about memory management, so they may or may not be winning because of
> the lack of translation.....
> ....  True, we should not count our
> RISCs before they are hatched, but there are real advanteges to RISC.

1) Real advantages to RISC : yes.
2) lack of memory management: unless one is just interested in a point
product with a fairly narrow performance range, one must be exceedingly
careful in memory management design for RISC, or you rapidly discover
that you get a chip OK for controllers or unprotected systems, and
near-useless for reasonable multi-tasking operating systems.
3) (Opinion, perhaps quite biased): it's pretty easy to do a RISC that's
2X a 780.  What's harder, but necessary, is to figure out how the same
chip architecture gets you to 5X, and soon 10X in a reasonable way;
just running the clock faster doesn't do it.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

henry@utzoo.UUCP (Henry Spencer) (08/13/85)

> ... I was more
> concerned with whether the RISC retains its speed advantage when faced
> with large amounts of absolute addressing.

What, pray tell, would *require* large amounts of absolute addressing?
The number of simple variables in a program is generally modest, and they
tend to be local rather than global.  Arrays and such tend to require
address arithmetic anyway, so there is little penalty in simply parking
a pointer to them in an anonymous simple variable.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

boston@celerity.UUCP (Boston Office) (08/14/85)

In article <419@kontron.UUCP> steve@kontron.UUCP (Steve McIntosh) writes:
>[ From the DTACK newsletter #44 (August 1985) ]
>
>"RISC ARCHITECTURES:
>
>WHAT IS A MIP?
>
>Technically, a MIP is a million instructions per second. OK then,
>what's an instruction? Ah! That's a very good question!
>
>Take, for example, the following 68000 instruction:
>
>	MOVE.W D7,(A3)+
>
>That instruction stores the lower word of the 32-bit register D7 at the
>address contained in the 32-bit address register A3, and then
>increments A3 by two (two bytes = one word). Here is that instruction's
>equivalent for the Nat Semi 32016:
>
>	MOVW D7,(SB)
>	ADDQD 2,SB
>
>The same operation in a hypothetical RISC machine:
>
>	MOVE R7,(R#)
>	LOAD 2,R4
>	ADD  R4,R#
>
>For simplicity, suppose that those three computers each performed that
>equivalent instruction (or instruction sequence) in exactly one
>microsecond. Then the 68000 would be operating at 1 MIP, the 32000
>series at 2 MIPS, and the hypothetical RISC machine at 3 MIPS. 
>
>EACH COMPUTER WOULD BE PERFORMING EXACTLY THE SAME AMOUNT OF WORK!
>
... PRECISELY!  That is why, when evaluating a system's power, we need
standards apart from the individual architecture.  

Ask for Whetstone MIPS if you seek a general raw-power figure, or other
benchmarks that more reflect your need.  (When Celerity quotes MIPS, for
example, we quote whetstone MIPS - even within a RISC-like architecture.)

cheong@uicsl.UUCP (08/15/85)

/* Written  6:33 am  Aug  2, 1985 by hammond@petrus.UUCP in uicsl:net.arch */
> In article <419@kontron.UUCP>, steve@kontron.UUCP (Steve McIntosh) writes:
> > Take, for example, the following 68000 instruction:
> > 
> > 	MOVE.W D7,(A3)+
> >
This takes 2 16 bit memory cycles (instruction fetch, operand store)
> > 
> 	On the Berkeley RISC, the equivalent istruction sequence is:
> 
> 	MOVE R7,(R4)
> 	ADD #2,R4,R4
>
This takes 3  32 bit memory cycles: (i.e. 3 instruction cycles)
(instruction fetch, operand store, instruction fetch)

The RISC I takes 3 clock cycles per instruction cycle.(See Computer Spet. 82)
The Berkeley RISC takes 2 instruction cycles per load/store (see CACM Jan '85).

The point was that for the above operation, the more compact encoding of
the 68000 requires less memory cycles and hence is faster.  The number of
clock cycles per memory cycle, assuming a reasonable architecture, is
irrelevant, since the RISC can do at most 1 instruction/memory cycle
since it has to fetch an instruction.  Note that the 68020 in fact
uses only 3 clock cycles per memory cycle (like the RISC I).

What does all this have to do with RISC vs CISC.  Is the auto-increment mode
common in compiler generated code?  How about other operations?  In other
words, I can accept that certain operations will be slower if the overall
performance improves, so picking on an individual sequence only helps if
we know its relative frequency in real code.

A side note:  The Berkeley RISC's have no absolute addressing mode, they
fake it by using R0 (always 0) plus an offset.  BUT, the offset can only
be 13 bits, hence they can only absolutely address the first 2**13 locations
in memory.  Large programs, eg the UNIX kernel (particularly from Berkeley)
use much more than 2**13 (like 2**19) for instructions, hence the problem
is how well would a RISC do when it takes 2 instructions to form an
absolute address and probably requires a register?  I'll accept RISCs when
I see one runnning 4.3 BSD faster than an 11/780.
/* End of text from uicsl:net.arch */

eugene@ames.UUCP (Eugene Miya) (08/19/85)

I have been following this discussion for some time.  So what's the
bottom line?  If architecture is your thing, the proof of the putting
are going to be running systems.  The critics of RICS then focus their
attack on a definition of MIPS.  We've had two (or more) "What is a
MIP letter?" in addition to the original one.

A couple of years ago just before the Winter Usenix conference, I saw the
Massively Parallel Processor (MPP).  It is rated at peak speed of 16
billion instructions per second (16 GIPS).  Now they only add later that this
is an 8-bit word with 1-bit serial data paths.  It burns me up when recently
I saw a television ad that Goodyear Aerospace did this `great' thing
for NASA at the cost of 10 million dollars. {oops, sorry, got out of
hand, flame off}

A problem is, certainly, how we measure things.  One letter brought out
the need to define what an instruction was.  The letter did not specifically
mention a property by name: that was `atomicity.'  Another problem is
a common base set of units: is there an appropriate conversion factor of
a 64-bit instruction to an 8-bit?  Is a factor of 8 good enough, probably
not.  Let's keep "standards" out of the discussion for now, and explore this
a bit.  Whetstones were mentioned in another letter, but the only people
who use these are computer manufacturers.  [MGT]FLOPS are another gastly
measure.  What qualities do our performance metrics need to have?

--eugene miya
  NASA Ames Research Center
  {hplabs,ihnp4,dual,hao,decwrl,allegra}!ames!aurora!eugene
  emiya@ames-vmsb

gas@lanl.ARPA (08/20/85)

Another metric is a sampling of large, mostly unmodifiable, commercial
codes.  My preference is MSC/Nastran (.5e6+lines of finite element code).
It runs on a surprising number of machines, and is a rigorous test of not
only performance (cpu and io) but the scientific/engineering environment
available to the average user.  It helps draw the line between special
purpose and general purpose environments (or, less tactfully, usable
and unusable machines)..                         george spix    gas@lanl
(MSC - MacNeal-Schwendler Corporation, Los Angeles)

jer@peora.UUCP (J. Eric Roskos) (08/22/85)

> Whetstones were mentioned in another letter, but the only people who use
> these are computer manufacturers.

This statement isn't true.  For example, back when I was a graduate-student
researcher in computer architectures, we used the Whetstones to test our
vertical-migration software.

> What qualities do our performance metrics need to have?

I think you need to make your performance measurements in such a way that
you get a set of distinct numbers which can be used analytically to determine
performance for a given program if you know certain properties of the
program.  For example:

1) The rate of execution of each member of the set of arithmetic operations
provided by the machine's instruction set, assuming the operands are all
in registers, with cache disabled.  This would give you an approximation of
the execution time of the algorithms for the arithmetic operations,
along with the instruction-fetch times, when not aided by caching.

2) The rate of execution of 1-word memory-to-memory moves, with cache
disabled.  This gives you the word-sized operand fetch and store times,
along with (again) the instruction-fetch times for these instructions.

3) The rate of execution of a tight loop performing only register-to-register
moves, with cache disabled.

4) The rate of execution of a tight loop performing only register-to-register
moves, with cache enabled.

5) The rate of execution of a tight loop performing (same word size as #3
and #4 above) memory-to-memory moves that produce all cache "hits", with
cache enabled.  Note that this gives you two properties of your cache: your
speedup for operand fetch and store resulting from caching, and any
performance penalties resulting from a write-through vs. write-back cache.

6) Specifications such as the number of registers available to the user,
the size of the cache, etc.

Well, you get the idea, anyway... personally I tend to feel that statistical
performance measurements are not nearly as useful as analytical ones; I
would rather see a list of fairly distinct performance properties of a pro-
cessor anytime, since I think you can do more with them in terms of
saying how the machine will perform for a given application that way.
To do this, you do need to understand your application, though.

I separated out the various forms of caching (operations in registers, and
use of a cache between the CPU and the primary memory) because so many
people "fudge" their results that way without giving any information from
which you can determine real performance.  The above list is just meant to
suggest "qualities" rather than being an exhaustive list; i.e., that the
performance metrics should reveal (rather than hide) the set of factors
that actually influence performance. [Unfortunately, this would never suit
most marketing organizations nor customers, since they want an all-
encompassing number.]

The metrics should also be compiler-independent, especially if you are
making measurements on microcomputers, since the majority of microcomputer
compilers today generate terrible object code (see my posting awhile back
of a "hand-compiled" 68000 program for the Macintosh in net.micro.mac for
an example of how bad this can be (and how little the significance of this
was understood!)).
-- 
Shyy-Anzr:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

	"Gurl ubyq gur fxl/Ba gur bgure fvqr/Bs obeqreyvarf..."

chuck@dartvax.UUCP (Chuck Simmons) (08/23/85)

> A problem is, certainly, how we measure things.  One letter brought out
> the need to define what an instruction was.  The letter did not specifically
> mention a property by name: that was `atomicity.'  ...
> Whetstones were mentioned in another letter, but the only people
> who use these are computer manufacturers.  [MGT]FLOPS are another gastly
> measure.  What qualities do our performance metrics need to have?
> 
> --eugene miya

It might be interesting to define some fairly simple standard operations
and ask how long it takes to perform the operations.  Typical standard
operations might be:   (high-level language pseudo-code in parens)

add -- takes two words (at least 32 bits) from memory, adds them together,
and puts the result back in memory.  (A := B + C)

index -- picks up an array offset from memory, performs bounds checking
on the offset (we don't all write in C), and loads the indexed element into a
register.  (A[i])

ptr_load -- picks up a pointer and an offset into a record and loads
the appropriate word from the pointed to record.  (P->Record.Field)

array_loop -- load each element of an array into a register.  It is
cheating to assume that the array contains a special value at either
end.  (for i = 1 to n do ... A[i] ...;)

The advantage of using these simple operations instead of FLOPS is that
a lot of programs don't use floating point operations very much.  These
simple operations would be a better measure than even simpler instructions
because each operation does something "useful".  These operations can
also have advantages over high-level language benchmarks because they are
not dependent on the quality of a compiler.

The qualities that I am aiming for here are primarily usefulness and 
simplicity.  Each of the above operations will be found in a wide variety
of programs, and each operation should be easy to implement on most
machines in that machines native assembler.  These are short pieces of
code that a compiler would generate fairly often.

-- Chuck
chuck@dartvax

bobbyo@celerity.UUCP (Bob Ollerton) (09/03/85)

I agree that using some real, heavy duty, commercial codes can be
a good way of measuring the performance of CPU architectures.
RISC CPUs can sometimes be difficult to get a handle on if the
particular implementation is strong in some cases, and weak in others.

Here are some results from a Finite Element Modeler from Swanson Analysis,
called ANSYS.  It is a large fortran program written quite a few years
ago and continuously enhanced.  It uses both single and double precision
math, I/O, and lots of virtual memory.

Please note that these results while supplied by the various vendors,
are being presented to you from a biased source; Me.

---------------------------------------------------------------------


Combined ANSYS benchmarks SP1, SP2, SP3, SP4:

Vendor           CPU seconds
-----------------------------

Prime 750          6505 
RIDGE 32           5750
APOLLO X60         5372
VAX 780 w/fpa      4574
DG MV8000          3290
IBM 4341-1         2973
Celerity C1200     2506


-- 
Bob Ollerton; Celerity Computing; 
9692 Via Excelencia; San Diego, Ca 92126; (619) 271 9940
{decvax || ucbvax || ihnp4}!sdcsvax!celerity!bobbyo
                              akgua!celerity!bobbyo