[net.arch] Floating point performance & Mr. Mashey's Mythical Mhz

kissell@garth.UUCP (Kevin Kissell) (10/16/86)

Keywords:

In article <722@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>However, a useful attribute of Roger's measure's (or variant thereof)
>is that looking at the measure (units of real performance) per Mhz,
>you some idea of architectural efficiency, i.e., smaller numbers are
>better, in that (cycle time) is likely to be a property of the technology,
>and hard to improve, at a given level of technology. [This is clearly
>a RISC-style argument of reducing the cycle count for delivered performance,
>and then letting technology carry you forward.]  Using the numbers above,
>one gets KiloWhets / Mhz, for example:

I don't understand how someone of John's sophistication can insist on
repeating such a clearly fallacious argument.  The statement "cycle time
is likely to be a property of the technology" is simply untrue, as I have
pointed out in previous postings.  Cycle time is a the product of gate delays
(a property of technology) and the number of sequential gates between latches
(a property of architecture).  For example, let us consider two machines
that are familiar to John and myself and yet of interest to the newsgroup:
the MIPS R2000 and the Fairchild Clipper.  An 8 Mhz R2000 has a cycle time
of 125ns.  A 33Mhz Clipper has a cycle time of 30ns.  Yet both are built
with essentially the same 2-micron CMOS technology.  I somehow doubt that
Fairchild's CMOS transistors switch four times faster than that of whoever
is secretly building R2000s this week.  The difference is architectural.

As I understand it, the R2000 was designed to take advantage of delayed
load/branch techniques, and to execute instructions in a small number of
clocks, which in fact go hand-in-hand.  A load or branch can take as little
as two clocks.  But the addition of two numbers cannot take less than one
clock, and so the ALU has a leasurely 125ns to do something that it could
in principle have done more quickly, had it been more heavily pipelined.

The Clipper was designed from fairly well-established supercomputer and
mainframe techniques.  The cycle time is the time required to do the smallest
amount  of useful work - an integer ALU operation at 30ns.  Other instructions
must then of course be multiples of that basic unit.  Assuming cache hits,
a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes
9 (270ns vs. 250ns for the R2000).

It should be noted that both machines allow for the overlapped execution
of instructions, but in different ways.  The R2000 overlaps register
operations with loads and branches using delay slots.  The Clipper
overlaps loads but not branches, using resource scoreboarding instead
of delay slots.  This means that the R2000 can branch more efficiently
(assuming the assembler can fill the delay slot), but the Clipper can
have more instructions executing concurrently than the R2000 (4 vs 2)
in in-line code.

Draw your own conclusions about "architectural efficiency".

>Machine	Mhz	KWhet	KWhet/Mhz
>80287		 8	 300	 40
>32332-32081	15	 728	 50		(these from Ray Curry,
>32332-32381	15	1200	 80		in <3833@nsc.UUCP>) (projected)
>32332-32310	15	1600	100*		"" "" (projected)
>Clipper?	33	1200?	 40		guess? anybody know better #?
>68881		12.5	 755	 60		(from discussion)
>68881		20	1240	 60		claimed by Moto, in SUN3-260
>SUN FPA	16.6	1700	100*		DP (from Hough) (in SUN3-160)
>MIPS R2360	 8	1160	140*		DP (interim, with restrictions)
>MIPS R2010	 8	4500	560		DP (simulated)

John's guess for the Clipper is off by over a factor of two.  The Clipper
FORTRAN compiler was brought up only recently.  In its present sane but
unoptimizing state, I obtained the following result on an Interpro 32C
running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
Hills Clipper FORTRAN compiler with Fairchild math libraries:

		Mhz	Kwhet	Kwhet/Mhz
Clipper		33	2920	Who cares?  Kwhet/Kg and Kwhet/cm2 are of
				more practical consequence.

Kevin D. Kissell
Fairchild Advanced Processor Division

gnu@hoptoad.uucp (John Gilmore) (10/17/86)

Speaking from MIPS, mash@mips.UUCP (John Mashey) writes:
>     ...looking at the measure (units of real performance) per Mhz,
>you some idea of architectural efficiency, i.e., smaller numbers are
>better, in that (cycle time) is likely to be a property of the technology,
>and hard to improve, at a given level of technology. 

Speaking from Fairchild, kissell@garth.UUCP (Kevin Kissell) writes:
> I don't understand how someone of John's sophistication can insist on
> repeating such a clearly fallacious argument.  The statement "cycle time
> is likely to be a property of the technology" is simply untrue...

I love it!  The Intel/Motorola wars have been fun, but I'm glad they're
temporarily in abeyance.  Onward with the RISC versus RISC wars!   B*}
-- 
John Gilmore  {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu   jgilmore@lll-crg.arpa
(C) Copyright 1986 by John Gilmore.             May the Source be with you!

jason@hpcnoe.UUCP (Jason Zions) (10/19/86)

jlg@lanl.ARPA (Jim Giles) / 12:59 pm  Oct 15, 1986 /
> Mflops:(Millions of FLoating point OPerations per Second)
> MHz: (Millions of cycles per second)
> 
> Therefore 'Mflops per MHz':(Millions^2 FLoating point OPeration cycles per
> 			    sec^2)
> 
'Scuse me?
	Mflops per MHz

	     =	Mflops / MHz

		Millions Flop / Sec
	     =	--------------------
		Millions Hertz / Sec

	     =	Floating Point OP / Hertz

	     =	FLOP per Cycle.

In other words, how many (or few) floating point operations happen in a
single cycle. Yeah, it's gonna be small number, but not as silly a number
as your derivation shows.

> J. Giles
> Los Alamos
--
Jason Zions				Hewlett-Packard
Colorado Networks Division		3404 E. Harmony Road
Mail Stop 102				Ft. Collins, CO  80525
	{ihnp4,seismo,hplabs,gatech}!hpfcdc!hpcnoe!jason

hansen@mips.UUCP (10/19/86)

In article <377@garth.UUCP> kissell@garth.UUCP (Kevin Kissell) writes:
>I don't understand how someone of John's sophistication can insist on
>repeating such a clearly fallacious argument.  The statement "cycle time
>is likely to be a property of the technology" is simply untrue, as I have
>pointed out in previous postings.  Cycle time is a the product of gate delays
>(a property of technology) and the number of sequential gates between latches
>(a property of architecture).  For example, let us consider two machines
>that are familiar to John and myself and yet of interest to the newsgroup:
>the MIPS R2000 and the Fairchild Clipper.  An 8 Mhz R2000 has a cycle time
>of 125ns.  A 33Mhz Clipper has a cycle time of 30ns.  Yet both are built
>with essentially the same 2-micron CMOS technology.  I somehow doubt that
>Fairchild's CMOS transistors switch four times faster than that of whoever
>is secretly building R2000s this week.  The difference is architectural.

	"cycle time is likely to be a property of the technology" is
	clearly a simplification that is useful for making relatively
	crude comparisons between widely varying machine designs.
	Cycle time, while a crude measure, has the advantage
	that it is clearly observable and well-documented.

	In practice, the number of sequential gates between latches
	is also generally a property of the technology, given that
	designers are attempting to optimize their own design.
	It is counterproductive to over-pipeline a design, as
	pipe registers themselves add delay and complexity.
	Let me emphasize, however, that I do not intend to
	assert that the Fairchild design is over-pipelined.

	Now let us address the general issue of a comparison
	of the technology of the two machines discussed above,
	(two machines that were clearly chosen entirely at random).
	It is indeed safe to assume that an 8 MHz R2000 has a
	cycle time of 125 ns. However, 8 MHz is not the maximum
	clock rate that the silicon will support - that figure
	is 16.67 MHz, or a cycle time of 60 ns (worst case over commercial
	temperatures). This 16.67 MHz R2000 part is built in a
	2-micron CMOS technology, and Fairchild's part is
	built in a process that is also described as a 2-micron CMOS
	technology. However, the phrase "2-micron CMOS technology"
	is actually very vague.
	
	The available public literature from both companies is
	not sufficient to compare these technologies point-by-point,
	but I fully expect that Fairchild has pushed harder
	on effective transistor gate length and oxide thickness to
	reach 33 MHz than MIPS has yet employed to reach 16.67 MHz.
	A difference in comparable gate speed of a factor of
	two is actually entirely plausable, though we believe the
	actual ratio is more on the order of 1.5.

	We have been getting our process technology from the same
	suppliers week after week. By using a slightly less agressive
	technology, we are able to get reliable, multiple-sourced processing.

>As I understand it, the R2000 was designed to take advantage of delayed
>load/branch techniques, and to execute instructions in a small number of
>clocks, which in fact go hand-in-hand.  A load or branch can take as little
>as two clocks.  But the addition of two numbers cannot take less than one
>clock, and so the ALU has a leasurely 125ns to do something that it could
>in principle have done more quickly, had it been more heavily pipelined.

	I have to disagree on several of the points claimed here.
	The R2000 design will execute load and branch instructions
	at a rate of one instruction per cycle (a 60 ns cycle),
	and takes one 60 ns cycle to perform an integer ALU operation.
	In fact, the R2000 will execute ALL instructions in a
	single cycle, which substantially simplified the design.
	It is, of course, entirely untrue that the addition of
	two numbers cannot take less than one clock, but this is
	not the heart of the matter: the integer ALU is not
	the critical path in the R2000 design.

>The Clipper was designed from fairly well-established supercomputer and
>mainframe techniques.  The cycle time is the time required to do the smallest
>amount  of useful work - an integer ALU operation at 30ns.  Other instructions
>must then of course be multiples of that basic unit.  Assuming cache hits,
>a load takes 4/6 clocks (120/180ns vs 250ns for the R2000) and a branch takes
>9 (270ns vs. 250ns for the R2000).

	Correcting the numbers above, we have 120/180 ns (Clipper)
	vs. 60 ns (R2000) for a load, and 270 ns vs 60 ns for a branch.

>It should be noted that both machines allow for the overlapped execution
>of instructions, but in different ways.  The R2000 overlaps register
>operations with loads and branches using delay slots.  The Clipper
>overlaps loads but not branches, using resource scoreboarding instead
>of delay slots.  This means that the R2000 can branch more efficiently
>(assuming the assembler can fill the delay slot), but the Clipper can
>have more instructions executing concurrently than the R2000 (4 vs 2)
>in in-line code.

	Resource scoreboarding is no more effective at using load
	delay slots (which are delays inherent in the computation)
	than static scheduling. Since instructions are issued in
	the order in which they are presented in a scoreboard
	controller, an operation that depends on the value of
	a pending load instruction must wait for
	the load to complete on either machine. The number of
	delay cycles, is, however, an important factor in
	determining performance. It is hardly advantageous
	to have 4 cycle (is is it 6 cycle?) load instructions,
	no matter how slickly this is portrayed as a feature with
	the phrase "can have more instructions executing concurrently."
	The R2000 can fill the delay slot with a useful instruction,
	(which can even be an additional load instruction) over 70%
	of the time. With what frequency can Clipper compilers find
	three instructions, none of which can be a load, to
	fill the three load delay slots on a Clipper?

>Draw your own conclusions about "architectural efficiency".

	The Clipper designers claim 5 MIPS performance at 33 MHz,
	while the R2000 performs at 10 MIPS at 16.67 MHz.
	The Fairchild technology is as much as twice as
	agressive as the R2000 technology, but the Clipper
	only achieves half the performance. My conclusion
	is that the R2000 is two-four times as "efficient"
	an architecture.

	For Clipper to reach the same performance in the same technology,
	using their current architecture, they need 66 MHz parts,
	with an input clock rate well above the broadcast FM radio band.

>>Machine	Mhz	KWhet	KWhet/Mhz
>>80287		 8	 300	 40
>>32332-32081	15	 728	 50		(these from Ray Curry,
>>32332-32381	15	1200	 80		in <3833@nsc.UUCP>) (projected)
>>32332-32310	15	1600	100*		"" "" (projected)
>>Clipper?	33	1200?	 40		guess? anybody know better #?
>>68881		12.5	 755	 60		(from discussion)
>>68881		20	1240	 60		claimed by Moto, in SUN3-260
>>SUN FPA	16.6	1700	100*		DP (from Hough) (in SUN3-160)
>>MIPS R2360	 8	1160	140*		DP (interim, with restrictions)
>>MIPS R2010	 8	4500	560		DP (simulated)
>
>John's guess for the Clipper is off by over a factor of two.  The Clipper
>FORTRAN compiler was brought up only recently.  In its present sane but
>unoptimizing state, I obtained the following result on an Interpro 32C
>running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
>Hills Clipper FORTRAN compiler with Fairchild math libraries:
>
>		Mhz	Kwhet	Kwhet/Mhz
>Clipper	33	2920	Who cares?  Kwhet/Kg and Kwhet/cm2 are of
>				more practical consequence.
>
>Kevin D. Kissell
>Fairchild Advanced Processor Division

Clipper		33	2930	90 = Kwhet/MHz

	I'd like thank Kevin for providing this performance data
	and point out that this ratio is a respectable accomplishment
	on Fairchild's part - this number is comparable to the
	values obtained by using multiple-chip FP processors
	built with Weitek arithmetic units and interfaced to
	microcoded processors. While the FP arithmetic operations
	take longer in the Clipper than in Weitek parts
	(which are built in an unmistakably slower technology),
	by reducing communications overhead, the overall performance
	comes out comparably well.

	Let me make clear why Kwhet/MHz or MIPS/MHz ratios are useful: 
	they provide some insight into where the emphasis was placed 
	in the design, and where future derivative designs can reach. 
	It's my view that Kevin's remarks confirm that the Clipper design 
	was intended from the start to build a machine with a low MIPS/MHz
	ratio, with the clock rate based on the lowest conceivable
	executable unit. It should also be clear what level of 
	architectural efficiency results from optimizing integer
	ALU operations (Clipper), rather than by optimizing the architecture 
	to execute load, store and branch operations (MIPS).

-- 

Craig Hansen			|	 "Evahthun' tastes
MIPS Computer Systems		|	 bettah when it
...decwrl!mips!hansen		|	 sits on a RISC"

mash@mips.UUCP (10/19/86)

In article <377@garth.UUCP> kissell@garth.UUCP (Kevin Kissell) writes:
>In article <722@mips.UUCP> mash@mips.UUCP (John Mashey) writes:

>...that are familiar to John and myself and yet of interest to the newsgroup:
>the MIPS R2000 and the Fairchild Clipper.  An 8 Mhz R2000 has a cycle time
>of 125ns.  A 33Mhz Clipper has a cycle time of 30ns.  Yet both are built
>with essentially the same 2-micron CMOS technology.  I somehow doubt that
>Fairchild's CMOS transistors switch four times faster than that of whoever
>is secretly building R2000s this week.  The difference is architectural.

(One of my colleagues got here first, hansen@mips, in 726@mips.UUCP,
so I'll just add a few notes where they don't overlap too much.)

There was no intent in the original posting to start a MIPS versus 
Clipper war [contrary to John Gilmore's posting in <1198@hoptoad.uucp>:
sorry John, another Moto versus Intel battle we do not need, fun though
it may be to watch!]  I was only trying to be reasonably inclusive of 
relevant 32-bit micros.  However, now that the issue has been raised.....

An 8Mhz R2000 isn't pushing the technology very hard, ON PURPOSE!!!
8Mhz parts appear first, followed by 12s and 16s, for the same reasons you got
12Mhz 68020s before 16s and 25s. Also, I'm told that the 2u design doesn't
push 2u technology as hard as it might have, in order to let the same
design be shrunk to 1.5u and 1.2u with minimal effort.

Now, the reason one might care about MWhets/MHz (or any similar measure
that compares the delivered real performance with some basic technology
speed) is to understand the margin and headroom in a design.
Since Kevin brought the issue up, some hypothetical questions:
	a) Will there be 66Mhz Clippers in 2u CMOS?
		[To get actual performance like 16Mhz R2000 in 2u;]
		[If the answer is yes, I know a bunch of people, not all at
		MIPS, either, who have some real tough questions involving
		transmission-line effects, how to do ECL or other reduced-
		voltage-swing I/O, etc.]
	b) If they will be, what year will they be?
		[1987?]
	c) When will there be bigger / (more in parallel) CAMMU chips?
		[Because if there aren't, how are the caches going to
		get enough bigger to keep the delivered performance in line
		with the CPU clock speed improvements? (for real programs)?
		Chips gets faster with shrinks, but they don't magically
		get re-laid-out to acquire more memory.  CAMMU chips have
		some good ideas in them, but they're not very big, especially
		compared with the needs of some of the real programs that
		people would like to run on high-performance micros. (There
		is some real nasty stuff lurking out there!  People keep
		putting them on our machines, so we know....If the Clipper
		FORTRAN compilers just came up recently, and they haven't
		yet tried running 500KLOC FORTRAN programs...interesting
		times are ahead....)
>
>The Clipper was designed from fairly well-established supercomputer and
>mainframe techniques....

"fairly well-established supercomputer and mainframe techniques"
is interesting.  I can think of 2 ways to read this assertion:
	a) High-performance VLSI designs should be done just like big
	machines.
OR
	b) High-performance VLSI should be designed with good understanding
	of big machines, as well as good understanding of the tradeoffs
	necessary for VLSI [margin, headroom, packaging constraints, processes,
	etc, etc], where those are different from the design tradeoffs of
	the big ECL boxes.
I hope Kevin meant b), which most people would agree with.
>
>John's guess for the Clipper is off by over a factor of two.  The Clipper

Thanks for the info: all I'd seen were random guesses from people around
the net, and it's a useful contribution to see numbers from somebody
that knows.  Hopefully, we'll see more?  [I assume that was DP?]

>FORTRAN compiler was brought up only recently.  In its present sane but
>unoptimizing state, I obtained the following result on an Interpro 32C
>running CLIX System V.3 at 33 Mhz (1 wait state), using a prototype Green
>Hills Clipper FORTRAN compiler with Fairchild math libraries:
>
>		Mhz	Kwhet	Kwhet/Mhz
>Clipper		33	2920	Who cares? Kwhet/Kg and Kwhet/cm2 are of
>				more practical consequence.

As hansen@mips noted, these are reasonable results, and I'd assume they'll
improve somewhat with more mature compiler technology.

Actually, this raises a set of questions that might be of general interest
in this newsgroup, basically:
1) What metrics are interesting?
2) How do you define them?
3) In what problem domains are they relevant?
4) What are different constraints that people use?
5) How do different metrics correlate, specifically, are some of the simpler
(easier-to-measure) good predictors of the more complex ones?

For example, here are some metrics, all of which have appeared in this
newsgroup at some time or other.  Proposals are solicited:

a) Clock rate. (Mhz) --
b) Peak Mips [i.e., typically back-to-back cached, register-register adds]. --
c) Sustained Mips ?
d) Benchmark performance relative to other computers  ++
e) Peak Mflops [i.e., "" "" for FP] --
f) Dhrystones
g) Whetstones +
h) LINPACK MFLops ++
i) Kwhets / Mflops [g/e] -
j) Kwhets / Mhz [g/a] +
k) Kg
l) cm2 (or cm3)
m) Watts
n) $$ +++
o) Kwhets / Kg [g/k]
p) Kwhets / cm2 [g/l] +
q) Kwhets / Watt [g/m] +
r) (any of the above) / $$ +++(esp if d))
---------
(-- & ++ indicate general impression of these metrics)

What's interesting is that people have all sorts of different constraint
combinations or optimization functions over any of these.  Let me try
a few examples, and solicit some more:
1) Maximize g), h) etc, subject to few constraints, i.e., for people who
	buy CRAYs, etc, money is (almost( no object.
2) Maximize one of the performance numbers, subject to some constraint.
	The constraint might be:
	absolute cm2 or cm3, as in some avionics things, i.e., if it
	doesn't fit, it doesn't matter how fast it is!
	$$: get me the most for some fixed amount of money, and I don't
	care if it's 2X faster, even if it's more cost-effective.
3) Performance may not be particularly important at all, relative to
object-code compatbility, software availability, service, etc.

Comments? What sorts of metrics are important to the people who read
this newsgroup? What kinds of constraints?  How do you buy machines?
If you buy CPU chips, how do you decide what to pick?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rentsch@unc.UUCP (Tim Rentsch) (10/27/86)

In article <727@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
> Now, the reason one might care about MWhets/MHz (or any similar measure
> that compares the delivered real performance with some basic technology
> speed) is to understand the margin and headroom in a design.

There is a subtle pitfall in arguing that FLOPS/HZ (or IPS/HZ) is a
measure of architectural "goodness".  Certainly, measuring FLOPS/HZ
is a reasonable attempt to factor out the particulars of the device
fabrication, which are obviously irrelevant to architecture.  (If
your chip runs twice as fast as my chip only because it is 5 times
as small, your process technology is better than mine, but your
architecture may not be.)  BUT -- and here is the pitfall -- it just
might be that given identical fabrication methods, the better
FLOPS/HZ choice would still run slower because it would not support
the higher clock rate.  RISC proponents would argue that one reason
for having simple instruction sets is to *lower the cycle time* so
that the machine can run faster and get more work done.  Your
machine's FLOPS/HZ may be twice as good as mine, but if my HZ is
three times yours (in identical technology), my machine is faster --
and so my architecture is better.

> Comments? What sorts of metrics are important to the people who read
> this newsgroup? What kinds of constraints?  How do you buy machines?
> If you buy CPU chips, how do you decide what to pick?

The metrics I'm interested in measure speed.  (Basically, I'm hooked
on fast machines.)  Other constraints are less interesting because:
(1) I will buy the fastest machine I can afford, and (2) in terms of
architecture, speed is the bottom line -- all else is just
mitigating circumstances. ("I know machine X runs 3 times as fast as
machine Y, but machine X is Gallium Arsenide."  Compare
architectures, not technologies.)

Here are my favorite metrics (in no particular order):

(1) micro-micro-benchmark:  well defined task, with well defined
algorithm, hand coded in lowest level language available (microcode
if it comes to that) by arbitrarily clever programmer who can take
advantage of all machine dependencies (instruction timings, overlaps
and/or interlocks, special instructions, cache sizes, etc.).
Algorithm can change slightly to take advantage of machine
characteristics, but must be "recognizable".

(1a) same as above, but at assembly language level.  instruction set
cleverness is allowed;  microcode and special knowledge such as
cache size is not.

(2) micro-benchmark: well defined task, with algorithm given in some
particular programming language (and benchmark must be compiled from
the given algorithm).  The point here is to measure the speed of the
machine in "typical" situations, including compiler effectiveness.
the time taken to do the compile is irrelevant, as long as it is
reasonably finite.

(3) macro-benchmark: the problem with (1) and (2) is that they don't
measure all kinds of things that inevitably take place in real
systems.  (on the other hand (1) and (2) are easy to run, and also
easy to fudge, so they are more often done.)  a macro-benchmark is
like (2) in having a given program, except that the given program is
very large, so that code size is comparable to amount of real memory
on the machine (hopefully code > real memory).  now the
effectiveness of the machine for problems-in-the-large will be
measured, including things like swapping speeds and TLB hit rates,
etc.  sadly, this is a vague measure because there are so few large
programs which can be used as the benchmark, and many variable
parameters creep in (such as how fast the disk seeks are, etc.).
even so, it is worth remembering that speed in the small is
different from speed in the large, and that the latter is really
what we desire.  (or should that be, "what I desire"?  :-)

cheers,

txr

crowl@rochester.ARPA (Lawrence Crowl) (10/27/86)

>>In article <727@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>> Now, the reason one might care about MWhets/MHz (or any similar measure
>> that compares the delivered real performance with some basic technology
>> speed) is to understand the margin and headroom in a design.

>In article <103@unc.unc.UUCP> rentsch@unc.UUCP (Tim Rentsch) writes:
> There is a subtle pitfall in arguing that FLOPS/HZ (or IPS/HZ) is a measure
> of architectural "goodness".  Certainly, measuring FLOPS/HZ is a reasonable
> attempt to factor out the particulars of the device fabrication, which are
> obviously irrelevant to architecture.  ...  BUT -- and here is the pitfall
> -- it just might be that given identical fabrication methods, the better
> FLOPS/HZ choice would still run slower because it would not support
> the higher clock rate.  

Perhaps what we are missing is that for a given level of technology, a longer
clock cycle allows us to have a larger depth of combinational circuitry.  That
is, we can have each clock work through more gates.  So, a 4 MHz clock which
governs propogation through a combinational circuit 4 gates deep will do
roughly the same work as a 1 MHz clock governing propogation through a
combinational circuit 16 gates deep.  Perhaps a better measure is the depth of
gates required to implement a FLOP, (or an instruction, or a window, etc.).

The very fast clock, heavily pipelined machines like the Cray and Clipper
follow the first approach, while the slower clock, less pipelined machines
like the Berkley RISC and MIPS follow the second approach.  Which is better is
probably dependent upon the technology used to implement the architecture and
the desired speed.  For instance, if we want a very fast vector processor, we
should probably choose the fast clock, more pipelined architecture.  If we want
a better price/performance ratio, we should probably choose the slow clock,
less pipelined architecture.

BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent.  The
quality of an architecture is dependent on the technology used to implement it,
and no architecture is "best" under more than a limited range of technologies.
For instance, under technologies in which the bandwidth to memory is most
limited, stack architectures (Burroughs, Lilith) will be "better".  Under 
technologies where the ability to process instructions is most limited, the
wide register to register architectures will be "better".
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627

bcase@amdcad.UUCP (Brian Case) (10/28/86)

>Perhaps what we are missing is that for a given level of technology, a longer
>clock cycle allows us to have a larger depth of combinational circuitry.  That
>is, we can have each clock work through more gates.  So, a 4 MHz clock which
>governs propogation through a combinational circuit 4 gates deep will do
>roughly the same work as a 1 MHz clock governing propogation through a
>combinational circuit 16 gates deep.  Perhaps a better measure is the depth of
>gates required to implement a FLOP, (or an instruction, or a window, etc.).

Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the
pipeline can be kept full (one of the major goals of RISC), then it will do
4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or
MIPS/MHz or whatever/MHz will be the same!  Thus, I still think this isn't
such a bad metric to use for comparison.  If pipelining can't be implemented
or the pipeline can't be kept full for a reasonable portion of the time,
the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator.

>The very fast clock, heavily pipelined machines like the Cray and Clipper
>follow the first approach, while the slower clock, less pipelined machines
>like the Berkley RISC and MIPS follow the second approach.  Which is better is

Now wait a minute.  I don't think anyone at Berkeley, Stanford, or MIPS Co.
will agree with this statement.  The clock speeds may vary among the machines
you mention, but that is basically a consequense of implementation technology.
I think everyone is trying to make pipestages as short as possible so that
future implementations will be able to exploit future technology to the
fullest extent.

>probably dependent upon the technology used to implement the architecture and
>the desired speed.  For instance, if we want a very fast vector processor, we
>should probably choose the fast clock, more pipelined architecture.  If we want
>a better price/performance ratio, we should probably choose the slow clock,
>less pipelined architecture.

I certainly agree that if a very fast vector processor is required, the higest
clock speed possible with the most pipelining that makes sense should be
chosen.  But why should we chose a different approach for the better price/
performance ratio?  Unless you are trying only to decrease price (which is
not the same as increasing price/performance), one should still aim for the
highest possible clock speed and pipelining.  If the price/performance is
right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz.  In
addition, for little extra cost (I claim but can't unconditionally prove),
the 4 at 4 Mhz version will in some cases give me the option of 4 times the
throughput.  I do acknowledge that I am starting to talk about a machine
for which FLOPS/MHz may not be a good comparison metric.

>BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent.  The
>quality of an architecture is dependent on the technology used to implement it,
>and no architecture is "best" under more than a limited range of technologies.
>For instance, under technologies in which the bandwidth to memory is most
>limited, stack architectures (Burroughs, Lilith) will be "better".  Under 
>technologies where the ability to process instructions is most limited, the
>wide register to register architectures will be "better".

I agree that technology influences (or maybe "should influence") architecture.
But I don't think limited memory bandwidth indicates a stack architecture,
rather, I would say a stack archtitecture is contraindicated!  If memory
bandwidth is a limiting factor on performance, then many registers are needed!
Optimizations which reduce memory bandwidth requirements are those that keep
computed results in registers for later re-use; such optimizations are
difficult, at best, to realize for a stack architecture.  When you say "the
ability to process instructions is most limited" I guess that you mean "the
ability to fetch instructions is most limited" (because any processor whose
ability to actually process its own instructions is most limited is probably
not worth discussing).  In this case, I would think that shorter instructions
in which some part of operand addressing is implicit (e.g. instructions for a
stack machine) would be indicated; "wide register to register" instructions
would simply make matters worse.  Probably the best thing to do is design the
machine right the first time, i.e. give it enough instruction bandwidth.

I fear that this posting reads like a flame; it is not intended to be a flame.

mash@mips.UUCP (John Mashey) (10/29/86)

In article <21944@rochester.ARPA> crowl@rochtest.UUCP (Lawrence Crowl) writes:
>>>In article <727@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
> ... MWhets/Mhz, etc, as way to factor out transient technology...
>
>Perhaps what we are missing is that for a given level of technology, a longer
>clock cycle allows us to have a larger depth of combinational circuitry.  That
>is, we can have each clock work through more gates.  So, a 4 MHz clock which
>governs propogation through a combinational circuit 4 gates deep will do
>roughly the same work as a 1 MHz clock governing propogation through a
>combinational circuit 16 gates deep.  Perhaps a better measure is the depth of
>gates required to implement a FLOP, (or an instruction, or a window, etc.).
Can you suggest some numbers for different machines? One of the reasons
I proposed a (simplsitic) measure is the absolute difficulty of finding
such thing out.
>
>
>BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent.  The
>quality of an architecture is dependent on the technology used to implement it,
>and no architecture is "best" under more than a limited range of technologies.
>For instance, under technologies in which the bandwidth to memory is most
>limited, stack architectures (Burroughs, Lilith) will be "better".  Under 
>technologies where the ability to process instructions is most limited, the
>wide register to register architectures will be "better".

Much of this seems true.  We always claim that the real meaning of RISC in
VLSI RISC is "Response to Inherent Shifts in Computer technology", i.e
in hardware: fast, dense, cheap SRAMs; higher-pincount VLSI packages,
and in software: more use of high-level languages; portable OS's like UX/.
In the days of core memories, it is likley that the more aggressively
undense RISCs [i.e., those with only 32-bit instructions] would have
been bad ideas for anything but high-end machines.
Given: TTL, NMOS, CMOS, ECL, GaAs, for example, it would be interesting to
hear what people think who are / have implmented same machine over multiple
technologies [such as DEC VAXen, IBM 370s, HP SPectrums, all of which 
are supposed exist in at least 3 of the first 4 of the above; I think most
GaAs designs are RISCs, given smaller gate counts.]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rentsch@unc.UUCP (Tim Rentsch) (11/02/86)

In article <1903@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:

> I must disagree.  Reliability is at least as important as speed.

Right.  But reliability is a measure of the *implementation*, not of
the architecture.  

cheers,

txr

crowl@rochester.ARPA (Lawrence Crowl) (11/04/86)

>>> mash@mips.UUCP (John Mashey)
)) crowl@rochtest.UUCP (Lawrence Crowl)
> bcase@amdcad.UUCP (Brian Case)
] mash@mips.UUCP (John Mashey)
crowl@rochtest.UUCP (Lawrence Crowl)

>>> ... MWhets/Mhz, etc, as way to factor out transient technology...

))Perhaps what we are missing is that for a given level of technology, a longer
))clock cycle allows us to have a larger depth of combinational circuitry.  That
))is, we can have each clock work through more gates.  So, a 4 MHz clock which
))governs propogation through a combinational circuit 4 gates deep will do
))roughly the same work as a 1 MHz clock governing propogation through a
))combinational circuit 16 gates deep.  Perhaps a better measure is the depth of
))gates required to implement a FLOP, (or an instruction, or a window, etc.).

]Can you suggest some numbers for different machines? One of the reasons
]I proposed a (simplsitic) measure is the absolute difficulty of finding
]such thing out.

No, I cannot suggest numbers.  I suspect they would be difficult to obtain.
Maybe I should think more next time.

>Yes, but if the 4 Mhz/4 gates implementation can support pipelining and the
>pipeline can be kept full (one of the major goals of RISC), then it will do
>4 times the work at 4 times the clock speed; in other words the FLOPS/MHz or
>MIPS/MHz or whatever/MHz will be the same!  Thus, I still think this isn't
>such a bad metric to use for comparison.  If pipelining can't be implemented
>or the pipeline can't be kept full for a reasonable portion of the time,
>the the FLOPS/MHz will indeed go down, making FLOPS/MHz a misleading indicator.

One of us is confused here, and I do not know which.  Assume a IPS takes a
constant 16 combinational gates.  The 4 MHz and 4 gates will require 4 stages
while the 1 MHz and 16 gates will require one stage.  Both machines will
execute 1 MIPS.  But they have a factor of 4 difference in MHz/MIPS.  If we
pipeline the 4 MHz and 4 gates into a four stage pipeline, the MHz/MIPS will
be the same but the performance will be a factor of 4 different.

))The very fast clock, heavily pipelined machines like the Cray and Clipper
))follow the first approach, while the slower clock, less pipelined machines
))like the Berkley RISC and MIPS follow the second approach.  Which is better is

>Now wait a minute.  I don't think anyone at Berkeley, Stanford, or MIPS Co.
>will agree with this statement.  The clock speeds may vary among the machines
>you mention, but that is basically a consequense of implementation technology.
>I think everyone is trying to make pipestages as short as possible so that
>future implementations will be able to exploit future technology to the
>fullest extent.

There are at least two approaches, exemplified by the following two examples.
The first has a clock controlling progress through three stages from the
register bank to the ALU, through the ALU, and back to the register bank.  
The second approach is to do all this in one stage.  The first approach has
the potential to pipe while the second has a lower clock rate.  In both cases
faster clock rates allow faster implementations.  Which machines take which
approach?

))probably dependent upon the technology used to implement the architecture and
))the desired speed.  For instance, if we want a very fast vector processor, we
))should probably choose the fast clock, more pipelined architecture.  If we
))want a better price/performance ratio, we should probably choose the slow
))clock, less pipelined architecture.

>I certainly agree that if a very fast vector processor is required, the higest
>clock speed possible with the most pipelining that makes sense should be
>chosen.  But why should we chose a different approach for the better price/
>performance ratio?  Unless you are trying only to decrease price (which is
>not the same as increasing price/performance), one should still aim for the
>highest possible clock speed and pipelining.  If the price/performance is
>right, I don't care if my add takes one cycle at 1 MHz or 4 at 4Mhz.  In
>addition, for little extra cost (I claim but can't unconditionally prove),
>the 4 at 4 Mhz version will in some cases give me the option of 4 times the
>throughput.  I do acknowledge that I am starting to talk about a machine
>for which FLOPS/MHz may not be a good comparison metric.

Higher clock rates generally imply higher quality parts, more EMI shielding,
etc, which implies a higher cost.  You do not expect a 3000 RPM engine to
cost the same as a 8000 RPM engine do you?  In addition, exploiting pipeline
potential generally costs significant development effort and gates to control
the piping.  Now, adding some pipeling to a simple scheme is probably cost
effective, but adding as much as is possible is not.  We must find a balance.

))BOLD UNSUPPORTED CLAIM: The "best" architecture is technology dependent.  The
))quality of an architecture is dependent on the technology used to implement
))it, and no architecture is "best" under more than a limited range of
))technologies.  For instance, under technologies in which the bandwidth to
))memory is most limited, stack architectures (Burroughs, Lilith) will be
))"better".  Under technologies where the ability to process instructions is
))most limited, the wide register to register architectures will be "better".

>I agree that technology influences (or maybe "should influence") architecture.
>But I don't think limited memory bandwidth indicates a stack architecture,
>rather, I would say a stack archtitecture is contraindicated!  If memory
>bandwidth is a limiting factor on performance, then many registers are needed!
>Optimizations which reduce memory bandwidth requirements are those that keep
>computed results in registers for later re-use; such optimizations are
>difficult, at best, to realize for a stack architecture.  

Stacks and registers are not incompatible.  It is easy to imagine a machine
which did pushes and pops between the stack and a register bank.  If register
to register architectures are allowed to store temporaries and local variables
in registers, the stack architecture should be allowed to also.  We should
separate the notion of registers as a means to evaluate expressions and as
a storage media.

>When you say "the ability to process instructions is most limited" I guess
>that you mean "the ability to fetch instructions is most limited" (because
>any processor whose ability to actually process its own instructions is most
>limited is probably not worth discussing).  In this case, I would think that
>shorter instructions in which some part of operand addressing is implicit
>(e.g. instructions for a stack machine) would be indicated; "wide register to
>register" instructions would simply make matters worse.  Probably the best
>thing to do is design the machine right the first time, i.e. give it enough
>instruction bandwidth.

"The ability to fetch instructions" is precisely what I did NOT mean.  You
seem to have effectively argued for a stack architecture when bandwidth to
memory is limited.  After all, instructions are in memory.  What I meant
by "the ability to process instructions" is once you have the instruction
in the CPU, how quickly can you deal with it (relative to getting it into
the CPU in the first place).
-- 
  Lawrence Crowl		716-275-5766	University of Rochester
			crowl@rochester.arpa	Computer Science Department
 ...!{allegra,decvax,seismo}!rochester!crowl	Rochester, New York,  14627

dvk@sei.cmu.edu (Daniel Klein) (11/04/86)

Okay, blatant, flaming opinion time...

I really don't care how fast the internal engine has to run to produce my
output.  If my little Alfa Romeo is tooling down the highway at 70 MPH with
an internal engine cycle time of 3100 RPM, and I get passed by a Ferrari
doing 110 MPH with an internal engine speed of 4900 RPM, who is going faster?
Certainly not me, not matter how you multiply the numbers!  My MPH/RPM is a
little higher, but I got my doors blown off nonetheless.

So if I am able to build some bizarre semi-synchronous architecture with a
2 GHz clock rate, does it mean my machine is slower (when you divide out the
clock in MFlops/MHz)?  I don't think so.  If we are looking for an esoteric
comparison of architectural efficiency, *then* perhaps we have a reasonable
metric here.

Now, wasn't it interesting how the MIPS machines appeared at the top of the
performance chart in the initial posting by Mashey?  Personally, I think RISC
architectures are a good idea, so I'm not arguing architectural values here.
But RISC looks just *great* when you use the clever little formula of
MFlops/MHz.  All I care about though, is who gets my jobs done the fastest.


--> The standard disclaimer: my opinions are my own, so there, nyaa nyaa.
--
--=============--=============--=============--=============--=============--
Daniel V. Klein, who lives in Pittsburgh, allegedly works for the Software
Engineering Institute, and strives to survive as best he can.

		ARPA:	dvk@sei.cmu.edu
		USENET:	{ucbvax,harvard,cadre}!dvk@sei.cmu.edu

	"The only thing that separates us from the animals is
	    superstition and mindless rituals".

guy@sun.uucp (Guy Harris) (11/06/86)

> I really don't care how fast the internal engine has to run to produce my
> output.  If my little Alfa Romeo is tooling down the highway at 70 MPH with
> an internal engine cycle time of 3100 RPM, and I get passed by a Ferrari
> doing 110 MPH with an internal engine speed of 4900 RPM, who is going
> faster?  Certainly not me, not matter how you multiply the numbers!
> My MPH/RPM is a little higher, but I got my doors blown off nonetheless.

Yes, but what if:

	1) Horsepower, say, were linearly proportional to RPM

	2) The horsepower need by both cars to sustain a particular
	   speed were the same

	3) Your Alfa had a redline of 20,000 RPM, while the Ferrari had a
	   redline of 6000 RPM

	4) "All other things are equal"

Then just step on the gas hard enough to get near the redline, and blow the
Ferrari's doors off.

I believe Mashey's thesis is that this is more-or-less the proper analogy;
the maximum clock rate possible is mainly a function of the chip technology,
not the architecture, so an architecture that gets more work done per clock
tick can ultimately be made to run faster than ones that get less work done
per clock tick.  I shall voice no opinion on whether this is the case or not
(I don't know enough to *have* an opinion on this) , but will just let the
chip designers battle it out.

> So if I am able to build some bizarre semi-synchronous architecture with a
> 2 GHz clock rate, does it mean my machine is slower (when you divide out the
> clock in MFlops/MHz)?  I don't think so.

Since MFlops/MHz is !N*O*T! a measure of machine speed, and was never
intended as such by Mashey, the machine is neither faster nor slower "when
you divide out the clock in MFlops/MHz".  If you don't divide out the clock,
no, it doesn't mean your machine is slower.  Nobody would argue that it did.

> If we are looking for an esoteric comparison of architectural efficiency,
> *then* perhaps we have a reasonable metric here.

Well, what did you *think* MFlops/MHz was intended as?  It *was* intended
for comparing architectural efficiency!

Please, people, before you flame this measure as absurd, make sure you're
not flaming it for not being a measure of raw speed; it wasn't *intended* to
be a measure of raw speed.  *You*, the end-user, may not be interested in
architectural efficiency, but may only be interested in "how fast something
gets your job done"; the person who has to design and build that something,
however, is going to be interested in architectural efficiency.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

kds@mipos3.UUCP (Ken Shoemaker ~) (11/07/86)

arguments about architectural effeciency aside, you'd have an easier time
making a system that runs at 8 MHz than one that runs at 33 MHz (or whatever)
even if the overall memory access time requirement is the same.  And
you'd have a much easier time making a system that goes at 16 MHz than one
that goes at 66 MHz.
-- 
The above views are personal.

I've seen the future, I can't afford it...

Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds
csnet/arpanet: kds@mipos3.intel.com

franka@mmintl.UUCP (Frank Adams) (01/01/87)

>> Comments? What sorts of metrics are important to the people who read
>> this newsgroup? What kinds of constraints?  How do you buy machines?
>> If you buy CPU chips, how do you decide what to pick?
>
>The metrics I'm interested in measure speed.  (Basically, I'm hooked
>on fast machines.)  Other constraints are less interesting because:
>(1) I will buy the fastest machine I can afford, and (2) in terms of
>architecture, speed is the bottom line -- all else is just
>mitigating circumstances.

I must disagree.  Reliability is at least as important as speed.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108