[comp.arch] Why floating point hardware: micro-parallelism, micro-cycles

dgh@validgh.com (David G. Hough on validgh) (09/09/90)

Stephen Spackman's recent postings recall a question I asked during
the development of the SPARC instruction-set architecture at Sun: since
floating-point instructions can be decomposed into simple integer operations,
how can they be justified in a RISC architecture?  Why is it that they don't
run as fast in software?  (They don't, and can't, but you might have to
try it to convince yourself.  All you need to do is look at 64-bit double
precision floating-point add/subtract on a 32-bit RISC architecture).

Basically I was attacking the idea that RISC = 'a few simple instructions'.
This was an overly simple definition anyway.  The correct definition of RISC
architecture is 'good engineering' in the sense of 'good engineering
economy', although not everybody has realized this yet.

The underlying answer to the floating-point question is that while 
software floating point is limited by the macro-instruction cycle
time and parallelism is limited by the macro-instruction parallelism
potential, a hardware floating-point implementation can run at a
faster clock and have entirely different kinds of parallelism.  For
instance, one of the Hot Chips Symposium papers this year mentioned
a floating-point addition unit that simultaneously does various
cases that arises and picks the correct one at the end.  And I mean
really simultaneously.   Although high performance hardware floating 
point is not microcoded in the usual sense it is often implemented in 
hard-wired micro steps whose clock rate isn't limited by the 
instruction fetch bandwidth as is the macro-clock cycle rate.

What tells you which complex multi-macro-cycle instructions (like floating-point
ops) are appropriate for inclusion in an instruction-set architecture?
One issue that arises if you want to be commercially successful
is that it's not a good idea to completely overlook any major application
area even if it's less than 1% of some "total", especially if your 
competitors didn't overlook it.  Thus MIPS put in integer multiplication
before SPARC and SPARC put in floating-point sqrt before MIPS.  Both
oversights were remedied in the second-generation instruction set
architectures, although I think MIPS has already implemented sqrt 
while no SPARC vendor has implemented integer multiplication.
Thus people like Silverman point out correctly
that current SPARC implementations
aren't competitive for his kinds of problems; this is embarrassing,
and I used to worry that it would bother potential customers - most
of whom don't depend on integer multiplication but may not know it -
but it doesn't seem to be much of a problem.  Sun-3 sales are down
in the noise compared to Sun-4 even if they can do some integer
arithmetic problems faster at the same clock.

Another aspect of commercial success is
mass marketability - generalized processors may be cheaper and more
cost effective and faster than specialized ones because of higher run 
rates and more attention from vendors in getting them into the latest 
device technologies.

Spackman's speculation is that a totally different paradigm for non-integer
calculations could be more cost-effective than conventional floating point.
There are lots of candidate proposals; consult any recent proceedings from
the IEEE Computer Arithmetic Symposia.  But most of them are content to
prove feasibility rather than cost-effectiveness.

As mentioned, the issue is good engineering economy.  The quantitative
approach demonstrated in Hennessy and Patterson is the best basis to
start, but it's much more expensive than thought-experiments posted
to news:  to really test an idea you have to build a hardware simulator
and a good optimizing compiler that properly exploits it, and possibly
design some language extensions to express what you can do.   And even
that's not enough; to avoid the kinds of embarrassments mentioned above
you need to learn as much as possible about what potential customers
actually do with computers and what they would do if they could.
It's a lifetime undertaking.

Besides Patterson, I should mention that Robert Garner, George Taylor,
John Mashey, and Earl Killian have helped me sort out what RISC
is all about.
-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.stanford.edu

Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (09/10/90)

>>>>> On 9 Sep 90 15:17:44 GMT, dgh@validgh.com (David G. Hough on validgh) said:
David> ...since floating-point instructions can be decomposed into simple
David> integer operations, how can they be justified in a RISC
David> architecture?  Why is it that they don't run as fast in software?
David> (They don't, and can't, but you might have to try it to convince
David> yourself.  All you need to do is look at 64-bit double precision
David> floating-point add/subtract on a 32-bit RISC architecture).

David> Basically I was attacking the idea that RISC = 'a few simple
David> instructions'.  This was an overly simple definition anyway.  The
David> correct definition of RISC architecture is 'good engineering' in the
David> sense of 'good engineering economy', although not everybody has
David> realized this yet.

Perhaps RISC does indeed stand for Reduced Instruction Set, and "good
engineering" can, and has, been applied to CISC architectures (notably the
80486 and the 68040).

Modern processor design is indeed indebted to the RISC pioneers who, in
order to compensate for reduced instruction sets, applied "good
engineering" to come up with some remarkable techniques for parallelism.
_Except for the reduced number of instructions_, these same techniques can
be applied to CISC (albeit some techniques with more difficulty).

If a CISC processor _averages_ close to 1 Cycle Per Instruction, what is
the advantage of removing many of those instructions?  Are you claiming a
CISC processor is somehow transformed into a RISC processor because of an
improved CPI, _even though the actual instruction set has not diminished_?
(e.g.  the 68040 & 80486)

In a given technology, the physics of the medium limits how fast a switch
can toggle, leaving parallelism as the route for even greater throughput.
It appears Reduced Instruction Sets and parallelism are, to a great degree,
orthagonal.  Am I missing something here?

Is it possible higher silicon densities will shift (or have shifted) the
economics of processor design toward more robust parallelized instruction
sets, perhaps even toward "Super CISC"?

	Just for discussion,

David> David Hough
David> dgh@validgh.com	uunet!validgh!dgh	na.hough@na-net.stanford.edu

#include <std/disclaimer.h>
--
Chuck Phillips  MS440
NCR Microelectronics 			Chuck.Phillips%FtCollins.NCR.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		uunet!ncrlnk!ncr-mpd!bach!chuckp

amos@taux01.nsc.com (Amos Shapir) (09/10/90)

[Quoted from the referenced article by Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips)]
>
>In a given technology, the physics of the medium limits how fast a switch
>can toggle, leaving parallelism as the route for even greater throughput.
>It appears Reduced Instruction Sets and parallelism are, to a great degree,
>orthagonal.  Am I missing something here?

What you're missing is that CISC processors are a bitch to parallelize
on the instruction level - each instruction or part thereof can take
a different number of cycle and occupy an unpredictable number of
resources; when several processors have to share these resources,
a lot of effort should be put into interlocking, synchronisation, etc.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522255  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N

mash@mips.COM (John Mashey) (09/15/90)

In article <CHUCK.PHILLIPS.90Sep9215755@halley.FtCollins.NCR.COM> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:
>>>>>> On 9 Sep 90 15:17:44 GMT, dgh@validgh.com (David G. Hough on validgh) said:
>David> ...since floating-point instructions can be decomposed into simple
>David> integer operations, how can they be justified in a RISC
>David> architecture?  Why is it that they don't run as fast in software?
>David> (They don't, and can't, but you might have to try it to convince
>David> yourself.  All you need to do is look at 64-bit double precision
>David> floating-point add/subtract on a 32-bit RISC architecture).
>
>David> Basically I was attacking the idea that RISC = 'a few simple
>David> instructions'.  This was an overly simple definition anyway.  The
>David> correct definition of RISC architecture is 'good engineering' in the
>David> sense of 'good engineering economy', although not everybody has
>David> realized this yet.
>
>Perhaps RISC does indeed stand for Reduced Instruction Set, and "good
>engineering" can, and has, been applied to CISC architectures (notably the
>80486 and the 68040).
>
>Modern processor design is indeed indebted to the RISC pioneers who, in
>order to compensate for reduced instruction sets, applied "good
>engineering" to come up with some remarkable techniques for parallelism.
>_Except for the reduced number of instructions_, these same techniques can
>be applied to CISC (albeit some techniques with more difficulty).
>
>If a CISC processor _averages_ close to 1 Cycle Per Instruction, what is
>the advantage of removing many of those instructions?  Are you claiming a
>CISC processor is somehow transformed into a RISC processor because of an
>improved CPI, _even though the actual instruction set has not diminished_?
>(e.g.  the 68040 & 80486)
>
>In a given technology, the physics of the medium limits how fast a switch
>can toggle, leaving parallelism as the route for even greater throughput.
>It appears Reduced Instruction Sets and parallelism are, to a great degree,
>orthagonal.  Am I missing something here?
>
>Is it possible higher silicon densities will shift (or have shifted) the
>economics of processor design toward more robust parallelized instruction
>sets, perhaps even toward "Super CISC"?
>
>	Just for discussion,
>
>David> David Hough
>David> dgh@validgh.com	uunet!validgh!dgh	na.hough@na-net.stanford.edu
>
>#include <std/disclaimer.h>
>--
>Chuck Phillips  MS440
>NCR Microelectronics 			Chuck.Phillips%FtCollins.NCR.com
>2001 Danfield Ct.
>Ft. Collins, CO.  80525   		uunet!ncrlnk!ncr-mpd!bach!chuckp

Newsgroups: comp.arch
Subject: Re: Why floating point hardware: micro-parallelism, micro-cycles
Summary: 
Expires: 
References: <197@validgh.com> <CHUCK.PHILLIPS.90Sep9215755@halley.FtCollins.NCR.COM>
Sender: 
Reply-To: mash@mips.COM (John Mashey)
Followup-To: 
Distribution: 
Organization: MIPS Computer Systems, Inc.
Keywords: 

There are a bunch of things in the following discussion that could
use some clarification, or amplification, so here goes:

In article <CHUCK.PHILLIPS.90Sep9215755@halley.FtCollins.NCR.COM> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:
>>>>>> On 9 Sep 90 15:17:44 GMT, dgh@validgh.com (David G. Hough on validgh) said:
>David> ...since floating-point instructions can be decomposed into simple
>David> integer operations, how can they be justified in a RISC
>David> architecture?  Why is it that they don't run as fast in software?
>David> (They don't, and can't, but you might have to try it to convince
>David> yourself.  All you need to do is look at 64-bit double precision
>David> floating-point add/subtract on a 32-bit RISC architecture).

>David> Basically I was attacking the idea that RISC = 'a few simple
>David> instructions'.  This was an overly simple definition anyway.  The
>David> correct definition of RISC architecture is 'good engineering' in the
>David> sense of 'good engineering economy', although not everybody has
>David> realized this yet.

Dgh has this right about FP (note that on a MIPS, 64-bit FP add = 2 cycles,
hard to match by sequences of integer instructions),`
and it is a good example of what people really do, without
the confusion of counting instructions. 

>Perhaps RISC does indeed stand for Reduced Instruction Set, and "good
>engineering" can, and has, been applied to CISC architectures (notably the
>80486 and the 68040).
Good engineering can be of course applied to CISCs, and has been, for years.
If you track succeeding designs among, for example, the S/360 & VAX
families, you will find that the designers have carefully studied the
statistics of program behavior, moved some instructions from microcode
into hardware, or vice-versa, or even into software emulation.
Examples include:
	360/44 (didn't have decimal ops, for example)
	MicroVAX II (also didn't have decimal ops)
In addition, successive designs have generally gotten more efficient pipeline
designs and memory hierarchies.  Certainly, the 80486 is a fine implementation,
the 68040 appears to be well-thought-out, from the published information.
This whole process, in general, goes on amongst all competent
computer designers, and has been, for many years, and is not particularly
new, nor would I expect that any knowledgable RISC designer tell you
that is was something magic and new.

So what's the difference: let's try again:
	RISC micros were designed from the beginning:
	1) To avoid instruction complexity that would require microcode
	in general, which often costs you 1.5-2 : 1 if used for the
	simpler instructions.
	2) (In better cases) with a great deal of input from software
	people.  Since RISCs are newer, they have a lot of benefit from
	hindsight.  Since RISCs were designed when there was considerable
	more use of high-level languages and (sometimes) optimizing
	compilers, it was much easier to study these things and input them
	into the design.  AS it happens, compiler technology has taken
	leaps in the last decade, and the tradeoffs have changed, not
	suprising, since the entire nature & structure of the computer
	business is a lot different from 10 years ago, and unbelievably
	different from 20 years ago.
	3) RISCs usually were designed after it was clear that caches were
	good things, and that let them make tradeoffs from Day 1, tradeoffs
	that were not necesasrily appropriate for architectures designed
	when caches were either unknown or not practical for the part of
	the design space being attacked.  Also in this category are:
		a) Pure code segments
		b) Virtual memory support, if needed
	In some cases, some older machines allowed programs to
	write into their code any time they felt like it (like into the
	immediately suceeding instruction), or they included features
	that conflicted morewith VM than they need to have.  All of these
	can be worked around, but hindsight...
	4) RISCs are generally designed to permit clean, simple pipelining,
	without requiring huge amounts of logic for special cases and such.
	This is certainly one of the key differences, and again, some of it
	comes from hindsight.
	5) Avoid those instructions that can easily be simulated by
	sequences of simple ones AT COMPARABLE PERFORMANCE.  Include those
	instructions, NO MATTER HOW "COMPLEX" someone thinks they are,
	if those instructions achieve performance that cannot be approximated
	elsewise, and if the tradeoffs are acceptable.  (again: include
	FP Add, which may well be a huge hunk of hardware, but don't
	include Translate&Test). 
It is interesting, as H&P point out, that never in the history of computing
have bunch of ISA (note: just ISAs, nothing said about architecture in general)
designs done at the same time resembled each other as much
as the current crop of RISCs do.  (This is where they describe several different
chips by showing their relatively minor differences from their DLX).
This doesn't mean there aren't important diifferences among them, but
machines that have 32-bit instructions, load/store orientation, usually 32
integer registers available at once, etc, etc, are a lot more alike, than,
say: IBM 1401, IBM 7074, and IBM 7094, or S/360, CDC 6600, Univac 1108, or
VAX & DG MV, or Intel 8086, Moto 68000, and NSC 32K.

>Modern processor design is indeed indebted to the RISC pioneers who, in
>order to compensate for reduced instruction sets, applied "good
>engineering" to come up with some remarkable techniques for parallelism.
>_Except for the reduced number of instructions_, these same techniques can
>be applied to CISC (albeit some techniques with more difficulty).
As noted, good engineering practice is good engineering practice,
and it didn't start with RISC.
However, the reduced number of instructions is the LEAST of the issues,
and people keep getting confused with this.  Much more relevant are
issues like:
	Operand and instruction alignment, especially in VM systems
	Number and especially kinds of addressing modes, especially
		multi-level indirect, for example.
	Number & size of operand fetches/writes caused by an instruction
	Multiple instruction sizes
	Number and kind of side-effects caused by an instruction, especially
		in VM systems
	Exception model

>
>If a CISC processor _averages_ close to 1 Cycle Per Instruction, what is
>the advantage of removing many of those instructions?  Are you claiming a
>CISC processor is somehow transformed into a RISC processor because of an
>improved CPI, _even though the actual instruction set has not diminished_?
>(e.g.  the 68040 & 80486)
Well, so far, 80486s don't appear to average close to 1 CPI, although,
as I've pointed out before, only the designers really know.
On the other hand, if you approximate CPI by MHz/(Integer-VAX-mips),
for machines for whichtaht makes sense, and use SPEC integer = Integer-vax-mips,
you get numbers like: (from "Your Mileage May Vary, Issue 2.0):

Clock	SPEC-Int	Clock/SPEC	Chip	System
25	11.2		2.23		SPARC	SUN SS1+ w/s (64K cache)
25	12.3		2.03		SPARC	Sun SS330 w/s (128K cache)
25	13.3		1.88		486	Intel-reported (128K)
25	18.3		1.37		88K	Moto 8864SP (128K)
25	19.4		1.29		R3000	MIPS Magnum 3000 w/s (64K)
25	19.7		1.27		R3000	MIPS M/2000, RC3260 (128K)
25	20.2		1.24		RS/6000	IBM RS/6000 model 530 w/s (72K)

Note, of course, that there is some element of apples&oranges here,
as these things are not completely contemporaneous in design,
have sometimes rather different silicon budgets, etc.
Still, if you believe clock/SPEC is anywhere near close to CPI for
these machines (it is for MIPS, but that's the only one I can be sure of),
the 486 is still off by factor of 2.  (Mainframes would get closer to 1,
I think, and I suspect the '040 will do al ittle better also.)

Of course, doing a heavily-streamlined implementation of a VAX, X86,
68K, etc ... doesn't magically make them RISC architectures, but of
course, one shouldn't care much, either (except for marketing :-).
The engineers are doing what they should be: making them go faster.
Of course, they sometimes have to squeeze harder to get everything in.
I have high respect for the implementation cleverness
that has often gone into such things, because it is VERY HARD WORK
to make ANYTHING go really fast, and people have to leave with past
decisions.  Consider people who build mainframes (IBM & PCMs):
they must live with decisions made 25 years ago....

-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) (09/17/90)

>>>>> On 14 Sep 90 23:41:34 GMT, mash@mips.COM (John Mashey) said:
John> As noted, good engineering practice is good engineering practice,
John> and it didn't start with RISC.
John> However, the reduced number of instructions is the LEAST of the issues,
John> and people keep getting confused with this.  Much more relevant are
John> issues like:
John> 	Operand and instruction alignment, especially in VM systems
John> 	Number and especially kinds of addressing modes, especially
John> 		multi-level indirect, for example.
John> 	Number & size of operand fetches/writes caused by an instruction
John> 	Multiple instruction sizes
John> 	Number and kind of side-effects caused by an instruction, especially
John> 		in VM systems
John> 	Exception model

Well put.  So how about ditching the RISC acronym for a more descriptive
one?  (e.g. LOUIS - Load/store One Uniform Instruction Size   1/2 :-)

John> -john mashey   DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Ditto.

John> UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
John> DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
John> USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
--
Chuck Phillips  MS440
NCR Microelectronics 			Chuck.Phillips%FtCollins.NCR.com
2001 Danfield Ct.
Ft. Collins, CO.  80525   		uunet!ncrlnk!ncr-mpd!bach!chuckp

shri@ncst.ernet.in (H.Shrikumar) (09/17/90)

>>>>>> On 9 Sep 90, dgh@validgh.com (David G. Hough on validgh) said:

>David> Basically I was attacking the idea that RISC = 'a few simple
>David> instructions'.  This was an overly simple definition anyway. 

In article ref. above, (Chuck.Phillips) adds:

>Perhaps RISC does indeed stand for Reduced Instruction Set, and "good
>engineering" can, and has, been applied to CISC architectures (notably the
>80486 and the 68040).

If only we defined RISC = REGULAR Instruction Set Computers .....
                          ^^^^^^^


(and (1/2 :-) CISC = Confusing Instruction Set Computers ? ;-)

-- shrikumar ( shri@ncst.in )

mash@mips.COM (John Mashey) (09/17/90)

In article <CHUCK.PHILLIPS.90Sep17040818@halley.FtCollins.NCR.COM> Chuck.Phillips@FtCollins.NCR.COM (Chuck.Phillips) writes:
...
>John> However, the reduced number of instructions is the LEAST of the issues,
>John> and people keep getting confused with this.  Much more relevant are
....

>Well put.  So how about ditching the RISC acronym for a more descriptive
>one?  (e.g. LOUIS - Load/store One Uniform Instruction Size   1/2 :-)

1) Would be nice, but we're probably stuck with it ...

2) And besides, I'd have to rewrite all of my foils that explain why
RISC (in sense of Reduced) has confused everybody :-)  My usual sequence
of acronyms is:
	Reduced Instruction Set Computer- not really
	Reusable Information Storage Computer-better (Marty Hopkins of IBM)
	Revolutionary Innovation in Science of Computing-no; Seymour at it 25yrs
	Response to Inherent Shifts in Computer technology -yes
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.COM (John Mashey) (09/18/90)

In article <919@shakti.ncst.ernet.in>, shri@ncst.ernet.in (H.Shrikumar) writes:
 
> If only we defined RISC = REGULAR Instruction Set Computers .....
>                           ^^^^^^^

No, that doesn't work either.  Many CISCs are as regular as RISCs,
and some are more so.  For instance, the VAX is pretty regular, as is
the NSC 32K.  Sometimes CISCs have completely regular addressing
modes that RISCs don't (i.e., where you include base+index addressing,
or auto-increment only one side of the load/store pairing).

In any case, part of the point of the last posting was that the acronym
didn't really matter much; of course, it's hardly the case that
one can draw a precise line between RISCs and CISCs anyway, and in fact,
being frenzied about which label to apply ismarketing, anyway.
Much more relevant is to study the underlying issues about kinds of features
that yield performance or not.  You will note that Hennessy & Patterson's
book doesn't waste a lot of time messing with RISC acronyms...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086