[comp.arch] Floating point register renaming

bcase@cup.portal.com (Brian bcase Case) (03/13/90)

>> 2) You can change the pipeline, and not recompile all of your code.
>
>Generally, a pipeline change means a new computer, usually bought for
>speed reasons.  Why not exploit all the new speed advantages?  And if
>you refuse to compile, is it so hard to write a Loader to insert NOops
>after instructions whose pipelines have lengthened?

Yes, it is so hard.  Indirect jumps are the problem.  A program
that computes a jump address, say, by using a procedure parameter
or looking in a switch table, will not function unless the
algorithm that computes the address is changed by the loader
too.  For certain cases, this is feasible, but in the most
general case, the problem cannot be solved.

murthy@panini.watson.ibm.com (Sesh Murthy) (03/16/90)

It occurs to me that all compilers for the Risc System 6000 machines
are not going to be produced by IBM.  Also you sometimes get object
code that has not been optimized.  Wouldn't floating point register
help here.

Of course this is only my UNINFORMED opinion.

Sesh Murthy

keith@mips.COM (Keith Garrett) (03/16/90)

In article <3090@elmer.oakhill.UUCP> michaelb@oakhill.UUCP (Mike Becker) writes:
>
>In article <36899@mips.mips.COM> mash@mips.COM (John Mashey) writes:
> [ discussion of 88K scoreboarding claims deleted]
>
>>(Whether they do a complete set, I don't care.)
>Someone cares.  In the case of the R3000's missing load interlock, programmers 
>(and compilers) have less freedom to relocate the loads.  Specifically, a load
>moved upstream on the 88K will not stall the main pipeline if a cache miss
>occurs. In contrast, the R3000 will stall the main pipeline in all cases
>where the load misses in the cache, negating the effect of moving the load
>upstream. (See MATRIX100, SPEC benchmark suite).  Admittedly, one should take 
>into account the frequency with which this occurs, but the point is why bother 
>introducing this constraint when register scoreboarding gets you around it in
>a straightforward fashion.

as mash pointed out in earlier postings, processors with a single port to
main memory become blocked very quickly. Its not clear that the complexity of
register interlocks is justified. MIPS is not opposed to hardware interlocks.
software interlocks are used for load and branch instructions because they
are easy, and the delay slots can usually be filled with useful instructions.
The R3000 has hardware interlocks for multiple cycle instructions (integer
mult/div, fp). As the pipeline changes for other implementations, we will
add hardware interlocks as needed to maintain compatibility.

>>..1-cycle load .. [on] MIPS chips
>Calling the R3000 load a 1-cycle load is "just marketing silliness, confusing,
>and contradictory to terminology long-used in computer architecture."  This
>usage suggests that the load has the same execution time as an integer add.

load and store instructions spend the same amount of time in each of the same
pipeline stages as any of the other 1-cycle instructions. The only difference
is that the result of load instructions is available one cycle later than
the result of alu ops would be (ie. a one cycle load delay). loads to not
stall the pipe except for cache misses. perhaps you are thinking of some other
architecture :?>

> [other claims for register interlocking deleted]
>
>-- mike becker DISCLAIMER: I represent myself only.

-- 
Keith Garrett        "This is *MY* opinion, OBVIOUSLY"
UUCP: keith@mips.com  or  {ames,decwrl,prls}!mips!keith
USPS: Mips Computer Systems,930 Arques Ave,Sunnyvale,Ca. 94086

tim@nucleus.amd.com (Tim Olson) (03/16/90)

In article <37065@mips.mips.COM> keith@mips.COM (Keith Garrett) writes:
| In article <3090@elmer.oakhill.UUCP> michaelb@oakhill.UUCP (Mike Becker) writes:
| >Calling the R3000 load a 1-cycle load is "just marketing silliness, confusing,
| >and contradictory to terminology long-used in computer architecture."  This
| >usage suggests that the load has the same execution time as an integer add.
| 
| load and store instructions spend the same amount of time in each of the same
| pipeline stages as any of the other 1-cycle instructions. The only difference
| is that the result of load instructions is available one cycle later than
| the result of alu ops would be (ie. a one cycle load delay). loads to not
| stall the pipe except for cache misses. perhaps you are thinking of some other
| architecture :?>

Yes, loads and stores have single-cycle issue rate in the MIPS
pipeline, but they have a 2-cycle latency.  This is what most people
refer to when talking about the cycle time of operations without
further specifying them.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)