[comp.arch] taxonomy for superscalars/etc

mark@hubcap.clemson.edu (Mark Smotherman) (07/21/90)

I would like to suggest a possible taxonomy to distinguish among
the differing organizations of new processors.  Your comments
and corrections are welcome.

I see three major areas: issue parallelism, start of execution
(which can differ from time of issue if it is the responsibility
of the functional unit to obtain its own operands), and resource
naming (specifically registers).

Thus I would like to give a three-part classification, i/e/n,
to each machine.  The first field would be issue parallelism:

  (1) single issue per cycle
  (s) superscalar issue of independent instructions
  (v) vliw issue in which several opcodes or instructions (i.e. i860)
      are grouped into a wide instruction word

The second field would be execution start time:

  (d) data dependencies (RAW) are interlocked by issue unit, which
      stalls until resolution; the fn unit starts upon issue since
      issue unit provides both the op specification as well as the
      operands
  (c) compiler must reorder instructions to avoid data dependencies
  (x) out-of-order execution, where the fn unit starts only after
      obtaining its operands -- the issue unit does not stall on
      data dependencies but forces the fn unit to resolve them

The third field would be resource naming, specifically register
renaming:

  (p) physical registers named in instructions -- must be concerned
      about anti-dependencies (WAR) and output dependencies (WAW)
  (r) hardware renames logical registers in instructions by tagging
      or assignment of physical registers (removes WAR and WAW)

Using this proposed taxonomy, I would classify the following machines:

  MIPS R3000                  1 / c / p
  MIPS R6000                  1 / d / p
  Motorola 88K                1 / d / p

  Intel i860                  v / d / p
  Multiflow Trace             v / c / p

  Apollo DN10000              s / d / p  ??
  Intel i960                  s / d / p
  Tandem Cyclone              s / d / p  ??

  CDC 6600                    1 / x / p

  IBM S/360 M91 (Tomasulo)    1 / x / r
  Nexgen F86                  1 / x / r

  Metaflow Lightning (SPARC)  s / x / r  ??
  IBM RS/6000                 s / x / r

-- 
Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: mark@hubcap.clemson.edu    UUCP: gatech!hubcap!mark

davec@nucleus.amd.com (Dave Christie) (07/24/90)

In article <9782@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
>I would like to suggest a possible taxonomy to distinguish among
>the differing organizations of new processors.  Your comments
>and corrections are welcome.
>
>I see three major areas: issue parallelism, start of execution
>(which can differ from time of issue if it is the responsibility
>of the functional unit to obtain its own operands), and resource
>naming (specifically registers).
>
>Thus I would like to give a three-part classification, i/e/n,
>to each machine.  The first field would be issue parallelism:
>
>  (1) single issue per cycle
>  (s) superscalar issue of independent instructions
>  (v) vliw issue in which several opcodes or instructions (i.e. i860)
>      are grouped into a wide instruction word

How about adding (d) = decoupled superscalar issue of independent 
instructions, where instructions are issued simultaneously from two
independent streams (Smith, James E., _Decoupled_Access/Execute_
Computer_Architectures_, ACM TOCS, Nov 1984, plus a half-dozen other
papers by him).  This is considerably different from the type of 
superscalar issue we are beginning to see in advanced RISC chips, 
and has been used commercially (Astronautics Corp's short-lived ZS-1(?) 
for one).  BTW, I'm not sure what your taxonomy is intended to cover - 
features that are architecturally visible, or implementation techniques 
regardless of whether they influence the architecture/compiler?  I assume
the latter, since renaming is used to hide stuff from the compiler.

>The second field would be execution start time:
>
>  (d) data dependencies (RAW) are interlocked by issue unit, which
>      stalls until resolution; the fn unit starts upon issue since
>      issue unit provides both the op specification as well as the
>      operands
>  (c) compiler must reorder instructions to avoid data dependencies
>  (x) out-of-order execution, where the fn unit starts only after
>      obtaining its operands -- the issue unit does not stall on
>      data dependencies but forces the fn unit to resolve them

These catagories aren't mutually exclusive - the R3000 which you
classify as 1/c/p is "d" for cache misses on a load, as well as
accesses to the HI and LO registers during multiply/divide.  Should
it be classified as "dc", or where does one draw the line?

I also have a semantic bone of contention for these first two parts.
When instruction execution becomes separated from instruction
"issue", most literature I've seen refers to the act of sending an
instruction to a functional unit as "dispatch", with the term "issue"
used to refer to the act of placing an instruction in execution.  I
don't think "execution" can be used in place of "issue" here, since
execution can take multiple cycles, and issue (& dispatch) indicate
single cycle events (to me, anyway).  

This may seem pretty picky, but rigorous definition of these terms
definitely helps avoid confusion when working with this stuff.  

>The third field would be resource naming, specifically register
>renaming:
>
>  (p) physical registers named in instructions -- must be concerned
>      about anti-dependencies (WAR) and output dependencies (WAW)
>  (r) hardware renames logical registers in instructions by tagging
>      or assignment of physical registers (removes WAR and WAW)

A few comments:
How about a designation for cases where possible dependencies are
eliminated mainly by using independent register sets (ref. i860
integer/fp registers, decoupled architectures)?

Dependencies between functional units can be handled via queues
between functional units - this is typical for decoupled architectures
(ref. aforementioned paper).  Moreover, this technique can be used 
in single-stream architectures with independent functional units,
and the queues can be architecturally visible (accessed via register
designators) or implemented using renaming (architecturally hidden),
where the renaming is only done for queued operands, not all operations.

I have seen implementations where the renaming is permanent, such
that the architectural state of a process is maintained by a set of
pointers into a pool of registers, and where it is only temporary
(to cover the average or maximum execution latency), with a set of
physically-addressed registers being updated in order from a reorder 
buffer or similar mechanism.  (This is probably getting too detailed
for your purposes.)

>Using this proposed taxonomy, I would classify the following machines:
>
>  Tandem Cyclone              s / d / p  ??

Maybe in a class by itself, considering how it relies on the compiler
to pair up instructions, which are then converted to a vliw-like
micrand via the control store.

>  IBM S/360 M91 (Tomasulo)    1 / x / r
>  IBM RS/6000                 s / x / r

These two use renaming in considerably different ways: in the 360/91
the tags refer to specific physical registers, and not just the floating
point registers.  In the RS6000, at least from what I can tell from the
literature I have, the renaming is used primarily to implement a load
operand queue, as described above.  I don't know whether or not renaming
is used to handle intra-FPU conflicts.  I'm pretty sure renaming is
not used within the fixed-point unit.  (The RS6000 actually has several
attributes of the single-stream decoupled access/execute architecture
described in Jim Smith's aforementioned paper.)

In any case, you should probably indicate what level of detail you wish
to represent with this taxonomy.  One can get really carried away with
this stuff [...he says nonchalantly, concluding the longest posting of
his life...:-)].

----------------------------------
Dave Christie           My opinions only.

davec@nucleus.amd.com (Dave Christie) (07/24/90)

Just a quick correction before I get flamed:

In article <1990Jul23.182546.25777@mozart.amd.com> I write:
>
>>  IBM S/360 M91 (Tomasulo)    1 / x / r
>>  IBM RS/6000                 s / x / r
>
>These two use renaming in considerably different ways: in the 360/91
>the tags refer to specific physical registers, and not just the floating
>point registers.  In the RS6000, at least from what I can tell from the

Sorry, for the 360/91 the tags refer to specific functional units (the 
reservation stations actually) and the entries in the Floating Point 
Buffers (essentially a load operand queue), i.e. the tags identify all 
the possible sources of operands, *except* for the FP registers.

---------------------------------
Dave Christie            My opinions only.

marc@oahu.cs.ucla.edu (Marc Tremblay) (07/25/90)

In article <1990Jul23.182546.25777@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes:
>>  IBM S/360 M91 (Tomasulo)    1 / x / r
>>  IBM RS/6000                 s / x / r
>
>These two use renaming in considerably different ways: in the 360/91
>the tags refer to specific physical registers, and not just the floating
>point registers.  In the RS6000, at least from what I can tell from the
>literature I have, the renaming is used primarily to implement a load
>operand queue, as described above.  I don't know whether or not renaming
>is used to handle intra-FPU conflicts.  I'm pretty sure renaming is
>not used within the fixed-point unit.

In the current version of the RS6000, floating point arithmetic instructions
are executed in sequence. There is thus no need to assign new tags to the
destination registers of these instructions. Only the floating point loads
create a new assignment in the mapping table (logical to physical translation
table). So the real purpose of the register renaming scheme in the RS/6000
is to be able to process floating point loads without waiting for source
registers to be used by previous instructions.

This type of out-of-order execution of instructions can lead to heavy hardware
support to deal with precise interrupts, which can eventually slowed down
the clock, but instead the RS/6000 offers a few alternatives/tradeoffs
regarding how tight one wants to monitor exceptions.
For example for some applications it may not be necessary to check
for exceptions immediately after each floating point arithmetic instruction.
For this case, one can insert an instruction that polls the status register
and trap if an exception is detected.
For instance in the following pseudo code:

	fdiv fp3,fp2,fp1
	fld  fp4,(r5)
	fld  fp6,(r7)
	...
	check_for_exceptions
	...

Why wait for the possible exception generated by the fdiv?
Instead the loads can be executed in parallel with the fdiv
and later an instruction can be inserted to check for exceptions.

For the purpose of debugging code and for running critical codes,
exceptions can be checked after every instructions by running
an application in the one-instruction-at-the-time mode.
Which means that you basically lose the throughput provided by
the pipeline implementation of the functional units.

_________________________________________________
Marc Tremblay
internet: marc@CS.UCLA.EDU
UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc

davec@nucleus.amd.com (Dave Christie) (07/26/90)

In article <37240@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes:
>
>In the current version of the RS6000, floating point arithmetic instructions
>are executed in sequence. There is thus no need to assign new tags to the
>destination registers of these instructions. Only the floating point loads
>create a new assignment in the mapping table (logical to physical translation
>table). So the real purpose of the register renaming scheme in the RS/6000
>is to be able to process floating point loads without waiting for source
>registers to be used by previous instructions.
>

Is the remapping permanent? Or are the load operands eventually 
transferred into physical registers as designated in the instruction?  
If it's permanent, then this _looks_ less like a queue, but the overall
effect is the same.

Permanent remapping would imply that all FP operand references would
have to go through the mapping table, slowing down access a tad.  I've
seen the register set described as "32 Floating Point registers and
six rename registers", but that really doesn't indicate whether it's a
pool of 38 logically-addressed registers, or 32 physically addressed
registers + a queue :-) of 6 load operand registers.

----------------------
Dave Christie        My opinions only.

marc@oahu.cs.ucla.edu (Marc Tremblay) (07/26/90)

In article <1990Jul25.172053.27085@mozart.amd.com> (Dave Christie) writes:
>In article <37240@shemp.CS.UCLA.EDU> (Marc Tremblay) writes:
>>In the current version of the RS6000, floating point arithmetic instructions
>>are executed in sequence. There is thus no need to assign new tags to the
>>destination registers of these instructions. Only the floating point loads
>>create a new assignment in the mapping table (logical to physical translation
>>table). So the real purpose of the register renaming scheme in the RS/6000
>>is to be able to process floating point loads without waiting for source
>>registers to be used by previous instructions.

>Is the remapping permanent? Or are the load operands eventually 
>transferred into physical registers as designated in the instruction?  
>If it's permanent, then this _looks_ less like a queue, but the overall
>effect is the same.

The remapping is permanent ... until of course the same logical register
gets reassigned to another physical register. All subsequent instructions
access the logical tag through the map table so that they access the latest
physical tag assigned.

>Permanent remapping would imply that all FP operand references would
>have to go through the mapping table, slowing down access a tad.  I've
>seen the register set described as "32 Floating Point registers and
>six rename registers", but that really doesn't indicate whether it's a
>pool of 38 logically-addressed registers, or 32 physically addressed
>registers + a queue :-) of 6 load operand registers.

There isn't a separate pool of registers assigned to be "rename registers".
All physical registers can be used for renaming.
During normal execution the 40 tags (some IBM literature says 38 tags)
are distributed as follow:

	1) 32 logical tags are assigned to 32 physical registers.
	2) a number of tags n (0<n<8) are stored in the Free List
	   ready to be used for new assignments.
	3) a number of tags (8-n) are waiting to be released by floating
	   arithmetic instructions or floating-point stores so that
	   they can be sent to the Free List for ulterior assignments.

A full cycle seems to be allocated for register renaming. That's plenty
of time to access the map table and manage the tag lists. I suspect that
they may even do it twice per cycle to reduce the number of ports
of the map table. Indeed if two instructions can be renamed per cycle,
and if both are FMA (Fused multiply and add) which require 3 source
tags and one destination tag, that's 8 ports/cycle for the mapping table.

An extra stage in the pipeline can be a limiting factor but in this case
it does not affect the pipeline result latency among adjacent instructions.
It does affect though the waiting time for example for a conditional
branch waiting for a floating-point compare. But that's a pretty rare case.

_________________________________________________
Marc Tremblay
internet: marc@CS.UCLA.EDU
UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc