mark@hubcap.clemson.edu (Mark Smotherman) (07/21/90)
I would like to suggest a possible taxonomy to distinguish among the differing organizations of new processors. Your comments and corrections are welcome. I see three major areas: issue parallelism, start of execution (which can differ from time of issue if it is the responsibility of the functional unit to obtain its own operands), and resource naming (specifically registers). Thus I would like to give a three-part classification, i/e/n, to each machine. The first field would be issue parallelism: (1) single issue per cycle (s) superscalar issue of independent instructions (v) vliw issue in which several opcodes or instructions (i.e. i860) are grouped into a wide instruction word The second field would be execution start time: (d) data dependencies (RAW) are interlocked by issue unit, which stalls until resolution; the fn unit starts upon issue since issue unit provides both the op specification as well as the operands (c) compiler must reorder instructions to avoid data dependencies (x) out-of-order execution, where the fn unit starts only after obtaining its operands -- the issue unit does not stall on data dependencies but forces the fn unit to resolve them The third field would be resource naming, specifically register renaming: (p) physical registers named in instructions -- must be concerned about anti-dependencies (WAR) and output dependencies (WAW) (r) hardware renames logical registers in instructions by tagging or assignment of physical registers (removes WAR and WAW) Using this proposed taxonomy, I would classify the following machines: MIPS R3000 1 / c / p MIPS R6000 1 / d / p Motorola 88K 1 / d / p Intel i860 v / d / p Multiflow Trace v / c / p Apollo DN10000 s / d / p ?? Intel i960 s / d / p Tandem Cyclone s / d / p ?? CDC 6600 1 / x / p IBM S/360 M91 (Tomasulo) 1 / x / r Nexgen F86 1 / x / r Metaflow Lightning (SPARC) s / x / r ?? IBM RS/6000 s / x / r -- Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634 INTERNET: mark@hubcap.clemson.edu UUCP: gatech!hubcap!mark
davec@nucleus.amd.com (Dave Christie) (07/24/90)
In article <9782@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes: >I would like to suggest a possible taxonomy to distinguish among >the differing organizations of new processors. Your comments >and corrections are welcome. > >I see three major areas: issue parallelism, start of execution >(which can differ from time of issue if it is the responsibility >of the functional unit to obtain its own operands), and resource >naming (specifically registers). > >Thus I would like to give a three-part classification, i/e/n, >to each machine. The first field would be issue parallelism: > > (1) single issue per cycle > (s) superscalar issue of independent instructions > (v) vliw issue in which several opcodes or instructions (i.e. i860) > are grouped into a wide instruction word How about adding (d) = decoupled superscalar issue of independent instructions, where instructions are issued simultaneously from two independent streams (Smith, James E., _Decoupled_Access/Execute_ Computer_Architectures_, ACM TOCS, Nov 1984, plus a half-dozen other papers by him). This is considerably different from the type of superscalar issue we are beginning to see in advanced RISC chips, and has been used commercially (Astronautics Corp's short-lived ZS-1(?) for one). BTW, I'm not sure what your taxonomy is intended to cover - features that are architecturally visible, or implementation techniques regardless of whether they influence the architecture/compiler? I assume the latter, since renaming is used to hide stuff from the compiler. >The second field would be execution start time: > > (d) data dependencies (RAW) are interlocked by issue unit, which > stalls until resolution; the fn unit starts upon issue since > issue unit provides both the op specification as well as the > operands > (c) compiler must reorder instructions to avoid data dependencies > (x) out-of-order execution, where the fn unit starts only after > obtaining its operands -- the issue unit does not stall on > data dependencies but forces the fn unit to resolve them These catagories aren't mutually exclusive - the R3000 which you classify as 1/c/p is "d" for cache misses on a load, as well as accesses to the HI and LO registers during multiply/divide. Should it be classified as "dc", or where does one draw the line? I also have a semantic bone of contention for these first two parts. When instruction execution becomes separated from instruction "issue", most literature I've seen refers to the act of sending an instruction to a functional unit as "dispatch", with the term "issue" used to refer to the act of placing an instruction in execution. I don't think "execution" can be used in place of "issue" here, since execution can take multiple cycles, and issue (& dispatch) indicate single cycle events (to me, anyway). This may seem pretty picky, but rigorous definition of these terms definitely helps avoid confusion when working with this stuff. >The third field would be resource naming, specifically register >renaming: > > (p) physical registers named in instructions -- must be concerned > about anti-dependencies (WAR) and output dependencies (WAW) > (r) hardware renames logical registers in instructions by tagging > or assignment of physical registers (removes WAR and WAW) A few comments: How about a designation for cases where possible dependencies are eliminated mainly by using independent register sets (ref. i860 integer/fp registers, decoupled architectures)? Dependencies between functional units can be handled via queues between functional units - this is typical for decoupled architectures (ref. aforementioned paper). Moreover, this technique can be used in single-stream architectures with independent functional units, and the queues can be architecturally visible (accessed via register designators) or implemented using renaming (architecturally hidden), where the renaming is only done for queued operands, not all operations. I have seen implementations where the renaming is permanent, such that the architectural state of a process is maintained by a set of pointers into a pool of registers, and where it is only temporary (to cover the average or maximum execution latency), with a set of physically-addressed registers being updated in order from a reorder buffer or similar mechanism. (This is probably getting too detailed for your purposes.) >Using this proposed taxonomy, I would classify the following machines: > > Tandem Cyclone s / d / p ?? Maybe in a class by itself, considering how it relies on the compiler to pair up instructions, which are then converted to a vliw-like micrand via the control store. > IBM S/360 M91 (Tomasulo) 1 / x / r > IBM RS/6000 s / x / r These two use renaming in considerably different ways: in the 360/91 the tags refer to specific physical registers, and not just the floating point registers. In the RS6000, at least from what I can tell from the literature I have, the renaming is used primarily to implement a load operand queue, as described above. I don't know whether or not renaming is used to handle intra-FPU conflicts. I'm pretty sure renaming is not used within the fixed-point unit. (The RS6000 actually has several attributes of the single-stream decoupled access/execute architecture described in Jim Smith's aforementioned paper.) In any case, you should probably indicate what level of detail you wish to represent with this taxonomy. One can get really carried away with this stuff [...he says nonchalantly, concluding the longest posting of his life...:-)]. ---------------------------------- Dave Christie My opinions only.
davec@nucleus.amd.com (Dave Christie) (07/24/90)
Just a quick correction before I get flamed: In article <1990Jul23.182546.25777@mozart.amd.com> I write: > >> IBM S/360 M91 (Tomasulo) 1 / x / r >> IBM RS/6000 s / x / r > >These two use renaming in considerably different ways: in the 360/91 >the tags refer to specific physical registers, and not just the floating >point registers. In the RS6000, at least from what I can tell from the Sorry, for the 360/91 the tags refer to specific functional units (the reservation stations actually) and the entries in the Floating Point Buffers (essentially a load operand queue), i.e. the tags identify all the possible sources of operands, *except* for the FP registers. --------------------------------- Dave Christie My opinions only.
marc@oahu.cs.ucla.edu (Marc Tremblay) (07/25/90)
In article <1990Jul23.182546.25777@mozart.amd.com> davec@nucleus.amd.com (Dave Christie) writes: >> IBM S/360 M91 (Tomasulo) 1 / x / r >> IBM RS/6000 s / x / r > >These two use renaming in considerably different ways: in the 360/91 >the tags refer to specific physical registers, and not just the floating >point registers. In the RS6000, at least from what I can tell from the >literature I have, the renaming is used primarily to implement a load >operand queue, as described above. I don't know whether or not renaming >is used to handle intra-FPU conflicts. I'm pretty sure renaming is >not used within the fixed-point unit. In the current version of the RS6000, floating point arithmetic instructions are executed in sequence. There is thus no need to assign new tags to the destination registers of these instructions. Only the floating point loads create a new assignment in the mapping table (logical to physical translation table). So the real purpose of the register renaming scheme in the RS/6000 is to be able to process floating point loads without waiting for source registers to be used by previous instructions. This type of out-of-order execution of instructions can lead to heavy hardware support to deal with precise interrupts, which can eventually slowed down the clock, but instead the RS/6000 offers a few alternatives/tradeoffs regarding how tight one wants to monitor exceptions. For example for some applications it may not be necessary to check for exceptions immediately after each floating point arithmetic instruction. For this case, one can insert an instruction that polls the status register and trap if an exception is detected. For instance in the following pseudo code: fdiv fp3,fp2,fp1 fld fp4,(r5) fld fp6,(r7) ... check_for_exceptions ... Why wait for the possible exception generated by the fdiv? Instead the loads can be executed in parallel with the fdiv and later an instruction can be inserted to check for exceptions. For the purpose of debugging code and for running critical codes, exceptions can be checked after every instructions by running an application in the one-instruction-at-the-time mode. Which means that you basically lose the throughput provided by the pipeline implementation of the functional units. _________________________________________________ Marc Tremblay internet: marc@CS.UCLA.EDU UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc
davec@nucleus.amd.com (Dave Christie) (07/26/90)
In article <37240@shemp.CS.UCLA.EDU> marc@oahu.cs.ucla.edu (Marc Tremblay) writes: > >In the current version of the RS6000, floating point arithmetic instructions >are executed in sequence. There is thus no need to assign new tags to the >destination registers of these instructions. Only the floating point loads >create a new assignment in the mapping table (logical to physical translation >table). So the real purpose of the register renaming scheme in the RS/6000 >is to be able to process floating point loads without waiting for source >registers to be used by previous instructions. > Is the remapping permanent? Or are the load operands eventually transferred into physical registers as designated in the instruction? If it's permanent, then this _looks_ less like a queue, but the overall effect is the same. Permanent remapping would imply that all FP operand references would have to go through the mapping table, slowing down access a tad. I've seen the register set described as "32 Floating Point registers and six rename registers", but that really doesn't indicate whether it's a pool of 38 logically-addressed registers, or 32 physically addressed registers + a queue :-) of 6 load operand registers. ---------------------- Dave Christie My opinions only.
marc@oahu.cs.ucla.edu (Marc Tremblay) (07/26/90)
In article <1990Jul25.172053.27085@mozart.amd.com> (Dave Christie) writes: >In article <37240@shemp.CS.UCLA.EDU> (Marc Tremblay) writes: >>In the current version of the RS6000, floating point arithmetic instructions >>are executed in sequence. There is thus no need to assign new tags to the >>destination registers of these instructions. Only the floating point loads >>create a new assignment in the mapping table (logical to physical translation >>table). So the real purpose of the register renaming scheme in the RS/6000 >>is to be able to process floating point loads without waiting for source >>registers to be used by previous instructions. >Is the remapping permanent? Or are the load operands eventually >transferred into physical registers as designated in the instruction? >If it's permanent, then this _looks_ less like a queue, but the overall >effect is the same. The remapping is permanent ... until of course the same logical register gets reassigned to another physical register. All subsequent instructions access the logical tag through the map table so that they access the latest physical tag assigned. >Permanent remapping would imply that all FP operand references would >have to go through the mapping table, slowing down access a tad. I've >seen the register set described as "32 Floating Point registers and >six rename registers", but that really doesn't indicate whether it's a >pool of 38 logically-addressed registers, or 32 physically addressed >registers + a queue :-) of 6 load operand registers. There isn't a separate pool of registers assigned to be "rename registers". All physical registers can be used for renaming. During normal execution the 40 tags (some IBM literature says 38 tags) are distributed as follow: 1) 32 logical tags are assigned to 32 physical registers. 2) a number of tags n (0<n<8) are stored in the Free List ready to be used for new assignments. 3) a number of tags (8-n) are waiting to be released by floating arithmetic instructions or floating-point stores so that they can be sent to the Free List for ulterior assignments. A full cycle seems to be allocated for register renaming. That's plenty of time to access the map table and manage the tag lists. I suspect that they may even do it twice per cycle to reduce the number of ports of the map table. Indeed if two instructions can be renamed per cycle, and if both are FMA (Fused multiply and add) which require 3 source tags and one destination tag, that's 8 ports/cycle for the mapping table. An extra stage in the pipeline can be a limiting factor but in this case it does not affect the pipeline result latency among adjacent instructions. It does affect though the waiting time for example for a conditional branch waiting for a floating-point compare. But that's a pretty rare case. _________________________________________________ Marc Tremblay internet: marc@CS.UCLA.EDU UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc