baum@Apple.COM (Allen J. Baum) (02/12/91)
[] >In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: >Superscalar is pretty easy to define, but >what *does* superpipelining really mean? (details of R4000 pipe... Thanks!) > >Two instructions are issued per EXTERNAL clock, >this is the same period as the on-chip cache latency. >To do this, an internal clock runs at double the external clock and >one instruction is issued per internal clock so >This requires that cache access is pipelined. This actually runs contrary to MY definition of superpipelined, which is probably like MY definition of RISC - there's no such animal as THE definition (for the curious, my definition requires all functional units be multiple cycle- the R4000 behaves like a normal pipeline with a multiple delay slot cache access [that's pipelined]). I would expect a multiple cycle hit for taken branches as well, without some kind of branch acceleration technique like a branch target cache. I'm also curious about the compatibility provisions. Is there a 'mode'? Are instructions 64 bits long now, or a mixture? Were just a few new instructions added (like load/store double, shift double) and the semantics of existing ones change (load becomes load signed/unsigned, shifts become shift single, etc)? If there is a mode, does it just change where in the word a condition is taken from? What else? Inquiring minds would, of course, like to know. Is this level of detail still subject to non-disclosure? -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
davec@nucleus.amd.com (Dave Christie) (02/13/91)
Sorry, this has gotten rather long... In article <49041@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >[] >>In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: >>Superscalar is pretty easy to define, but >>what *does* superpipelining really mean? > > (details of R4000 pipe... Thanks!) >> >>Two instructions are issued per EXTERNAL clock, >>this is the same period as the on-chip cache latency. >>To do this, an internal clock runs at double the external clock and >>one instruction is issued per internal clock so >>This requires that cache access is pipelined. > >This actually runs contrary to MY definition of superpipelined, which >is probably like MY definition of RISC - there's no such animal as THE >definition (for the curious, my definition requires all functional >units be multiple cycle- the R4000 behaves like a normal pipeline with >a multiple delay slot cache access [that's pipelined]). I was just about to post in a similar vein. When Jouppi coined the terms superscalar and superpipelined (at least his paper in ACM CAN a couple of years ago was the first I saw of the terms), he was comparing the two techniques in an apples/apples sort of way, stating that the most basic superpipeline operations have more than one pipestage of latency, and both are subject to similar dependency stalls. The R4000 is certainly aggressively pipelined, but not fully superpipelined. Only somewhat moreso than the 88K or 29050 for instance (disregarding the frequency) - one could say they're all superpipelined w.r.t floating point operations. Load operations in the R3000 could be called superpipelined - instructions are issued at twice the latency of load operations. The superpipelined criteria mentioned by Mr. Price in another posting are rather weak: issue rate 2x icache access? What if I decided to put in such a large icache that it increased my cycle time to almost 2x the capabilities of my other (R3K-like) stages, then merely made the cache double width and allowed two cycles to access it so I could still issue instructions at the faster rate - is it suddenly superpipelined? (If you say yes, you must be in marketing ;-). I'm certainly not trying to pick on the R4000 - it's a reasonably impressive design, and timely execution on the rest of the project will make it more so. It's more the term "superpipelined" - which is quickly becoming as meaninful as "RISC" and "MIPS". It's been well known since the early 60's that there are various degrees of aggressive pipelining; Jouppi refers to the the CDC 6600 and 7600 as superpipelined. These varying degrees of aggressiveness is what will make the term so slippery (hence a good marketing term. Can you tell I'm not in marketing?:-). What the R4000 has done for me at least is point out that superpipelining as defined and compared with superscalar by Jouppi is for the most part an academic exercise, at least with respect to the reality of the technologies we have to work with today: on chip cache sizes grow faster than their access times, but by the same token integer alu sizes (& latencies) shrink (alright, assuming constant word size...) so what might have been fairly balanced alu/cache access times in an older technology start to get very skewed - it makes perfect sense to split your cache access into two stages, but not your alu. (The same applies to superscalar designs.) One could go with a smaller, faster pipelined cache, split the alu and then maybe run at an even higher rate, BUT maybe not double and it would be more sensitive to operand dependencies. Moreover, smaller caches with higher execution rates is rather pointless. For these reasons, I don't think one could do an optimal fully-superpipelined implementation. (I don't know what the relative latencies of the alu and combined/split cache stages turned out to be for the R4000, but I'm tempted to speculate that one of the cache stages turned out to be still slower than a 32-bit alu, so going to 64 bits, while maybe even exceeding that cache stage by a bit, wasn't too painful, and therefore downright worthwhile. Just pure speculation, of course.) >I would expect a multiple cycle hit for taken branches as well, without >some kind of branch acceleration technique like a branch target cache. Yep. An extra cycle. At least they can still eat one cycle with the delay slot. This extra branch recovery cycle and the extra load cycle will cause non-uniform speed ups among various integer codes that aren't re-scheduled. >I'm also curious about the compatibility provisions. Is there a >'mode'? Are instructions 64 bits long now, or a mixture? Were just a few >new instructions added (like load/store double, shift double) and the >semantics of existing ones change (load becomes load signed/unsigned, >shifts become shift single, etc)? If there is a mode, does it just change >where in the word a condition is taken from? What else? Yeah, good questions. What's the cache impact of pointers taking up double the space? (I know this doesn't apply to 32-bit "mode", but I can't imagine that there'll be two sizes of pointers in 64-bit "mode".) This plus the extra load cycle won't look good for things like searching linked lists, but then I have yet to see any really aggressive high performance design provide balanced speedups across the spectrum. In any case, one can be sure Mipsco took all that into account for their target market. ------------------------------------------ Dave Christie My opinions only.
gillies@m.cs.uiuc.edu (Don Gillies) (02/13/91)
I read an article today in "The Microprocessor Report" that said the increase from 32 to 64 bits added approximately 10-15% to the size of the chip. This is quite surprising. You'd think that going from 32 to 64 bits would double the size of the ALU and all the data paths and registers. Does this mean that the datapath and ALU and all the registers accounted for only 10-15% of the microprocessor to begin with? What is the baseline design for this 10-15% increase in area? --
cprice@mips.COM (Charlie Price) (02/22/91)
Yet again, hoping the net gods bless this posting... In article <49041@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >[] >>In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: >>Superscalar is pretty easy to define, but >>what *does* superpipelining really mean? > > (details of R4000 pipe... Thanks!) Ya Betcha! By the way, this is the level of detail in information that was handed out to the press. You can look for pictures in the glossy trade rags any time now. >This actually runs contrary to MY definition of superpipelined, which >is probably like MY definition of RISC - there's no such animal as THE >definition... While I agree that there is no "one true meaning", for the term, we need *some* terminology we can agree on to talk about these things. I will have more to say on this in a subsequent posting. >I would expect a multiple cycle hit for taken branches as well, without >some kind of branch acceleration technique like a branch target cache. Yes, increased load latency and branch latency is one of the effects of increasing the pipeline depth. This is a non-negligible consideration in picking a pipeline structure. >I'm also curious about the compatibility provisions. >Inquiring minds would, of course, like to know. Is this level of detail >still subject to non-disclosure? I can't find any discussion of this in our technology presentations, so this will have to be a topic for the future. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650