[comp.arch] R4000 - compatibilty questions

baum@Apple.COM (Allen J. Baum) (02/12/91)

[]
>In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
>Superscalar is pretty easy to define, but
>what *does* superpipelining really mean?

 (details of R4000 pipe... Thanks!)
>
>Two instructions are issued per EXTERNAL clock,
>this is the same period as the on-chip cache latency.
>To do this, an internal clock runs at double the external clock and
>one instruction is issued per internal clock so
>This requires that cache access is pipelined.

This actually runs contrary to MY definition of superpipelined, which
is probably like MY definition of RISC - there's no such animal as THE
definition (for the curious, my definition requires all functional
units be multiple cycle- the R4000 behaves like a normal pipeline with
a multiple delay slot cache access [that's pipelined]).

I would expect a multiple cycle hit for taken branches as well, without
some kind of branch acceleration technique like a branch target cache.

I'm also curious about the compatibility provisions. Is there a
'mode'?  Are instructions 64 bits long now, or a mixture? Were just a few
new instructions added (like load/store double, shift double) and the
semantics of existing ones change (load becomes load signed/unsigned,
shifts become shift single, etc)? If there is a mode, does it just change
where in the word a condition is taken from? What else?

Inquiring minds would, of course, like to know. Is this level of detail
still subject to non-disclosure?

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

davec@nucleus.amd.com (Dave Christie) (02/13/91)

Sorry, this has gotten rather long...

In article <49041@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
>>Superscalar is pretty easy to define, but
>>what *does* superpipelining really mean?
>
> (details of R4000 pipe... Thanks!)
>>
>>Two instructions are issued per EXTERNAL clock,
>>this is the same period as the on-chip cache latency.
>>To do this, an internal clock runs at double the external clock and
>>one instruction is issued per internal clock so
>>This requires that cache access is pipelined.
>
>This actually runs contrary to MY definition of superpipelined, which
>is probably like MY definition of RISC - there's no such animal as THE
>definition (for the curious, my definition requires all functional
>units be multiple cycle- the R4000 behaves like a normal pipeline with
>a multiple delay slot cache access [that's pipelined]).

I was just about to post in a similar vein.  When Jouppi coined the
terms superscalar and superpipelined (at least his paper in ACM CAN a 
couple of years ago was the first I saw of the terms), he was comparing the 
two techniques in an apples/apples sort of way, stating that the most 
basic superpipeline operations have more than one pipestage of latency,
and both are subject to similar dependency stalls.  The R4000 is certainly
aggressively pipelined, but not fully superpipelined.  Only somewhat
moreso than the 88K or 29050 for instance (disregarding the frequency) - 
one could say they're all superpipelined w.r.t floating point operations.
Load operations in the R3000 could be called superpipelined - instructions
are issued at twice the latency of load operations.  The superpipelined
criteria mentioned by Mr. Price in another posting are rather weak:
issue rate 2x icache access? What if I decided to put in such a large
icache that it increased my cycle time to almost 2x the capabilities of
my other (R3K-like) stages, then merely made the cache double width and 
allowed two cycles to access it so I could still issue instructions at the 
faster rate - is it suddenly superpipelined?  (If you say yes, you must be 
in marketing ;-).  

I'm certainly not trying to pick on the R4000 - it's a reasonably 
impressive design, and timely execution on the rest of the project will 
make it more so.  It's more the term "superpipelined" - which is quickly 
becoming as meaninful as "RISC" and "MIPS".  It's been well known since
the early 60's that there are various degrees of aggressive pipelining;
Jouppi refers to the the CDC 6600 and 7600 as superpipelined.  These 
varying degrees of aggressiveness is what will make the term so slippery 
(hence a good marketing term. Can you tell I'm not in marketing?:-).  

What the R4000 has done for me at least is point out that superpipelining 
as defined and compared with superscalar by Jouppi is for the most part an 
academic exercise, at least with respect to the reality of the technologies 
we have to work with today:  on chip cache sizes grow faster than their 
access times, but by the same token integer alu sizes (& latencies) shrink 
(alright, assuming constant word size...) so what might have been fairly 
balanced alu/cache access times in an older technology start to get very 
skewed - it makes perfect sense to split your cache access into two stages, 
but not your alu.  (The same applies to superscalar designs.)  One could
go with a smaller, faster pipelined cache, split the alu and then maybe
run at an even higher rate, BUT maybe not double and it would be more
sensitive to operand dependencies.  Moreover, smaller caches with higher
execution rates is rather pointless.  For these reasons, I don't think one
could do an optimal fully-superpipelined implementation.     (I don't
know what the relative latencies of the alu and combined/split cache stages 
turned out to be for the R4000, but I'm tempted to speculate that one of
the cache stages turned out to be still slower than a 32-bit alu, so going
to 64 bits, while maybe even exceeding that cache stage by a bit, wasn't
too painful, and therefore downright worthwhile.  Just pure speculation,
of course.)

>I would expect a multiple cycle hit for taken branches as well, without
>some kind of branch acceleration technique like a branch target cache.

Yep. An extra cycle.  At least they can still eat one cycle with the
delay slot.  This extra branch recovery cycle and the extra load cycle 
will cause non-uniform speed ups among various integer codes that aren't
re-scheduled.  

>I'm also curious about the compatibility provisions. Is there a
>'mode'?  Are instructions 64 bits long now, or a mixture? Were just a few
>new instructions added (like load/store double, shift double) and the
>semantics of existing ones change (load becomes load signed/unsigned,
>shifts become shift single, etc)? If there is a mode, does it just change
>where in the word a condition is taken from? What else?

Yeah, good questions.  What's the cache impact of pointers taking up
double the space? (I know this doesn't apply to 32-bit "mode", but I can't
imagine that there'll be two sizes of pointers in 64-bit "mode".)  This
plus the extra load cycle won't look good for things like searching linked 
lists, but then I have yet to see any really aggressive high performance 
design provide balanced speedups across the spectrum.  In any case, one 
can be sure Mipsco took all that into account for their target market.

------------------------------------------
Dave Christie       My opinions only.

gillies@m.cs.uiuc.edu (Don Gillies) (02/13/91)

I read an article today in "The Microprocessor Report" that said the
increase from 32 to 64 bits added approximately 10-15% to the size of
the chip.  This is quite surprising.  You'd think that going from 32
to 64 bits would double the size of the ALU and all the data paths and
registers.  Does this mean that the datapath and ALU and all the
registers accounted for only 10-15% of the microprocessor to begin
with?  What is the baseline design for this 10-15% increase in area?
--

cprice@mips.COM (Charlie Price) (02/22/91)

Yet again, hoping the net gods bless this posting...

In article <49041@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
>>Superscalar is pretty easy to define, but
>>what *does* superpipelining really mean?
>
> (details of R4000 pipe... Thanks!)

Ya Betcha!
By the way, this is the level of detail in information that
was handed out to the press.  You can look for pictures in
the glossy trade rags any time now.

>This actually runs contrary to MY definition of superpipelined, which
>is probably like MY definition of RISC - there's no such animal as THE
>definition...

While I agree that there is no "one true meaning",
for the term, we need *some* terminology we can agree on
to talk about these things.
I will have more to say on this in a subsequent posting.

>I would expect a multiple cycle hit for taken branches as well, without
>some kind of branch acceleration technique like a branch target cache.

Yes, increased load latency and branch latency is one of the effects 
of increasing the pipeline depth.  This is a non-negligible
consideration in picking a pipeline structure.

>I'm also curious about the compatibility provisions.
>Inquiring minds would, of course, like to know. Is this level of detail
>still subject to non-disclosure?

I can't find any discussion of this in our technology presentations,
so this will have to be a topic for the future.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650