[comp.arch] questionable nos.--SPARC vs. MIPS on gcc

baum@Apple.COM (Allen J. Baum) (12/23/88)

[]
>In article <82150@sun.uucp> edkelly@sun.UUCP (Ed Kelly) writes:
>2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. 
>NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad 
>for code density, and it increases inst. cache miss penalties(due to more
>memory accesses and greater probability of a miss).
>     Current SPARC implementations incur a clock cycle penalty for some of the
>cases where MIPS has to insert NOPs however, so counting all NOPs against MIPS
>overstates the situation. This includes the load-use interlock case(1,474,619)
>and the untaken annulled branch case(634,700). While these cycles are not 
>"architectural" many implementations will incur these penalties.

    Sorry, but I have to disagree here.
The no-ops are architectural because MIPs thought it unlikely that any
implementation could ever get away without an extra cycle in those
circumstances. Note that SPARC incurs these penalties. I predict that
they will not go away for quite a while.
    Specifically: Load/use interlock. The only way to avoid this is execution
of instructions out of order, which requires that multiple instructions be
examined in parallel, and a lot more. It this can be done, then the "Noop" can 
effectively be executed out of order as well, and its cycle will disappear.
The space it occupies is still there, but the effect on performance is 
second order, and neglible.
    The branch-delay slot annulment is a little different in that it make
take less than Herculean effort to avoid the wasted cycle. But again, that
works both ways. I'll be generous and only put half of those cycles back.

So, putting the cycles back where I believe they belong:

>			SPARC		MIPS		MIPS-SPARC
>
>Total Instructions	16,313,907	18,635,185	+2,321,278
>load interlock cycles	(1,474,619)	na
>untaken-branch annull  (  317,350
                        18,105,876      18,635,185	+  529 309
                                                      ~3% difference
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

dsc@raspail.UUCP (Dave Christie) (12/24/88)

In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
A lot of stuff that I quite agree with, and: 

> The no-ops are architectural because MIPs thought it unlikely that any
> implementation could ever get away without an extra cycle in those
> circumstances. Note that SPARC incurs these penalties. I predict that
> they will not go away for quite a while.
>    Specifically: Load/use interlock. The only way to avoid this is execution
> of instructions out of order, which requires that multiple instructions be
                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> examined in parallel and a lot more. It this can be done, then the "Noop" can 
  ^^^^^^^^^^^^^^^^^^^^
> effectively be executed out of order as well, and its cycle will disappear.

This is precisely what the MIPS code reorganizer does - changing
the "natural" order of instructions to eliminate the load delay penalty
as much as possible by examining several instructions at once.
I realize you were probably referring to hardware code reorganization at
run-time, which is indeed a complex matter.  But reorganizing of object code
prior to execution does much the same thing without the complex hardware.
Granted, there are cases where reorganization can be done at runtime that
cannot be done beforehand (depending on how much hardware complexity you
want to throw in), but the global view a compiler has can allow optimizations
that the instruction issue logic can't do.  For the most part, hardware to
reorganize code at runtime to fill in one cycle of load delay would probably
fail to do so (causing a stall) as often as the MIPS code reorganizer has
to insert a noop.  So MIPS is already doing what you suggested. 
-- 
Dave Christie, Control Data Canada, Mississauga, Ontario
dsc@raspail.UUCP   or   {backbone}!uunet!rosevax!shamash!raspail!dsc
"Any opinions expressed herein do not necessarily reflect those of CDC,
and for that matter, probably no one else."

baum@Apple.COM (Allen J. Baum) (12/27/88)

[]
>In article <1117@raspail.UUCP> dsc@raspail.UUCP (Dave Christie) writes:
>
>In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
>A lot of stuff that I quite agree with, and: 
>
>>    Specifically: Load/use interlock. The only way to avoid this is execution
>> of instructions out of order, which requires that multiple instructions be
>                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>> examined in parallel and a lot more. It this can be done, then the "Noop" can 
>  ^^^^^^^^^^^^^^^^^^^^
>> effectively be executed out of order as well, and its cycle will disappear.
>
>This is precisely what the MIPS code reorganizer does...

 Of course it does. And, where it fails, it must insert a no-op. SPARC doesn't,
and its hardware will insert a cycle. The original article tried to make the
point that the architectural interlock meant that some implementations of SPARC
would not require that cycle. My point was that in these situations only a
very complex machine that examined multiple instructions simultaneously in
hardware, and could execute instructions out of order would be able to avoid
the interlock cycle, and that MIPs could spend that kind of hardware, and
create 0-cycle no-op instructions to have the same effect.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

mo@prisma (12/30/88)

>/* Written  3:59 pm  Dec 26, 1988 by baum@apple in prisma:comp.arch */
>[]
>>In article <1117@raspail.UUCP> dsc@raspail.UUCP (Dave Christie) writes:
>>
>>In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
>>A lot of stuff that I quite agree with, and: 
>>
>>>    Specifically: Load/use interlock. The only way to avoid this is execution
>>> of instructions out of order, which requires that multiple instructions be
>>                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>> examined in parallel and a lot more. It this can be done, then the "Noop" can 
>>  ^^^^^^^^^^^^^^^^^^^^
>>> effectively be executed out of order as well, and its cycle will disappear.
>>
>>This is precisely what the MIPS code reorganizer does...
>
> Of course it does. And, where it fails, it must insert a no-op. SPARC doesn't,
>and its hardware will insert a cycle. The original article tried to make the
>point that the architectural interlock meant that some implementations of SPARC
>would not require that cycle. My point was that in these situations only a
>very complex machine that examined multiple instructions simultaneously in
>hardware, and could execute instructions out of order would be able to avoid
>the interlock cycle, and that MIPs could spend that kind of hardware, and
>create 0-cycle no-op instructions to have the same effect.
>
>--
>		  baum@apple.com		(408)974-3385
>{decwrl,hplabs}!amdahl!apple!baum
>/* End of text from prisma:comp.arch */

Sorry folks, but you don't have to do the full-blown
superscalar "multiple instructions per cycle" to have non-stalling
loads.  Further, the fact that the SPARC doesn't INSIST the
compiler put NOPs in there means that the binaries for the
existing SPARC processors will run faster, without recompilation,
on a machine which can usually avoid the load stalls.

Running the same binaries, but faster, is clearly a win.

	-Mike

baum@Apple.COM (Allen J. Baum) (01/04/89)

[]
>In article <2700003@prisma> mo@prisma writes:
>>In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes:
....stuff impying that the load-use interlock can't be avoided without
    looking at multiple instructions at once....

>Sorry folks, but you don't have to do the full-blown
>superscalar "multiple instructions per cycle" to have non-stalling
>loads.  Further, the fact that the SPARC doesn't INSIST the
>compiler put NOPs in there means that the binaries for the
>existing SPARC processors will run faster, without recompilation,
>on a machine which can usually avoid the load stalls.
>
>Running the same binaries, but faster, is clearly a win.

Sorry, but I believe you are wrong. While non-stalling loads don't require
this in general, they do require it in the load-use interlock case, i.e.
where the data is used immediately after it is loaded. If you know a scheme
that can avoid this, please post (and patent it) quickly.
 In any case, the same technique can be used to make the architecturally
required NOP instruction take zero cycles, WITHOUT recompilation.
 If you have a counterexample/argument, I'd love to hear it.
--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum