baum@Apple.COM (Allen J. Baum) (12/23/88)
[] >In article <82150@sun.uucp> edkelly@sun.UUCP (Ed Kelly) writes: >2) There are lots of NOPs in MIPS code. This is an ARCHITECTURAL feature. >NOPs are not benign. As well as the direct cycles lost, lots of NOPs is bad >for code density, and it increases inst. cache miss penalties(due to more >memory accesses and greater probability of a miss). > Current SPARC implementations incur a clock cycle penalty for some of the >cases where MIPS has to insert NOPs however, so counting all NOPs against MIPS >overstates the situation. This includes the load-use interlock case(1,474,619) >and the untaken annulled branch case(634,700). While these cycles are not >"architectural" many implementations will incur these penalties. Sorry, but I have to disagree here. The no-ops are architectural because MIPs thought it unlikely that any implementation could ever get away without an extra cycle in those circumstances. Note that SPARC incurs these penalties. I predict that they will not go away for quite a while. Specifically: Load/use interlock. The only way to avoid this is execution of instructions out of order, which requires that multiple instructions be examined in parallel, and a lot more. It this can be done, then the "Noop" can effectively be executed out of order as well, and its cycle will disappear. The space it occupies is still there, but the effect on performance is second order, and neglible. The branch-delay slot annulment is a little different in that it make take less than Herculean effort to avoid the wasted cycle. But again, that works both ways. I'll be generous and only put half of those cycles back. So, putting the cycles back where I believe they belong: > SPARC MIPS MIPS-SPARC > >Total Instructions 16,313,907 18,635,185 +2,321,278 >load interlock cycles (1,474,619) na >untaken-branch annull ( 317,350 18,105,876 18,635,185 + 529 309 ~3% difference -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
dsc@raspail.UUCP (Dave Christie) (12/24/88)
In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: A lot of stuff that I quite agree with, and: > The no-ops are architectural because MIPs thought it unlikely that any > implementation could ever get away without an extra cycle in those > circumstances. Note that SPARC incurs these penalties. I predict that > they will not go away for quite a while. > Specifically: Load/use interlock. The only way to avoid this is execution > of instructions out of order, which requires that multiple instructions be ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > examined in parallel and a lot more. It this can be done, then the "Noop" can ^^^^^^^^^^^^^^^^^^^^ > effectively be executed out of order as well, and its cycle will disappear. This is precisely what the MIPS code reorganizer does - changing the "natural" order of instructions to eliminate the load delay penalty as much as possible by examining several instructions at once. I realize you were probably referring to hardware code reorganization at run-time, which is indeed a complex matter. But reorganizing of object code prior to execution does much the same thing without the complex hardware. Granted, there are cases where reorganization can be done at runtime that cannot be done beforehand (depending on how much hardware complexity you want to throw in), but the global view a compiler has can allow optimizations that the instruction issue logic can't do. For the most part, hardware to reorganize code at runtime to fill in one cycle of load delay would probably fail to do so (causing a stall) as often as the MIPS code reorganizer has to insert a noop. So MIPS is already doing what you suggested. -- Dave Christie, Control Data Canada, Mississauga, Ontario dsc@raspail.UUCP or {backbone}!uunet!rosevax!shamash!raspail!dsc "Any opinions expressed herein do not necessarily reflect those of CDC, and for that matter, probably no one else."
baum@Apple.COM (Allen J. Baum) (12/27/88)
[] >In article <1117@raspail.UUCP> dsc@raspail.UUCP (Dave Christie) writes: > >In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: >A lot of stuff that I quite agree with, and: > >> Specifically: Load/use interlock. The only way to avoid this is execution >> of instructions out of order, which requires that multiple instructions be > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> examined in parallel and a lot more. It this can be done, then the "Noop" can > ^^^^^^^^^^^^^^^^^^^^ >> effectively be executed out of order as well, and its cycle will disappear. > >This is precisely what the MIPS code reorganizer does... Of course it does. And, where it fails, it must insert a no-op. SPARC doesn't, and its hardware will insert a cycle. The original article tried to make the point that the architectural interlock meant that some implementations of SPARC would not require that cycle. My point was that in these situations only a very complex machine that examined multiple instructions simultaneously in hardware, and could execute instructions out of order would be able to avoid the interlock cycle, and that MIPs could spend that kind of hardware, and create 0-cycle no-op instructions to have the same effect. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
mo@prisma (12/30/88)
>/* Written 3:59 pm Dec 26, 1988 by baum@apple in prisma:comp.arch */ >[] >>In article <1117@raspail.UUCP> dsc@raspail.UUCP (Dave Christie) writes: >> >>In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: >>A lot of stuff that I quite agree with, and: >> >>> Specifically: Load/use interlock. The only way to avoid this is execution >>> of instructions out of order, which requires that multiple instructions be >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >>> examined in parallel and a lot more. It this can be done, then the "Noop" can >> ^^^^^^^^^^^^^^^^^^^^ >>> effectively be executed out of order as well, and its cycle will disappear. >> >>This is precisely what the MIPS code reorganizer does... > > Of course it does. And, where it fails, it must insert a no-op. SPARC doesn't, >and its hardware will insert a cycle. The original article tried to make the >point that the architectural interlock meant that some implementations of SPARC >would not require that cycle. My point was that in these situations only a >very complex machine that examined multiple instructions simultaneously in >hardware, and could execute instructions out of order would be able to avoid >the interlock cycle, and that MIPs could spend that kind of hardware, and >create 0-cycle no-op instructions to have the same effect. > >-- > baum@apple.com (408)974-3385 >{decwrl,hplabs}!amdahl!apple!baum >/* End of text from prisma:comp.arch */ Sorry folks, but you don't have to do the full-blown superscalar "multiple instructions per cycle" to have non-stalling loads. Further, the fact that the SPARC doesn't INSIST the compiler put NOPs in there means that the binaries for the existing SPARC processors will run faster, without recompilation, on a machine which can usually avoid the load stalls. Running the same binaries, but faster, is clearly a win. -Mike
baum@Apple.COM (Allen J. Baum) (01/04/89)
[] >In article <2700003@prisma> mo@prisma writes: >>In article <22745@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: ....stuff impying that the load-use interlock can't be avoided without looking at multiple instructions at once.... >Sorry folks, but you don't have to do the full-blown >superscalar "multiple instructions per cycle" to have non-stalling >loads. Further, the fact that the SPARC doesn't INSIST the >compiler put NOPs in there means that the binaries for the >existing SPARC processors will run faster, without recompilation, >on a machine which can usually avoid the load stalls. > >Running the same binaries, but faster, is clearly a win. Sorry, but I believe you are wrong. While non-stalling loads don't require this in general, they do require it in the load-use interlock case, i.e. where the data is used immediately after it is loaded. If you know a scheme that can avoid this, please post (and patent it) quickly. In any case, the same technique can be used to make the architecturally required NOP instruction take zero cycles, WITHOUT recompilation. If you have a counterexample/argument, I'd love to hear it. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum