hoelzle@Neon.Stanford.EDU (Urs Hoelzle) (11/13/90)
OK, here's something I've been wondering about for a long time: on the
SPARC, loads take *always* at least 2 cycles, and stores take 3 (in
all the Suns I know of). Even if the load's result isn't used by the
next instruction. Why??? If the compiler is able to fill the load
delay slots with other instructions, the load should be just one
cycle, right? E.g.
load [l0], l1
add l2, l3, l2
sub l2, 1, l2
add l2, l1, l2 ; use the loaded value
*should* take at most 4 cycles.
One argument would be that it is a Good Thing to avoid pipeline
interlocks this way, speeding up the basic cycle etc. But SPARCs do
have interlocks, and in the sequence
load [l0], l1
add l1, l2, l1
the load takes *3* cycles. So if the control logic can stall the
pipeline for one cycle to let the load complete, why not stall it for
1 or 2 and make the load one cycle?
Similar story for stores: why not use a store buffer and make the
store 1 cycle?
Is there any SPARC implementation which does loads and stores in one
cycle, and if not, why can others do it (e.g. MIPS)? I believe that
the problems with loads & stores contribute significantly to the
relative slowness (high CPI) of SPARC vs {MIPS,R6000,...} at identical
clock rates. If I'm wrong, please correct me.
[I know this is a question of implementation, not architecture - but
why would anyone want to implement it this way?]
-Urs
------------------------------------------------------------------------------
Urs Hoelzle, CS PhD student hoelzle@cs.stanford.EDU
Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305
--
------------------------------------------------------------------------------
Urs Hoelzle, CS PhD student hoelzle@cs.stanford.EDU
Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305marc@fiji.cs.ucla.edu (Marc Tremblay) (11/13/90)
In article <1990Nov12.232129.21399@Neon.Stanford.EDU> hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes: >OK, here's something I've been wondering about for a long time: on the >SPARC, loads take *always* at least 2 cycles, and stores take 3 (in >all the Suns I know of). Even if the load's result isn't used by the >next instruction. Why??? Remember that SPARC offers the flexibility of base register plus index register memory addressing. This means that at one point in the pipeline *two* registers must be accessed just to generate the address. The register that contains the data must also be accessed which means that a store requires *three* register accesses. The register file of early implementations of the SPARC architecture had only two read ports (per cycle). The address can be generated during one cycle and then the register containing the data can be accessed in the following cycle. The third cycle consists of actually storing the data off-chip (no on-chip cache), hence the long latency. For loads, early implementations of the SPARC (once again) had multiplexed address and data busses between instruction and data. This means that while the address of the load goes out and when the data comes back no instruction can be fetched. This results in a extra cycle of latency. These implementations "features" can be overcome by dedicating a bit more silicon and more pins to the processor chip. _________________________________________________ Marc Tremblay internet: marc@CS.UCLA.EDU UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc