hoelzle@Neon.Stanford.EDU (Urs Hoelzle) (11/13/90)
OK, here's something I've been wondering about for a long time: on the SPARC, loads take *always* at least 2 cycles, and stores take 3 (in all the Suns I know of). Even if the load's result isn't used by the next instruction. Why??? If the compiler is able to fill the load delay slots with other instructions, the load should be just one cycle, right? E.g. load [l0], l1 add l2, l3, l2 sub l2, 1, l2 add l2, l1, l2 ; use the loaded value *should* take at most 4 cycles. One argument would be that it is a Good Thing to avoid pipeline interlocks this way, speeding up the basic cycle etc. But SPARCs do have interlocks, and in the sequence load [l0], l1 add l1, l2, l1 the load takes *3* cycles. So if the control logic can stall the pipeline for one cycle to let the load complete, why not stall it for 1 or 2 and make the load one cycle? Similar story for stores: why not use a store buffer and make the store 1 cycle? Is there any SPARC implementation which does loads and stores in one cycle, and if not, why can others do it (e.g. MIPS)? I believe that the problems with loads & stores contribute significantly to the relative slowness (high CPI) of SPARC vs {MIPS,R6000,...} at identical clock rates. If I'm wrong, please correct me. [I know this is a question of implementation, not architecture - but why would anyone want to implement it this way?] -Urs ------------------------------------------------------------------------------ Urs Hoelzle, CS PhD student hoelzle@cs.stanford.EDU Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305 -- ------------------------------------------------------------------------------ Urs Hoelzle, CS PhD student hoelzle@cs.stanford.EDU Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305
marc@fiji.cs.ucla.edu (Marc Tremblay) (11/13/90)
In article <1990Nov12.232129.21399@Neon.Stanford.EDU> hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes: >OK, here's something I've been wondering about for a long time: on the >SPARC, loads take *always* at least 2 cycles, and stores take 3 (in >all the Suns I know of). Even if the load's result isn't used by the >next instruction. Why??? Remember that SPARC offers the flexibility of base register plus index register memory addressing. This means that at one point in the pipeline *two* registers must be accessed just to generate the address. The register that contains the data must also be accessed which means that a store requires *three* register accesses. The register file of early implementations of the SPARC architecture had only two read ports (per cycle). The address can be generated during one cycle and then the register containing the data can be accessed in the following cycle. The third cycle consists of actually storing the data off-chip (no on-chip cache), hence the long latency. For loads, early implementations of the SPARC (once again) had multiplexed address and data busses between instruction and data. This means that while the address of the load goes out and when the data comes back no instruction can be fetched. This results in a extra cycle of latency. These implementations "features" can be overcome by dedicating a bit more silicon and more pins to the processor chip. _________________________________________________ Marc Tremblay internet: marc@CS.UCLA.EDU UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc