[comp.arch] load delays on SPARC

hoelzle@Neon.Stanford.EDU (Urs Hoelzle) (11/13/90)

OK, here's something I've been wondering about for a long time: on the
SPARC, loads take *always* at least 2 cycles, and stores take 3 (in
all the Suns I know of).  Even if the load's result isn't used by the
next instruction.  Why???  If the compiler is able to fill the load
delay slots with other instructions, the load should be just one
cycle, right?  E.g.

	load [l0], l1
	add l2, l3, l2
	sub l2, 1, l2
	add l2, l1, l2		; use the loaded value

*should* take at most 4 cycles.

One argument would be that it is a Good Thing to avoid pipeline
interlocks this way, speeding up the basic cycle etc.  But SPARCs do
have interlocks, and in the sequence

	load [l0], l1
	add  l1, l2, l1

the load takes *3* cycles.  So if the control logic can stall the
pipeline for one cycle to let the load complete, why not stall it for
1 or 2 and make the load one cycle?

Similar story for stores: why not use a store buffer and make the
store 1 cycle?  

Is there any SPARC implementation which does loads and stores in one
cycle, and if not, why can others do it (e.g. MIPS)?  I believe that
the problems with loads & stores contribute significantly to the
relative slowness (high CPI) of SPARC vs {MIPS,R6000,...} at identical
clock rates.  If I'm wrong, please correct me.

[I know this is a question of implementation, not architecture - but
why would anyone want to implement it this way?]

-Urs

------------------------------------------------------------------------------
Urs Hoelzle, CS PhD student			       hoelzle@cs.stanford.EDU
Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305

-- 
------------------------------------------------------------------------------
Urs Hoelzle, CS PhD student			       hoelzle@cs.stanford.EDU
Center for Integrated Systems, CIS 42, Stanford University, Stanford, CA 94305

marc@fiji.cs.ucla.edu (Marc Tremblay) (11/13/90)

In article <1990Nov12.232129.21399@Neon.Stanford.EDU> hoelzle@Neon.Stanford.EDU (Urs Hoelzle) writes:
>OK, here's something I've been wondering about for a long time: on the
>SPARC, loads take *always* at least 2 cycles, and stores take 3 (in
>all the Suns I know of).  Even if the load's result isn't used by the
>next instruction.  Why???

Remember that SPARC offers the flexibility of base register
plus index register memory addressing. This means that
at one point in the pipeline *two* registers must be accessed
just to generate the address.
The register that contains the data must also be accessed
which means that a store requires *three* register accesses.
The register file of early implementations of the SPARC architecture
had only two read ports (per cycle). The address can be generated
during one cycle and then the register containing the data can be
accessed in the following cycle. The third cycle consists of
actually storing the data off-chip (no on-chip cache),
hence the long latency.

For loads, early implementations of the SPARC (once again)
had multiplexed address and data busses between instruction and data.
This means that while the address of the load goes out and
when the data comes back no instruction can be fetched.
This results in a extra cycle of latency.

These implementations "features" can be overcome by dedicating
a bit more silicon and more pins to the processor chip.

_________________________________________________
Marc Tremblay
internet: marc@CS.UCLA.EDU
UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc