[comp.arch] load-store architecture

preston@titan.rice.edu (Preston Briggs) (11/12/89)

In article <1702@ncrcce.StPaul.NCR.COM> pasek@c10sd3.StPaul.NCR.COM (M. A. Pasek) writes:
>In article I wrote:
>>...Separate load-store instructions allow lower cost of access
>>to non-registered variables.  Why?  Imagine trying to add from memory
>>to a register.  A CISC version might be

>>	ADD	r5,0(r8)	; load from adress specisfied from r8
>>				  wait for memory, doing 
>>					nothing much in the meantime
>>				  add the value to r5
>>				  throw away the value you just waited for.
>>		cost = 1 or 2 + memory acess time

>> and a risc version might be

>>	LD	r7,0(r8)	; load from adress in r8 into r7
>>	.			; do some useful work while waiting for memory
>>	ADD	r5,r6,r7	; r5 := r6 + r7
>>		cost equals 2.  Additionally, the old value of r6 is still
>>		around, and the value from memory is handy if we need it.


>I don't know if I'm just stupid, but on the machine I work with, the "CISC"
>example takes a total of ONE cycle, assuming that the next instruction does
>not reference R5, and is not contained in the memory location referenced 
>by R8.  

And (perhaps) not use the adder, which is busy with the previous instruction.
I'm sure you're not stupid -- you're just pointing out some of the 
exaggeration in my posting.  On the other hand, that's a mighty impressive
cache you've got that will service memory requests in 1 cycle.
My 6809 had memory that fast, but on newer machines, I can't afford
memory to keep up with the CPU.  On yet another hand, IBM and CDC
(and others?) have built CPUs that will allow all sorts of out-of-order
execution, but they were complex and expensive (and admittedly fast).
The point is not to waste functional resources waiting for memory.

>Finally, the "throw away" of the value we just waited for can be avoided
>by coding the "RISC" example on the "CISC" machine, with the same results:
>it takes 2 cycles (if you do that "useful work" in between the two 
>instructions).  

Of course.  But RISC people (and this compiler person) argue that
you nearly always want to preserve the value; therefore, why
provide the more complex "reg = reg op memory" (or worse) instructions?

It's kind of a question of balance.  Everything on the CPU costs money
and we'd like to keep it all busy all the time.  I think this'll be
the big challenge of chips like the i860 and the new IBM erisc.
We'll have an integer adder, an fp adder, and fp multiplier,
and perhaps more.  How can we keep it all going?  Waiting on memory
is not the answer.



An interesting paper on "balance" is
  "Estimating interlock and improving balance for pipelined architectures"
   Callahan, Cocke, Kennedy
   International Conference of Parallel Processing, 1987

They define several kinds of balance.
The "machine balance" is the rate data can be fetched (number of words
per cycle) divided the rate at which floating point operations 
can be performed (number of flops/cycle).

The "loop balance" of a given loop is the number of words accessed
divided by the number of flops performed.

If, for some particular loop, the loop balance is greater than the
machine balance, then we can say the loop is "memory-bound."
On the other hand, if the loop balance is less than the
machine balance, then the loop is "compute-bound."

Consider a simple version of matrix multiply.
	do i = 1, n
	  do j = 1, n
	    C(i, j) = 0
	    do k = 1, n
	      C(i, j) = C(i, j) + A(i, k) * B(k, j)
	    end
	  end
	end

In this case the loop bound of the inner loop is 3 accesses/2 flops.
For the i860, we could say it's balance is 1 access/2 flops.
(This is all complicated by cache, load double, load quad, non-pipelined
 operations, etc.  Since the chip can only fetch off-chip every other cycle,
 we could claim the balance is much lower.  On the other hand, if we don't
 use pipelined FP, then the balance is raised.)
So, the i860 is memory-bound in this case, even if all the data
fits in cache.

The paper points out refinements of these balance computations
plus transformations that help match machine balance and loop
balance.  For example, we can rewite the above loop giving
	do i = 1, n
	  do j = 1, n
	    t = 0
	    do k = 1, n
	      t = t + A(i, k) * B(k, j)
	    end
	    C(i, j) = t
	  end
	end
where t will end up in a register.

Now the loop balance is 2 accesses/2 flops which is better,
but the i860 would still be memory-bound.
If we "unroll and jam" (as described in the paper), we get
	do i = 1, n
	  do j = 1, n, 4
	    t0 = 0
	    t1 = 0
	    t2 = 0
	    t3 = 0
	    do k = 1, n
	      ta = A(i, k)
	      t0 = t0 + ta * B(k, j+0)
	      t1 = t1 + ta * B(k, j+1)
	      t2 = t2 + ta * B(k, j+2)
	      t3 = t3 + ta * B(k, j+2)
	    end
	    C(i, j+0) = t0
	    C(i, j+1) = t1
	    C(i, j+2) = t2
	    C(i, j+3) = t3
	  end
	end
where n is divisible by 4 for clarity.

Now the loop balance is 5 accesses/8 flops.  This is a lot better
(comparing 5/8 to 1/2).  Of course, there are a lot of possible 
variations; more are explored in the paper.

Note that we suddenly need 5 fp registers to hold intermediate results.
If we continue unrolling and jamming, we'll want even more registers;
hence our violent reaction to people who suggest that 4 registers
are plenty.  Note that it isn't the loop that requires the registers,
or even the transformation; it's the mismatch between the 
machine's balance and the loops balance.


Hillis (in The Connection Machine) takes another tack.
He points out that the cpu is usually kept wonderfully busy,
but that the bulk of memory (often the bulk of a machine) is
usually idle.  That is, only a small fraction is being accessed
at a time.  This leads him to many processors with proportionally
less memory.  His hope is that with an adequate number of processors,
he can keep *all* the memory reasonably busy.

Preston Briggs
preston@titan.rice.edu