preston@titan.rice.edu (Preston Briggs) (11/12/89)
In article <1702@ncrcce.StPaul.NCR.COM> pasek@c10sd3.StPaul.NCR.COM (M. A. Pasek) writes: >In article I wrote: >>...Separate load-store instructions allow lower cost of access >>to non-registered variables. Why? Imagine trying to add from memory >>to a register. A CISC version might be >> ADD r5,0(r8) ; load from adress specisfied from r8 >> wait for memory, doing >> nothing much in the meantime >> add the value to r5 >> throw away the value you just waited for. >> cost = 1 or 2 + memory acess time >> and a risc version might be >> LD r7,0(r8) ; load from adress in r8 into r7 >> . ; do some useful work while waiting for memory >> ADD r5,r6,r7 ; r5 := r6 + r7 >> cost equals 2. Additionally, the old value of r6 is still >> around, and the value from memory is handy if we need it. >I don't know if I'm just stupid, but on the machine I work with, the "CISC" >example takes a total of ONE cycle, assuming that the next instruction does >not reference R5, and is not contained in the memory location referenced >by R8. And (perhaps) not use the adder, which is busy with the previous instruction. I'm sure you're not stupid -- you're just pointing out some of the exaggeration in my posting. On the other hand, that's a mighty impressive cache you've got that will service memory requests in 1 cycle. My 6809 had memory that fast, but on newer machines, I can't afford memory to keep up with the CPU. On yet another hand, IBM and CDC (and others?) have built CPUs that will allow all sorts of out-of-order execution, but they were complex and expensive (and admittedly fast). The point is not to waste functional resources waiting for memory. >Finally, the "throw away" of the value we just waited for can be avoided >by coding the "RISC" example on the "CISC" machine, with the same results: >it takes 2 cycles (if you do that "useful work" in between the two >instructions). Of course. But RISC people (and this compiler person) argue that you nearly always want to preserve the value; therefore, why provide the more complex "reg = reg op memory" (or worse) instructions? It's kind of a question of balance. Everything on the CPU costs money and we'd like to keep it all busy all the time. I think this'll be the big challenge of chips like the i860 and the new IBM erisc. We'll have an integer adder, an fp adder, and fp multiplier, and perhaps more. How can we keep it all going? Waiting on memory is not the answer. An interesting paper on "balance" is "Estimating interlock and improving balance for pipelined architectures" Callahan, Cocke, Kennedy International Conference of Parallel Processing, 1987 They define several kinds of balance. The "machine balance" is the rate data can be fetched (number of words per cycle) divided the rate at which floating point operations can be performed (number of flops/cycle). The "loop balance" of a given loop is the number of words accessed divided by the number of flops performed. If, for some particular loop, the loop balance is greater than the machine balance, then we can say the loop is "memory-bound." On the other hand, if the loop balance is less than the machine balance, then the loop is "compute-bound." Consider a simple version of matrix multiply. do i = 1, n do j = 1, n C(i, j) = 0 do k = 1, n C(i, j) = C(i, j) + A(i, k) * B(k, j) end end end In this case the loop bound of the inner loop is 3 accesses/2 flops. For the i860, we could say it's balance is 1 access/2 flops. (This is all complicated by cache, load double, load quad, non-pipelined operations, etc. Since the chip can only fetch off-chip every other cycle, we could claim the balance is much lower. On the other hand, if we don't use pipelined FP, then the balance is raised.) So, the i860 is memory-bound in this case, even if all the data fits in cache. The paper points out refinements of these balance computations plus transformations that help match machine balance and loop balance. For example, we can rewite the above loop giving do i = 1, n do j = 1, n t = 0 do k = 1, n t = t + A(i, k) * B(k, j) end C(i, j) = t end end where t will end up in a register. Now the loop balance is 2 accesses/2 flops which is better, but the i860 would still be memory-bound. If we "unroll and jam" (as described in the paper), we get do i = 1, n do j = 1, n, 4 t0 = 0 t1 = 0 t2 = 0 t3 = 0 do k = 1, n ta = A(i, k) t0 = t0 + ta * B(k, j+0) t1 = t1 + ta * B(k, j+1) t2 = t2 + ta * B(k, j+2) t3 = t3 + ta * B(k, j+2) end C(i, j+0) = t0 C(i, j+1) = t1 C(i, j+2) = t2 C(i, j+3) = t3 end end where n is divisible by 4 for clarity. Now the loop balance is 5 accesses/8 flops. This is a lot better (comparing 5/8 to 1/2). Of course, there are a lot of possible variations; more are explored in the paper. Note that we suddenly need 5 fp registers to hold intermediate results. If we continue unrolling and jamming, we'll want even more registers; hence our violent reaction to people who suggest that 4 registers are plenty. Note that it isn't the loop that requires the registers, or even the transformation; it's the mismatch between the machine's balance and the loops balance. Hillis (in The Connection Machine) takes another tack. He points out that the cpu is usually kept wonderfully busy, but that the bulk of memory (often the bulk of a machine) is usually idle. That is, only a small fraction is being accessed at a time. This leads him to many processors with proportionally less memory. His hope is that with an adequate number of processors, he can keep *all* the memory reasonably busy. Preston Briggs preston@titan.rice.edu