w-colinp@microsoft.UUCP (Colin Plumb) (11/30/88)
In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Note, of course, that this is a model of the world likely to go away, >for anybody who is serious about scalar floating point. The structure >is only likely to happen when you have a long-cycle-count FP device, >which is still fast enough to be desirable, but where the addition of a few >cycles' overhead doesn't clobber the performance. As micros with >tight-coupled FPUs (MIPS R3000/R3010, Moto 88K, Cypress SPARC + TI 8847) >and low-cycle count operations (2-10 cycles for 64-bit add & mult) >become more common, you just lose too much performance moving data >around to afford that kind of interface, at least in a scalar unit. Gee, this is intersesting, considering that the MIPS integer multiply/divide is done basically the same way: move operands to special registers, issue instruction, if a read is attempted before the results are ready, stall. The only difference is that the registers are on-chip and that the is instruction that starts the multiply isn't a store. On, say, a 68020, you simply can't have loads and stores complete in one cycle, so this wouldn't work, and the address computation on the MIPS might cost a cycle, but on the 29000, where the address is known by the end of the decode stage, the load/store can leave the chip during the execute phase and complete that same cycle, so you're only losing one cycle transferring operands (less , if you make setup time assumptions. This seems pretty tightly coupled to me. (For the confused: the 29000 has some extra address modifier bits, some of which are used to address the floating-point unit, and the others are used to indicate what sort of transaction this is - are you writing operand 1, operand 2, or an instruction. You can use both address and data buses to send 64 bits of data in one cycle. Reads are only 32 bits at a time, sorry.) Basically, I claim that this model of FPU operation is very close in speed to one where the CPU has more knowledge of the FPU, if done properly. (P.S. Question to MIPS: I think you only need to back up one extra instruction on an exception if that instruction is a *taken* branch. Do you do it this way, or on all branches?) -- -Colin (microsof!w-colinp@sun.com)
mash@mips.COM (John Mashey) (12/01/88)
In article <1044@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes: >In article <8939@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>Note, of course, that this is a model of the world likely to go away, >>for anybody who is serious about scalar floating point. The structure >>is only likely to happen when you have a long-cycle-count FP device,... >Gee, this is intersesting, considering that the MIPS integer multiply/divide >is done basically the same way: move operands to special registers, issue >instruction, if a read is attempted before the results are ready, stall. ACtually, it doesn't work this way: the mult instruction picks up the values from the regular registers, i.e., one instruction both fetches the data and initiates the instruction. The only "extra" instruction is the one that moves the result back. >On, say, a 68020, you simply can't have loads and stores complete in one >cycle, so this wouldn't work, and the address computation on the MIPS might >cost a cycle, but on the 29000, where the address is known by the end of >the decode stage, the load/store can leave the chip during the execute >phase and complete that same cycle, so you're only losing one cycle >transferring operands (less , if you make setup time assumptions. >This seems pretty tightly coupled to me. >(For the confused: the 29000 has some extra address modifier bits, some of >which are used to address the floating-point unit, and the others are >used to indicate what sort of transaction this is - are you writing >operand 1, operand 2, or an instruction. You can use both address and >data buses to send 64 bits of data in one cycle. Reads are only 32 >bits at a time, sorry.) >Basically, I claim that this model of FPU operation is very close >in speed to one where the CPU has more knowledge of the FPU, if >done properly. Here's an example. MAybe this is close in speed, or maybe not, or maybe I don't understand the 29K FPU interface well enough. here's a small example (*always be wary of small examples; I don't have time right now for anything more substantive): main() { double x, y, z; x = y + z; } this generates (for an R3000), and assuming data & instrs in-cache: # 2 double x, y, z; # 3 x = y + z; l.d $f4, 8($sp) l.d $f6, 0($sp) add.d $f8, $f4, $f6 s.d $f8, 16($sp) The l.d and s.d actually turn into pairs of loads & stores, and this sequence takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for the add.d, and 2 swc1's. Assume a 29K with cache, so it has loads like the R3000's. As far as I can tell, a 29K would use 17: 4 load cycles (best case: in some cases, offset calculations, or use of load multiple with count-setup would take another one or two) 2 writes (get the data over to the FPU) 1 write (start the add-double) 6 cycles (do the add.double) 2 reads (get the 64-bit result back) 2 stores (put the result back in memory; again assume no offset calculations) Now, the biggest difference is the 6 cycles versus 2, and if this were * instead of +, it would be 6 versus 5. Still, as it stands, making worst case assumptions about the R3000 (that the nop gets left in), and some best-case assumptions about the 29K, you get a 9:17 ratio for this case, a 12:17 for y*z; if you do the single-precision case with the same assumptions, you get a 6:12 ratio for x+y, and 8:12 for x*y. So, we get: DP + DP * SP + SP * R3000 9 12 6 8 (subtract 1 if the nop gets scheduled) 29K 17 17 12 12 Also, interesting to see what would happen if you compare two non-existent machines: an R3000* whose FP operation times increase to match the 29K's, and a 29K whose FP op times decrease to match the R3000s: R3000* 13 15 10 10 AN R3000 with 29K FP cycle times 29K* 13 16 8 10 A 29K whose FP operations had R3000 cycle times What does this mean: ANS: statistically, not much, except that it means that the extra cycles overhead getting data in and out would cancel the advantage of decreasing the FP cycle times. A better comparison might be between R3000 and 29K*, which shows that even with the same FP operation times, one is paying the following % cycle count penalties: 44%, 33%, 33%, 25%, at a minimum. If you guess that you schedule the nop away 70% of the time, you get: R3000 8.3 11.3 5.3 7.3 29K* 13 16 8 10 % hit 57% 42% 51% 37% Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS. This is a microscopic example, and one should never believe in them too much. However, the general effect is to add at least 1 cycle for every FP variable load/store, just by the difference in interface. Having stood on their heads to get things like 2-cycle adds, MIPS designers would roll in their graves before adding a cycle to each variable load/store! (Of course, AMD and we aim at somewhat different markets, and each company has the own tradeoffs, so this tradeoff is not inherently evil, and there are certainly classes of programs where this is less of problem. still...) If I've misunderstood the 29K interface, somebody (Tim?) correct me. >(P.S. Question to MIPS: I think you only need to back up one extra instruction >on an exception if that instruction is a *taken* branch. Do you do it this >way, or on all branches?) I'm not sure what you mean. When there is an exception, the Exception Program Counter EPC points at the instruction that should be resumed to restart execution. If the exception occurred in the branch-delay slot, the cause register's Branch Delay bit is set, so that the OS can analyze those cases where the exception is caused by an instruction in the BD-slot, and where you want to do something different, like emulating an FP operation on a system that has no FP. there is normally no difference in handling taken or untaken branches. the only place you'd need to figure that out is in the case where you want to go back to the user and skip the instruction in the delay slot. then, you figure out whether or not the branch is being taken, and either take it, or not. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
w-colinp@microsoft.UUCP (Colin Plumb) (12/01/88)
In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >Actually, it doesn't work this way: the mult instruction picks up the values >from the regular registers, i.e., one instruction both fetches the data and >initiates the instruction. The only "extra" instruction is the one that >moves the result back. I stand corrected. Still, If I use the 29027 simply as an integer mutiplier, I can do exactly the same thing. >Here's an example. Maybe this is close in speed, or maybe not, or maybe >I don't understand the 29K FPU interface well enough. here's a small >example (*always be wary of small examples; I don't have time right now >for anything more substantive): >main() { > double x, y, z; > x = y + z; >} This is a *bit* heavy on the loads and stores - register allocation, anyone? While people do add up vectors (which is 1 access per add, not 3), the heaviest thing I can think of that's popular is dot products. Now if I'm allowed to put the 29027 into pipeline mode, it can eat 4 words of data every 3 clocks, which is gonna tax anybody's memory system, but I'll factor time to do the f.p. op out of the comparison. You need to get 4 words per step of the dot product from memory to the floating point chip. That's 4 loads, either way, and an additional 2 stores for the 29027. That's only 50% overhead, assuming all cache hits. On the 29027, I can just leave the multiply-accumulate opcode in the instruction register and keep feeding it operands, while on the MIPS chip I have to issue add.d and mul.d, which is an extra 2 cycles penalty it pays. But sigh, I'm losing track of the point of the argument. If I try and code up a dot-product loop, it gets worse, as I can unroll the pointer incrementing on the MIPS chip by using the base-plus-offset addressing mode, which the 29000 doesn't have. Of course, if it's a small array (up to 30 64-bit words each, or so), I'll just keep the whole thing in registers on the 29000... >Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS. >This is a microscopic example, and one should never believe in them >too much. However, the general effect is to add at least 1 cycle >for every FP variable load/store, just by the difference in interface. >Having stood on their heads to get things like 2-cycle adds, MIPS designers >would roll in their graves before adding a cycle to each variable load/store! Actually, that's half a cycle to loads and a cycle to stores. But you're right, this is a silly sort of thing to benchmark. How about some figures on the frequencies of the various ops in floating point programs, O great benchamrk oracle? :-) Just how often do I need to load and store? And don't forget that on the 29000, if it's just a local variable, I store it in the register file/stack cache (guaranteed one cycle) and the actual memory move may be obviated entirely. Even without all this logic, I think I can safely say that for vector operations, the memory->fpu->memory speed is essential, thus all the tricky things they do in Crays, avoiding the two steps of mem->reg and reg->fpu. For all-register work, like Mandelbrot kernels, it doesn't matter. And in between, I dunno. I still think it doesn't hurt *that* bad. What's the silicon cost for the coprocessor interface on the R2000/R3000? >>(P.S. Question to MIPS: I think you only need to back up one extra instruction >>on an exception if that instruction is a *taken* branch. Do you do it this >>way, or on all branches?) > >I'm not sure what you mean. Quote from the R2000 architecture manual: [The Exception Program Counter register] This register contains the virtual address of the instruction that caused the exception. When that instruction resides in a branch delay slot, the EPC register contains the virtual address of the immediately preceding Branch or Jump instruction. What I'm wondering is, in the instruction sequence foo bar jump (untaken) baz quux where the jump is not taken, is "baz" considered to be in the jump's delay slot? I.e. if baz faults, will the EPC point to it, or to the jump. Of course, if the jump *is* taken, then EPC will point to the jump, but I'm not sure if a "branch delay slot" is the instruction after a change- flow-of-control instruction, or a change in the flow of control. I.e. is the labelling static or dynamic? If dynamic, an instruction emulator wouldn't have to recompute the condition; it would know that the branch should be taken to find the correct return address. (It's late... er, early. Sorry if this could use a little polishing.) -- -Colin (microsof!w-colinp@sun.com)
tim@crackle.amd.com (Tim Olson) (12/02/88)
In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes: | Here's an example. MAybe this is close in speed, or maybe not, or maybe | I don't understand the 29K FPU interface well enough. here's a small | example (*always be wary of small examples; I don't have time right now | for anything more substantive): | main() { | double x, y, z; | x = y + z; | } | | this generates (for an R3000), and assuming data & instrs in-cache: | # 2 double x, y, z; | # 3 x = y + z; | l.d $f4, 8($sp) | l.d $f6, 0($sp) | add.d $f8, $f4, $f6 | s.d $f8, 16($sp) | The l.d and s.d actually turn into pairs of loads & stores, and this sequence | takes 9 cycles: 4 lwc1's, a nop (which might be scheduled away), 2 cycles for | the add.d, and 2 swc1's. | Assume a 29K with cache, so it has loads like the R3000's. | As far as I can tell, a 29K would use 17: | 4 load cycles (best case: in some cases, offset calculations, or use of | load multiple with count-setup would take another one or two) | 2 writes (get the data over to the FPU) | 1 write (start the add-double) | 6 cycles (do the add.double) | 2 reads (get the 64-bit result back) | 2 stores (put the result back in memory; again assume no offset calculations) No, local doubles are kept in the Am29000 register file, so no loads/stores will occur to the memory stack. The Am29000 has two methods of generating floating-point code, either emitting floating point instructions (which trap in the current Am29000 implementation) or emitting inline '027 code directly. The fp instruction code for: double g(double x, double y) { return x+y; } (essentially the same as your test case, but I had to revise it to make it emit *any* code) is: jmpi lr0 dadd gr96,lr4,lr2 (i.e. return, adding the incoming parameters (lr2-3, lr4-5) into the return-result registers (gr96,gr97) in the delay slot of the return. The in-line '027 code for this looks like: const gr96,1 ; (0x1) consth gr96,65536 ; (0x10000) store 1,38,gr96,gr96 store 1,32,lr4,lr5 store 1,97,lr2,lr3 load 1,1,gr97,gr96 load 1,0,gr96,gr96 The first two instructions "build" the '027 instruction that is to be performed (in this case, a DADD). The first store stores that instruction to the '027 coprocessor. The second store transfers the 'y' parameter to the coprocessor in a single cycle, and the third store transfers the 'x' parameter, as well as starts the coprocessor operation. The Am29000 then stalls on the load of the result lsb's, (5 cycles) then grabs the msb's and returns. This takes 12 cycles total, counting the building of the add instruction (which would be cached in a local register if it were to be used again). | * instead of +, it would be 6 versus 5. Still, as it stands, making worst | case assumptions about the R3000 (that the nop gets left in), and some | best-case assumptions about the 29K, you get a 9:17 ratio for this case, | a 12:17 for y*z; if you do the single-precision case with the same assumptions, | you get a 6:12 ratio for x+y, and 8:12 for x*y. | So, we get: | DP + DP * SP + SP * | R3000 9 12 6 8 (subtract 1 if the nop gets scheduled) | 29K 17 17 12 12 Nope, it is: DP + DP * SP + SP * R3000 9 12 6 8 29K 12 12 9 9 This again assumes that the '027 instruction is not reused (which it would be in "real" code). If it were reused, the counts would drop by 2 cycles. | Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS. Agreed. -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
mash@mips.COM (John Mashey) (12/02/88)
In article <1054@microsoft.UUCP> w-colinp@microsoft.UUCP (Colin Plumb) writes: ..... >But sigh, I'm losing track of the point of the argument. .... >... Of course, if it's a small array >(up to 30 64-bit words each, or so), I'll just keep the whole thing in >registers on the 29000... It would be very interesting to see what realistic high-level languauge programs end up allocating floating-point arrays in the registers... especially given FORTRAN call-by-reference..... >>Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THROUGH COMPILERS. >Actually, that's half a cycle to loads and a cycle to stores. But you're >right, this is a silly sort of thing to benchmark. How about some >figures on the frequencies of the various ops in floating point programs, >O great benchamrk oracle? :-) Just how often do I need to load and store? Here's a few numbers, real quick, % instructions that are FP load/store: load store spice 15.2% 8.8% (scalar) doduc 26.1% 8% (scalar) linpack, 64bit 34.5% 18.6% (vector) >And don't forget that on the 29000, if it's just a local variable, I >store it in the register file/stack cache (guaranteed one cycle) and >the actual memory move may be obviated entirely. > >Even without all this logic, I think I can safely say that for vector >operations, the memory->fpu->memory speed is essential, thus all the >tricky things they do in Crays, avoiding the two steps of mem->reg >and reg->fpu. For all-register work, like Mandelbrot kernels, it >doesn't matter. And in between, I dunno. I still think it doesn't >hurt *that* bad. What's the silicon cost for the coprocessor interface >on the R2000/R3000? There are algorithms where FP values stick in the registers. Many very scalar real programs do many loads&stores that simply will not go away with zillions of on-chip registers [unless they're stack cache like CRISP's, where the registers have addresses just like memory. Even then, it appears that typical allocatable-on-the-stack arrays blow away any reasonable on-chip register caches, for a while. >Quote from the R2000 architecture manual: >[The Exception Program Counter register] >This register contains the virtual address of the instruction that caused >the exception. When that instruction resides in a branch delay slot, the >EPC register contains the virtual address of the immediately preceding >Branch or Jump instruction. > >What I'm wondering is, in the instruction sequence > > foo > bar > jump (untaken) > baz > quux > >where the jump is not taken, is "baz" considered to be in the jump's delay >slot? I.e. if baz faults, will the EPC point to it, or to the jump. >Of course, if the jump *is* taken, then EPC will point to the jump, but >I'm not sure if a "branch delay slot" is the instruction after a change- >flow-of-control instruction, or a change in the flow of control. I.e. >is the labelling static or dynamic? If dynamic, an instruction emulator >wouldn't have to recompute the condition; it would know that the branch >should be taken to find the correct return address. It's static, i.e., it's irrelevant whether jump is taken or not. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (12/02/88)
In article <23656@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes: >In article <9061@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >| Here's an example. MAybe this is close in speed, or maybe not, or maybe >| I don't understand the 29K FPU interface well enough. here's a small >| example (*always be wary of small examples; I don't have time right now >| for anything more substantive): >| main() { >| double x, y, z; >| x = y + z; >| } Sorr >No, local doubles are kept in the Am29000 register file, so no >loads/stores will occur to the memory stack. The Am29000 has two Sorry, I was using a simple example, without optimization, to create code typical of references to variables in memory. The optimizer of course erases this code. >methods of generating floating-point code, either emitting floating >point instructions (which trap in the current Am29000 implementation) or >emitting inline '027 code directly. The fp instruction code for: >double >g(double x, double y) >{ > return x+y; >} > >(essentially the same as your test case, but I had to revise it to make >it emit *any* code) is: > > jmpi lr0 > dadd gr96,lr4,lr2 As I understand it, this is a trap to a routine that ends up issuing something the code in your next example.... The R3000 code for this routine is of course: [y.c: 3] 0x0: 03e00008 jr ra [y.c: 3] 0x4: 462e6000 add.d f0,f12,f14 >The in-line '027 code for this looks like: > > const gr96,1 ; (0x1) > consth gr96,65536 ; (0x10000) > store 1,38,gr96,gr96 > store 1,32,lr4,lr5 > store 1,97,lr2,lr3 > load 1,1,gr97,gr96 > load 1,0,gr96,gr96 >Nope, it is: > > DP + DP * SP + SP * > R3000 9 12 6 8 > 29K 12 12 9 9 > >This again assumes that the '027 instruction is not reused (which it >would be in "real" code). If it were reused, the counts would drop by 2 >cycles. Given the same assumptions (variables already in registers, and we have enough of them to make that happen occasionally also), but giving 29K benefit of the doubt on the constants, we end up with: DP + DP * SP + SP * R3000 2 5 2 4 29K 10 10 7 7 This, of course, is one of the most favorable-to-MIPS comparisons, which is why I didn't use it in the first case. Needless to say, 29K code generators will want to allocate FP variables to the FP unit whenever possible, to avoid this kind of hit. >| Now: THE REAL PROOF IS IN RUNNING REAL PROGRAMS, THRU COMPILERS. >Agreed. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
tim@crackle.amd.com (Tim Olson) (12/03/88)
In article <9139@winchester.mips.COM> mash@mips.COM (John Mashey) writes: | Given the same assumptions (variables already in registers, | and we have enough of them to make that happen occasionally also), | but giving 29K benefit of the doubt on the constants, we end up with: | DP + DP * SP + SP * | R3000 2 5 2 4 | 29K 10 10 7 7 | This, of course, is one of the most favorable-to-MIPS comparisons, | which is why I didn't use it in the first case. Needless to say, | 29K code generators will want to allocate FP variables to the FP unit | whenever possible, to avoid this kind of hit. Of course. I was trying to make a fair comparison. Our local scalar stack *is* in the local register file, which is what we were comparing. If you want to compare operations where the operands are already in the FP registers, then: DP + DP * SP + SP * R3000 2 5 2 4 29K 5 5 5 5 (Assuming non-pipelined '027 operation). -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)