firth@sei.cmu.edu.UUCP (05/06/87)
I've just finished reading about the AM29000 processor. Naturally, I have lots of questions, but here's one I'd appreciate a rapid answer to: The machine has comparison instructions that yield a Boolean result in a register. The processor description says that TRUE is represented by a 1 in the MOST significant bit. Is this a typo? Thanks in advance
brian@neptune.AMD.COM (Brian McMinn) (05/06/87)
In article <1270@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: > [regarding Am29000] > > The machine has comparison instructions that > yield a Boolean result in a register. The > processor description says that TRUE is > represented by a 1 in the MOST significant > bit. Is this a typo? No, this is not a typo. All negative numbers are regarded as Boolean TRUE. Although this does not correspond with the "C" language definition of TRUE, it did gain some speed in the hardware. Since most booleans are produced by compare instructions (things like while(i--) to the contrary, but look at JMPFDEC before you complain too loudly), we could define a boolean to be whatever was convenient. In a hardware comparator, the "result" of the compare is generated from the carry out of the two most significant bits. Rather than routing the result bit back to the lsb or propogating it to all 32 bits, we just stuffed it into the msb. Another nice side effect is that we only need to test ONE bit of the condition register on a jump. -- Brian McMinn 1-(512)-462-5389 Advanced Micro Devices domainLand: brian@neptune.AMD.COM Austin, Texas bangLand: ...!amdcad!neptune!brian
tim@amdcad.UUCP (05/06/87)
In article <1270@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: +----- |I've just finished reading about the AM29000 |processor. Naturally, I have lots of questions, |but here's one I'd appreciate a rapid answer to: | | The machine has comparison instructions that | yield a Boolean result in a register. The | processor description says that TRUE is | represented by a 1 in the MOST significant | bit. Is this a typo? +----- No, it is not a typo. One of the restrictions to the Boolean representation is that it had to be a single bit to allow quick determination of the target of a conditional jump. Given this, it could either be placed in the most-significant bit (msb) or the least-significant bit (lsb). The code generated for either of these representations is, in general, equivalent in terms of code space and cycles, except for two cases: the msb representation has the benefit of a quick "jump on negative", while the lsb representation "looks like a C Boolean", i.e. it has the correct value when a Boolean is assigned to a variable. For example, the code sequences that are generated for these cases are: if (x < 0) ..... msb lsb --------------------------- -------------------------- jmpt x_reg,if_failed cplt bool_reg,x_reg,0 jmpf bool_reg,if_failed x = (y < z); msb lsb ---------------------------- -------------------------- cplt bool_reg,y_reg,z_reg cplt x_reg,y_reg,z_reg srl x_reg,bool_reg,31 Since the first code sequence is *much* more prevalent than the second in typical C code, it is better to "optimize" the first sequence at the expense of the second. -- Tim Olson Advanced Micro Devices Processor Strategic Development
lm@cottage.UUCP (05/07/87)
In article <138@neptune.AMD.COM> brian@neptune.AMD.COM (Brian McMinn) writes: > >No, this is not a typo. All negative numbers are regarded as Boolean >TRUE. Although this does not correspond with the "C" language >definition of TRUE, it did gain some speed in the hardware. Since Could you please show the assembly language generated for the following, { int x; if (x) <stmt1>; else <stmt2>; if (!x) <stmt1>; else <stmt2>; } And tell me which will be executed for x = -1, 0, 1. Thanks much... --- Larry McVoy lm@cottage.wisc.edu or uwvax!mcvoy "What a wonderful world it is that has girls in it!" -L.L.
tim@amdcad.AMD.COM (Tim Olson) (05/07/87)
In article <3540@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes: > In article <138@neptune.AMD.COM> brian@neptune.AMD.COM (Brian McMinn) writes: > > > >No, this is not a typo. All negative numbers are regarded as Boolean > >TRUE. Although this does not correspond with the "C" language > >definition of TRUE, it did gain some speed in the hardware. Since > > Could you please show the assembly language generated for the following, > > { int x; > > if (x) > <stmt1>; > else > <stmt2>; > > if (!x) > <stmt1>; > else > <stmt2>; > } > > And tell me which will be executed for x = -1, 0, 1. I think you may be confusing Am29000 Booleans with 'C' Booleans. They are two different animals, and when code is generated from C programs, comparison instructions (usually) must be used to create the correct Am29000 Boolean. For example, the above code fragment compiles to: --------------------------------------- .global _main _main: .align ; if (x) cpneq gr72,gr70,0 <-- sets the msb of gr72 if gr70 != 0 jmpf gr72,$16 <-- jumps if the msb of gr72 is not set nop ; y = 0; jmp $17 <-- remember those delayed branches! const gr71,0 $16: ; else ; y = 1; const gr71,1 $17: ; if (!x) cpeq gr72,gr70,0 jmpf gr72,$18 nop ; y = 0; jmp $19 const gr71,0 $18: ; else ; y = 1; const gr71,1 $19: jmpi lr00 nop ---------------------------------------------- The code execution would be just what you expect (and what is required of C) for all values of x. -- Tim Olson Advanced Micro Devices Processor Strategic Planning
socha@drivax.UUCP (05/08/87)
In article <16587@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes: >In article <3540@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes: >> Could you please show the assembly language generated for the following, >> >> { int x; >> >> if (x) >> <stmt1>; >> else >> <stmt2>; >> >> if (!x) >> <stmt1>; >> else >> <stmt2>; >> } Well, to add some fuel to the fire about how a machine's performance can depend on the smarts in the compiler used, I hand modified the example code given in the referenced article. The changed code is shown below. The changes were limited to taking better advantage of the delayed branching. I only re-arranged and removed some code. --------------------------------------- .global _main _main: .align ; if (x) cpneq gr72,gr70,0 <-- sets the msb of gr72 if gr70 != 0 jmpf gr72,$16 <-- jumps if the msb of gr72 is not set ; else ; y = 1; const gr71,1 <-- waisted assign on fall through case ; y = 0; const gr71,0 <-- remember those delayed branches! $16: ; if (!x) cpeq gr72,gr70,0 jmpf gr72,$18 ; else ; y = 1; const gr71,1 ; THEN y = 0; const gr71,0 $18: jmpi lr00 nop > -- Tim Olson > Advanced Micro Devices The savings were: nop jmp $17 <-- remember those delayed branches! nop jmp $19 And that is not so shabby for such a small programme especially if the hardware is smart enough to recognize that the jumped to address is probably in the pipeline. Now, don't try to say that the simple case of 1 instruction after the branch doesn't occur (or occurs rarely). I've seen it often enough to make me think that this is a worthwhile optimization by the compiler. BTW I like the fact that non-arithmetic instructions DO NOT change (affect) the condition code (status) register. This can develop other optimizations. But, I can't understand why the following sequence is needed for multiply. mul gr72,gr71,#0 ;step 1 repeat 30 ; (not in assembler? my addition) mul gr72,gr71,gr72 ;steps 2 to 31 (that's right, 30 words!) mull gr72,gr71,gr72 ; step 32 The above from Am29000 user's manual page 7-10 for a 32 bit integer multiply. It takes 32 instructions to do it (similar for divide). Now, I can understand the advantages of a RISC processor but this is going to far. Should I put 32 instructions in the processing stream each time I need to multiply two numbers? Should I use a subroutine? Seems to me a perfect time/space tradeoff decision. But, what are the costs? -- UUCP:...!amdahl!drivax!socha WAT Iron'75 "Everything should be made as simple as possible but not simpler." A. Einstein
tim@amdcad.AMD.COM (Tim Olson) (05/09/87)
In article <1512@drivax.UUCP> socha@drivax.UUCP (Henri J. Socha (x6251)) writes: +----- | Well, to add some fuel to the fire about how a machine's performance can | depend on the smarts in the compiler used, I hand modified the example code | given in the referenced article. The changed code is shown below. | The changes were limited to taking better advantage of the delayed branching. | I only re-arranged and removed some code. ... changed code +----- | The savings were: | nop | jmp $17 <-- remember those delayed branches! | nop | jmp $19 +----- The optimizations you performed were valid. Our internal compiler is conservative in the instructions it selects for placement after a delayed branch -- it just didn't recognize the case you found. The standard "commercial compiler" for the Am29000 should be able to perform this, though. +----- | BTW I like the fact that non-arithmetic instructions DO NOT change (affect) | the condition code (status) register. This can develop other optimizations. +----- Actually, the only condition codes used by the processor are the carry (C) and the divide flag (DF). The others are there for "completeness" and to enhance performance in emulations. ... discussion of multiply taking 32 steps +----- | Now, I can understand the advantages of a RISC processor but this is going | to far. Should I put 32 instructions in the processing stream each time | I need to multiply two numbers? Should I use a subroutine? | Seems to me a perfect time/space tradeoff decision. But, what are the costs? +----- Yes, it is a time/space tradeoff. Note, however, that the Am29000 has a relatively low overhead cost associated with a subroutine call. Our C runtime library code we use for simulations has a multiply subroutine which is a variant of the one shown in the user's manual. It performs a quick check at the beginning to see how many steps are really required, then performs only that many steps. This has been shown to reduce the entire cost of a runtime multiply (including procedure-call overhead) to around 25 cycles (when used on a range of multiply-intensive code). -- Tim Olson Advanced Micro Devices
henry@utzoo.UUCP (Henry Spencer) (05/10/87)
> The above from Am29000 user's manual page 7-10 for a 32 bit integer multiply. > It takes 32 instructions to do it (similar for divide). > > Now, I can understand the advantages of a RISC processor but this is going > to far. Should I put 32 instructions in the processing stream each time > I need to multiply two numbers? Should I use a subroutine? Many multiplies (at least in non-numerical code) are by small integer constants. For those, a sensible compiler will generate whatever small number of steps is actually needed to do the particular multiply, as a function of the binary representation of the constant. -- "If you want PL/I, you know Henry Spencer @ U of Toronto Zoology where to find it." -- DMR {allegra,ihnp4,decvax,pyramid}!utzoo!henry
tim@amdcad.AMD.COM (Tim Olson) (05/11/87)
In article <8012@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: +----- | Many multiplies (at least in non-numerical code) are by small integer | constants. For those, a sensible compiler will generate whatever small | number of steps is actually needed to do the particular multiply, as a | function of the binary representation of the constant. +----- If the multiply is by a constant known at compile time, it is usually better to perform the multiply as a series of shifts and adds, rather than a sequence of multiply-steps, since (on the average) half of the significant multiplier bits are a '1'. For example, the following code int x; main() { x = x*57; } Compiles to: const lr05,_x+(0) consth lr05,(_x+(0))>>16 load 16,lr02,lr05 sll lr04,lr02,3 add lr03,lr02,lr04 sll lr04,lr02,4 add lr03,lr03,lr04 sll lr04,lr02,5 add lr03,lr03,lr04 store 16,lr03,lr05 for a multiply time of 6 cycles. The equivalent multiply-step code would be: const lr05,_x+(0) consth lr05,(_x+(0))>>16 load 16,lr02,lr05 mtsr Q,lr02 const lr03,57 mul lr02,lr03,0 mul lr02,lr03,lr02 mul lr02,lr03,lr02 mul lr02,lr03,lr02 mul lr02,lr03,lr02 mul lr02,lr03,lr02 mull lr02,lr03,lr02 mtsrim FS,7 mfsr gr60,Q extract lr02,lr02,gr60 ; extract the result store 16,lr02,lr05 for a multiply time of 12 cycles. -- Tim Olson Advanced Micro Devices
hansen@mips.UUCP (05/12/87)
In article <16640@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes: > int x; > > main() > { > > x = x*57; > } > > Compiles to: <code deleted> > for a multiply time of 6 cycles. The MIPS compiler system does the same operation in 4 cycles, by performing subtracts as well as adds. The same code above compiles (unoptimized) to: main: 0x0: lw t6,0(gp) 0x4: nop 0x8: sll t7,t6,3 0xc: subu t7,t7,t6 0x10: sll t7,t7,3 0x14: addu t7,t7,t6 0x18: jr ra 0x1c: sw t7,0(gp) The code above uses the fact that 57 = (8-1)*8 + 1. For the other AMD example, the MIPS compiler system can do all the branches directly, without the compares, and places the single-statement paths directly into the branch delay slots as was demonstated in the hand-tweaked version. In fact, I had to change the code to generate two different results and combine them, because otherwise the optimizer would remove the first set of statements (they generate a dead value). The code vanishes entirely if no value is returned by the routine. The source code: int bool(x) int x; { int y, z; if (x) y=0; else y=1; if (!x) z=0; else z=1; return(y+z); } Compiles to: bool: 0x0: beq a0,zero,0x14 0x4: li v1,1 0x8: b 0x14 0xc: move v1,zero 0x10: li v1,1 0x14: bne a0,zero,0x28 0x18: li a0,1 0x1c: b 0x28 0x20: move a0,zero 0x24: li a0,1 0x28: jr ra 0x2c: addu v0,v1,a0 It turns out that the movement of some code into branch delay slots occurs as a peephole optimization after the code has been reorganized once, so the same instruction appears at address 0x4 and 0x10, though the instruction at address 0x10 is never used. (The same thing occurs at 0x18 and 0x24.) If a post-pass was employed to clean up the code, the unconditional branches could also be removed (addresses 0x8 and 0x1c). We don't bother because, in practice, single-instruction branch paths are rare, and the unused instructions don't cause any serious problems. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen
henry@utzoo.UUCP (Henry Spencer) (05/13/87)
> If the multiply is by a constant known at compile time, it is usually > better to perform the multiply as a series of shifts and adds, rather > than a sequence of multiply-steps... That's what I meant, actually; I just didn't say it very well. -- "The average nutritional value Henry Spencer @ U of Toronto Zoology of promises is roughly zero." {allegra,ihnp4,decvax,pyramid}!utzoo!henry