[comp.arch] AM29000 Booleans

firth@sei.cmu.edu.UUCP (05/06/87)

I've just finished reading about the AM29000
processor.  Naturally, I have lots of questions,
but here's one I'd appreciate a rapid answer to:

  The machine has comparison instructions that
  yield a Boolean result in a register.  The
  processor description says that TRUE is
  represented by a 1 in the MOST significant
  bit.  Is this a typo?

Thanks in advance

brian@neptune.AMD.COM (Brian McMinn) (05/06/87)

In article <1270@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
> [regarding Am29000]
>
>  The machine has comparison instructions that
>  yield a Boolean result in a register.  The
>  processor description says that TRUE is
>  represented by a 1 in the MOST significant
>  bit.  Is this a typo?

No, this is not a typo.  All negative numbers are regarded as Boolean
TRUE.  Although this does not correspond with the "C" language
definition of TRUE, it did gain some speed in the hardware.  Since
most booleans are produced by compare instructions (things like
while(i--) to the contrary, but look at JMPFDEC before you complain
too loudly), we could define a boolean to be whatever was convenient.

In a hardware comparator, the "result" of the compare is generated
from the carry out of the two most significant bits.  Rather than
routing the result bit back to the lsb or propogating it to all 32
bits, we just stuffed it into the msb.  Another nice side effect is
that we only need to test ONE bit of the condition register on a
jump.
-- 
	Brian McMinn		1-(512)-462-5389
	Advanced Micro Devices	domainLand:  brian@neptune.AMD.COM
	Austin, Texas		bangLand:   ...!amdcad!neptune!brian

tim@amdcad.UUCP (05/06/87)

In article <1270@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
+-----
|I've just finished reading about the AM29000
|processor.  Naturally, I have lots of questions,
|but here's one I'd appreciate a rapid answer to:
|
|  The machine has comparison instructions that
|  yield a Boolean result in a register.  The
|  processor description says that TRUE is
|  represented by a 1 in the MOST significant
|  bit.  Is this a typo?
+-----

No, it is not a typo.  One of the restrictions to the Boolean
representation is that it had to be a single bit to allow quick
determination of the target of a conditional jump.  Given this, it could
either be placed in the most-significant bit (msb) or the
least-significant bit (lsb).  The code generated for either of these
representations is, in general, equivalent in terms of code space and
cycles, except for two cases: the msb representation has the benefit of
a quick "jump on negative", while the lsb representation "looks like a C
Boolean", i.e.  it has the correct value when a Boolean is assigned to a
variable. 

For example, the code sequences that are generated for these cases are:

			if (x < 0) .....

	msb					lsb
	---------------------------		--------------------------
	jmpt	x_reg,if_failed			cplt	bool_reg,x_reg,0
						jmpf	bool_reg,if_failed



			x = (y < z);

	msb					lsb
	----------------------------		--------------------------
	cplt	bool_reg,y_reg,z_reg		cplt	x_reg,y_reg,z_reg
	srl	x_reg,bool_reg,31


Since the first code sequence is *much* more prevalent than the second in
typical C code, it is better to "optimize" the first sequence at the
expense of the second.


	-- Tim Olson
	Advanced Micro Devices
	Processor Strategic Development

lm@cottage.UUCP (05/07/87)

In article <138@neptune.AMD.COM> brian@neptune.AMD.COM (Brian McMinn) writes:
>
>No, this is not a typo.  All negative numbers are regarded as Boolean
>TRUE.  Although this does not correspond with the "C" language
>definition of TRUE, it did gain some speed in the hardware.  Since

Could you please show the assembly language generated for the following,

{	int x;
	
	if (x)
	    <stmt1>;
	else
	    <stmt2>;

	if (!x)
	    <stmt1>;
	else
	    <stmt2>;
}

And tell me which will be executed for x = -1, 0, 1.

Thanks much...
---
Larry McVoy 	        lm@cottage.wisc.edu  or  uwvax!mcvoy

"What a wonderful world it is that has girls in it!"  -L.L.

tim@amdcad.AMD.COM (Tim Olson) (05/07/87)

In article <3540@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes:
> In article <138@neptune.AMD.COM> brian@neptune.AMD.COM (Brian McMinn) writes:
> >
> >No, this is not a typo.  All negative numbers are regarded as Boolean
> >TRUE.  Although this does not correspond with the "C" language
> >definition of TRUE, it did gain some speed in the hardware.  Since
> 
> Could you please show the assembly language generated for the following,
> 
> {	int x;
> 	
> 	if (x)
> 	    <stmt1>;
> 	else
> 	    <stmt2>;
> 
> 	if (!x)
> 	    <stmt1>;
> 	else
> 	    <stmt2>;
> }
> 
> And tell me which will be executed for x = -1, 0, 1.

I think you may be confusing Am29000 Booleans with 'C' Booleans.  They are
two different animals, and when code is generated from C programs, comparison
instructions (usually) must be used to create the correct Am29000 Boolean.
For example, the above code fragment compiles to:

---------------------------------------
	.global	_main
_main:
	.align

;	if (x)
	cpneq	gr72,gr70,0		<-- sets the msb of gr72 if gr70 != 0
	jmpf	gr72,$16		<-- jumps if the msb of gr72 is not set
	nop

;	    y = 0;
	jmp	$17			<-- remember those delayed branches!
	const	gr71,0
$16:
;	else
;	    y = 1;
	const	gr71,1
$17:

;	if (!x)
	cpeq	gr72,gr70,0
	jmpf	gr72,$18
	nop

;	    y = 0;
	jmp	$19
	const	gr71,0
$18:
;	else
;	    y = 1;
	const	gr71,1
$19:
	jmpi	lr00
	nop
----------------------------------------------
The code execution would be just what you expect (and what is required of C)
for all values of x.


	-- Tim Olson
	Advanced Micro Devices
	Processor Strategic Planning

socha@drivax.UUCP (05/08/87)

In article <16587@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
>In article <3540@spool.WISC.EDU>, lm@cottage.WISC.EDU (Larry McVoy) writes:
 
>> Could you please show the assembly language generated for the following,
>> 
>> {	int x;
>> 	
>> 	if (x)
>> 	    <stmt1>;
>> 	else
>> 	    <stmt2>;
>> 
>> 	if (!x)
>> 	    <stmt1>;
>> 	else
>> 	    <stmt2>;
>> }

Well, to add some fuel to the fire about how a machine's performance can
depend on the smarts in the compiler used, I hand modified the example code
given in the referenced article.  The changed code is shown below.
The changes were limited to taking better advantage of the delayed branching.
I only re-arranged and removed some code.

---------------------------------------
	.global	_main
_main:
	.align

;	if (x)
	cpneq	gr72,gr70,0		<-- sets the msb of gr72 if gr70 != 0
	jmpf	gr72,$16		<-- jumps if the msb of gr72 is not set
;	else
;	    y = 1;
	const	gr71,1			<-- waisted assign on fall through case

;	    y = 0;
	const	gr71,0			<-- remember those delayed branches!
$16:
;	if (!x)
	cpeq	gr72,gr70,0
	jmpf	gr72,$18
;	else
;	    y = 1;
	const	gr71,1

; THEN	    y = 0;
	const	gr71,0
$18:
	jmpi	lr00
	nop
>	-- Tim Olson
>	Advanced Micro Devices

The savings were:
	nop
	jmp	$17			<-- remember those delayed branches!
	nop
	jmp	$19

And that is not so shabby for such a small programme especially if the hardware
is smart enough to recognize that the jumped to address is probably in the
pipeline.

Now, don't try to say that the simple case of 1 instruction after the branch
doesn't occur (or occurs rarely).  I've seen it often enough to make me
think that this is a worthwhile optimization by the compiler.

BTW I like the fact that non-arithmetic instructions DO NOT change (affect)
the condition code (status) register. This can develop other optimizations.
But, I can't understand why the following sequence is needed for multiply.

	mul	gr72,gr71,#0		;step 1
	repeat	30			; (not in assembler? my addition)
	mul	gr72,gr71,gr72		;steps 2 to 31 (that's right, 30 words!)

	mull	gr72,gr71,gr72		; step 32

The above from Am29000 user's manual page 7-10 for a 32 bit integer multiply.
It takes 32 instructions to do it  (similar for divide).

Now, I can understand the advantages of a RISC processor but this is going
to far.  Should I put 32 instructions in the processing stream each time
I need to multiply two numbers?  Should I use a subroutine?
Seems to me a perfect  time/space tradeoff decision.  But, what are the costs?

-- 
UUCP:...!amdahl!drivax!socha                                      WAT Iron'75
"Everything should be made as simple as possible but not simpler."  A. Einstein

tim@amdcad.AMD.COM (Tim Olson) (05/09/87)

In article <1512@drivax.UUCP> socha@drivax.UUCP (Henri J. Socha (x6251)) writes:

+-----
| Well, to add some fuel to the fire about how a machine's performance can
| depend on the smarts in the compiler used, I hand modified the example code
| given in the referenced article.  The changed code is shown below.
| The changes were limited to taking better advantage of the delayed branching.
| I only re-arranged and removed some code.

	... changed code
+-----
| The savings were:
| 	nop
| 	jmp	$17			<-- remember those delayed branches!
| 	nop
| 	jmp	$19
+-----
The optimizations you performed were valid.  Our internal compiler is
conservative in the instructions it selects for placement after a
delayed branch -- it just didn't recognize the case you found.  The
standard "commercial compiler" for the Am29000 should be able to perform
this, though.
+-----
| BTW I like the fact that non-arithmetic instructions DO NOT change (affect)
| the condition code (status) register. This can develop other optimizations.
+-----
Actually, the only condition codes used by the processor are the carry
(C) and the divide flag (DF).  The others are there for "completeness"
and to enhance performance in emulations.

	... discussion of multiply taking 32 steps

+-----
| Now, I can understand the advantages of a RISC processor but this is going
| to far.  Should I put 32 instructions in the processing stream each time
| I need to multiply two numbers?  Should I use a subroutine?
| Seems to me a perfect  time/space tradeoff decision.  But, what are the costs?
+-----
Yes, it is a time/space tradeoff.  Note, however, that the Am29000 has a
relatively low overhead cost associated with a subroutine call.  Our C
runtime library code we use for simulations has a multiply subroutine
which is a variant of the one shown in the user's manual.  It performs a
quick check at the beginning to see how many steps are really required,
then performs only that many steps.  This has been shown to reduce the
entire cost of a runtime multiply (including procedure-call overhead) to
around 25 cycles (when used on a range of multiply-intensive code).

	-- Tim Olson
	Advanced Micro Devices
	

henry@utzoo.UUCP (Henry Spencer) (05/10/87)

> The above from Am29000 user's manual page 7-10 for a 32 bit integer multiply.
> It takes 32 instructions to do it  (similar for divide).
> 
> Now, I can understand the advantages of a RISC processor but this is going
> to far.  Should I put 32 instructions in the processing stream each time
> I need to multiply two numbers?  Should I use a subroutine?

Many multiplies (at least in non-numerical code) are by small integer
constants.  For those, a sensible compiler will generate whatever small
number of steps is actually needed to do the particular multiply, as a
function of the binary representation of the constant.
-- 
"If you want PL/I, you know       Henry Spencer @ U of Toronto Zoology
where to find it." -- DMR         {allegra,ihnp4,decvax,pyramid}!utzoo!henry

tim@amdcad.AMD.COM (Tim Olson) (05/11/87)

In article <8012@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes:
+-----
| Many multiplies (at least in non-numerical code) are by small integer
| constants.  For those, a sensible compiler will generate whatever small
| number of steps is actually needed to do the particular multiply, as a
| function of the binary representation of the constant.
+-----
If the multiply is by a constant known at compile time, it is usually
better to perform the multiply as a series of shifts and adds, rather
than a sequence of multiply-steps, since (on the average) half of the
significant multiplier bits are a '1'.  For example, the following code

	int x;

	main()
	{

		x = x*57;
	}

Compiles to:
	const	lr05,_x+(0)
	consth	lr05,(_x+(0))>>16
	load	16,lr02,lr05
	sll	lr04,lr02,3
	add	lr03,lr02,lr04
	sll	lr04,lr02,4
	add	lr03,lr03,lr04
	sll	lr04,lr02,5
	add	lr03,lr03,lr04
	store	16,lr03,lr05

for a multiply time of 6 cycles.  The equivalent multiply-step code
would be:

	const	lr05,_x+(0)
	consth	lr05,(_x+(0))>>16
	load	16,lr02,lr05
	mtsr	Q,lr02
	const	lr03,57
	mul	lr02,lr03,0
	mul	lr02,lr03,lr02
	mul	lr02,lr03,lr02
	mul	lr02,lr03,lr02
	mul	lr02,lr03,lr02
	mul	lr02,lr03,lr02
	mull	lr02,lr03,lr02
	mtsrim	FS,7
	mfsr	gr60,Q
	extract	lr02,lr02,gr60		; extract the result
	store	16,lr02,lr05

for a multiply time of 12 cycles.


	-- Tim Olson
	Advanced Micro Devices

hansen@mips.UUCP (05/12/87)

In article <16640@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
> 	int x;
> 
> 	main()
> 	{
> 
> 		x = x*57;
> 	}
> 
> Compiles to:
	<code deleted>
> for a multiply time of 6 cycles.

The MIPS compiler system does the same operation in 4 cycles, by performing
subtracts as well as adds.  The same code above compiles (unoptimized) to:

	main:
0x0:	lw	t6,0(gp)
0x4:	nop
0x8:	sll	t7,t6,3
0xc:	subu	t7,t7,t6
0x10:	sll	t7,t7,3
0x14:	addu	t7,t7,t6
0x18:	jr	ra
0x1c:	sw	t7,0(gp)

The code above uses the fact that 57 = (8-1)*8 + 1.

For the other AMD example, the MIPS compiler system can do all the branches
directly, without the compares, and places the single-statement paths
directly into the branch delay slots as was demonstated in the hand-tweaked
version. In fact, I had to change the code to generate two different results
and combine them, because otherwise the optimizer would remove the first set
of statements (they generate a dead value). The code vanishes entirely if no
value is returned by the routine.  The source code:

int bool(x)
	int x;
{	int y, z;
	
	if (x)
	    y=0;
	else
	    y=1;

	if (!x)
	    z=0;
	else
	    z=1;
	return(y+z);
}

Compiles to:

	bool:
0x0:	beq	a0,zero,0x14
0x4:	li	v1,1
0x8:	b	0x14
0xc:	move	v1,zero
0x10:	li	v1,1
0x14:	bne	a0,zero,0x28
0x18:	li	a0,1
0x1c:	b	0x28
0x20:	move	a0,zero
0x24:	li	a0,1
0x28:	jr	ra
0x2c:	addu	v0,v1,a0

It turns out that the movement of some code into branch delay slots occurs
as a peephole optimization after the code has been reorganized once, so the
same instruction appears at address 0x4 and 0x10, though the instruction at
address 0x10 is never used. (The same thing occurs at 0x18 and 0x24.)  If a
post-pass was employed to clean up the code, the unconditional branches
could also be removed (addresses 0x8 and 0x1c).  We don't bother because, in
practice, single-instruction branch paths are rare, and the unused
instructions don't cause any serious problems.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

henry@utzoo.UUCP (Henry Spencer) (05/13/87)

> If the multiply is by a constant known at compile time, it is usually
> better to perform the multiply as a series of shifts and adds, rather
> than a sequence of multiply-steps...

That's what I meant, actually; I just didn't say it very well.
-- 
"The average nutritional value    Henry Spencer @ U of Toronto Zoology
of promises is roughly zero."     {allegra,ihnp4,decvax,pyramid}!utzoo!henry