[comp.arch] a style question

roy@phri.nyu.edu (Roy Smith) (10/02/90)

davis@pacific.mps.ohio-state.edu  (John E. Davis) writes:
> Which ['x != 100' or 'x < 100' RHS] generates faster code?  It seems to
> me that it is easier to tell if two values are unequal than to tell if
> one is greater than the other.  I'd rather save the machine a few
> micro-seconds than myself since I only do the comparison once whereas the
> machine must do it many times.

	Are there actually any machines in which a compare-and-branch for
inequality is any faster or slower than a compare-and-branch for less-than?
It seems to me that either should take one pass through the ALU to do the
comparison and set some flags, so they should both take the same amount of
time.  I'm basing my assumption on experience with pdp-11 type machines,
but I find it hard to imagine any other machine being significantly
different.  Maybe if you had an asynchronous ALU?

	The only scenario I could think of would be a RISC machine which
has only two branches; branch-on-equal, and branch-on-less-than.  The
compiler could generate an appropriate stream of instructions to simulate
any possible branch condition from just those two, and some streams might
end up being longer than others, but that sounds pretty strange, and very
un-orthogonal.
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

mutchler@zule.EBay.Sun.COM (Dan Mutchler) (10/03/90)

In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes:

   davis@pacific.mps.ohio-state.edu  (John E. Davis) writes:
   > Which ['x != 100' or 'x < 100' RHS] generates faster code?  It seems to
   > me that it is easier to tell if two values are unequal than to tell if
   > one is greater than the other.  I'd rather save the machine a few
   > micro-seconds than myself since I only do the comparison once whereas the
   > machine must do it many times.

	   Are there actually any machines in which a compare-and-branch for
   inequality is any faster or slower than a compare-and-branch for less-than?
   It seems to me that either should take one pass through the ALU to do the
   comparison and set some flags, so they should both take the same amount of
   time.  I'm basing my assumption on experience with pdp-11 type machines,
   but I find it hard to imagine any other machine being significantly
   different.  Maybe if you had an asynchronous ALU?

	   The only scenario I could think of would be a RISC machine which
   has only two branches; branch-on-equal, and branch-on-less-than.  The
   compiler could generate an appropriate stream of instructions to simulate
   any possible branch condition from just those two, and some streams might
   end up being longer than others, but that sounds pretty strange, and very
   un-orthogonal.

My experiece has been that compares are essentially a subtract (some
early machines didn't actually have a compare) that sets the
appropriate condition flags. So X!=100 is subtract 100 from X and
branch not zero and X < 100 is subtract 100 from X and branch
negative. If I remember correctly the cmp and sub instructions use the
same number of clock cycles on the 8086.

--

Dan Mutchler                       | ARPA/Internet:  mutchler@zule.EBay.Sun.COM
Sun Federal System Engineer        | UUCP:           ...!sun!mutchler
--------------------------------------------------------------------------
Flying back from Lubbock, I saw Jesus on the plane
Or maybe it was Elvis, You know they kind of look the same.
                     -- Don Henley

meissner@osf.org (Michael Meissner) (10/03/90)

In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy
Smith) writes:

| 	Are there actually any machines in which a compare-and-branch for
| inequality is any faster or slower than a compare-and-branch for less-than?
| It seems to me that either should take one pass through the ALU to do the
| comparison and set some flags, so they should both take the same amount of
| time.  I'm basing my assumption on experience with pdp-11 type machines,
| but I find it hard to imagine any other machine being significantly
| different.  Maybe if you had an asynchronous ALU?

Yes, any computer based on the MIPS chipset (MIPS, DECstation, SGI) is
faster to do a branch on both equality and inquality, than for the
other comparison operators.

| 	The only scenario I could think of would be a RISC machine which
| has only two branches; branch-on-equal, and branch-on-less-than.  The
| compiler could generate an appropriate stream of instructions to simulate
| any possible branch condition from just those two, and some streams might
| end up being longer than others, but that sounds pretty strange, and very
| un-orthogonal.

Mips does not have a branch on a < b (unless b is 0).  It has a set
register to 1 if a < b instruction (& 0 otherwise).  Thus to do the
branch, you set the scratch register to be the value of a < b, and
then do a branch on that register being zero.  It does have a direct
instruction to branch if two registers are equal or not equal.

For example, consider the following program:

	int i, j, k, l, m, n;

	void foo(){
	  if (i < j)
	    k++;

	  if (l == m)
	    n++;
	}

It produces the following code after running it through the compiler
and assembler:

		foo:
	File 'test-branch.c':
	0: int i, j, k, l, m, n;
	1: 
	2: void foo(){
	3:   if (i < j)
	  [test-branch.c:   4] 0x0:	8f8e0000	lw	r14,0(gp)
	  [test-branch.c:   4] 0x4:	8f8f0000	lw	r15,0(gp)
	  [test-branch.c:   4] 0x8:	00000000	nop
	  [test-branch.c:   4] 0xc:	01cf082a	slt	r1,r14,r15
	  [test-branch.c:   4] 0x10:	10200005	beq	r1,r0,0x28
	  [test-branch.c:   4] 0x14:	00000000	nop
	4:     k++;
	  [test-branch.c:   5] 0x18:	8f980000	lw	r24,0(gp)
	  [test-branch.c:   5] 0x1c:	00000000	nop
	  [test-branch.c:   5] 0x20:	27190001	addiu	r25,r24,1
	  [test-branch.c:   5] 0x24:	af990000	sw	r25,0(gp)
	5: 
	6:   if (l == m)
	  [test-branch.c:   7] 0x28:	8f880000	lw	r8,0(gp)
	  [test-branch.c:   7] 0x2c:	8f890000	lw	r9,0(gp)
	  [test-branch.c:   7] 0x30:	00000000	nop
	  [test-branch.c:   7] 0x34:	15090005	bne	r8,r9,0x4c
	  [test-branch.c:   7] 0x38:	00000000	nop
	7:     n++;
	  [test-branch.c:   8] 0x3c:	8f8a0000	lw	r10,0(gp)
	  [test-branch.c:   8] 0x40:	00000000	nop
	  [test-branch.c:   8] 0x44:	254b0001	addiu	r11,r10,1
	  [test-branch.c:   8] 0x48:	af8b0000	sw	r11,0(gp)
	8: }
	  [test-branch.c:   9] 0x4c:	03e00008	jr	r31
	  [test-branch.c:   9] 0x50:	00000000	nop
	  0x54:	00000000	nop
	  0x58:	00000000	nop
	  0x5c:	00000000	nop

From looking at the way the bits are set down, the blez (branch on
less than or equal zero) and bgtz (branch on greater than zero)
instructions could have been defined as ble and bgt, since the second
register field is required to be 0 (and register $0 is hardwired to
0).  I suspect the decision may have been due to chip real estate, and
the fact that equality comparisons happen more frequently in real
programs.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Do apple growers tell their kids money doesn't grow on bushes?

mash@mips.COM (John Mashey) (10/03/90)

In article <MEISSNER.90Oct2140411@osf.osf.org> meissner@osf.org (Michael Meissner) writes:
>In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy
>Smith) writes:
>
>| 	Are there actually any machines in which a compare-and-branch for
>| inequality is any faster or slower than a compare-and-branch for less-than?
>| It seems to me that either should take one pass through the ALU to do the
>| comparison and set some flags, so they should both take the same amount of
>| time.  I'm basing my assumption on experience with pdp-11 type machines,
>| but I find it hard to imagine any other machine being significantly
>| different......
Well, they are different....

>Yes, any computer based on the MIPS chipset (MIPS, DECstation, SGI) is
>faster to do a branch on both equality and inquality, than for the
>other comparison operators.

>Mips does not have a branch on a < b (unless b is 0).  It has a set
>register to 1 if a < b instruction (& 0 otherwise).  Thus to do the
>branch, you set the scratch register to be the value of a < b, and
>then do a branch on that register being zero.  It does have a direct
>instruction to branch if two registers are equal or not equal.
....example....
>From looking at the way the bits are set down, the blez (branch on
>less than or equal zero) and bgtz (branch on greater than zero)
>instructions could have been defined as ble and bgt, since the second
>register field is required to be 0 (and register $0 is hardwired to
>0).  I suspect the decision may have been due to chip real estate, and
>the fact that equality comparisons happen more frequently in real
>programs.

OK, now the real reason is the cycle time tradeoff, which may or
may not be relevant to other chips, because it depends on other decisions
you have made.  The R3000 worked pretty hard to get 1-cycle
compare-and-branch for everything that it could, including such things
as a micro-tlb because it couldn't do a lookup in the main tlb quite
fast enough.  SO, WHY does it have:
1.	branch on a == b	2 registers
2.	branch on a != b	""
3.	branch on a <= 0	register & 0
4.	branch on a > 0		""
5.	branch on a < 0		""
6.	branch on a >= 0	""

But not
7,8.	branch on a < b, or a <= b

ANS:
	7-8 require a 32-bit subtract.
	1-6 do not require any subtract, and can be implemented with logic
	that is faster than subtract, in less space.
	Note that 1 & 2 can be implemented by xoring a & b, and seeing
	if the resulting bits are all zero (1), or not all zero (2).
	Cases 5 & 6 need only test the sign bit.
	Cases 3 and 4 test the sign bit plus whether or not all other
	bits are 0.
	
	The timing was tight enough, for us, that a subtract as part of
	a compare-and-branch would have lengthened the cycle time,
	or would have added another branch delay cycle, losing
	more than the compensation gained by having the extra instruction.
	Deciding this took simulation, including noting that the most
	frequent (>50%) test is a == 0 or a != 0  (1 or 2), and that smart
	compilers can often turn some loop termninations from a <= N
	into a == N.  Note that one must still include a compare
	instruction (in our case Set Less Than), that is essentially
	a subtract, but delivers the 0/1 indication to another register,
	and easily fits in the same timing as a normal subtract, but does
	not have to obtain the result quite as quickly as the compare-branch
	instructions must.

See pages 106-107 in Hennessy & Patterson for further discussion.

Again, note that this reasoning may or may not be valid for everybody,
as it depends on the other architectural predecessors & expected
implementation technologies.  For example, I think HP PA chose the other
direction. 

However, such choices are hardly random events, or people leaving things
out for no particular reason.  I recall this one got studied a lot,
because conditional branches are important, and they tend not to go
away, even with good optimizers.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

Don_A_Corbitt@cup.portal.com (10/03/90)

roy@phri.nyu.edu writes:
> davis@pacific.mps.ohio-state.edu  (John E. Davis) writes:
> > Which ['x != 100' or 'x < 100' RHS] generates faster code?  It seems to
> > me that it is easier to tell if two values are unequal than to tell if
> > one is greater than the other.  I'd rather save the machine a few
> > micro-seconds than myself since I only do the comparison once whereas the
> > machine must do it many times.
> 
> 	Are there actually any machines in which a compare-and-branch for
> inequality is any faster or slower than a compare-and-branch for less-than?
> It seems to me that either should take one pass through the ALU to do the
> comparison and set some flags, so they should both take the same amount of
> time.  I'm basing my assumption on experience with pdp-11 type machines,
> but I find it hard to imagine any other machine being significantly
> different.  Maybe if you had an asynchronous ALU?

[goes on to theorize about RISC machine with limited branch types]

Actually, the i860 is faster when comparing for equality than greater-than.
For example:
	xor	r16, r17, r0
	bc	somewhere
takes two clocks, plus one annulled after the branch if taken.
The sequence
	subu	r16, r17, r0
	bc	somewhere
takes three clocks, since there is a freeze before the conditional 
branch while the condition code is updated.  On this chip, adds/addu and
subs/subu followed by a conditional branch costs an extra cycle.

I haven't seen this in print, but I assume it is because there is extra 
delay waiting for carry propagation to set the condition code.  XOR can
test all the bits in parallel.
---
Don_A_Corbitt@cup.portal.com      Not a spokesperson for CrystalGraphics, Inc.
Mail flames, post apologies.       Support short .signatures, three lines max.
(Soon leaving this account - flame me now, while you have the chance!)

marcb@hp-ptp.HP.COM (Marc Brandis) (10/03/90)

meissner@osf.org (Michael Meissner) writes:

>In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy
>Smith) writes:

>| 	Are there actually any machines in which a compare-and-branch for
>| inequality is any faster or slower than a compare-and-branch for less-than?
>| It seems to me that either should take one pass through the ALU to do the
>| comparison and set some flags, so they should both take the same amount of
>| time.  I'm basing my assumption on experience with pdp-11 type machines,
>| but I find it hard to imagine any other machine being significantly
>| different.  Maybe if you had an asynchronous ALU?

>Yes, any computer based on the MIPS chipset (MIPS, DECstation, SGI) is
>faster to do a branch on both equality and inquality, than for the
>other comparison operators.

>Mips does not have a branch on a < b (unless b is 0).  It has a set
>register to 1 if a < b instruction (& 0 otherwise).  Thus to do the
>branch, you set the scratch register to be the value of a < b, and
>then do a branch on that register being zero.  It does have a direct
>instruction to branch if two registers are equal or not equal.

I do not know the actual reasons for which the MIPS has been designed
like this.    

A possible reason may be that a compare for equality or inequality can
in fact be implemented faster than an arithmetic compare, as you need
basically an adder to do the arithmetic comparison. In order to compare
for inequality, you need only a bunch of XOR gates wired together. As
this is so fast, the comparison can be done in an early pipeline stage
(e.g. just after the register fetch). The data does not have to be
feeded through the ALU. You can win one cycle using this, as the
result of the comparison and therefore the outcome of the branch is
known earlier.

I remember having read about one machine being designed like this, but
I do not remember which one.


(* I speak only for myself.
	Marc-Michael Brandis
	Institut fuer Computersysteme
	ETH Zentrum
	CH-8092 Zuerich, Switzerland
	e-mail: brandis@inf.ethz.ch
		brandis@iis.ethz.ch
   Temporarily at HP, marcb@hp-ptp.ptp.hp.com
*)

dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) (10/04/90)

Can you say "nitpicking over whether to do == or < is silly micro-level
optimization which can't even be relied on to behave the same from one
machine to the next"?  I knew you could!

--

Dan S.

mash@mips.COM (John Mashey) (10/04/90)

In article <2572@lectroid.sw.stratus.com> dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) writes:
>Can you say "nitpicking over whether to do == or < is silly micro-level
>optimization which can't even be relied on to behave the same from one
>machine to the next"?  I knew you could!

At the code level, this is probably true, although it certainly is the
case that there are fundamental computer system design reasons why
== might be faster than <.  Whether these reasons show up in any
given computer architecture is another issue, but certainly, if you care,
the == or != tests are likely to be equal or faster on many machines,
or equal on others, but seldom slower.
It probably doesn't matter most places, except in some technical code
with short loops with many iterations.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) (10/04/90)

In article <41898@mips.mips.COM> mash@mips.COM (John Mashey) writes:
:In article <2572@lectroid.sw.stratus.com> dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) writes:
::Can you say "nitpicking over whether to do == or < is silly micro-level
::optimization which can't even be relied on to behave the same from one
::machine to the next"?  I knew you could!
:
:At the code level, this is probably true, although it certainly is the
:case that there are fundamental computer system design reasons why
:== might be faster than <.  Whether these reasons show up in any
:given computer architecture is another issue, but certainly, if you care,
:the == or != tests are likely to be equal or faster on many machines,
:or equal on others, but seldom slower.
:It probably doesn't matter most places, except in some technical code
:with short loops with many iterations.

My point was only that I felt that any performance advantage (one or
two clocks was bandied about) is grossly outweighed by the rest of the
loop, and given that, I tend to prefer the < syntax (since it is pretty
much required for floating-point variables.)

:-- 
:-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
:UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
:DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
:USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086


--

Dan S.

aglew@crhc.uiuc.edu (Andy Glew) (10/08/90)

..> Q: Are there machines where test for equality is faster than test for <, etc.?
..> A: yes.  John Mashey explains, wrt. the MIPS R3000:
>
>1.	branch on a == b	2 registers
>2.	branch on a != b	""
>3.	branch on a <= 0	register & 0
>4.	branch on a > 0		""
>5.	branch on a < 0		""
>6.	branch on a >= 0	""
>
>But not
>7,8.	branch on a < b, or a <= b
>
>...
>	
>	The timing was tight enough, for us, that a subtract as part of
>	a compare-and-branch would have lengthened the cycle time,
>	or would have added another branch delay cycle, losing
>	more than the compensation gained by having the extra instruction.


Letting me harp on one of my favorite harping points: carry
propagation is evil, and gets more evil as we move towards 64 bit
machines.




Besides carry, there is some small potential for making 

1z.	branch on a == 0	1 register
2z.	branch on a != 0	""

as well as 3-6 above, faster than the class of 1 and 2,
as well as faster than 7 and 8.

1z, 2z, and 3-6 can be precomputed, either at the time the value is
created, or in a bit of slack time a little bit thereafter[*], and the
result encoded in only a few bits - 6 if you want to be really slack,
and only have to select a single bit at test time, fewer if you are
willing to do a bit of encoding (more if you worry about
signed/unsigned distinctions).

These precomputed branch conditions can be stored as a few bits
associated with the register.  A logic function which is a function of
only a few bits is usually faster than a function of many bits (as is
required at branch time when comparing two variables for
(in)equality).
    Moreover, these few bits can be associated with the branch logic, 
rather than with the rest of the register next to the ALU.

Needless to say, you wouldn't do this unless you really had to.  It
can shave a bit off your cycle time for branches, but you don't really
need to do that unless that is your critical time.

As usual, simulate the tradeoffs. (Hint: scientific code does okay,
pattern matching code tends to have tests for equality more often).



[*] you wouldn't want to increase the latency of the results 
    in order to do this precomputation.
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]