roy@phri.nyu.edu (Roy Smith) (10/02/90)
davis@pacific.mps.ohio-state.edu (John E. Davis) writes: > Which ['x != 100' or 'x < 100' RHS] generates faster code? It seems to > me that it is easier to tell if two values are unequal than to tell if > one is greater than the other. I'd rather save the machine a few > micro-seconds than myself since I only do the comparison once whereas the > machine must do it many times. Are there actually any machines in which a compare-and-branch for inequality is any faster or slower than a compare-and-branch for less-than? It seems to me that either should take one pass through the ALU to do the comparison and set some flags, so they should both take the same amount of time. I'm basing my assumption on experience with pdp-11 type machines, but I find it hard to imagine any other machine being significantly different. Maybe if you had an asynchronous ALU? The only scenario I could think of would be a RISC machine which has only two branches; branch-on-equal, and branch-on-less-than. The compiler could generate an appropriate stream of instructions to simulate any possible branch condition from just those two, and some streams might end up being longer than others, but that sounds pretty strange, and very un-orthogonal. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
mutchler@zule.EBay.Sun.COM (Dan Mutchler) (10/03/90)
In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes: davis@pacific.mps.ohio-state.edu (John E. Davis) writes: > Which ['x != 100' or 'x < 100' RHS] generates faster code? It seems to > me that it is easier to tell if two values are unequal than to tell if > one is greater than the other. I'd rather save the machine a few > micro-seconds than myself since I only do the comparison once whereas the > machine must do it many times. Are there actually any machines in which a compare-and-branch for inequality is any faster or slower than a compare-and-branch for less-than? It seems to me that either should take one pass through the ALU to do the comparison and set some flags, so they should both take the same amount of time. I'm basing my assumption on experience with pdp-11 type machines, but I find it hard to imagine any other machine being significantly different. Maybe if you had an asynchronous ALU? The only scenario I could think of would be a RISC machine which has only two branches; branch-on-equal, and branch-on-less-than. The compiler could generate an appropriate stream of instructions to simulate any possible branch condition from just those two, and some streams might end up being longer than others, but that sounds pretty strange, and very un-orthogonal. My experiece has been that compares are essentially a subtract (some early machines didn't actually have a compare) that sets the appropriate condition flags. So X!=100 is subtract 100 from X and branch not zero and X < 100 is subtract 100 from X and branch negative. If I remember correctly the cmp and sub instructions use the same number of clock cycles on the 8086. -- Dan Mutchler | ARPA/Internet: mutchler@zule.EBay.Sun.COM Sun Federal System Engineer | UUCP: ...!sun!mutchler -------------------------------------------------------------------------- Flying back from Lubbock, I saw Jesus on the plane Or maybe it was Elvis, You know they kind of look the same. -- Don Henley
meissner@osf.org (Michael Meissner) (10/03/90)
In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy Smith) writes: | Are there actually any machines in which a compare-and-branch for | inequality is any faster or slower than a compare-and-branch for less-than? | It seems to me that either should take one pass through the ALU to do the | comparison and set some flags, so they should both take the same amount of | time. I'm basing my assumption on experience with pdp-11 type machines, | but I find it hard to imagine any other machine being significantly | different. Maybe if you had an asynchronous ALU? Yes, any computer based on the MIPS chipset (MIPS, DECstation, SGI) is faster to do a branch on both equality and inquality, than for the other comparison operators. | The only scenario I could think of would be a RISC machine which | has only two branches; branch-on-equal, and branch-on-less-than. The | compiler could generate an appropriate stream of instructions to simulate | any possible branch condition from just those two, and some streams might | end up being longer than others, but that sounds pretty strange, and very | un-orthogonal. Mips does not have a branch on a < b (unless b is 0). It has a set register to 1 if a < b instruction (& 0 otherwise). Thus to do the branch, you set the scratch register to be the value of a < b, and then do a branch on that register being zero. It does have a direct instruction to branch if two registers are equal or not equal. For example, consider the following program: int i, j, k, l, m, n; void foo(){ if (i < j) k++; if (l == m) n++; } It produces the following code after running it through the compiler and assembler: foo: File 'test-branch.c': 0: int i, j, k, l, m, n; 1: 2: void foo(){ 3: if (i < j) [test-branch.c: 4] 0x0: 8f8e0000 lw r14,0(gp) [test-branch.c: 4] 0x4: 8f8f0000 lw r15,0(gp) [test-branch.c: 4] 0x8: 00000000 nop [test-branch.c: 4] 0xc: 01cf082a slt r1,r14,r15 [test-branch.c: 4] 0x10: 10200005 beq r1,r0,0x28 [test-branch.c: 4] 0x14: 00000000 nop 4: k++; [test-branch.c: 5] 0x18: 8f980000 lw r24,0(gp) [test-branch.c: 5] 0x1c: 00000000 nop [test-branch.c: 5] 0x20: 27190001 addiu r25,r24,1 [test-branch.c: 5] 0x24: af990000 sw r25,0(gp) 5: 6: if (l == m) [test-branch.c: 7] 0x28: 8f880000 lw r8,0(gp) [test-branch.c: 7] 0x2c: 8f890000 lw r9,0(gp) [test-branch.c: 7] 0x30: 00000000 nop [test-branch.c: 7] 0x34: 15090005 bne r8,r9,0x4c [test-branch.c: 7] 0x38: 00000000 nop 7: n++; [test-branch.c: 8] 0x3c: 8f8a0000 lw r10,0(gp) [test-branch.c: 8] 0x40: 00000000 nop [test-branch.c: 8] 0x44: 254b0001 addiu r11,r10,1 [test-branch.c: 8] 0x48: af8b0000 sw r11,0(gp) 8: } [test-branch.c: 9] 0x4c: 03e00008 jr r31 [test-branch.c: 9] 0x50: 00000000 nop 0x54: 00000000 nop 0x58: 00000000 nop 0x5c: 00000000 nop From looking at the way the bits are set down, the blez (branch on less than or equal zero) and bgtz (branch on greater than zero) instructions could have been defined as ble and bgt, since the second register field is required to be 0 (and register $0 is hardwired to 0). I suspect the decision may have been due to chip real estate, and the fact that equality comparisons happen more frequently in real programs. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Do apple growers tell their kids money doesn't grow on bushes?
mash@mips.COM (John Mashey) (10/03/90)
In article <MEISSNER.90Oct2140411@osf.osf.org> meissner@osf.org (Michael Meissner) writes: >In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy >Smith) writes: > >| Are there actually any machines in which a compare-and-branch for >| inequality is any faster or slower than a compare-and-branch for less-than? >| It seems to me that either should take one pass through the ALU to do the >| comparison and set some flags, so they should both take the same amount of >| time. I'm basing my assumption on experience with pdp-11 type machines, >| but I find it hard to imagine any other machine being significantly >| different...... Well, they are different.... >Yes, any computer based on the MIPS chipset (MIPS, DECstation, SGI) is >faster to do a branch on both equality and inquality, than for the >other comparison operators. >Mips does not have a branch on a < b (unless b is 0). It has a set >register to 1 if a < b instruction (& 0 otherwise). Thus to do the >branch, you set the scratch register to be the value of a < b, and >then do a branch on that register being zero. It does have a direct >instruction to branch if two registers are equal or not equal. ....example.... >From looking at the way the bits are set down, the blez (branch on >less than or equal zero) and bgtz (branch on greater than zero) >instructions could have been defined as ble and bgt, since the second >register field is required to be 0 (and register $0 is hardwired to >0). I suspect the decision may have been due to chip real estate, and >the fact that equality comparisons happen more frequently in real >programs. OK, now the real reason is the cycle time tradeoff, which may or may not be relevant to other chips, because it depends on other decisions you have made. The R3000 worked pretty hard to get 1-cycle compare-and-branch for everything that it could, including such things as a micro-tlb because it couldn't do a lookup in the main tlb quite fast enough. SO, WHY does it have: 1. branch on a == b 2 registers 2. branch on a != b "" 3. branch on a <= 0 register & 0 4. branch on a > 0 "" 5. branch on a < 0 "" 6. branch on a >= 0 "" But not 7,8. branch on a < b, or a <= b ANS: 7-8 require a 32-bit subtract. 1-6 do not require any subtract, and can be implemented with logic that is faster than subtract, in less space. Note that 1 & 2 can be implemented by xoring a & b, and seeing if the resulting bits are all zero (1), or not all zero (2). Cases 5 & 6 need only test the sign bit. Cases 3 and 4 test the sign bit plus whether or not all other bits are 0. The timing was tight enough, for us, that a subtract as part of a compare-and-branch would have lengthened the cycle time, or would have added another branch delay cycle, losing more than the compensation gained by having the extra instruction. Deciding this took simulation, including noting that the most frequent (>50%) test is a == 0 or a != 0 (1 or 2), and that smart compilers can often turn some loop termninations from a <= N into a == N. Note that one must still include a compare instruction (in our case Set Less Than), that is essentially a subtract, but delivers the 0/1 indication to another register, and easily fits in the same timing as a normal subtract, but does not have to obtain the result quite as quickly as the compare-branch instructions must. See pages 106-107 in Hennessy & Patterson for further discussion. Again, note that this reasoning may or may not be valid for everybody, as it depends on the other architectural predecessors & expected implementation technologies. For example, I think HP PA chose the other direction. However, such choices are hardly random events, or people leaving things out for no particular reason. I recall this one got studied a lot, because conditional branches are important, and they tend not to go away, even with good optimizers. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
Don_A_Corbitt@cup.portal.com (10/03/90)
roy@phri.nyu.edu writes: > davis@pacific.mps.ohio-state.edu (John E. Davis) writes: > > Which ['x != 100' or 'x < 100' RHS] generates faster code? It seems to > > me that it is easier to tell if two values are unequal than to tell if > > one is greater than the other. I'd rather save the machine a few > > micro-seconds than myself since I only do the comparison once whereas the > > machine must do it many times. > > Are there actually any machines in which a compare-and-branch for > inequality is any faster or slower than a compare-and-branch for less-than? > It seems to me that either should take one pass through the ALU to do the > comparison and set some flags, so they should both take the same amount of > time. I'm basing my assumption on experience with pdp-11 type machines, > but I find it hard to imagine any other machine being significantly > different. Maybe if you had an asynchronous ALU? [goes on to theorize about RISC machine with limited branch types] Actually, the i860 is faster when comparing for equality than greater-than. For example: xor r16, r17, r0 bc somewhere takes two clocks, plus one annulled after the branch if taken. The sequence subu r16, r17, r0 bc somewhere takes three clocks, since there is a freeze before the conditional branch while the condition code is updated. On this chip, adds/addu and subs/subu followed by a conditional branch costs an extra cycle. I haven't seen this in print, but I assume it is because there is extra delay waiting for carry propagation to set the condition code. XOR can test all the bits in parallel. --- Don_A_Corbitt@cup.portal.com Not a spokesperson for CrystalGraphics, Inc. Mail flames, post apologies. Support short .signatures, three lines max. (Soon leaving this account - flame me now, while you have the chance!)
marcb@hp-ptp.HP.COM (Marc Brandis) (10/03/90)
meissner@osf.org (Michael Meissner) writes: >In article <1990Oct2.151644.1581@phri.nyu.edu> roy@phri.nyu.edu (Roy >Smith) writes: >| Are there actually any machines in which a compare-and-branch for >| inequality is any faster or slower than a compare-and-branch for less-than? >| It seems to me that either should take one pass through the ALU to do the >| comparison and set some flags, so they should both take the same amount of >| time. I'm basing my assumption on experience with pdp-11 type machines, >| but I find it hard to imagine any other machine being significantly >| different. Maybe if you had an asynchronous ALU? >Yes, any computer based on the MIPS chipset (MIPS, DECstation, SGI) is >faster to do a branch on both equality and inquality, than for the >other comparison operators. >Mips does not have a branch on a < b (unless b is 0). It has a set >register to 1 if a < b instruction (& 0 otherwise). Thus to do the >branch, you set the scratch register to be the value of a < b, and >then do a branch on that register being zero. It does have a direct >instruction to branch if two registers are equal or not equal. I do not know the actual reasons for which the MIPS has been designed like this. A possible reason may be that a compare for equality or inequality can in fact be implemented faster than an arithmetic compare, as you need basically an adder to do the arithmetic comparison. In order to compare for inequality, you need only a bunch of XOR gates wired together. As this is so fast, the comparison can be done in an early pipeline stage (e.g. just after the register fetch). The data does not have to be feeded through the ALU. You can win one cycle using this, as the result of the comparison and therefore the outcome of the branch is known earlier. I remember having read about one machine being designed like this, but I do not remember which one. (* I speak only for myself. Marc-Michael Brandis Institut fuer Computersysteme ETH Zentrum CH-8092 Zuerich, Switzerland e-mail: brandis@inf.ethz.ch brandis@iis.ethz.ch Temporarily at HP, marcb@hp-ptp.ptp.hp.com *)
dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) (10/04/90)
Can you say "nitpicking over whether to do == or < is silly micro-level optimization which can't even be relied on to behave the same from one machine to the next"? I knew you could! -- Dan S.
mash@mips.COM (John Mashey) (10/04/90)
In article <2572@lectroid.sw.stratus.com> dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) writes: >Can you say "nitpicking over whether to do == or < is silly micro-level >optimization which can't even be relied on to behave the same from one >machine to the next"? I knew you could! At the code level, this is probably true, although it certainly is the case that there are fundamental computer system design reasons why == might be faster than <. Whether these reasons show up in any given computer architecture is another issue, but certainly, if you care, the == or != tests are likely to be equal or faster on many machines, or equal on others, but seldom slower. It probably doesn't matter most places, except in some technical code with short loops with many iterations. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) (10/04/90)
In article <41898@mips.mips.COM> mash@mips.COM (John Mashey) writes: :In article <2572@lectroid.sw.stratus.com> dswartz@bigbootay.sw.stratus.com (Dan Swartzendruber) writes: ::Can you say "nitpicking over whether to do == or < is silly micro-level ::optimization which can't even be relied on to behave the same from one ::machine to the next"? I knew you could! : :At the code level, this is probably true, although it certainly is the :case that there are fundamental computer system design reasons why :== might be faster than <. Whether these reasons show up in any :given computer architecture is another issue, but certainly, if you care, :the == or != tests are likely to be equal or faster on many machines, :or equal on others, but seldom slower. :It probably doesn't matter most places, except in some technical code :with short loops with many iterations. My point was only that I felt that any performance advantage (one or two clocks was bandied about) is grossly outweighed by the rest of the loop, and given that, I tend to prefer the < syntax (since it is pretty much required for floating-point variables.) :-- :-john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> :UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash :DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 :USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 -- Dan S.
aglew@crhc.uiuc.edu (Andy Glew) (10/08/90)
..> Q: Are there machines where test for equality is faster than test for <, etc.? ..> A: yes. John Mashey explains, wrt. the MIPS R3000: > >1. branch on a == b 2 registers >2. branch on a != b "" >3. branch on a <= 0 register & 0 >4. branch on a > 0 "" >5. branch on a < 0 "" >6. branch on a >= 0 "" > >But not >7,8. branch on a < b, or a <= b > >... > > The timing was tight enough, for us, that a subtract as part of > a compare-and-branch would have lengthened the cycle time, > or would have added another branch delay cycle, losing > more than the compensation gained by having the extra instruction. Letting me harp on one of my favorite harping points: carry propagation is evil, and gets more evil as we move towards 64 bit machines. Besides carry, there is some small potential for making 1z. branch on a == 0 1 register 2z. branch on a != 0 "" as well as 3-6 above, faster than the class of 1 and 2, as well as faster than 7 and 8. 1z, 2z, and 3-6 can be precomputed, either at the time the value is created, or in a bit of slack time a little bit thereafter[*], and the result encoded in only a few bits - 6 if you want to be really slack, and only have to select a single bit at test time, fewer if you are willing to do a bit of encoding (more if you worry about signed/unsigned distinctions). These precomputed branch conditions can be stored as a few bits associated with the register. A logic function which is a function of only a few bits is usually faster than a function of many bits (as is required at branch time when comparing two variables for (in)equality). Moreover, these few bits can be associated with the branch logic, rather than with the rest of the register next to the ALU. Needless to say, you wouldn't do this unless you really had to. It can shave a bit off your cycle time for branches, but you don't really need to do that unless that is your critical time. As usual, simulate the tradeoffs. (Hint: scientific code does okay, pattern matching code tends to have tests for equality more often). [*] you wouldn't want to increase the latency of the results in order to do this precomputation. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi] -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]