crowl@cs.rochester.edu (Lawrence Crowl) (02/16/88)
Several authors have been arguing over conditional branching using condition codes versus using booleans in general-purpose registers. Why not store the condition codes in a general-purpose register? This would result in a scheduable two-instruction sequence with multi-way branching at the cost of using a general purpose register and slightly lower code density. For example, compare R1 with R2 place condition code in R3 branch to ADDRESS if condition code in R3 is GREATER branch to ADDRESS if condition code in R3 is EQUAL branch to ADDRESS if condition code in R3 is LESS Will this approach work? Will the condition code comparison in the branch cost too much? (It is an instruction constant.) Comments please! -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
mash@mips.COM (John Mashey) (02/17/88)
In article <6834@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes: >Several authors have been arguing over conditional branching using condition >codes versus using booleans in general-purpose registers. Why not store the >condition codes in a general-purpose register? .... > compare R1 with R2 place condition code in R3 > branch to ADDRESS if condition code in R3 is GREATER > branch to ADDRESS if condition code in R3 is EQUAL > branch to ADDRESS if condition code in R3 is LESS >Will this approach work? Will the condition code comparison in the branch cost >too much? (It is an instruction constant.) Comments please! In some flavor or other, at least several RISCs do this, such as: MIPS R2000 (has set-less-than op that generates 0 or 1) AMD 29000 (has compare insts that set sign bit of arb. reg) Moto 78000 (has compares that set cond codes in regs, I think) and I think HP Precision does this also. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (02/17/88)
In article <6834@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes: >Several authors have been arguing over conditional branching using condition >codes versus using booleans in general-purpose registers. Why not store the >condition codes in a general-purpose register? This would result in a >scheduable two-instruction sequence with multi-way branching at the cost of >using a general purpose register and slightly lower code density. For example, > > compare R1 with R2 place condition code in R3 > branch to ADDRESS if condition code in R3 is GREATER > branch to ADDRESS if condition code in R3 is EQUAL > branch to ADDRESS if condition code in R3 is LESS > >Will this approach work? Will the condition code comparison in the branch cost >too much? (It is an instruction constant.) Comments please! Yes, it works. An example of such an engine is the Ridge 32 manufactured by Ridge Computers Inc., of Santa Clara, CA. No only are there instructions to put a "condition code" in a register, there are instructions to conditional branch with the registers involved in the comparison direction. For example: branch if the value in R1 is greater than the value in R2, along with a branch prediction bit as well. The branch takes one clock if the prediction is correct, 4 clocks if it the prediciton bit is wrong. It worketh, and it worketh well. There is a very BIG lesson here as well for those who are picking an instruction set to "simulate". RISC is the way to fly, but in addition one finds that the cost of computing condition codes can be far greater than the cost of computing the results of an instruction itself, only to either ignore the condition codes or destroy them in the next instruction. Choosing an instruction set with "no condition codes" is a major kick in the butt with respect to simulator speed. This is one of the key reasons why the Ridge 32 instruction set, suitably generalized for the purposes of the shared memory multiprocessor environment, was chosen for the Cerberus multiprocessor simulator.
jmd@granite.dec.com (John Danskin) (02/18/88)
I am a fan of condition codes in general registers because I would like to be able to turn the following code sequence: fcmp r0, r1 nop nop br GE, somewhere fcmp r0, r2 nop nop br GE, somewhere1 fcmp r0, r3 nop nop br GE, somewhere2 into fcmp r0, r1, c1 fcmp r0, r2, c2 fcmp r0, r3, c3 br GE, c1, somewhere br GE, c2, somewhere2 br GE, c3, somewhere3 Obviously two operand compare and branch is niftier, but hardly anyone can do that in one (reasonably fast) risc cycle. I have no idea how much this sort of thing comes up in average code, but it makes a fair bit of difference in 3D graphics clipping computations. I would like to say as well that it seems like a waste to put the condition codes in GENERAL registers. The little suckers are only one or two bits depending on how you set things up, and it would be relatively inexpensive to have a whole bank of special condition code registers. Some vector instruction sets do this already. Anybody care to flame on how incredibly useless this is for the real world? After all, I make MY living writing microcode, and we all know that nobody has been doing that since 1843 8-). -- John Danskin | decwrl!jmd DEC Workstation Systems Engineering | (415)853-6724 100 Hamilton Avenue | My comments are my own. Palo Alto, CA 94306 | I do not speak for DEC.
baum@apple.UUCP (Allen J. Baum) (02/18/88)
-------- [] >In article <1585@winchester.mips.COM> mash@winchester.UUCP(John Mashey)writes: re: conditional branching using condition codes vs. booleans in GPRs: > >In some flavor or other, at least several RISCs do this, such as: .... >and I think HP Precision does this also. HP does have {Compare,Move,Bittest,Add}&Branch instructions. It can't set an arbitrary condition in any bit of a register in one instruction; however, there is a special instruction (compare&clear) that allows construction of a 2 instruction sequence that will; it can also be used for other things, since the instruction following compare&clear (it 'skips' on condition) can be any instruction, not just the obvious 'load immediate 1'. There was a load condition inst. at one time, but it was removed because it looked like it added a gate to the critical path length. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
reeder@ut-emx.UUCP (02/18/88)
In article <6834@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes: > using a general purpose register and slightly lower code density. For example, > > compare R1 with R2 place condition code in R3 > branch to ADDRESS if condition code in R3 is GREATER > branch to ADDRESS if condition code in R3 is EQUAL > branch to ADDRESS if condition code in R3 is LESS > > Will this approach work? Will the condition code comparison in the branch cost > too much? (It is an instruction constant.) Comments please! That is exactly how you do things on a Cray, for example: A0 A1-A2 Compare A1 and A2 JAP addr1 Branch to addr1 if A1 > A2 JAZ addr2 Branch to addr2 if A1 = A2 JAN addr3 Branch to addr3 if A1 < A2 William "Wills" Reeder -- honk honk honk
cik@l.cc.purdue.edu (Herman Rubin) (02/19/88)
In article <868@ut-emx.UUCP>, reeder@ut-emx.UUCP writes: > In article <6834@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes: ...... > > compare R1 with R2 place condition code in R3 > > branch to ADDRESS if condition code in R3 is GREATER > > branch to ADDRESS if condition code in R3 is EQUAL > > branch to ADDRESS if condition code in R3 is LESS ...... > That is exactly how you do things on a Cray, for example: > > A0 A1-A2 Compare A1 and A2 > JAP addr1 Branch to addr1 if A1 > A2 > JAZ addr2 Branch to addr2 if A1 = A2 > JAN addr3 Branch to addr3 if A1 < A2 This is not the same. The first instruction computes a value and inserts it in A0. The other instructions perform a test on the value in A0 and use _that_, not the bits in a condition code, to decide whether to perform the branch. To clearly see the difference, suppose that several condition codes were involved in a branch situation. On a Cray, the appropriate values would have to be stored in various A or S registers, and _for each situation for which a jump is contemplated_, the appropriate value would have to be stored in A0 or S0 and then tested. It is not possible to directly move A0 or S0 to another A or S register, either. On machines which can do the more common type of tests, it is no faster to compute the difference and transfer on it rather than transferring on the comparison. Furthermore, this does not allow the condition codes to be used for overflow or carry. Of course, the Cray is not too well suited for integer multiplication or any kind of division, but it is well suited for integer addition and shifts, and that is where overflow and carry are important. Let me make some additional suggestions apropos of Crowl's posting. It should be possible to address an arbitrary byte of a register (this is useful for other purposes). One way of doing this is to make a small address a reference to the register file; on virtual machines this would be costless, and the amount of memory lost on non-virtual machines would be negligeable. Some machines have done this, but I do not believe that they have done it well; either this can only be used in certain situations, or the user's registers are in blocks and/or separated from main memory. If we do this, we can keep many condition codes active for transfers. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
ok@quintus.UUCP (Richard A. O'Keefe) (02/19/88)
In article <184@granite.dec.com>, jmd@granite.dec.com (John Danskin) writes: > I would like to say as well that it seems like a waste to put the condition > codes in GENERAL registers. The little suckers are only one or two bits > depending on how you set things up, and it would be relatively inexpensive > to have a whole bank of special condition code registers. Some vector > instruction sets do this already. > What's wrong with putting the condition codes in general registers? If you have as many registers as taking a RISC is supposed to buy you, putting a condition code in a general register isn't going to hurt, and there is an interesting advantage. Having the conditional branches be of the form B<cc> <reg>,<label> means that you have a fast way of dispatching on some combinations of bits in a register, which means dispatching on tag bits can be *faster* than it would in a condition-code architecture. It also means that if you want to set up a particular set of conditions (e.g. clearing the carry before starting a multi-precision add) you just use an ordinary constant->register move, and don't need a special class of instructions for setting condition registers.
mash@mips.COM (John Mashey) (02/19/88)
In article <184@granite.dec.com> jmd@granite.UUCP (John Danskin) writes: >I am a fan of condition codes in general registers because I would like to be.. >I would like to say as well that it seems like a waste to put the condition >codes in GENERAL registers. The little suckers are only one or two bits >depending on how you set things up, and it would be relatively inexpensive >to have a whole bank of special condition code registers. ... Unfortunately, this probably is not a good idea, for both hardware and software reasons: In a heavily-pipeline machine (like any of the current RISCs), the output of an ALU cycle isn't actually written back into a register into one or more cycles later. Nevertheless, the output must be available as input to the next cycle, without a delay cycle, or else you take a large performance hit. Hence, such machines use "register bypass networks" to let a cycle grab the previous cycle's output on the way to the registers. If you want to have a single condition code whose use doesn't delay a branch, you have to bypass the condition code also, and of course, that takes more hardware, which must be slightly different than the register bypassing network. Compilers MUST deal with the CC, and that has been a notable pain on some machines, as it is a register that doesn't act like any other register. If you want to have multiple condition codes: An optimizer must allocate and deallocate them, i.e., it better be doing live/dead analysis, perhaps keeping a set safe across functions calls, etc, i.e., everything it needs to do for registers! BTW, the better global optimizers tend to be space-intensive, with at least some effect from the number of registers. Having a bunch of CCs that need to be tracked is NOT doing a favor for the compiler writer. Your instruction set needs to be able to address the CC register file, which thus ends up being a bunch of (different) sorts of registers, and of course, they need their own bypass network. In any case, in most normal code, the lifetime of most condition-code settings is very short, i.e., you very seldom have more than one or two lying around. When you do, if you have a machine with enough registers, you just use the registers, and all is well. Thus, having "weird" registers not only doesn't help anybody much, but it's more likely to get in the way of both hardware and software. I can't remember the references offhand, but Bill Wulf had an article many years ago, I think in CACM, that described the attributres of architectures that were good/bad for optimizers, and I think Steve Johnson had a similar article, w.r.t. the C language. It's been a long time; the latter may have been a Bell Labs internal memo. Moral: it's REAL easy for hardware to do "favors" for software that can be done without. Separate CCs are good examples. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
kers@otter.hple.hp.com (Christopher Dollin) (02/19/88)
In view of this discussion on the merits or otherwise of condition codes, especially as applied to RISCy machines, I wonder what the assembled multitudes think about the architecture of the Acorn Risc Machine (ARM)? The critical points about the ARM from this point of view are (a) it is based on condition codes rather than compare-and-branch; (b) the register operations set the CCs *optionally*; (c) *every* instruction is conditional, so some IF statements can be implemented with no branching at all; for example abs(x) -> x can be implemented as TEQ x, #0 ; tell me about (register) x RSBMI x, #0 ; if it's <0 then 0-x -> x (reverse subtract) Your turn guys ...................... Regards, Kers | "Why Lisp if you can talk Poperly?"
viggy@hpcupt1.HP.COM (Viggy Mokkarala) (02/20/88)
In article <7430@apple.UUCP> baum@apple.UUCP (Allen J. Baum) writes: >HP does have {Compare,Move,Bittest,Add}&Branch instructions. It can't set an >arbitrary condition in any bit of a register in one instruction; however, there >is a special instruction (compare&clear) that allows construction of a 2 >instruction sequence that will; it can also be used for other things, since >the instruction following compare&clear (it 'skips' on condition) can be any >instruction, not just the obvious 'load immediate 1'. There was a load >condition inst. at one time, but it was removed because it looked like it added >a gate to the critical path length. >{decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385 Thanks Allen, for pointing this out. At the risk of restating facts, I want to elaborate on the Compare and Clear instruction (for those who don't have a HPPA handbook). The Compare and Clear instruction compares two registers, clears a register, and conditionally nullifies the execution of the following instruction (based on the result of the comparison). There are 16 conditions that can be tested against. This instruction can be used to produce, in a general register, the logical value (0 or 1) of the result of a comparison. The two instruction sequence COMCLR, <> rb, rc, ra LDO 1(0),ra sets ra to 1 if rb and rc are equal and to 0 otherwise.
bcase@apple.UUCP (Brian Case) (02/20/88)
In article <682@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >Let me make some additional suggestions apropos of Crowl's posting. >It should be possible to address an arbitrary byte of a register >(this is useful for other purposes). One way of doing this is to >make a small address a reference to the register file; on virtual >machines this would be costless, and the amount of memory lost on >non-virtual machines would be negligeable. Some machines have done >this, but I do not believe that they have done it well; either this >can only be used in certain situations, or the user's registers are >in blocks and/or separated from main memory. If we do this, we can >keep many condition codes active for transfers. Mapping the register set of a machine into the memory address space of a simple, pipelined machine is difficult and costly (in space and time) to implement and, I staunchly claim, vitually useless (except in the case of the CRISP where the concept is central to the architecture, but in that case there "is no register file."). Believe me, even if your machine did allow memory-mapped, byte access to the register file of a fast, modern processor, you wouldn't use it because it would be so much slower than any other kind of register access (you can't know it is really a register access until the address is computed (and maybe translated), but the register file is way back there in pipeline stage two, so the machine has to sit and wait while you insert at least one bubble to perform the unexpected (to the lock-step pipeline) register access).
cik@l.cc.purdue.edu (Herman Rubin) (02/21/88)
The vaious examples of how well their computer does on the problem assumes that the branch is immediate. However, suppose that xyz is computed, and several instructions later it is desired to branch on xyz = 0, and even later, if xyz != 0, to utilize the sign? If condition codes only go to the condition code register, clearly additional work must be done. I have also encountered situations where a shift is performed, and later it is desired to branch on an overflow occurring in the process. Only having the condition codes stored and branching on the stored condition code values avoids unnecessary comparisons. It would take little hardware modification for a processor which generates condition codes to store them in a register when that is wanted, and to use them from a user-controlled register instead of the condition code register. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet
ccplumb@watmath.waterloo.edu (Colin Plumb) (02/22/88)
kers@otter.hple.hp.com (Christopher Dollin) wrote: >The critical points about the ARM from this point of view are (a) it is based >on condition codes rather than compare-and-branch; (b) the register operations >set the CCs *optionally*; (c) *every* instruction is conditional, so some IF >statements can be implemented with no branching at all; for example > > abs(x) -> x > >can be implemented as > > TEQ x, #0 ; tell me about (register) x > RSBMI x, #0 ; if it's <0 then 0-x -> x (reverse subtract) > >Your turn guys ...................... Well, something similar can be done with result flags in general-purpose registers. For example, if the Am29000 used -1 for true instead of 80000000, you could do that in three instructions: cplt temp, x, #0 ; temp = (x < 0) xor x, x, temp ; x ^= temp; sub x, x, temp ; x -= temp If temp is 0 after the test, this has no effect on x, but if it's -1, this inverts x and adds one, i.e. negates it. The 29000 doesn't quite do things this way, but isn't there a processor out there somewhere that does? Things like z = (x > y) ? a : b; become three instructions: cpgt z, x, y ; z = (x > y) /* z now holds 0 or -1 */ and z, z, #a-b ; z &= (a-b) /* z now holds 0 or a-b */ add z, z, #b ; z += b /* z now holds b or a */ In this case, the ARM can do no better, although it seems that it can in the general case. As pipelines grow longer, we may see skip instructions come back. They may "waste" a cycle on a no-op, but they keep the pipeline full. -- -Colin (watmath!ccplumb) Zippy says: I own seven-eighths of all the artists in downtown Burbank!
bcase@apple.UUCP (Brian Case) (02/22/88)
In article <780004@otter.hple.hp.com> kers@otter.hple.hp.com (Christopher Dollin) writes: >In view of this discussion on the merits or otherwise of condition codes, >especially as applied to RISCy machines, I wonder what the assembled multitudes >think about the architecture of the Acorn Risc Machine (ARM)? > >The critical points about the ARM from this point of view are (a) it is based >on condition codes rather than compare-and-branch; (b) the register operations >set the CCs *optionally*; (c) *every* instruction is conditional, so some IF >statements can be implemented with no branching at all; for example > > abs(x) -> x > >can be implemented as > > TEQ x, #0 ; tell me about (register) x > RSBMI x, #0 ; if it's <0 then 0-x -> x (reverse subtract) Ah! A very good question indeed. The ARM does indeed gain some advantage in some cases. Architecting (some cringe at such language) arithmetic ops this way (i.e., setting them only if the instruction says to do so) fixes one of the major problems with condition codes: One or several instructions can separate the instruction that sets the CCs and the instruction that uses the CCs. (The Berkely RISCs have this feature, don't they?) If you look at real code, you'll see that the ability to conditionally execute any instruction saves some jumps. My biggest complaints about the ARM are: it can name only 16 registers, part of the PC is taken up by some status bits (to allow much simplified internal interrupt return sequencing), and the use of traditional CCs may cause problems for future aggressive implementations, i.e. the kind of implementation that exploits out-of-order or multiple-instruction-at-a- time techniques. Other little complaints center around the somewhat longer critical path in the execute pipeline stage (barrel shift in series with ALU) that might very well cause the ARM to lag other processors in clock rate and the addressing modes (post-/pre- increment/decrement will cause a little implementation complexity in a more heavily pipelined implementation; abort/restart may be a pain). The ARM is probably one of the absolute best processor choices available for really low-cost, medium-performance (where medium now means 2x to 3x cheap 68020 system) embedded applications. However, with only 16 registers and no clear commitment to "aggressive" implementations by anyone (where are these implementations? the ARM has been available for a _long time_ by most RISC standards), I can't imagine choosing it for the main CPU of anything that has to last through several generations.
bcase@apple.UUCP (Brian Case) (02/23/88)
In article <17031@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes: >Well, something similar can be done with result flags in general-purpose >registers. For example, if the Am29000 used -1 for true instead of >80000000, you could do that in three instructions: > >cplt temp, x, #0 ; temp = (x < 0) >xor x, x, temp ; x ^= temp; >sub x, x, temp ; x -= temp > >If temp is 0 after the test, this has no effect on x, but if >it's -1, this inverts x and adds one, i.e. negates it. > >The 29000 doesn't quite do things this way, but isn't there a processor >out there somewhere that does? This brings up an interesting architectural point: Why didn't the 29000 use -1 for true? The answer is two part: (1) it would cost more to force 32 bus lines to one than to force only one and sink the other 31 lines. Since the computation of the one-bit boolean is probably the most critical path in the ALU part of the machine, it seemed silly to increase this path length. (2) Setting all 32 lines to 1 or zero based on the outcome of a comparison gives no more information than setting only one. Thus, the desired -1 or 0 in the cases mentioned here can be created by an arithmetic right shift by 31 places after the boolean is computed. Yes, this adds one cycle to the above sequences, which is significant here, but at least the same *algorithm* can be used with its attendant jump elimination. >Things like z = (x > y) ? a : b; become three instructions: >cpgt z, x, y ; z = (x > y) /* z now holds 0 or -1 */ >and z, z, #a-b ; z &= (a-b) /* z now holds 0 or a-b */ >add z, z, #b ; z += b /* z now holds b or a */ So make the above sequence: cpgt z,x,y sra z,z,31 and z,z,#a-b add z,z,#b (except that if a and b are constants, the 29000 requires more instructions to form them into registers since only 8-bit constants can be named for arithmetic instructions. If a and b are in registers already, then only one additional instruction is needed to form a-b.) >In this case, the ARM can do no better, although it seems that it can >in the general case. As pipelines grow longer, we may see skip >instructions come back. They may "waste" a cycle on a no-op, but they >keep the pipeline full. Skip instructions are already back! See the HP Spectrum. The idea of condition execution (as on ARM) is really just skip in disguise..