[comp.arch] Condition Codes in General Registers

crowl@cs.rochester.edu (Lawrence Crowl) (02/16/88)

Several authors have been arguing over conditional branching using condition
codes versus using booleans in general-purpose registers.  Why not store the
condition codes in a general-purpose register?  This would result in a
scheduable two-instruction sequence with multi-way branching at the cost of
using a general purpose register and slightly lower code density.  For example,

    compare R1 with R2 place condition code in R3
    branch to ADDRESS if condition code in R3 is GREATER
    branch to ADDRESS if condition code in R3 is EQUAL
    branch to ADDRESS if condition code in R3 is LESS

Will this approach work?  Will the condition code comparison in the branch cost
too much?  (It is an instruction constant.)  Comments please!
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

mash@mips.COM (John Mashey) (02/17/88)

In article <6834@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>Several authors have been arguing over conditional branching using condition
>codes versus using booleans in general-purpose registers.  Why not store the
>condition codes in a general-purpose register?  ....

>    compare R1 with R2 place condition code in R3
>    branch to ADDRESS if condition code in R3 is GREATER
>    branch to ADDRESS if condition code in R3 is EQUAL
>    branch to ADDRESS if condition code in R3 is LESS

>Will this approach work?  Will the condition code comparison in the branch cost
>too much?  (It is an instruction constant.)  Comments please!

In some flavor or other, at least several RISCs do this,
such as:
	MIPS R2000  (has set-less-than op that generates 0 or 1)
	AMD 29000 (has compare insts that set sign bit of arb. reg)
	Moto 78000 (has compares that set cond codes in regs, I think)
and I think HP Precision does this also.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (02/17/88)

In article <6834@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>Several authors have been arguing over conditional branching using condition
>codes versus using booleans in general-purpose registers.  Why not store the
>condition codes in a general-purpose register?  This would result in a
>scheduable two-instruction sequence with multi-way branching at the cost of
>using a general purpose register and slightly lower code density.  For example,
>
>    compare R1 with R2 place condition code in R3
>    branch to ADDRESS if condition code in R3 is GREATER
>    branch to ADDRESS if condition code in R3 is EQUAL
>    branch to ADDRESS if condition code in R3 is LESS
>
>Will this approach work?  Will the condition code comparison in the branch cost
>too much?  (It is an instruction constant.)  Comments please!
Yes, it works.  An example of such an engine is the Ridge 32 manufactured
by Ridge Computers Inc., of Santa Clara, CA.  No only are there instructions
to put a "condition code" in a register, there are instructions to conditional
branch with the registers involved in the comparison direction.  For example:
branch if the value in R1 is greater than the value in R2, along with a
branch prediction bit as well.  The branch takes one clock if the prediction
is correct, 4 clocks if it the prediciton bit is wrong.  It worketh, and it
worketh well.

There is a very BIG lesson here as well for those who are picking an instruction
set to "simulate".  RISC is the way to fly, but in addition one finds that the
cost of computing condition codes can be far greater than the cost of computing
the results of an instruction itself, only to either ignore the condition codes
or destroy them in the next instruction.

Choosing an instruction set with "no condition codes" is a major kick in the
butt with respect to simulator speed.  This is one of the key reasons why
the Ridge 32 instruction set, suitably generalized for the purposes of the
shared memory multiprocessor environment, was chosen for the Cerberus
multiprocessor simulator.

jmd@granite.dec.com (John Danskin) (02/18/88)

I am a fan of condition codes in general registers because I would like to be
able to turn the following code sequence:

	fcmp	r0, r1
	nop
	nop
	br	GE, somewhere

	fcmp	r0, r2
	nop
	nop
	br	GE, somewhere1

	fcmp	r0, r3
	nop
	nop
	br	GE, somewhere2

into
	fcmp	r0, r1, c1
	fcmp	r0, r2, c2
	fcmp	r0, r3, c3

	br	GE, c1, somewhere
	br	GE, c2, somewhere2
	br	GE, c3, somewhere3

Obviously two operand compare and branch is niftier, but hardly anyone can
do that in one (reasonably fast) risc cycle.

I have no idea how much this sort of thing comes up in average code,
but it makes a fair bit of difference in 3D graphics clipping computations.

I would like to say as well that it seems like a waste to put the condition
codes in GENERAL registers. The little suckers are only one or two bits
depending on how you set things up, and it would be relatively inexpensive
to have a whole bank of special condition code registers. Some vector
instruction sets do this already.

Anybody care to flame on how incredibly useless this is for the real world?
After all, I make MY living writing microcode, and we all know that nobody
has been doing that since 1843 8-).
-- 
John Danskin				| decwrl!jmd
DEC Workstation Systems Engineering	| (415)853-6724 
100 Hamilton Avenue			| My comments are my own.
Palo Alto, CA  94306			| I do not speak for DEC.

baum@apple.UUCP (Allen J. Baum) (02/18/88)

--------
[]
>In article <1585@winchester.mips.COM> mash@winchester.UUCP(John Mashey)writes:
re: conditional branching using condition codes vs. booleans in GPRs:
>
>In some flavor or other, at least several RISCs do this, such as:
....
>and I think HP Precision does this also.

HP does have {Compare,Move,Bittest,Add}&Branch instructions. It can't set an
arbitrary condition in any bit of a register in one instruction; however, there
is a special instruction (compare&clear) that allows construction of a 2
instruction sequence that will; it can also be used for other things, since
the instruction following compare&clear (it 'skips' on condition) can be any
instruction, not just the obvious 'load immediate 1'. There was a load 
condition inst. at one time, but it was removed because it looked like it added
a gate to the critical path length.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

reeder@ut-emx.UUCP (02/18/88)

In article <6834@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes:
> using a general purpose register and slightly lower code density.  For example,
> 
>     compare R1 with R2 place condition code in R3
>     branch to ADDRESS if condition code in R3 is GREATER
>     branch to ADDRESS if condition code in R3 is EQUAL
>     branch to ADDRESS if condition code in R3 is LESS
> 
> Will this approach work?  Will the condition code comparison in the branch cost
> too much?  (It is an instruction constant.)  Comments please!

That is exactly how you do things on a Cray, for example:

          A0      A1-A2                 Compare A1 and A2
          JAP     addr1                 Branch to addr1 if A1 > A2
          JAZ     addr2                 Branch to addr2 if A1 = A2
          JAN     addr3                 Branch to addr3 if A1 < A2

William "Wills" Reeder

-- 

honk honk honk

cik@l.cc.purdue.edu (Herman Rubin) (02/19/88)

In article <868@ut-emx.UUCP>, reeder@ut-emx.UUCP writes:
> In article <6834@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes:
	......
> >     compare R1 with R2 place condition code in R3
> >     branch to ADDRESS if condition code in R3 is GREATER
> >     branch to ADDRESS if condition code in R3 is EQUAL
> >     branch to ADDRESS if condition code in R3 is LESS
	......
> That is exactly how you do things on a Cray, for example:
> 
>           A0      A1-A2                 Compare A1 and A2
>           JAP     addr1                 Branch to addr1 if A1 > A2
>           JAZ     addr2                 Branch to addr2 if A1 = A2
>           JAN     addr3                 Branch to addr3 if A1 < A2

This is not the same.  The first instruction computes a value and 
inserts it in A0.  The other instructions perform a test on the value
in A0 and use _that_, not the bits in a condition code, to decide
whether to perform the branch.

To clearly see the difference, suppose that several condition codes
were involved in a branch situation.  On a Cray, the appropriate values
would have to be stored in various A or S registers, and _for each
situation for which a jump is contemplated_, the appropriate value
would have to be stored in A0 or S0 and then tested.  It is not possible
to directly move A0 or S0 to another A or S register, either.  On 
machines which can do the more common type of tests, it is no faster
to compute the difference and transfer on it rather than transferring
on the comparison.

Furthermore, this does not allow the condition codes to be used for
overflow or carry.  Of course, the Cray is not too well suited for
integer multiplication or any kind of division, but it is well suited
for integer addition and shifts, and that is where overflow and carry
are important.

Let me make some additional suggestions apropos of Crowl's posting.
It should be possible to address an arbitrary byte of a register
(this is useful for other purposes).  One way of doing this is to
make a small address a reference to the register file; on virtual
machines this would be costless, and the amount of memory lost on
non-virtual machines would be negligeable.  Some machines have done
this, but I do not believe that they have done it well; either this
can only be used in certain situations, or the user's registers are
in blocks and/or separated from main memory.  If we do this, we can
keep many condition codes active for transfers.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

ok@quintus.UUCP (Richard A. O'Keefe) (02/19/88)

In article <184@granite.dec.com>, jmd@granite.dec.com (John Danskin) writes:
> I would like to say as well that it seems like a waste to put the condition
> codes in GENERAL registers. The little suckers are only one or two bits
> depending on how you set things up, and it would be relatively inexpensive
> to have a whole bank of special condition code registers. Some vector
> instruction sets do this already.
> 
What's wrong with putting the condition codes in general registers?
If you have as many registers as taking a RISC is supposed to buy you,
putting a condition code in a general register isn't going to hurt,
and there is an interesting advantage.  Having the conditional
branches be of the form
	B<cc> <reg>,<label>
means that you have a fast way of dispatching on some combinations of
bits in a register, which means dispatching on tag bits can be *faster*
than it would in a condition-code architecture.  It also means that if
you want to set up a particular set of conditions (e.g. clearing the
carry before starting a multi-precision add) you just use an ordinary
constant->register move, and don't need a special class of instructions
for setting condition registers.

mash@mips.COM (John Mashey) (02/19/88)

In article <184@granite.dec.com> jmd@granite.UUCP (John Danskin) writes:
>I am a fan of condition codes in general registers because I would like to be..
>I would like to say as well that it seems like a waste to put the condition
>codes in GENERAL registers. The little suckers are only one or two bits
>depending on how you set things up, and it would be relatively inexpensive
>to have a whole bank of special condition code registers. ...

Unfortunately, this probably is not a good idea, for both hardware and
software reasons:

In a heavily-pipeline machine (like any of the current RISCs),
the output of an ALU cycle isn't actually written back into a register
into one or more cycles later.  Nevertheless, the output must be available
as input to the next cycle, without a delay cycle, or else you take a large
performance hit.  Hence, such machines use "register bypass networks"
to let a cycle grab the previous cycle's output on the way to the
registers.

If you want to have a single condition code whose use doesn't delay a
branch, you have to bypass the condition code also, and of course, that
takes more hardware, which must be slightly different than the register
bypassing network.  Compilers MUST deal with the CC, and that has been
a notable pain on some machines, as it is a register that doesn't
act like any other register.

If you want to have multiple condition codes:
An optimizer must allocate and deallocate them, i.e., it
better be doing live/dead analysis, perhaps keeping a set
safe across functions calls, etc, i.e., everything it needs
to do for registers!  BTW, the better global optimizers tend
to be space-intensive, with at least some effect from the
number of registers.  Having a bunch of CCs that need to be
tracked is NOT doing a favor for the compiler writer.
Your instruction set needs to be able to address the CC register
file, which thus ends up being a bunch of (different) sorts of
registers, and of course, they need their own bypass network.

In any case, in most normal code, the lifetime of most condition-code
settings is very short, i.e., you very seldom have more than one or
two lying around. When you do, if you  have a machine with enough
registers, you just use the registers, and all is well.

Thus, having "weird" registers not only doesn't help anybody much,
but it's more likely to get in the way of both hardware and software.

I can't remember the references offhand, but Bill Wulf had an article
many years ago, I think in CACM, that described the attributres of
architectures that were good/bad for optimizers, and I think Steve
Johnson had a similar article, w.r.t. the C language.  It's been a long
time; the latter may have been a Bell Labs internal memo.

Moral: it's REAL easy for hardware to do "favors" for software that
can be done without.  Separate CCs are good examples.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

kers@otter.hple.hp.com (Christopher Dollin) (02/19/88)

In view of this discussion on the merits or otherwise of condition codes,
especially as applied to RISCy machines, I wonder what the assembled multitudes
think about the architecture of the Acorn Risc Machine (ARM)?

The critical points about the ARM from this point of view are (a) it is based
on condition codes rather than compare-and-branch; (b) the register operations 
set the CCs *optionally*; (c) *every* instruction is conditional, so some IF
statements can be implemented with no branching at all; for example

    abs(x) -> x

can be implemented as

    TEQ     x, #0       ; tell me about (register) x 
    RSBMI   x, #0       ; if it's <0 then 0-x -> x (reverse subtract)

Your turn guys ......................


Regards,
Kers                                    | "Why Lisp if you can talk Poperly?"

viggy@hpcupt1.HP.COM (Viggy Mokkarala) (02/20/88)

In article <7430@apple.UUCP> baum@apple.UUCP (Allen J. Baum) writes:


>HP does have {Compare,Move,Bittest,Add}&Branch instructions. It can't set an
>arbitrary condition in any bit of a register in one instruction; however, there
>is a special instruction (compare&clear) that allows construction of a 2
>instruction sequence that will; it can also be used for other things, since
>the instruction following compare&clear (it 'skips' on condition) can be any
>instruction, not just the obvious 'load immediate 1'. There was a load 
>condition inst. at one time, but it was removed because it looked like it added
>a gate to the critical path length.

>{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

Thanks Allen, for pointing this out.  At the risk of restating facts, I want to
elaborate on the Compare and Clear instruction (for those who don't have a HPPA
handbook).

The Compare and Clear instruction compares two registers, clears a register,
and conditionally nullifies the execution of the following instruction (based
on the result of the comparison).  There are 16 conditions that can be tested
against.

This instruction can be used to produce, in a general register, the logical
value (0 or 1) of the result of a comparison.

The two instruction sequence

	COMCLR, <>	rb, rc, ra
	LDO		1(0),ra

sets ra to 1 if rb and rc are equal and to 0 otherwise.

bcase@apple.UUCP (Brian Case) (02/20/88)

In article <682@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>Let me make some additional suggestions apropos of Crowl's posting.
>It should be possible to address an arbitrary byte of a register
>(this is useful for other purposes).  One way of doing this is to
>make a small address a reference to the register file; on virtual
>machines this would be costless, and the amount of memory lost on
>non-virtual machines would be negligeable.  Some machines have done
>this, but I do not believe that they have done it well; either this
>can only be used in certain situations, or the user's registers are
>in blocks and/or separated from main memory.  If we do this, we can
>keep many condition codes active for transfers.

Mapping the register set of a machine into the memory address space of a
simple, pipelined machine is difficult and costly (in space and time)
to implement and, I staunchly claim, vitually useless (except in the case
of the CRISP where the concept is central to the architecture, but in that
case there "is no register file.").  Believe me, even if your machine did
allow memory-mapped, byte access to the register file of a fast, modern
processor, you wouldn't use it because it would be so much slower than
any other kind of register access (you can't know it is really a register
access until the address is computed (and maybe translated), but the
register file is way back there in pipeline stage two, so the machine has
to sit and wait while you insert at least one bubble to perform the
unexpected (to the lock-step pipeline) register access).

cik@l.cc.purdue.edu (Herman Rubin) (02/21/88)

The vaious examples of how well their computer does on the problem assumes
that the branch is immediate.  However, suppose that xyz is computed, and
several instructions later it is desired to branch on xyz = 0, and even
later, if xyz != 0, to utilize the sign?  If condition codes only go to
the condition code register, clearly additional work must be done.  I have
also encountered situations where a shift is performed, and later it is
desired to branch on an overflow occurring in the process.  Only having
the condition codes stored and branching on the stored condition code values
avoids unnecessary comparisons.

It would take little hardware modification for a processor which generates
condition codes to store them in a register when that is wanted, and to use
them from a user-controlled register instead of the condition code register.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (ARPA or UUCP) or hrubin@purccvm.bitnet

ccplumb@watmath.waterloo.edu (Colin Plumb) (02/22/88)

kers@otter.hple.hp.com (Christopher Dollin) wrote:
>The critical points about the ARM from this point of view are (a) it is based
>on condition codes rather than compare-and-branch; (b) the register operations 
>set the CCs *optionally*; (c) *every* instruction is conditional, so some IF
>statements can be implemented with no branching at all; for example
>
>    abs(x) -> x
>
>can be implemented as
>
>    TEQ     x, #0       ; tell me about (register) x 
>    RSBMI   x, #0       ; if it's <0 then 0-x -> x (reverse subtract)
>
>Your turn guys ......................

Well, something similar can be done with result flags in general-purpose
registers.  For example, if the Am29000 used -1 for true instead of
80000000, you could do that in three instructions:

cplt	temp, x, #0	; temp = (x < 0)
xor	x, x, temp	; x ^= temp;
sub	x, x, temp	; x -= temp

If temp is 0 after the test, this has no effect on x, but if
it's -1, this inverts x and adds one, i.e. negates it.

The 29000 doesn't quite do things this way, but isn't there a processor
out there somewhere that does?

Things like z = (x > y) ? a : b; become three instructions:
cpgt	z, x, y		; z = (x > y)	/* z now holds 0 or -1	*/
and	z, z, #a-b	; z &= (a-b)	/* z now holds 0 or a-b */
add	z, z, #b	; z += b	/* z now holds b or a	*/

In this case, the ARM can do no better, although it seems that it can
in the general case.  As pipelines grow longer, we may see skip
instructions come back.  They may "waste" a cycle on a no-op, but they
keep the pipeline full.
--
	-Colin (watmath!ccplumb)

Zippy says:
I own seven-eighths of all the artists in downtown Burbank!

bcase@apple.UUCP (Brian Case) (02/22/88)

In article <780004@otter.hple.hp.com> kers@otter.hple.hp.com (Christopher Dollin) writes:
>In view of this discussion on the merits or otherwise of condition codes,
>especially as applied to RISCy machines, I wonder what the assembled multitudes
>think about the architecture of the Acorn Risc Machine (ARM)?
>
>The critical points about the ARM from this point of view are (a) it is based
>on condition codes rather than compare-and-branch; (b) the register operations 
>set the CCs *optionally*; (c) *every* instruction is conditional, so some IF
>statements can be implemented with no branching at all; for example
>
>    abs(x) -> x
>
>can be implemented as
>
>    TEQ     x, #0       ; tell me about (register) x 
>    RSBMI   x, #0       ; if it's <0 then 0-x -> x (reverse subtract)

Ah!  A very good question indeed.  The ARM does indeed gain some advantage
in some cases.  Architecting (some cringe at such language) arithmetic ops
this way (i.e., setting them only if the instruction says to do so) fixes
one of the major problems with condition codes:  One or several instructions
can separate the instruction that sets the CCs and the instruction that uses
the CCs.  (The Berkely RISCs have this feature, don't they?)  If you look
at real code, you'll see that the ability to conditionally execute any
instruction saves some jumps.

My biggest complaints about the ARM are:  it can name only 16 registers,
part of the PC is taken up by some status bits (to allow much simplified
internal interrupt return sequencing), and the use of traditional CCs may
cause problems for future aggressive implementations, i.e. the kind of
implementation that exploits out-of-order or multiple-instruction-at-a-
time techniques.  Other little complaints center around the somewhat
longer critical path in the execute pipeline stage (barrel shift in series
with ALU) that might very well cause the ARM to lag other processors in
clock rate and the addressing modes (post-/pre- increment/decrement will
cause a little implementation complexity in a more heavily pipelined
implementation; abort/restart may be a pain).

The ARM is probably one of the absolute best processor choices available
for really low-cost, medium-performance (where medium now means 2x to 3x
cheap 68020 system) embedded applications.  However, with only 16
registers and no clear commitment to "aggressive" implementations by
anyone (where are these implementations?  the ARM has been available for
a _long time_ by most RISC standards), I can't imagine choosing it for the
main CPU of anything that has to last through several generations.

bcase@apple.UUCP (Brian Case) (02/23/88)

In article <17031@watmath.waterloo.edu> ccplumb@watmath.waterloo.edu (Colin Plumb) writes:
>Well, something similar can be done with result flags in general-purpose
>registers.  For example, if the Am29000 used -1 for true instead of
>80000000, you could do that in three instructions:
>
>cplt	temp, x, #0	; temp = (x < 0)
>xor	x, x, temp	; x ^= temp;
>sub	x, x, temp	; x -= temp
>
>If temp is 0 after the test, this has no effect on x, but if
>it's -1, this inverts x and adds one, i.e. negates it.
>
>The 29000 doesn't quite do things this way, but isn't there a processor
>out there somewhere that does?

This brings up an interesting architectural point:  Why didn't the 29000
use -1 for true?  The answer is two part:  (1) it would cost more to force
32 bus lines to one than to force only one and sink the other 31 lines.
Since the computation of the one-bit boolean is probably the most critical
path in the ALU part of the machine, it seemed silly to increase this
path length.  (2) Setting all 32 lines to 1 or zero based on the outcome
of a comparison gives no more information than setting only one.  Thus,
the desired -1 or 0 in the cases mentioned here can be created by an
arithmetic right shift by 31 places after the boolean is computed.  Yes,
this adds one cycle to the above sequences, which is significant here,
but at least the same *algorithm* can be used with its attendant jump
elimination.

>Things like z = (x > y) ? a : b; become three instructions:
>cpgt	z, x, y		; z = (x > y)	/* z now holds 0 or -1	*/
>and	z, z, #a-b	; z &= (a-b)	/* z now holds 0 or a-b */
>add	z, z, #b	; z += b	/* z now holds b or a	*/

So make the above sequence:
cpgt	z,x,y
sra	z,z,31
and	z,z,#a-b
add	z,z,#b
(except that if a and b are constants, the 29000 requires more instructions
to form them into registers since only 8-bit constants can be named for
arithmetic instructions.  If a and b are in registers already, then only
one additional instruction is needed to form a-b.)

>In this case, the ARM can do no better, although it seems that it can
>in the general case.  As pipelines grow longer, we may see skip
>instructions come back.  They may "waste" a cycle on a no-op, but they
>keep the pipeline full.

Skip instructions are already back!  See the HP Spectrum.  The idea of
condition execution (as on ARM) is really just skip in disguise..