[comp.arch] conditional branches

roger@telesoft.UUCP (Roger Arnold @prodigal) (02/11/88)

There's been a lot of discussion lately on the utility of FREQUENCY
statements and the subject of branch prediction, etc.  But nobody has 
been addressing an issue that looks (to me) to offer more potential
for improving performance.

A significant fraction of real code always seems to be devoted to
repeating tests whose outcome is "already known".  I'm thinking of
things like tests on input parameters, when the tests occur multiple
times within the same subprogram activation.  There may be a standard
technical term for what I'm talking about, but I'm not aware of it.
"Loop invariant conditional expressions" are a major subcategory, but
the category I want includes any test whose outcome can be determined
by another test or tests preceding it in the data flow--recognizing
that "another test" could include a prior execution of the same test.

It's easy for an optimizer that does global data flow analysis to
recognize such tests, but no architecture that I know of gives it  
anything very interesting to do with them.  Sure, for a complex loop 
invariant test, the optimizer can move the test into the loop pre-
header, and substitute a simple test on a boolean varible inside the 
loop.  Big deal.  The test on the variable and the conditional branch 
still end up sitting in the loop, taking up code space and execution 
cycles.  Or the optimizer can get wild, and set up one copy of the 
loop as the THEN part of an IF THEN ELSE, and another as the ELSE 
part.  That way the invariant tests, and any code that becomes dead 
as a result of the test hoisting, are eliminated from the loops.
Execution efficiency is optimized, but the high cost in code space 
makes it practical only for small loops.

What I'd like to see in an architecture, as a minimum, would be a set
of one bit registers into which boolean values can be written.  It's
silly to have to tie up an entire 32-bit register to hold a boolean
value.  Especially since the register will still have to be tested to
make its value accessible to the branch unit.  With addressable flag
registers, conditional branch instructions could directly reference 
the flag of interest, with no preceding test instruction.  The flag 
registers could be available to an early stage of the instruction 
decode pipe, so that the branch instructions referencing them would 
be promptly converted to no-ops or unconditional branches, as
appropriate.  Five of the flag registers could be reserved for the
standard ALU flags (zero, negative, positive, carry, and overflow), 
so that only one general form of conditional branch would be needed.

Now, who's going to tell me that I've just described their favorite
machine from out of the past?  Surely SOMEONE has used this idea
before?

- Roger Arnold				..sdcsvax!telesoft!roger

earl@mips.COM (Earl Killian) (02/12/88)

Roger Arnold proposes that branches be implemented as computing
booleans (to one of 32 1-bit homes to be specific) and testing them in
a separate instruction, for the the purpose of optimizing tests.

The problem with this is that it makes a conditional a two instruction
sequence.  With what we know about instruction frequencies, this is
not a good idea.  The abstract concept of "compare two values and
branch accordingly" represents around 15% of the instructions executed
on a computer.  To make this take two instructions increases your
instruction count by 15% (and thus, in many cases, your cycle count by
15%).

Packaging the abstract concept of conditional branching into one
instruction rather than two is one of the numerous ways that some RISC
machines go faster than other architectures.  What I've found quite
surprizing is that some recent architectures (e.g. Berkeley RISC and
its clone, SPARC), used condition codes after studying instruction
stream statistics.  I suppose compare and branch in one instruction
was considered too difficult, at the time.

daryl@hpcllcm.HP.COM (Daryl Odnert) (02/13/88)

The Hewlett-Packard Precision Architecture (a.k.a. Spectrum) includes
some conditional branch instructions that may be of interest.

There are instructions that allow you to
   -- compare two registers and conditionally branch based on the
      result of the comparison

   -- compare a register vs. an immediate and conditionally branch
      based on the result of the comparison

   -- add two registers and conditionally branch based on certain
      relationships between the operands

   -- add an immediate to a register and conditionally branch based
      on certain relationships between the operands

Perhaps more directly related to Roger's suggestion are the following
two instructions:
   
   BB -- branch on bit: tests a bit at a fixed position in a register
      and branch based if the bit is on or off (depending on a bit set
      in the instruction word.)

   BVB -- branch on variable bit: conditionally branches based on a bit
       at a position in a register specified by the value in another register.

These last two instructions allow an optimizing compiler to perform
the transformation suggested.  I'm not sure whether the test is
done early enough in the pipeline for the hardware to treat the
instruction as a no-op or an unconditional branch.  The answer may be
different depending on which hardware implementation you look at.

Daryl Odnert
Hewlett-Packard Computer Language Lab
{outside world}!hplabs!hpcllcm!daryl

baum@apple.UUCP (Allen J. Baum) (02/13/88)

--------
[]
>In <191@telesoft.UUCP> roger@telesoft.UUCP (Roger Arnold @prodigal) writes:
>What I'd like to see in an architecture, as a minimum, would be a set
>of one bit registers into which boolean values can be written.
>With addressable flag registers, conditional branch instructions could 
>directly reference the flag of interest, with no preceding test instruction.
>  Five of the flag registers could be reserved for the
>standard ALU flags (zero, negative, positive, carry, and overflow), 
>so that only one general form of conditional branch would be needed.
>
>Now, who's going to tell me that I've just described their favorite
>machine from out of the past?  Surely SOMEONE has used this idea
>before?

IBM uses exactly this scheme in the RT/PC, although they only have one
general purpose bit that can hold any boolean result. The instruction
encodings will allow conditional branches on the state of the low 16
(sometimes 8) bits of the Condition Status Register. Of these 16 bits,
9 are reserved, one is permanently 0, there are <,=,>,C,& V conditions,
and one general purpose 'test bit' that can be set with a move-to-testbit
instruction. Presumably, the reserved bits could be used for more of the
'testbits', although more  move-to-testbit instructions would be required.
Only bits in registers can be moved to/from the testbit.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

bcase@apple.UUCP (Brian Case) (02/13/88)

In article <1556@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>The problem with this is that it makes a conditional a two instruction
>sequence.  With what we know about instruction frequencies, this is
>not a good idea.  The abstract concept of "compare two values and
>branch accordingly" represents around 15% of the instructions executed
>on a computer.  To make this take two instructions increases your
>instruction count by 15% (and thus, in many cases, your cycle count by
>15%).
>
>Packaging the abstract concept of conditional branching into one
>instruction rather than two is one of the numerous ways that some RISC
>machines go faster than other architectures.  What I've found quite
>surprizing is that some recent architectures (e.g. Berkeley RISC and
>its clone, SPARC), used condition codes after studying instruction
>stream statistics.  I suppose compare and branch in one instruction
>was considered too difficult, at the time.

The Am29000 provides at least one other variation:  conditional branch
instructions look at only the most significant bit of a register to decide
whether or not to branch.  Thus, this architecture requires two instructions
to accomplish compare/branch, but the result of the compare is a data value
like any other.  (Note that test for 32-bit twos-complement negative is
"free."  This comes in handy, very handy, for simulation of other
architectures!)  The computation of the boolean is subject to many
optimizations, not the least of which is scheduling.  The compare
instruction often ends up in a branch or load/store delay slot (sorry, I
don't have numbers for this).

As Earl points out, however, there is no question that a savings can be
had by providing compare/branch; the savings is mostly space, in my opinion.
I can tell you that there isn't time in the current Am29000 implementation
for a compare in the register read pipeline stage.  There might be time for
a zero-detection, but I doubt it.  As I understand it, the MIPS pipeline
does have a zero-detection in the register read stage, and I can believe
having it is a win.  For the Am29000, compare/branch would require either:
(1) a slower clock and unbalanced pipe, or (2) an extra delay slot or pipe
bubble for branches.  Neither of these options (I believe) is better than
a two-instruction compare and branch.  Different architectures require
different trade-offs to ensure "good" implementations.  The Am29000 register
file, just like the MIPS compare/branch, is a win.  Who wins more?  Oh please,
not that again!  :-) :-)

    bcase

roger@telesoft.UUCP (Roger Arnold @prodigal) (02/13/88)

In article <1556@gumby.mips.COM>, earl@mips.COM (Earl Killian) writes:
> Roger Arnold proposes [..] (boolean registers)
> 
> The problem with this is that it makes a conditional a two instruction
> sequence.  [..]  The abstract concept of "compare two values and
> branch accordingly" represents around 15% of the instructions executed
> on a computer.  To make this take two instructions increases your
> instruction count by 15% (and thus, in many cases, your cycle count by
> 15%).
> 
> [..] What I've found quite
> surprizing is that some recent architectures (e.g. Berkeley RISC and
> its clone, SPARC), used condition codes after studying instruction
> stream statistics.  I suppose compare and branch in one instruction
> was considered too difficult, at the time.

I'm no authority, but it doesn't surprise me at all that a RISC would
use condition codes.  Compare and branch, in one instruction, strikes 
me as more at home in a CISC than a RISC architecture.  For one thing, 
if the instruction cycle is closely tuned to the basic ALU cycle, there 
won't be time to feed back the result of a compare to the branch in one
cycle.  To allow it, the cycle would have to be extended, at a major
performance cost to the rest of the instruction set.  

Of course, marvelous things are possible with pipelining and parallel
lookahead, but there's still a second consideration: there aren't
enough bits to specify two operands and a target address in one fixed 
size instruction.  Not, at least, without imposing restrictions that do 
ugly things to the instuction set's orthogonality.  (E.g., both operands 
in a register, target address a limited displacement PC relative mode).  
Then, too, condition codes enable instruction scheduling; very good for 
pipelining and low level parallelism in top end hardware.

As to the 15% "compare two values" and branch, I've no trouble at all
believing that number.  In fact, it feels conservative, if it includes
comparisons to zero.  But it was consideration of how many of those
compaisions only repeat tests previously made that prompted me to 
suggest boolean registers.  Sorry, no numbers.  Just a programmer's 
notoriously unreliable "gut feeling", based on looking a LOT of code 
over the years.

- Roger Arnold				..ucsd!telesoft!roger

mac3n@babbage.acc.virginia.edu (Alex Colvin) (02/14/88)

> >What I'd like to see in an architecture, as a minimum, would be a set
> >of one bit registers into which boolean values can be written.

> IBM uses exactly this scheme in the RT/PC, although they only have one
> general purpose bit that can hold any boolean result.

So does the Honeywell Level 6 mini, now called the DPS6.  This gives you a
convenient place to return a status bit, but no substatus.  A problem is
that many langages aren't prepared to use such a register.  Seen any BIT
types in C?

aglew@ccvaxa.UUCP (02/15/88)

..> Comparisons, branches, and booleans

The test x = 0, and therefore x = y, can almost certainly be done with
minimum delay in a pipeline, as can x < 0, looking just at the sign bit.
A full arithmetic compare might be a problem because of carry.

When storing the result of a compare into a register, it might be better
to store all the CC bits, rather than just a boolean corresponding to
what you wanted to test. Eg.

	compare result := x ? y
	if result was < branch foo
	if result was > branch bar
	if result was unordered branch foobar
	...

And have a loose encoding so that the "if result was ..." instruction is cheap.

This is similar to what some vector processors do with vector compares: create
a vector of CCs, and be able to do multiple tests with them without rescanning
the vector. 


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

mash@mips.COM (John Mashey) (02/15/88)

In article <208@telesoft.UUCP> roger@telesoft.UUCP (Roger Arnold @prodigal) writes:
>In article <1556@gumby.mips.COM>, earl@mips.COM (Earl Killian) writes:
>> Roger Arnold proposes [..] (boolean registers)

>I'm no authority, but it doesn't surprise me at all that a RISC would
>use condition codes.  Compare and branch, in one instruction, strikes 
>me as more at home in a CISC than a RISC architecture.  For one thing, 
>if the instruction cycle is closely tuned to the basic ALU cycle, there 
>won't be time to feed back the result of a compare to the branch in one
>cycle.  To allow it, the cycle would have to be extended, at a major
>performance cost to the rest of the instruction set.  

It is difficult to do a general compare ( a < b ) and branch in one cycle,
without the effects noted.  If you're careful, you can do:
	register a = register b 
	register a != register b 
		compares against 0 fall out by using reg b == 0
	register a < 0  [i.e., sign bit == 1]
	register a >= 0 [sign bit == 0]
	register a > 0 [i.e., sign bit == 0, at least one nonzero bit elsewise]
	register a <= 0 [sign bit == 1, or sign == 0 and rest == 0]

>
>Of course, marvelous things are possible with pipelining and parallel
>lookahead, but there's still a second consideration: there aren't
>enough bits to specify two operands and a target address in one fixed 
>size instruction.  Not, at least, without imposing restrictions that do 
>ugly things to the instuction set's orthogonality.  (E.g., both operands 
>in a register, target address a limited displacement PC relative mode).  
>Then, too, condition codes enable instruction scheduling; very good for 
>pipelining and low level parallelism in top end hardware.

Note also the difference between using a "compare" or "set" command that
sets one or bits in an arbitrary GP register, and a compare that sets
a real condition code register.  The former gives compilers much more latitude
for code scheduling, and actually turns out to simplify the hardware,
apparently: the "bypassing" hardware that lets you grab results in the next
instruction, before the results have actually gotten back to the registers,
is simpler because all it ever has to do is a uniformly-sized register,
not register + condition code.  [Probably not a huge deal, but every bit less
unnecessary complexity...]
Compilers do NOT like condition codes at all: a lot of very clean algorithms
get uglified by having to deal with magicly different condition-codes.
The restriction noted isn't particularly ugly: actually, it fit into
our scheme quite easily.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

sjc@mips.COM (Steve "The" Correll) (02/15/88)

In article <208@telesoft.UUCP>, roger@telesoft.UUCP (Roger Arnold @prodigal) writes:
> ...there aren't
> enough bits to specify two operands and a target address in one fixed 
> size instruction.  Not, at least, without imposing restrictions that do 
> ugly things to the instuction set's orthogonality.  (E.g., both operands 
> in a register, target address a limited displacement PC relative mode).  
> Then, too, condition codes enable instruction scheduling; very good for 
> pipelining and low level parallelism in top end hardware.

As an illustration, the MIPS compare-and-branch allocates 6 bits to an opcode,
5 bits to each register field, and 16 bits to a word-offset target.

The register-operand-only restriction seems painless in practice (if a value
is being tested often, the optimizer is happy to keep it in a register, and
one of the registers acts like a literal zero) and we have immediate "set
less than" instructions which set the low-order bit of a register for testing
later on. This is similar to a condition code, but uses any available
general-purpose register rather than a dedicated boolean. Sometimes that
really pays off. For example, I like to avoid branches by implementing the
three-way comparison used by Unix "qsort" (yielding -1/0/+1) as:

  (t0 > t1) - (t0 < t1)

This translates into two "set less than"s and a "subtract"; it's possible
precisely because the result of each comparison goes into a general register
rather than a dedicated boolean. One can similarly use arithmetic in lieu of
branches for certain range-checks, and one can use "set less than" followed
by an ordinary add to propagate a carry in multiple-precision arithmetic; I
can't prove it's superior to a carry flag plus "addc" and "subc" instructions,
but it's satisfyingly simple and elegant.

The 16-bit offset restriction is almost never a problem, but loops with more
than 128k bytes of code inside them do exist, and they are indeed a pain.

Condition codes can be good or bad for instruction-scheduling: in some CISC
architectures, so many instructions bash so many condition codes that you
have to keep the test and the branch adjacent anyway.

-- 
...decwrl!mips!sjc						Steve Correll

bcase@apple.UUCP (Brian Case) (02/16/88)

In article <208@telesoft.UUCP> roger@telesoft.UUCP (Roger Arnold @prodigal) writes:
>In article <1556@gumby.mips.COM>, earl@mips.COM (Earl Killian) writes:
>> Roger Arnold proposes [..] (boolean registers)
>> 
>> The abstract concept of "compare two values and
>> branch accordingly" represents around 15% of the instructions executed
>> on a computer.  To make this take two instructions increases your
>> instruction count by 15% (and thus, in many cases, your cycle count by
>> 15%).
>I'm no authority, but it doesn't surprise me at all that a RISC would
>use condition codes.  Compare and branch, in one instruction, strikes 
>me as more at home in a CISC than a RISC architecture.  For one thing, 
>if the instruction cycle is closely tuned to the basic ALU cycle, there 
>won't be time to feed back the result of a compare to the branch in one
>cycle.  To allow it, the cycle would have to be extended, at a major
>performance cost to the rest of the instruction set.  

Er, not really true since condition codes come from the ALU and have to meet
a set up time for the condition code register anyway.  Adding compare and
branch has its worst effects by either putting the register file PLUS the
ALU in a critical path or delaying the effect of conditional branches by
one more cycle (at least this is how I see it; did I goof?).  Yes, there
is some cost, but adding a gate or two to the ALU stage is probably not
going to cause it to be the most critical path (cache access (TLB is a
cache) is probably the criticalest path).

>Of course, marvelous things are possible with pipelining and parallel
>lookahead, but there's still a second consideration: there aren't
>enough bits to specify two operands and a target address in one fixed 
>size instruction.  Not, at least, without imposing restrictions that do 
>ugly things to the instuction set's orthogonality.  (E.g., both operands 
>in a register, target address a limited displacement PC relative mode).  
>Then, too, condition codes enable instruction scheduling; very good for 
>pipelining and low level parallelism in top end hardware.

Wait a minute, condition codes HINDER instruction scheduling, not "enable"
it.  The problem is that, even if each instruction can decide to set the
cond. codes based on a bit in the instruction, there is only one condition
code register; what if the result of a compare needs to be saved for later?
What if you (or the compiler) want to move a compare to some distant spot
to fill up a branch or load delay slot but there is an intervening instruction
that sets the condition codes?  Outa luck.  Your "boolean registers" idea
definitely mittigates this problem, perhaps eliminating it (but it creates
others).

>As to the 15% "compare two values" and branch, I've no trouble at all
>believing that number.  In fact, it feels conservative, if it includes
>comparisons to zero.  But it was consideration of how many of those
>compaisions only repeat tests previously made that prompted me to 
>suggest boolean registers.  Sorry, no numbers.  Just a programmer's 
>notoriously unreliable "gut feeling", based on looking a LOT of code 
>over the years.

Your boolean register idea is simply a less general, special case of saving
the result of a comparison in a general data register.  I suppose we could
argue about whether or not it is really necessary to have the full generality
of booleans in data registers, but I can say that I, personally now, this is
just me, would rather write a compiler for a machine that has either one (1)
condition code register or uses booleans in data registers.  It's the old
"there should be zero, one, or (large) n of something."

jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) (02/16/88)

In article <1556@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>Roger Arnold proposes that branches be implemented as computing
>booleans (to one of 32 1-bit homes to be specific) and testing them in
>a separate instruction, for the the purpose of optimizing tests.
...
>Packaging the abstract concept of conditional branching into one
>instruction rather than two is one of the numerous ways that some RISC
>machines go faster than other architectures.  What I've found quite
>surprizing is that some recent architectures (e.g. Berkeley RISC and
>its clone, SPARC), used condition codes after studying instruction
>stream statistics.  I suppose compare and branch in one instruction
>was considered too difficult, at the time.

Think about compare & branch from a hardware point of view.  To do it in
one cycle, you must fetch two values, run them through the ALU, and get the
result.  Now you have the information that allows you to determine whether
to branch.  You must also determine the branch destination.  This may also
require some computation, an addition to the PC (though it might be speeded
a little by knowing the offset if some small number of bits, which it has
to be given a 32-bit instruction.)  If you're willing to build another
fast adder for this computation and run it in parallel, you MIGHT be able
to pull it off, though I doubt it.  It would cost LOTS of chip area, and
would probably be your critical path that determines your cycle time (certainly
it would be if you didn't have a parallel adder!)

I'm not saying that conditions codes are the answer, just that the one-
instruction compare & branch creates some big problems for chip designers
at the hardware level.  If you go to the ISSCC (?), check out the solution
that the GE RPM-40 uses (not that it's perfect either).  More I think I
should not say before the conference.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

aglew@ccvaxa.UUCP (02/18/88)

..> Compare and branch.

In case it needs to be restated, special cases of compare and branch
*DO* *NOT* need to be run through the ALU: x = y, x != y, x < 0, etc.

earl@mips.COM (Earl Killian) (02/18/88)

In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes:

   Think about compare & branch from a hardware point of view.  To do
   it in one cycle, you must fetch two values, run them through the
   ALU, and get the result.  Now you have the information that allows
   you to determine whether to branch.  You must also determine the
   branch destination.  This may also require some computation, an
   addition to the PC (though it might be speeded a little by knowing
   the offset if some small number of bits, which it has to be given a
   32-bit instruction.)  If you're willing to build another fast adder
   for this computation and run it in parallel, you MIGHT be able to
   pull it off, though I doubt it.  It would cost LOTS of chip area,
   and would probably be your critical path that determines your cycle
   time (certainly it would be if you didn't have a parallel adder!)

Hardware makes things go faster.  That's why RISC machines tend to
have more hardware in them than CISCs (they find room the extra
hardware by tossing out the firmware, for a net savings).  It is
perfectly reasonable to dedicate an adder to computing
PC+branchdisplacement on every instruction (not just branch
instructions), and selecting between that and PC+1 based on the branch
decision.  Perfectly reasonable because that one adder just added 10%
to your performance.

Branch decisions can have practically the same timing constraints as
load/store instructions in a simple pipeline; if you can do the
address add for the load/stores, then you can do the branch decision.
The details depend on your pipeline.  The MIPS R2000 pipeline is not
quite as generous to branch decisions as a simple pipeline because it
has virtual to physical translation in series with cache access, which
is why it leaves out the X < Y compare and branch.  It does do X = Y,
and X ? 0, which are most of the compare and branches.  The end result
is that the MIPS architecture is about 10% more efficient than
condition-code architectures from branches alone (i.e. needs to
execute 10% fewer instructions).

firth@sei.cmu.edu (Robert Firth) (02/18/88)

Some of our work here confirms the view that condition codes are
not a very good idea.  Compiling the same code for MIPS M/500
and Vax-11, using the same compiler technology, gave the result

MIPS: 6.1% of instructions were conditional branches
      1.2% were 'set condition' instructions

VAX: 10.0% conditional branches
      8.9% tests or comparisons

MIPS provides only comparison against zero in the conditional
branch  This took care of about 80% of branches; the other 20%
required a prior set condition instruction.

Vax branches against the condition codes, but loses because
nearly all the branches then required a prior instruction to
set them.  Here is an extract from our report, discussing the
Vax statistics:

	"The code generator slaves the condition codes
	 religiously, both through linear code and across
	 control transfers.  Nevertheless, of 415 conditional
	 branches, 371 (almost 90%) required a prior test
	 or compare to set the condition codes; of 2169
	 normal instructions that set the condition codes,
	 only 44 (about 2%) did so to any purpose.  It is
	 hard to avoid the conclusion that condition codes
	 are a waste of time, effort, and silicon."

bcase@apple.UUCP (Brian Case) (02/20/88)

In article <4252@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:
>Some of our work here confirms the view that condition codes are
>not a very good idea.
>	"The code generator slaves the condition codes
>	 religiously, both through linear code and across
>	 control transfers.  Nevertheless, of 415 conditional
>	 branches, 371 (almost 90%) required a prior test
>	 or compare to set the condition codes; of 2169
>	 normal instructions that set the condition codes,
>	 only 44 (about 2%) did so to any purpose.  It is
>	 hard to avoid the conclusion that condition codes
>	 are a waste of time, effort, and silicon."

(Here we go again...)  There is a paper by Robert D. Russell, "The PDP-11:
A Case Study of How _NOT_ To Design Condition Codes," that, although for
a different architecture, says exactly the same thing.  Excerpts from this
old, old paper:

"A sample of several hand-coded PDP-11 modules was taken at random from the
Oh my God------------^^^^^^^^^^
RSX-11M operating system, the DEC-10 communications front end (which is
handled by dedicated PDP-11s), and the PL-11 compiler.  These are typical
of "systems" programs found on the PDP-11.  Out of a total of 1044 assembled
instructions, 671 (64%) set the condition codes as a principle or side
effect of their operation.  Of these, only 159 (24%) were followed by
instruction sequences that actually utilized the condition code setting.
Hence, in 76% of the instructions the effort required to se the condition
codes was completely wasted.  Furthermore, of the 159 instructions that set
the condition codes for susequent meaningful utilization, 116 (73%) were
instructions (17 TST, 51 CMP, 42 BIT, 4 SEC, and 2 CLC instructions) whose
primary purpose is to set condition codes. ...  This means that the side
effect of setting the condition codes is wasted in 512 (92%) of the 555
instructions whose primary function is not condition code manipulation.
(Of the 43 that did set the condition codes for later use, 36 (84%) were MOV
instructions.)  Of course, these are measurements of "static" instruction
sequences, not a dynamic trace during execution.  Nonetheless, it would
appear that in practice very few explicit tests are eliminated by having
data operation instructions set the condition code bits as a side effect, 
and the increased complexity with accompanying performance degredataion of
having every such instruction set the condition codes is indeed not justified."

This paper was published sometime around 1975 (I can look up the reference
for interested parties).  This paper, together with optimizing compiler
papers and Wulf's "Compilers and Computer Architecture" and others
hammered home the point that condition codes are not the best way to
propagate relational information between instructions.

robinson@dewey.soe.berkeley.edu (Michael Robinson) (02/20/88)

In article <7445@apple.UUCP> bcase@apple.UUCP (Brian Case) writes:
>(Here we go again...)  There is a paper by Robert D. Russell, "The PDP-11:
>A Case Study of How _NOT_ To Design Condition Codes," that, although for
>a different architecture, says exactly the same thing.  Excerpts from this
>old, old paper:
>[...]
>This means that the side
>effect of setting the condition codes is wasted in 512 (92%) of the 555
>instructions whose primary function is not condition code manipulation.
>(Of the 43 that did set the condition codes for later use, 36 (84%) were MOV
>instructions.)

But:

>Of course, these are measurements of "static" instruction
>sequences, not a dynamic trace during execution.

Maybe I'm just stupid, but aren't the dynamic measurements far more
significant?

In my own assembly language programming, my experience has been that 
condition-code-setting instructions are used to advantage most often in
tight, processor-intensive loops (copies, sorts, searches, etc.)--the type
of instructions, therefore, that tend to get executed with much greater 
frequency than usual code.

Surely, someone out there must have done dynamic analysis of condition
code utilization.  I won't be convinced on this issue until I've seen the
results.

------------------------------------------------------------------------------
Michael Robinson                              USENET:  ucbvax!ernie!robinson
                                              ARPA: robinson@ernie.berkeley.edu

jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (02/21/88)

In article <1610@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes:
>
>   Think about compare & branch from a hardware point of view.  To do
...
>   for this computation and run it in parallel, you MIGHT be able to
>   pull it off, though I doubt it.  It would cost LOTS of chip area,
>   and would probably be your critical path that determines your cycle
>   time (certainly it would be if you didn't have a parallel adder!)

>Hardware makes things go faster.  That's why RISC machines tend to
>have more hardware in them than CISCs (they find room the extra
>hardware by tossing out the firmware, for a net savings).  It is
>perfectly reasonable to dedicate an adder to computing
>PC+branchdisplacement on every instruction (not just branch
>instructions), and selecting between that and PC+1 based on the branch
>decision.  Perfectly reasonable because that one adder just added 10%
>to your performance.

	You may well be right.  It's always worth checking into when doing
a design, because it IS a potential win, if you can pull it off.  However,
when working at the cutting edge, a large adder is a BIG amount of chip
space to use.  A full ALU (which is just an adder and a shifter) can take
20+% of the entire chip.  The bigger the chip, the lower the yield, and
the more delays in intra-chip runs.
	There also are timing considerations.  If you have your cycle timed
out to be the same as the time through a 32-bit adder, you may have trouble
getting result of the comparison out in time.  Also, the calculation in a
pipelined machine is between PC+N+1 and PC+N+disp, but thats minor.  Another
thing is that running those bits and the PC off to the adder might slow down
one of the decode stages.  Lastly, you have to find the bits in the instruction
to specify both registers and the displacement, and this extra format may also
slow down decoding.
	The point here isn't that it's impossible to get a win with conditional
branches, but that there are a LOT of side-effects that have to be dealt
with in doing it.

>Branch decisions can have practically the same timing constraints as
>load/store instructions in a simple pipeline; if you can do the
>address add for the load/stores, then you can do the branch decision.

"simple pipeline"?  Many chips use the ALU stage of loads/stores for the
address computation.  Once again, you can throw an address-calculator in here
to speed them up by a cycle or so, but here it might impact your register
file design, for indexed load/stores.

The moral:  Chip design is a wonderously complicated place, full of hidden
connections and fragilites.  Very few ideas are easy to implement.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

bcase@apple.UUCP (Brian Case) (02/22/88)

In article <400@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>In article <1610@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>>In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes:
> A full ALU (which is just an adder and a shifter) can take
>20+% of the entire chip.  The bigger the chip, the lower the yield, and
>the more delays in intra-chip runs.

WOW!  What technology are you talking about?  Looking at the die photo of
the 29K (1.25 micron CMOS), the ALU/shifter is between 5% and 7% of the
die.  The PC+relative offset adder is probably more like 3%.  Eliminating
the PC adder would probably have had no effect on the yield-influencing
die size (although this analysis does not include internal bus considerations;
but I would guess that sharing the main ALU for PC+offset would have
*increased* the internal bus overhead).

andrew@frip.gwd.tek.com (Andrew Klossner) (02/23/88)

	"The test x = 0, and therefore x = y, can almost certainly be
	done with minimum delay in a pipeline, as can x < 0, looking
	just at the sign bit.  A full arithmetic compare might be a
	problem because of carry."

I've seen this assertion a couple of times, but it ain't so.  You need
to worry about carry when doing a subtract, but not when just doing a
compare.  If both operands are positive, a simple bit-for-bit
comparison is sufficient, analogous to the way that C programs compare
strings.  When the operands can be signed, a little extra logic is
required, but nothing as fancy as a full adder/subtractor.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

roger@telesoft.UUCP (Roger Arnold @prodigal) (02/24/88)

In article <28200099@ccvaxa>, aglew@ccvaxa.UUCP writes:
> 
> ..> Compare and branch.
> 
> In case it needs to be restated, special cases of compare and branch
> *DO* *NOT* need to be run through the ALU: x = y, x != y, x < 0, etc.

Grumph!  What's being said here?  It's hard to carry on architectural 
discussions in general terms; everyone (me included) makes unstated
assumptions about the context.  

"x = y" has to be run through SOMEthing.  The values of x and y must 
be accessed, and compared bitwise.  If the point being made is that it
doesn't take a full ALU to do it, well, of course.  All it takes is 
the OR of 32 parallel XOR gates, or something equivalent.  The delay
through such a circuit is less than it is through an adder.  But the
register file accesses are probably going to use the same internal
busses that feed the ALU.  Not that it HAS to be that way, but the
alternatives seem very expensive.  Am I missing something?

- Roger Arnold				..ucsd!telesoft!roger

doug@edge.UUCP (Doug Pardee) (02/26/88)

]I've seen this assertion a couple of times, but it ain't so.  You need
]to worry about carry when doing a subtract, but not when just doing a
]compare.  If both operands are positive, a simple bit-for-bit
]comparison is sufficient, analogous to the way that C programs compare
]strings.  When the operands can be signed, a little extra logic is
]required, but nothing as fancy as a full adder/subtractor.

Not only is bit-for-bit comparison sufficient, it should be considered
*necessary*.  Using subtraction for comparison only makes a mess of
things by introducing (unnecessarily) the possibility of overflow when
comparing a quite negative number with a quite positive one.
-- 
Doug Pardee         {ames,hplabs,sun,amdahl,ihnp4,allegra}!oliveb!edge!doug
Edge Computer Corp., Scottsdale, AZ                 uunet!ism780c!edge!doug

hansen@mips.COM (Craig Hansen) (02/27/88)

]I've seen this assertion a couple of times, but it ain't so.  You need
]to worry about carry when doing a subtract, but not when just doing a
]compare.  If both operands are positive, a simple bit-for-bit
]comparison is sufficient, analogous to the way that C programs compare
]strings.  When the operands can be signed, a little extra logic is
]required, but nothing as fancy as a full adder/subtractor.

However, to do the bit-for-bit comparison in parallel (most C programs
compare strings one byte at a time), you need a structure that looks
_remarkably_ like a carry chain. In fact, the structure's just as
"fancy" as the carry chain in a subtractor, because it is functionally
identical to the carry chain in a subtractor. Remember that the
comparison of the high-order bits has priority over the comparison of
low-order bits....

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...{ames,decwrl,prls}!mips!hansen or hansen@mips.com   408-991-0234

aglew@ccvaxa.UUCP (03/01/88)

>]If both operands are positive, a simple bit-for-bit
>]comparison is sufficient, analogous to the way that C programs compare
>]strings.
>
>However, to do the bit-for-bit comparison in parallel (most C programs
>compare strings one byte at a time), you need a structure that looks
>_remarkably_ like a carry chain. In fact, the structure's just as
>"fancy" as the carry chain in a subtractor, because it is functionally
>identical to the carry chain in a subtractor. Remember that the
>comparison of the high-order bits has priority over the comparison of
>low-order bits....
>
>Craig Hansen

Thanks, Craig, I was beginning to get worried.