roger@telesoft.UUCP (Roger Arnold @prodigal) (02/11/88)
There's been a lot of discussion lately on the utility of FREQUENCY statements and the subject of branch prediction, etc. But nobody has been addressing an issue that looks (to me) to offer more potential for improving performance. A significant fraction of real code always seems to be devoted to repeating tests whose outcome is "already known". I'm thinking of things like tests on input parameters, when the tests occur multiple times within the same subprogram activation. There may be a standard technical term for what I'm talking about, but I'm not aware of it. "Loop invariant conditional expressions" are a major subcategory, but the category I want includes any test whose outcome can be determined by another test or tests preceding it in the data flow--recognizing that "another test" could include a prior execution of the same test. It's easy for an optimizer that does global data flow analysis to recognize such tests, but no architecture that I know of gives it anything very interesting to do with them. Sure, for a complex loop invariant test, the optimizer can move the test into the loop pre- header, and substitute a simple test on a boolean varible inside the loop. Big deal. The test on the variable and the conditional branch still end up sitting in the loop, taking up code space and execution cycles. Or the optimizer can get wild, and set up one copy of the loop as the THEN part of an IF THEN ELSE, and another as the ELSE part. That way the invariant tests, and any code that becomes dead as a result of the test hoisting, are eliminated from the loops. Execution efficiency is optimized, but the high cost in code space makes it practical only for small loops. What I'd like to see in an architecture, as a minimum, would be a set of one bit registers into which boolean values can be written. It's silly to have to tie up an entire 32-bit register to hold a boolean value. Especially since the register will still have to be tested to make its value accessible to the branch unit. With addressable flag registers, conditional branch instructions could directly reference the flag of interest, with no preceding test instruction. The flag registers could be available to an early stage of the instruction decode pipe, so that the branch instructions referencing them would be promptly converted to no-ops or unconditional branches, as appropriate. Five of the flag registers could be reserved for the standard ALU flags (zero, negative, positive, carry, and overflow), so that only one general form of conditional branch would be needed. Now, who's going to tell me that I've just described their favorite machine from out of the past? Surely SOMEONE has used this idea before? - Roger Arnold ..sdcsvax!telesoft!roger
earl@mips.COM (Earl Killian) (02/12/88)
Roger Arnold proposes that branches be implemented as computing booleans (to one of 32 1-bit homes to be specific) and testing them in a separate instruction, for the the purpose of optimizing tests. The problem with this is that it makes a conditional a two instruction sequence. With what we know about instruction frequencies, this is not a good idea. The abstract concept of "compare two values and branch accordingly" represents around 15% of the instructions executed on a computer. To make this take two instructions increases your instruction count by 15% (and thus, in many cases, your cycle count by 15%). Packaging the abstract concept of conditional branching into one instruction rather than two is one of the numerous ways that some RISC machines go faster than other architectures. What I've found quite surprizing is that some recent architectures (e.g. Berkeley RISC and its clone, SPARC), used condition codes after studying instruction stream statistics. I suppose compare and branch in one instruction was considered too difficult, at the time.
daryl@hpcllcm.HP.COM (Daryl Odnert) (02/13/88)
The Hewlett-Packard Precision Architecture (a.k.a. Spectrum) includes some conditional branch instructions that may be of interest. There are instructions that allow you to -- compare two registers and conditionally branch based on the result of the comparison -- compare a register vs. an immediate and conditionally branch based on the result of the comparison -- add two registers and conditionally branch based on certain relationships between the operands -- add an immediate to a register and conditionally branch based on certain relationships between the operands Perhaps more directly related to Roger's suggestion are the following two instructions: BB -- branch on bit: tests a bit at a fixed position in a register and branch based if the bit is on or off (depending on a bit set in the instruction word.) BVB -- branch on variable bit: conditionally branches based on a bit at a position in a register specified by the value in another register. These last two instructions allow an optimizing compiler to perform the transformation suggested. I'm not sure whether the test is done early enough in the pipeline for the hardware to treat the instruction as a no-op or an unconditional branch. The answer may be different depending on which hardware implementation you look at. Daryl Odnert Hewlett-Packard Computer Language Lab {outside world}!hplabs!hpcllcm!daryl
baum@apple.UUCP (Allen J. Baum) (02/13/88)
-------- [] >In <191@telesoft.UUCP> roger@telesoft.UUCP (Roger Arnold @prodigal) writes: >What I'd like to see in an architecture, as a minimum, would be a set >of one bit registers into which boolean values can be written. >With addressable flag registers, conditional branch instructions could >directly reference the flag of interest, with no preceding test instruction. > Five of the flag registers could be reserved for the >standard ALU flags (zero, negative, positive, carry, and overflow), >so that only one general form of conditional branch would be needed. > >Now, who's going to tell me that I've just described their favorite >machine from out of the past? Surely SOMEONE has used this idea >before? IBM uses exactly this scheme in the RT/PC, although they only have one general purpose bit that can hold any boolean result. The instruction encodings will allow conditional branches on the state of the low 16 (sometimes 8) bits of the Condition Status Register. Of these 16 bits, 9 are reserved, one is permanently 0, there are <,=,>,C,& V conditions, and one general purpose 'test bit' that can be set with a move-to-testbit instruction. Presumably, the reserved bits could be used for more of the 'testbits', although more move-to-testbit instructions would be required. Only bits in registers can be moved to/from the testbit. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
bcase@apple.UUCP (Brian Case) (02/13/88)
In article <1556@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >The problem with this is that it makes a conditional a two instruction >sequence. With what we know about instruction frequencies, this is >not a good idea. The abstract concept of "compare two values and >branch accordingly" represents around 15% of the instructions executed >on a computer. To make this take two instructions increases your >instruction count by 15% (and thus, in many cases, your cycle count by >15%). > >Packaging the abstract concept of conditional branching into one >instruction rather than two is one of the numerous ways that some RISC >machines go faster than other architectures. What I've found quite >surprizing is that some recent architectures (e.g. Berkeley RISC and >its clone, SPARC), used condition codes after studying instruction >stream statistics. I suppose compare and branch in one instruction >was considered too difficult, at the time. The Am29000 provides at least one other variation: conditional branch instructions look at only the most significant bit of a register to decide whether or not to branch. Thus, this architecture requires two instructions to accomplish compare/branch, but the result of the compare is a data value like any other. (Note that test for 32-bit twos-complement negative is "free." This comes in handy, very handy, for simulation of other architectures!) The computation of the boolean is subject to many optimizations, not the least of which is scheduling. The compare instruction often ends up in a branch or load/store delay slot (sorry, I don't have numbers for this). As Earl points out, however, there is no question that a savings can be had by providing compare/branch; the savings is mostly space, in my opinion. I can tell you that there isn't time in the current Am29000 implementation for a compare in the register read pipeline stage. There might be time for a zero-detection, but I doubt it. As I understand it, the MIPS pipeline does have a zero-detection in the register read stage, and I can believe having it is a win. For the Am29000, compare/branch would require either: (1) a slower clock and unbalanced pipe, or (2) an extra delay slot or pipe bubble for branches. Neither of these options (I believe) is better than a two-instruction compare and branch. Different architectures require different trade-offs to ensure "good" implementations. The Am29000 register file, just like the MIPS compare/branch, is a win. Who wins more? Oh please, not that again! :-) :-) bcase
roger@telesoft.UUCP (Roger Arnold @prodigal) (02/13/88)
In article <1556@gumby.mips.COM>, earl@mips.COM (Earl Killian) writes: > Roger Arnold proposes [..] (boolean registers) > > The problem with this is that it makes a conditional a two instruction > sequence. [..] The abstract concept of "compare two values and > branch accordingly" represents around 15% of the instructions executed > on a computer. To make this take two instructions increases your > instruction count by 15% (and thus, in many cases, your cycle count by > 15%). > > [..] What I've found quite > surprizing is that some recent architectures (e.g. Berkeley RISC and > its clone, SPARC), used condition codes after studying instruction > stream statistics. I suppose compare and branch in one instruction > was considered too difficult, at the time. I'm no authority, but it doesn't surprise me at all that a RISC would use condition codes. Compare and branch, in one instruction, strikes me as more at home in a CISC than a RISC architecture. For one thing, if the instruction cycle is closely tuned to the basic ALU cycle, there won't be time to feed back the result of a compare to the branch in one cycle. To allow it, the cycle would have to be extended, at a major performance cost to the rest of the instruction set. Of course, marvelous things are possible with pipelining and parallel lookahead, but there's still a second consideration: there aren't enough bits to specify two operands and a target address in one fixed size instruction. Not, at least, without imposing restrictions that do ugly things to the instuction set's orthogonality. (E.g., both operands in a register, target address a limited displacement PC relative mode). Then, too, condition codes enable instruction scheduling; very good for pipelining and low level parallelism in top end hardware. As to the 15% "compare two values" and branch, I've no trouble at all believing that number. In fact, it feels conservative, if it includes comparisons to zero. But it was consideration of how many of those compaisions only repeat tests previously made that prompted me to suggest boolean registers. Sorry, no numbers. Just a programmer's notoriously unreliable "gut feeling", based on looking a LOT of code over the years. - Roger Arnold ..ucsd!telesoft!roger
mac3n@babbage.acc.virginia.edu (Alex Colvin) (02/14/88)
> >What I'd like to see in an architecture, as a minimum, would be a set > >of one bit registers into which boolean values can be written. > IBM uses exactly this scheme in the RT/PC, although they only have one > general purpose bit that can hold any boolean result. So does the Honeywell Level 6 mini, now called the DPS6. This gives you a convenient place to return a status bit, but no substatus. A problem is that many langages aren't prepared to use such a register. Seen any BIT types in C?
aglew@ccvaxa.UUCP (02/15/88)
..> Comparisons, branches, and booleans The test x = 0, and therefore x = y, can almost certainly be done with minimum delay in a pipeline, as can x < 0, looking just at the sign bit. A full arithmetic compare might be a problem because of carry. When storing the result of a compare into a register, it might be better to store all the CC bits, rather than just a boolean corresponding to what you wanted to test. Eg. compare result := x ? y if result was < branch foo if result was > branch bar if result was unordered branch foobar ... And have a loose encoding so that the "if result was ..." instruction is cheap. This is similar to what some vector processors do with vector compares: create a vector of CCs, and be able to do multiple tests with them without rescanning the vector. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have nameserver aglew@gswd-vms.gould.com - if you don't aglew@gswd-vms.arpa - if you use DoD hosttable aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier? My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
mash@mips.COM (John Mashey) (02/15/88)
In article <208@telesoft.UUCP> roger@telesoft.UUCP (Roger Arnold @prodigal) writes: >In article <1556@gumby.mips.COM>, earl@mips.COM (Earl Killian) writes: >> Roger Arnold proposes [..] (boolean registers) >I'm no authority, but it doesn't surprise me at all that a RISC would >use condition codes. Compare and branch, in one instruction, strikes >me as more at home in a CISC than a RISC architecture. For one thing, >if the instruction cycle is closely tuned to the basic ALU cycle, there >won't be time to feed back the result of a compare to the branch in one >cycle. To allow it, the cycle would have to be extended, at a major >performance cost to the rest of the instruction set. It is difficult to do a general compare ( a < b ) and branch in one cycle, without the effects noted. If you're careful, you can do: register a = register b register a != register b compares against 0 fall out by using reg b == 0 register a < 0 [i.e., sign bit == 1] register a >= 0 [sign bit == 0] register a > 0 [i.e., sign bit == 0, at least one nonzero bit elsewise] register a <= 0 [sign bit == 1, or sign == 0 and rest == 0] > >Of course, marvelous things are possible with pipelining and parallel >lookahead, but there's still a second consideration: there aren't >enough bits to specify two operands and a target address in one fixed >size instruction. Not, at least, without imposing restrictions that do >ugly things to the instuction set's orthogonality. (E.g., both operands >in a register, target address a limited displacement PC relative mode). >Then, too, condition codes enable instruction scheduling; very good for >pipelining and low level parallelism in top end hardware. Note also the difference between using a "compare" or "set" command that sets one or bits in an arbitrary GP register, and a compare that sets a real condition code register. The former gives compilers much more latitude for code scheduling, and actually turns out to simplify the hardware, apparently: the "bypassing" hardware that lets you grab results in the next instruction, before the results have actually gotten back to the registers, is simpler because all it ever has to do is a uniformly-sized register, not register + condition code. [Probably not a huge deal, but every bit less unnecessary complexity...] Compilers do NOT like condition codes at all: a lot of very clean algorithms get uglified by having to deal with magicly different condition-codes. The restriction noted isn't particularly ugly: actually, it fit into our scheme quite easily. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
sjc@mips.COM (Steve "The" Correll) (02/15/88)
In article <208@telesoft.UUCP>, roger@telesoft.UUCP (Roger Arnold @prodigal) writes: > ...there aren't > enough bits to specify two operands and a target address in one fixed > size instruction. Not, at least, without imposing restrictions that do > ugly things to the instuction set's orthogonality. (E.g., both operands > in a register, target address a limited displacement PC relative mode). > Then, too, condition codes enable instruction scheduling; very good for > pipelining and low level parallelism in top end hardware. As an illustration, the MIPS compare-and-branch allocates 6 bits to an opcode, 5 bits to each register field, and 16 bits to a word-offset target. The register-operand-only restriction seems painless in practice (if a value is being tested often, the optimizer is happy to keep it in a register, and one of the registers acts like a literal zero) and we have immediate "set less than" instructions which set the low-order bit of a register for testing later on. This is similar to a condition code, but uses any available general-purpose register rather than a dedicated boolean. Sometimes that really pays off. For example, I like to avoid branches by implementing the three-way comparison used by Unix "qsort" (yielding -1/0/+1) as: (t0 > t1) - (t0 < t1) This translates into two "set less than"s and a "subtract"; it's possible precisely because the result of each comparison goes into a general register rather than a dedicated boolean. One can similarly use arithmetic in lieu of branches for certain range-checks, and one can use "set less than" followed by an ordinary add to propagate a carry in multiple-precision arithmetic; I can't prove it's superior to a carry flag plus "addc" and "subc" instructions, but it's satisfyingly simple and elegant. The 16-bit offset restriction is almost never a problem, but loops with more than 128k bytes of code inside them do exist, and they are indeed a pain. Condition codes can be good or bad for instruction-scheduling: in some CISC architectures, so many instructions bash so many condition codes that you have to keep the test and the branch adjacent anyway. -- ...decwrl!mips!sjc Steve Correll
bcase@apple.UUCP (Brian Case) (02/16/88)
In article <208@telesoft.UUCP> roger@telesoft.UUCP (Roger Arnold @prodigal) writes: >In article <1556@gumby.mips.COM>, earl@mips.COM (Earl Killian) writes: >> Roger Arnold proposes [..] (boolean registers) >> >> The abstract concept of "compare two values and >> branch accordingly" represents around 15% of the instructions executed >> on a computer. To make this take two instructions increases your >> instruction count by 15% (and thus, in many cases, your cycle count by >> 15%). >I'm no authority, but it doesn't surprise me at all that a RISC would >use condition codes. Compare and branch, in one instruction, strikes >me as more at home in a CISC than a RISC architecture. For one thing, >if the instruction cycle is closely tuned to the basic ALU cycle, there >won't be time to feed back the result of a compare to the branch in one >cycle. To allow it, the cycle would have to be extended, at a major >performance cost to the rest of the instruction set. Er, not really true since condition codes come from the ALU and have to meet a set up time for the condition code register anyway. Adding compare and branch has its worst effects by either putting the register file PLUS the ALU in a critical path or delaying the effect of conditional branches by one more cycle (at least this is how I see it; did I goof?). Yes, there is some cost, but adding a gate or two to the ALU stage is probably not going to cause it to be the most critical path (cache access (TLB is a cache) is probably the criticalest path). >Of course, marvelous things are possible with pipelining and parallel >lookahead, but there's still a second consideration: there aren't >enough bits to specify two operands and a target address in one fixed >size instruction. Not, at least, without imposing restrictions that do >ugly things to the instuction set's orthogonality. (E.g., both operands >in a register, target address a limited displacement PC relative mode). >Then, too, condition codes enable instruction scheduling; very good for >pipelining and low level parallelism in top end hardware. Wait a minute, condition codes HINDER instruction scheduling, not "enable" it. The problem is that, even if each instruction can decide to set the cond. codes based on a bit in the instruction, there is only one condition code register; what if the result of a compare needs to be saved for later? What if you (or the compiler) want to move a compare to some distant spot to fill up a branch or load delay slot but there is an intervening instruction that sets the condition codes? Outa luck. Your "boolean registers" idea definitely mittigates this problem, perhaps eliminating it (but it creates others). >As to the 15% "compare two values" and branch, I've no trouble at all >believing that number. In fact, it feels conservative, if it includes >comparisons to zero. But it was consideration of how many of those >compaisions only repeat tests previously made that prompted me to >suggest boolean registers. Sorry, no numbers. Just a programmer's >notoriously unreliable "gut feeling", based on looking a LOT of code >over the years. Your boolean register idea is simply a less general, special case of saving the result of a comparison in a general data register. I suppose we could argue about whether or not it is really necessary to have the full generality of booleans in data registers, but I can say that I, personally now, this is just me, would rather write a compiler for a machine that has either one (1) condition code register or uses booleans in data registers. It's the old "there should be zero, one, or (large) n of something."
jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) (02/16/88)
In article <1556@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >Roger Arnold proposes that branches be implemented as computing >booleans (to one of 32 1-bit homes to be specific) and testing them in >a separate instruction, for the the purpose of optimizing tests. ... >Packaging the abstract concept of conditional branching into one >instruction rather than two is one of the numerous ways that some RISC >machines go faster than other architectures. What I've found quite >surprizing is that some recent architectures (e.g. Berkeley RISC and >its clone, SPARC), used condition codes after studying instruction >stream statistics. I suppose compare and branch in one instruction >was considered too difficult, at the time. Think about compare & branch from a hardware point of view. To do it in one cycle, you must fetch two values, run them through the ALU, and get the result. Now you have the information that allows you to determine whether to branch. You must also determine the branch destination. This may also require some computation, an addition to the PC (though it might be speeded a little by knowing the offset if some small number of bits, which it has to be given a 32-bit instruction.) If you're willing to build another fast adder for this computation and run it in parallel, you MIGHT be able to pull it off, though I doubt it. It would cost LOTS of chip area, and would probably be your critical path that determines your cycle time (certainly it would be if you didn't have a parallel adder!) I'm not saying that conditions codes are the answer, just that the one- instruction compare & branch creates some big problems for chip designers at the hardware level. If you go to the ISSCC (?), check out the solution that the GE RPM-40 uses (not that it's perfect either). More I think I should not say before the conference. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
aglew@ccvaxa.UUCP (02/18/88)
..> Compare and branch. In case it needs to be restated, special cases of compare and branch *DO* *NOT* need to be run through the ALU: x = y, x != y, x < 0, etc.
earl@mips.COM (Earl Killian) (02/18/88)
In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes:
Think about compare & branch from a hardware point of view. To do
it in one cycle, you must fetch two values, run them through the
ALU, and get the result. Now you have the information that allows
you to determine whether to branch. You must also determine the
branch destination. This may also require some computation, an
addition to the PC (though it might be speeded a little by knowing
the offset if some small number of bits, which it has to be given a
32-bit instruction.) If you're willing to build another fast adder
for this computation and run it in parallel, you MIGHT be able to
pull it off, though I doubt it. It would cost LOTS of chip area,
and would probably be your critical path that determines your cycle
time (certainly it would be if you didn't have a parallel adder!)
Hardware makes things go faster. That's why RISC machines tend to
have more hardware in them than CISCs (they find room the extra
hardware by tossing out the firmware, for a net savings). It is
perfectly reasonable to dedicate an adder to computing
PC+branchdisplacement on every instruction (not just branch
instructions), and selecting between that and PC+1 based on the branch
decision. Perfectly reasonable because that one adder just added 10%
to your performance.
Branch decisions can have practically the same timing constraints as
load/store instructions in a simple pipeline; if you can do the
address add for the load/stores, then you can do the branch decision.
The details depend on your pipeline. The MIPS R2000 pipeline is not
quite as generous to branch decisions as a simple pipeline because it
has virtual to physical translation in series with cache access, which
is why it leaves out the X < Y compare and branch. It does do X = Y,
and X ? 0, which are most of the compare and branches. The end result
is that the MIPS architecture is about 10% more efficient than
condition-code architectures from branches alone (i.e. needs to
execute 10% fewer instructions).
firth@sei.cmu.edu (Robert Firth) (02/18/88)
Some of our work here confirms the view that condition codes are not a very good idea. Compiling the same code for MIPS M/500 and Vax-11, using the same compiler technology, gave the result MIPS: 6.1% of instructions were conditional branches 1.2% were 'set condition' instructions VAX: 10.0% conditional branches 8.9% tests or comparisons MIPS provides only comparison against zero in the conditional branch This took care of about 80% of branches; the other 20% required a prior set condition instruction. Vax branches against the condition codes, but loses because nearly all the branches then required a prior instruction to set them. Here is an extract from our report, discussing the Vax statistics: "The code generator slaves the condition codes religiously, both through linear code and across control transfers. Nevertheless, of 415 conditional branches, 371 (almost 90%) required a prior test or compare to set the condition codes; of 2169 normal instructions that set the condition codes, only 44 (about 2%) did so to any purpose. It is hard to avoid the conclusion that condition codes are a waste of time, effort, and silicon."
bcase@apple.UUCP (Brian Case) (02/20/88)
In article <4252@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes: >Some of our work here confirms the view that condition codes are >not a very good idea. > "The code generator slaves the condition codes > religiously, both through linear code and across > control transfers. Nevertheless, of 415 conditional > branches, 371 (almost 90%) required a prior test > or compare to set the condition codes; of 2169 > normal instructions that set the condition codes, > only 44 (about 2%) did so to any purpose. It is > hard to avoid the conclusion that condition codes > are a waste of time, effort, and silicon." (Here we go again...) There is a paper by Robert D. Russell, "The PDP-11: A Case Study of How _NOT_ To Design Condition Codes," that, although for a different architecture, says exactly the same thing. Excerpts from this old, old paper: "A sample of several hand-coded PDP-11 modules was taken at random from the Oh my God------------^^^^^^^^^^ RSX-11M operating system, the DEC-10 communications front end (which is handled by dedicated PDP-11s), and the PL-11 compiler. These are typical of "systems" programs found on the PDP-11. Out of a total of 1044 assembled instructions, 671 (64%) set the condition codes as a principle or side effect of their operation. Of these, only 159 (24%) were followed by instruction sequences that actually utilized the condition code setting. Hence, in 76% of the instructions the effort required to se the condition codes was completely wasted. Furthermore, of the 159 instructions that set the condition codes for susequent meaningful utilization, 116 (73%) were instructions (17 TST, 51 CMP, 42 BIT, 4 SEC, and 2 CLC instructions) whose primary purpose is to set condition codes. ... This means that the side effect of setting the condition codes is wasted in 512 (92%) of the 555 instructions whose primary function is not condition code manipulation. (Of the 43 that did set the condition codes for later use, 36 (84%) were MOV instructions.) Of course, these are measurements of "static" instruction sequences, not a dynamic trace during execution. Nonetheless, it would appear that in practice very few explicit tests are eliminated by having data operation instructions set the condition code bits as a side effect, and the increased complexity with accompanying performance degredataion of having every such instruction set the condition codes is indeed not justified." This paper was published sometime around 1975 (I can look up the reference for interested parties). This paper, together with optimizing compiler papers and Wulf's "Compilers and Computer Architecture" and others hammered home the point that condition codes are not the best way to propagate relational information between instructions.
robinson@dewey.soe.berkeley.edu (Michael Robinson) (02/20/88)
In article <7445@apple.UUCP> bcase@apple.UUCP (Brian Case) writes: >(Here we go again...) There is a paper by Robert D. Russell, "The PDP-11: >A Case Study of How _NOT_ To Design Condition Codes," that, although for >a different architecture, says exactly the same thing. Excerpts from this >old, old paper: >[...] >This means that the side >effect of setting the condition codes is wasted in 512 (92%) of the 555 >instructions whose primary function is not condition code manipulation. >(Of the 43 that did set the condition codes for later use, 36 (84%) were MOV >instructions.) But: >Of course, these are measurements of "static" instruction >sequences, not a dynamic trace during execution. Maybe I'm just stupid, but aren't the dynamic measurements far more significant? In my own assembly language programming, my experience has been that condition-code-setting instructions are used to advantage most often in tight, processor-intensive loops (copies, sorts, searches, etc.)--the type of instructions, therefore, that tend to get executed with much greater frequency than usual code. Surely, someone out there must have done dynamic analysis of condition code utilization. I won't be convinced on this issue until I've seen the results. ------------------------------------------------------------------------------ Michael Robinson USENET: ucbvax!ernie!robinson ARPA: robinson@ernie.berkeley.edu
jesup@pawl22.pawl.rpi.edu (Randell E. Jesup) (02/21/88)
In article <1610@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes: > > Think about compare & branch from a hardware point of view. To do ... > for this computation and run it in parallel, you MIGHT be able to > pull it off, though I doubt it. It would cost LOTS of chip area, > and would probably be your critical path that determines your cycle > time (certainly it would be if you didn't have a parallel adder!) >Hardware makes things go faster. That's why RISC machines tend to >have more hardware in them than CISCs (they find room the extra >hardware by tossing out the firmware, for a net savings). It is >perfectly reasonable to dedicate an adder to computing >PC+branchdisplacement on every instruction (not just branch >instructions), and selecting between that and PC+1 based on the branch >decision. Perfectly reasonable because that one adder just added 10% >to your performance. You may well be right. It's always worth checking into when doing a design, because it IS a potential win, if you can pull it off. However, when working at the cutting edge, a large adder is a BIG amount of chip space to use. A full ALU (which is just an adder and a shifter) can take 20+% of the entire chip. The bigger the chip, the lower the yield, and the more delays in intra-chip runs. There also are timing considerations. If you have your cycle timed out to be the same as the time through a 32-bit adder, you may have trouble getting result of the comparison out in time. Also, the calculation in a pipelined machine is between PC+N+1 and PC+N+disp, but thats minor. Another thing is that running those bits and the PC off to the adder might slow down one of the decode stages. Lastly, you have to find the bits in the instruction to specify both registers and the displacement, and this extra format may also slow down decoding. The point here isn't that it's impossible to get a win with conditional branches, but that there are a LOT of side-effects that have to be dealt with in doing it. >Branch decisions can have practically the same timing constraints as >load/store instructions in a simple pipeline; if you can do the >address add for the load/stores, then you can do the branch decision. "simple pipeline"? Many chips use the ALU stage of loads/stores for the address computation. Once again, you can throw an address-calculator in here to speed them up by a cycle or so, but here it might impact your register file design, for indexed load/stores. The moral: Chip design is a wonderously complicated place, full of hidden connections and fragilites. Very few ideas are easy to implement. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup
bcase@apple.UUCP (Brian Case) (02/22/88)
In article <400@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes: >In article <1610@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >>In article <375@imagine.PAWL.RPI.EDU> jesup@pawl1.pawl.rpi.edu (Randell E. Jesup) writes: > A full ALU (which is just an adder and a shifter) can take >20+% of the entire chip. The bigger the chip, the lower the yield, and >the more delays in intra-chip runs. WOW! What technology are you talking about? Looking at the die photo of the 29K (1.25 micron CMOS), the ALU/shifter is between 5% and 7% of the die. The PC+relative offset adder is probably more like 3%. Eliminating the PC adder would probably have had no effect on the yield-influencing die size (although this analysis does not include internal bus considerations; but I would guess that sharing the main ALU for PC+offset would have *increased* the internal bus overhead).
andrew@frip.gwd.tek.com (Andrew Klossner) (02/23/88)
"The test x = 0, and therefore x = y, can almost certainly be done with minimum delay in a pipeline, as can x < 0, looking just at the sign bit. A full arithmetic compare might be a problem because of carry." I've seen this assertion a couple of times, but it ain't so. You need to worry about carry when doing a subtract, but not when just doing a compare. If both operands are positive, a simple bit-for-bit comparison is sufficient, analogous to the way that C programs compare strings. When the operands can be signed, a little extra logic is required, but nothing as fancy as a full adder/subtractor. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]
roger@telesoft.UUCP (Roger Arnold @prodigal) (02/24/88)
In article <28200099@ccvaxa>, aglew@ccvaxa.UUCP writes: > > ..> Compare and branch. > > In case it needs to be restated, special cases of compare and branch > *DO* *NOT* need to be run through the ALU: x = y, x != y, x < 0, etc. Grumph! What's being said here? It's hard to carry on architectural discussions in general terms; everyone (me included) makes unstated assumptions about the context. "x = y" has to be run through SOMEthing. The values of x and y must be accessed, and compared bitwise. If the point being made is that it doesn't take a full ALU to do it, well, of course. All it takes is the OR of 32 parallel XOR gates, or something equivalent. The delay through such a circuit is less than it is through an adder. But the register file accesses are probably going to use the same internal busses that feed the ALU. Not that it HAS to be that way, but the alternatives seem very expensive. Am I missing something? - Roger Arnold ..ucsd!telesoft!roger
doug@edge.UUCP (Doug Pardee) (02/26/88)
]I've seen this assertion a couple of times, but it ain't so. You need ]to worry about carry when doing a subtract, but not when just doing a ]compare. If both operands are positive, a simple bit-for-bit ]comparison is sufficient, analogous to the way that C programs compare ]strings. When the operands can be signed, a little extra logic is ]required, but nothing as fancy as a full adder/subtractor. Not only is bit-for-bit comparison sufficient, it should be considered *necessary*. Using subtraction for comparison only makes a mess of things by introducing (unnecessarily) the possibility of overflow when comparing a quite negative number with a quite positive one. -- Doug Pardee {ames,hplabs,sun,amdahl,ihnp4,allegra}!oliveb!edge!doug Edge Computer Corp., Scottsdale, AZ uunet!ism780c!edge!doug
hansen@mips.COM (Craig Hansen) (02/27/88)
]I've seen this assertion a couple of times, but it ain't so. You need ]to worry about carry when doing a subtract, but not when just doing a ]compare. If both operands are positive, a simple bit-for-bit ]comparison is sufficient, analogous to the way that C programs compare ]strings. When the operands can be signed, a little extra logic is ]required, but nothing as fancy as a full adder/subtractor. However, to do the bit-for-bit comparison in parallel (most C programs compare strings one byte at a time), you need a structure that looks _remarkably_ like a carry chain. In fact, the structure's just as "fancy" as the carry chain in a subtractor, because it is functionally identical to the carry chain in a subtractor. Remember that the comparison of the high-order bits has priority over the comparison of low-order bits.... -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...{ames,decwrl,prls}!mips!hansen or hansen@mips.com 408-991-0234
aglew@ccvaxa.UUCP (03/01/88)
>]If both operands are positive, a simple bit-for-bit >]comparison is sufficient, analogous to the way that C programs compare >]strings. > >However, to do the bit-for-bit comparison in parallel (most C programs >compare strings one byte at a time), you need a structure that looks >_remarkably_ like a carry chain. In fact, the structure's just as >"fancy" as the carry chain in a subtractor, because it is functionally >identical to the carry chain in a subtractor. Remember that the >comparison of the high-order bits has priority over the comparison of >low-order bits.... > >Craig Hansen Thanks, Craig, I was beginning to get worried.