hascall@atanasoff.cs.iastate.edu (John Hascall) (09/04/89)
I was thinking about filling the the delay slot behind a delayed branch and wondered if there has been any work/research/thought/implementations that put the branch instruction BEFORE the instruction that sets the condition codes for the branch. For example: AGAIN: JSUB FOO_RTN ; return FOO in R0 BEQL AGAIN ; try again if we TEST R0 ; get zero back JSUB BAR_RTN ; turn the FOO into a BAR Although not without complications, it would seem an excellent way to have a high branch delay slot fill ratio. Comments? Flames? John Hascall (p.s. Sorry for discussing architecture here of all places)
mash@mips.COM (John Mashey) (09/05/89)
In article <1432@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: > I was thinking about filling the the delay slot behind > a delayed branch and wondered if there has been any > work/research/thought/implementations that put the > branch instruction BEFORE the instruction that sets > the condition codes for the branch. For example: > > AGAIN: JSUB FOO_RTN ; return FOO in R0 > BEQL AGAIN ; try again if we > TEST R0 ; get zero back > JSUB BAR_RTN ; turn the FOO into a BAR > Although not without complications, it would seem an > excellent way to have a high branch delay slot fill ratio. > Comments? Flames? The issue is doing the test far enough in advance of the conditional branch that the branch can use the results without stalls. Most machines can set condition codes in one instruction and do a conditional branch in the next following instruction. Some machines can do a compare-and-branch in one instruction. No machine I've ever seen can do a stall-free conditional branch based on the results of the instruction after the branch! Put another way: as much as computer architects would like pipestages whose results are available in advance of their execution, such things are only found in science-fiction...... (Perhaps people recall the old Thiotmoline(sp?) stories in Analog? T... was a chemical that showed a reaction a few seconds BEFORE you poured water in it... These are like -5ns SRAMs, a commodity long sought after by designers.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
hankd@pur-ee.UUCP (Hank Dietz) (09/05/89)
In article <1432@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: > I was thinking about filling the the delay slot behind > a delayed branch and wondered if there has been any > work/research/thought/implementations that put the > branch instruction BEFORE the instruction that sets > the condition codes for the branch. For example: This is called "early targeting," "branch-target buffering," or simply instruction prefetch. It can easily be done on a number of machines; for example, the CARP machine design REQUIRES the user to explicitly prefetch each instruction block (it can only execute instructions from instruction registers). The problem is that if you don't know the condition when you initiate the branch, then you have to consider the branch going both ways. This is the primary reason that this technique isn't very popular. A more popular alternative is to get around the one-condition-code-register code motion constraint by having multiple condition registers (e.g., using general registers for condition evaluation). -hankd@ecn.purdue.edu
hascall@atanasoff.cs.iastate.edu (John Hascall) (09/05/89)
In article <26859> mash@mips.COM (John Mashey) writes: }In article <1432> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: } }> }> AGAIN: JSUB FOO_RTN ; return FOO in R0 }> BEQL AGAIN ; try again if we }> TEST R0 ; get zero back }> JSUB BAR_RTN ; turn the FOO into a BAR }> Although not without complications, it would seem an }> excellent way to have a high branch delay slot fill ratio. }The issue is doing the test far enough in advance of the conditional }branch that the branch can use the results without stalls. I was thinking of doing it the other way around. }Put another way: as much as computer architects would like }pipestages whose results are available in advance of their execution, }such things are only found in science-fiction...... No. What I was alluding to was "starting down both paths" of the branch and then "dumping the loser". One (simple minded?) way of doing this would be to have two parallel fetch&instruction-decoders in the pipe (perhaps with suitable memory interleaving). T EXECUTE F&I-D 1 F&I-D 2 I ------------- ------------- ------------- M BEQL AGAIN TEST R0 -idle- E TEST R0 JSUB BAR_RTN JSUB FOO_RTN | JSUB ???_RTN ... ... V John Hascall Signature void where prohibited by law.
johnl@esegue.segue.boston.ma.us (John R. Levine) (09/05/89)
In article <1437@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: > No. What I was alluding to was "starting down both paths" of the > branch and then "dumping the loser". Back in 1969, the 360/91 prefetched both paths. Since it had a prefetch queue but no cache, it had to prefetch both in order to be sure it had something to do after the branch. Like a lot of other stuff the /91 did (most notoriously executing instructions out of order and punting if one of them caused an exception) people seem to have found other ways to get high performance with less complication. Much of the /91 technology was inherited from the STRETCH, so it may have prefetched both ways as well. Of course, "high performance" is relative. The /91 I used had a 60ns cycle time and 2MB of RAM, while the desktop machine on which I am typing this has a 40ns cycle time and 4MB, soon to be 8MB, of RAM. It has a cache, too. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 492 3869 {ima|lotus}!esegue!johnl, johnl@ima.isc.com, Levine@YALE.something Massachusetts has 64 licensed drivers who are over 100 years old. -The Globe
slackey@bbn.com (Stan Lackey) (09/06/89)
In article <1437@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: >In article <26859> mash@mips.COM (John Mashey) writes: >}In article <1432> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: >}> AGAIN: JSUB FOO_RTN ; return FOO in R0 >}> BEQL AGAIN ; try again if we >}> TEST R0 ; get zero back >}> Although not without complications, it would seem an >}> excellent way to have a high branch delay slot fill ratio. >}Put another way: as much as computer architects would like >}pipestages whose results are available in advance of their execution, >}such things are only found in science-fiction...... > No. What I was alluding to was "starting down both paths" of the > branch and then "dumping the loser". Another way is to use branch prediction; guess at the direction using some algorithm (they range from "terrible" to "pretty good") and start fetching instr's and operands. At least you only prefetch one path (single instruction cache port, single instruction decoder, etc). You need to be careful about doing things that are hard to undo when you turn out to be wrong, though. Other than in the Multiflow Trace, all the algorithms that I know about have been implemented in hardware. I wonder if there have been studies done (or implementations?) of a more conventional architecture where the branch instruction has information in it (inserted by the compiler, possibly using runtime statistics) to tell the hardware which way to predict the branch. Most of the hardware algorithms work well for inloop situations, where prediction is done either by looking at the direction of the branch to keep execution in the loop, or by caching recently-executed branches and using history analysis. I am wondering about cases like a long piece of code (such as a system call) where state is tested to control flow, and very little is inloop. Of course, lots of machines which depend on an instruction cache are going to perform dismally on this type of code anyway. -Stan
ad@macbeth.cs.duke.edu (Apostolos Dollas) (09/08/89)
From article <1437@atanasoff.cs.iastate.edu>, by hascall@atanasoff.cs.iastate.edu (John Hascall): > > No. What I was alluding to was "starting down both paths" of the > branch and then "dumping the loser". > > One (simple minded?) way of doing this would be to have two > parallel fetch&instruction-decoders in the pipe (perhaps with > suitable memory interleaving). > > T EXECUTE F&I-D 1 F&I-D 2 > I ------------- ------------- ------------- > M BEQL AGAIN TEST R0 -idle- > E TEST R0 JSUB BAR_RTN JSUB FOO_RTN > | JSUB ???_RTN ... ... > V > A great idea! In fact, we have been doing research in this area for the last several years. We have developed a working prototype system with multiple instruction decoding units (IDUs) and have filed for a patent. The system uses program flow graph information to prefetch instructions so that the IDUs are kept full at all times. This extra hardware enables *all* branch instructions to execute without any breaks (bubbles) in the instruction pipeline. As a result, the sustained performance of the execution unit is maximized. We are preparing an article for publication and will be glad to send copies to anyone interested. (We have not forgotten those of you who requested more information in April!) ========================================================================= Robert F. Krick PHONE: (919) 684-3123 x61 Dept. of Electrical Eng. CSNET: rfk@dukee Duke University UUCP: decvax!duke!dukee!rfk Durham, NC 27706 ARPA: rfk@dukee.egr.duke.edu
jmk@alice.UUCP (Jim McKie) (09/11/89)
To paraphrase from 'Architectural Innovations in the CRISP Microprocessor', Spring COMPCON 87 Proceedings, by Berenbaum, Ditzel and McLellan: The CRISP CPU has 'branch folding': the Decoded Instruction Cache contains a 'next address' field and also, in the case of conditional branches, an 'alternate address' field. Logic in the PDU recognises when a non-branching instruction is followed by a branching instruction and 'folds' the two instructions together. This single instruction is then placed in the Decoded Instruction Cache. The separate branch instruction disappears entirely from the execution pipeline and the program behaves as if the branch were executed in zero time. The 'alternate address' field is used to hold the second possible next instruction addresss for folded conditional branch instructions. When an instruction folded with a conditional branch is read from the instruction cache, one of the two paths for the branch is selected for the next instruction address, and the address that was not used is retained with the instruction as it proceeds down the execution pipeline. The alternate address field is retained with each pipeline stage only until the logic can determine whether the selected branch path was correct or not. When the outcome of the branch condition is known, if the wrong next address was selected, any instructions in progress in the pipeline following the conditional branch are flushed and the alternate address from the folded conditional branch is re-introduced as the next instruction address at the beginning of the execution pipeline. Determining which of the two possible next addresses of a conditional branch is likely to be taken is aided in CRISP with static branch prediction. The encoding for the conditional branch instruction contains a single branch prediction bit which can be set by the compiler. Branch prediction is useful when a conditional branch instruction in the pipeline can alter the flow of instructions before the result of a comparison can be computed. If, however, there are no compare instructions in the pipeline then there is no need for branch prediction. Since only the compare instruction may set the condition code flag, the outcome of the conditional branch is known with certainty. CRISP intentionally has separate compare and conditional branch instructions so that a compiler or optimizer may insert instructions between the compare and conditional branch (this is 'branch spreading'). By combining branch folding and code motion in CRISP there can be no cost at all for either conditional or uncondditional branches. Jim McKie research!jmk -or- jmk@research.att.com
rfk@dukee.egr.duke.edu (Robert F Krick) (09/13/89)
Indeed a variety of techniques have been incorporated in CRISP to reduce but not eliminate the performance degradation caused by conditional branch instructions in an instruction pipeline. CRISP or any other architecture with a single prefetch and decode pipeline cannot *guarantee* that the sustained performance of the execution unit will equal its theoretical maximum performance for any program. From article <9881@alice.UUCP>, by jmk@alice.UUCP (Jim McKie): > > To paraphrase from 'Architectural Innovations in the CRISP > Microprocessor', Spring COMPCON 87 Proceedings, by > Berenbaum, Ditzel and McLellan: > > The CRISP CPU has 'branch folding': the Decoded Instruction Cache > contains a 'next address' field and also, in the case of conditional > branches, an 'alternate address' field. > Logic in the PDU recognises when a non-branching instruction is followed > by a branching instruction and 'folds' the two instructions together. > This single instruction is then placed in the Decoded Instruction Cache. > The separate branch instruction disappears entirely from the execution > pipeline and the program behaves as if the branch were executed in > zero time. Clearly branch folding is an excellent hardware solution to alleviate the execution unit from the trivial operation of determining the new program counter. However, this scheme breaks down for consecutive conditional branch instructions! > The 'alternate address' field is used to hold the second possible > next instruction addresss for folded conditional branch instructions. > When an instruction folded with a conditional branch is read from > the instruction cache, one of the two paths for the branch is > selected for the next instruction address, and the address that was > not used is retained with the instruction as it proceeds down > the execution pipeline. The alternate address field is retained with > each pipeline stage only until the logic can determine whether the > selected branch path was correct or not. When the outcome of the > branch condition is known, if the wrong next address was selected, > any instructions in progress in the pipeline following the conditional > branch are flushed and the alternate address from the folded > conditional branch is re-introduced as the next instruction address > at the beginning of the execution pipeline. The flush of the instruction pipeline in this scheme results from only retaining the alternate *address* field and not the alternate *instruction* through the instruction pipeline. Whenever the instruction pipeline must be flushed, the performance of the execution unit falls below its theoretical maximum. > Determining which of the two possible next addresses of a conditional > branch is likely to be taken is aided in CRISP with static branch > prediction. The encoding for the conditional branch instruction > contains a single branch prediction bit which can be set by the > compiler. By definition, branch prediction (either static or dynamic) cannot always guess the direction of the branch. Furthermore, the need for branch prediction highlights the fact that branch folding cannot guarantee that the performance of the execution unit will not be degraded. > Branch prediction is useful when a conditional branch instruction > in the pipeline can alter the flow of instructions before the > result of a comparison can be computed. If, however, there are no > compare instructions in the pipeline then there is no need for > branch prediction. Since only the compare instruction may set the > condition code flag, the outcome of the conditional branch is known > with certainty. > CRISP intentionally has separate compare and conditional branch > instructions so that a compiler or optimizer may insert instructions > between the compare and conditional branch (this is 'branch spreading'). > By combining branch folding and code motion in CRISP there can be > no cost at all for either conditional or uncondditional branches. Statistics have shown that changes in program flow occur as frequently as every five instructions (on the average!) which limits the benefit of 'branch spreading'. Furthermore, I believe that the technique of 'branch spreading' can be proved to be equivalent (at best) to filling the delay slots associated with delayed branches (generally acknowledged to be successful only 70% of the time for the first delay slot and much less for subsequent slots.) > > Jim McKie research!jmk -or- jmk@research.att.com Last but not least, any cache misses which occur when prefetching an instruction will idle the execution unit as the instruction is brought from the slower main memory. In contrast, we have developed an architecture and have constructed a functional prototype with multiple instruction prefetch and decode units to guarantee that the next instruction will *always* be prefetched and decoded for execution even in the presence of conditional branch instructions. Furthermore, i-cache misses are completely eliminated by making use of a graph of the program flow. As a result, the execution unit must never wait for a decoded instruction in executing any program. ========================================================================= Robert F. Krick PHONE: (919) 684-3123 x61 Dept. of Electrical Eng. CSNET: rfk@dukee Duke University UUCP: decvax!duke!dukee!rfk Durham, NC 27706 ARPA: rfk@dukee.egr.duke.edu "When the branching gets conditional, the prefetching gets tough, and the tough get going!" ========================================================================= Robert F. Krick PHONE: (919) 684-3123 x61 Dept. of Electrical Eng. CSNET: rfk@dukee Duke University UUCP: decvax!duke!dukee!rfk
rod@venera.isi.edu (Rodney Doyle Van Meter III) (09/19/89)
I came in in the middle of this conversation, but it seems to me that I was once told that a relatively old machine (early 60s) had a user- or compiler-settable bit in conditional branch instructions to indicate which was the more likely path. Keeps your hardware easy, you only need one prefetch unit, but complicates the life of SW guys a bit. It's an easy guess in loops; a little harder in IF statements. --Rod