[comp.arch] Filling branch delay slot with test

hascall@atanasoff.cs.iastate.edu (John Hascall) (09/04/89)

     I was thinking about filling the the delay slot behind
     a delayed branch and wondered if there has been any
     work/research/thought/implementations that put the
     branch instruction BEFORE the instruction that sets
     the condition codes for the branch.  For example:

         AGAIN:   JSUB   FOO_RTN        ; return FOO in R0
                  BEQL   AGAIN          ; try again if we
		  TEST   R0             ;   get zero back
		  JSUB   BAR_RTN        ; turn the FOO into a BAR

     Although not without complications, it would seem an
     excellent way to have a high branch delay slot fill ratio.
     Comments?  Flames?

John Hascall
(p.s. Sorry for discussing architecture here of all places)

mash@mips.COM (John Mashey) (09/05/89)

In article <1432@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:

>     I was thinking about filling the the delay slot behind
>     a delayed branch and wondered if there has been any
>     work/research/thought/implementations that put the
>     branch instruction BEFORE the instruction that sets
>     the condition codes for the branch.  For example:
>
>         AGAIN:   JSUB   FOO_RTN        ; return FOO in R0
>                  BEQL   AGAIN          ; try again if we
>		  TEST   R0             ;   get zero back
>		  JSUB   BAR_RTN        ; turn the FOO into a BAR

>     Although not without complications, it would seem an
>     excellent way to have a high branch delay slot fill ratio.
>     Comments?  Flames?

The issue is doing the test far enough in advance of the conditional
branch that the branch can use the results without stalls.
Most machines can set condition codes in one instruction and do a
conditional branch in the next following instruction.  Some machines
can do a compare-and-branch in one instruction.  No machine I've ever
seen can do a stall-free conditional branch based on the
results of the instruction after the branch!
Put another way: as much as computer architects would like
pipestages whose results are available in advance of their execution,
such things are only found in science-fiction...... (Perhaps people
recall the old Thiotmoline(sp?) stories in Analog? T... was a chemical
that showed a reaction a few seconds BEFORE you poured water in it...
These are like -5ns SRAMs, a commodity long sought after by
designers.)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

hankd@pur-ee.UUCP (Hank Dietz) (09/05/89)

In article <1432@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
>     I was thinking about filling the the delay slot behind
>     a delayed branch and wondered if there has been any
>     work/research/thought/implementations that put the
>     branch instruction BEFORE the instruction that sets
>     the condition codes for the branch.  For example:

This is called "early targeting," "branch-target buffering," or simply
instruction prefetch.  It can easily be done on a number of machines; for
example, the CARP machine design REQUIRES the user to explicitly prefetch
each instruction block (it can only execute instructions from instruction
registers).

The problem is that if you don't know the condition when you initiate the
branch, then you have to consider the branch going both ways.  This is the
primary reason that this technique isn't very popular.  A more popular
alternative is to get around the one-condition-code-register code motion
constraint by having multiple condition registers (e.g., using general
registers for condition evaluation).

						-hankd@ecn.purdue.edu

hascall@atanasoff.cs.iastate.edu (John Hascall) (09/05/89)

In article <26859> mash@mips.COM (John Mashey) writes:
}In article <1432> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
}
}>
}>         AGAIN:   JSUB   FOO_RTN        ; return FOO in R0
}>                  BEQL   AGAIN          ; try again if we
}>		  TEST   R0             ;   get zero back
}>		  JSUB   BAR_RTN        ; turn the FOO into a BAR
 
}>     Although not without complications, it would seem an
}>     excellent way to have a high branch delay slot fill ratio.
 
}The issue is doing the test far enough in advance of the conditional
}branch that the branch can use the results without stalls.

    I was thinking of doing it the other way around.

}Put another way: as much as computer architects would like
}pipestages whose results are available in advance of their execution,
}such things are only found in science-fiction......

     No.  What I was alluding to was "starting down both paths" of the
     branch and then "dumping the loser".

     One (simple minded?) way of doing this would be to have two
     parallel fetch&instruction-decoders in the pipe (perhaps with
     suitable memory interleaving).

     T     EXECUTE        F&I-D 1        F&I-D 2
     I     -------------  -------------  -------------
     M     BEQL AGAIN     TEST R0        -idle-
     E     TEST R0        JSUB BAR_RTN   JSUB FOO_RTN
     |     JSUB ???_RTN   ...            ...
     V


John Hascall
Signature void where prohibited by law.

johnl@esegue.segue.boston.ma.us (John R. Levine) (09/05/89)

In article <1437@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
>     No.  What I was alluding to was "starting down both paths" of the
>     branch and then "dumping the loser".

Back in 1969, the 360/91 prefetched both paths.  Since it had a prefetch
queue but no cache, it had to prefetch both in order to be sure it had
something to do after the branch.  Like a lot of other stuff the /91 did
(most notoriously executing instructions out of order and punting if one of
them caused an exception) people seem to have found other ways to get high
performance with less complication.  Much of the /91 technology was
inherited from the STRETCH, so it may have prefetched both ways as well.

Of course, "high performance" is relative.  The /91 I used had a 60ns cycle
time and 2MB of RAM, while the desktop machine on which I am typing this has
a 40ns cycle time and 4MB, soon to be 8MB, of RAM.  It has a cache, too.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 492 3869
{ima|lotus}!esegue!johnl, johnl@ima.isc.com, Levine@YALE.something
Massachusetts has 64 licensed drivers who are over 100 years old.  -The Globe

slackey@bbn.com (Stan Lackey) (09/06/89)

In article <1437@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
>In article <26859> mash@mips.COM (John Mashey) writes:
>}In article <1432> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
>}>         AGAIN:   JSUB   FOO_RTN        ; return FOO in R0
>}>                  BEQL   AGAIN          ; try again if we
>}>		  TEST   R0             ;   get zero back
>}>     Although not without complications, it would seem an
>}>     excellent way to have a high branch delay slot fill ratio.

>}Put another way: as much as computer architects would like
>}pipestages whose results are available in advance of their execution,
>}such things are only found in science-fiction......
>     No.  What I was alluding to was "starting down both paths" of the
>     branch and then "dumping the loser".

Another way is to use branch prediction; guess at the direction using
some algorithm (they range from "terrible" to "pretty good") and start
fetching instr's and operands.  At least you only prefetch one path
(single instruction cache port, single instruction decoder, etc).  You
need to be careful about doing things that are hard to undo when you
turn out to be wrong, though.

Other than in the Multiflow Trace, all the algorithms that I know
about have been implemented in hardware.  I wonder if there have been
studies done (or implementations?) of a more conventional architecture
where the branch instruction has information in it (inserted by the
compiler, possibly using runtime statistics) to tell the hardware
which way to predict the branch.

Most of the hardware algorithms work well for inloop situations, where
prediction is done either by looking at the direction of the branch to
keep execution in the loop, or by caching recently-executed branches
and using history analysis.  I am wondering about cases like a long
piece of code (such as a system call) where state is tested to control
flow, and very little is inloop.  Of course, lots of machines which
depend on an instruction cache are going to perform dismally on this
type of code anyway.
-Stan

ad@macbeth.cs.duke.edu (Apostolos Dollas) (09/08/89)

From article <1437@atanasoff.cs.iastate.edu>, by hascall@atanasoff.cs.iastate.edu (John Hascall):
> 
>      No.  What I was alluding to was "starting down both paths" of the
>      branch and then "dumping the loser".
> 
>      One (simple minded?) way of doing this would be to have two
>      parallel fetch&instruction-decoders in the pipe (perhaps with
>      suitable memory interleaving).
> 
>      T     EXECUTE        F&I-D 1        F&I-D 2
>      I     -------------  -------------  -------------
>      M     BEQL AGAIN     TEST R0        -idle-
>      E     TEST R0        JSUB BAR_RTN   JSUB FOO_RTN
>      |     JSUB ???_RTN   ...            ...
>      V
> 

A great idea!  In fact, we have been doing research in this area for the 
last several years.  We have developed a working prototype system with 
multiple instruction decoding units (IDUs) and have filed for a patent. 
The system uses program flow graph information to prefetch instructions
so that the IDUs are kept full at all times.  This extra hardware enables 
*all* branch instructions to execute without any breaks (bubbles) in the 
instruction pipeline.  As a result, the sustained performance of the 
execution unit is maximized.  

We are preparing an article for publication and will be glad to send
copies to anyone interested.  (We have not forgotten those of you who
requested more information in April!)

=========================================================================
        Robert F. Krick  		PHONE: (919) 684-3123 x61
    	Dept. of Electrical Eng. 	CSNET: rfk@dukee
    	Duke University			UUCP:  decvax!duke!dukee!rfk
        Durham, NC    27706  		ARPA:  rfk@dukee.egr.duke.edu

jmk@alice.UUCP (Jim McKie) (09/11/89)

To paraphrase from 'Architectural Innovations in the CRISP
Microprocessor', Spring COMPCON 87 Proceedings, by
Berenbaum, Ditzel and McLellan:

The CRISP CPU has 'branch folding': the Decoded Instruction Cache
contains a 'next address' field and also, in the case of conditional
branches, an 'alternate address' field.
Logic in the PDU recognises when a non-branching instruction is followed
by a branching instruction and 'folds' the two instructions together.
This single instruction is then placed in the Decoded Instruction Cache.
The separate branch instruction disappears entirely from the execution
pipeline and the program behaves as if the branch were executed in
zero time.
The 'alternate address' field is used to hold the second possible 
next instruction addresss for folded conditional branch instructions.
When an instruction folded with a conditional branch is read from
the instruction cache, one of the two paths for the branch is
selected for the next instruction address, and the address that was
not used is retained with the instruction as it proceeds down
the execution pipeline. The alternate address field is retained with
each pipeline stage only until the logic can determine whether the 
selected branch path was correct or not. When the outcome of the 
branch condition is known, if the wrong next address was selected,
any instructions in progress in the pipeline following the conditional
branch are flushed and the alternate address from the folded
conditional branch is re-introduced as the next instruction address
at the beginning of the execution pipeline.

Determining which of the two possible next addresses of a conditional
branch is likely to be taken is aided in CRISP with static branch
prediction. The encoding for the conditional branch instruction
contains a single branch prediction bit which can be set by the
compiler.
Branch prediction is useful when a conditional branch instruction
in the pipeline can alter the flow of instructions before the 
result of a comparison can be computed. If, however, there are no
compare instructions in the pipeline then there is no need for
branch prediction. Since only the compare instruction may set the
condition code flag, the outcome of the conditional branch is known
with certainty.
CRISP intentionally has separate compare and conditional branch 
instructions so that a compiler or optimizer may insert instructions
between the compare and conditional branch (this is 'branch spreading').
By combining branch folding and code motion in CRISP there can be
no cost at all for either conditional or uncondditional branches.

Jim McKie	research!jmk -or- jmk@research.att.com

rfk@dukee.egr.duke.edu (Robert F Krick) (09/13/89)

Indeed a variety of techniques have been incorporated in CRISP to reduce
but not eliminate the performance degradation caused by conditional branch 
instructions in an instruction pipeline.  CRISP or any other architecture 
with a single prefetch and decode pipeline cannot *guarantee* that the 
sustained performance of the execution unit will equal its theoretical 
maximum performance for any program.  

From article <9881@alice.UUCP>, by jmk@alice.UUCP (Jim McKie):
> 
> To paraphrase from 'Architectural Innovations in the CRISP
> Microprocessor', Spring COMPCON 87 Proceedings, by
> Berenbaum, Ditzel and McLellan:
> 
> The CRISP CPU has 'branch folding': the Decoded Instruction Cache
> contains a 'next address' field and also, in the case of conditional
> branches, an 'alternate address' field.
> Logic in the PDU recognises when a non-branching instruction is followed
> by a branching instruction and 'folds' the two instructions together.
> This single instruction is then placed in the Decoded Instruction Cache.
> The separate branch instruction disappears entirely from the execution
> pipeline and the program behaves as if the branch were executed in
> zero time.
Clearly branch folding is an excellent hardware solution to alleviate the
execution unit from the trivial operation of determining the new program 
counter. However, this scheme breaks down for consecutive conditional 
branch instructions!

> The 'alternate address' field is used to hold the second possible 
> next instruction addresss for folded conditional branch instructions.
> When an instruction folded with a conditional branch is read from
> the instruction cache, one of the two paths for the branch is
> selected for the next instruction address, and the address that was
> not used is retained with the instruction as it proceeds down
> the execution pipeline. The alternate address field is retained with
> each pipeline stage only until the logic can determine whether the 
> selected branch path was correct or not. When the outcome of the 
> branch condition is known, if the wrong next address was selected,
> any instructions in progress in the pipeline following the conditional
> branch are flushed and the alternate address from the folded
> conditional branch is re-introduced as the next instruction address
> at the beginning of the execution pipeline.
The flush of the instruction pipeline in this scheme results from only 
retaining the alternate *address* field and not the alternate *instruction* 
through the instruction pipeline.  Whenever the instruction pipeline must 
be flushed, the performance of the execution unit falls below its theoretical 
maximum.

> Determining which of the two possible next addresses of a conditional
> branch is likely to be taken is aided in CRISP with static branch
> prediction. The encoding for the conditional branch instruction
> contains a single branch prediction bit which can be set by the
> compiler.
By definition, branch prediction (either static or dynamic) cannot always 
guess the direction of the branch.  Furthermore, the need for branch 
prediction highlights the fact that branch folding cannot guarantee that 
the performance of the execution unit will not be degraded.

> Branch prediction is useful when a conditional branch instruction
> in the pipeline can alter the flow of instructions before the 
> result of a comparison can be computed. If, however, there are no
> compare instructions in the pipeline then there is no need for
> branch prediction. Since only the compare instruction may set the
> condition code flag, the outcome of the conditional branch is known
> with certainty.
> CRISP intentionally has separate compare and conditional branch 
> instructions so that a compiler or optimizer may insert instructions
> between the compare and conditional branch (this is 'branch spreading').
> By combining branch folding and code motion in CRISP there can be
> no cost at all for either conditional or uncondditional branches.
Statistics have shown that changes in program flow occur as frequently 
as every five instructions (on the average!) which limits the benefit of 
'branch spreading'.  Furthermore, I believe that the technique of 'branch 
spreading' can be proved to be equivalent (at best) to filling the delay 
slots associated with delayed branches (generally acknowledged to be 
successful only 70% of the time for the first delay slot and much less 
for subsequent slots.)

> 
> Jim McKie	research!jmk -or- jmk@research.att.com

Last but not least, any cache misses which occur when prefetching an 
instruction will idle the execution unit as the instruction is brought 
from the slower main memory.

In contrast, we have developed an architecture and have constructed a 
functional prototype with multiple instruction prefetch and decode units 
to guarantee that the next instruction will *always* be prefetched and 
decoded for execution even in the presence of conditional branch instructions.
Furthermore, i-cache misses are completely eliminated by making use of 
a graph of the program flow.  As a result, the execution unit must
never wait for a decoded instruction in executing any program.

=========================================================================
        Robert F. Krick  		PHONE: (919) 684-3123 x61
    	Dept. of Electrical Eng. 	CSNET: rfk@dukee
    	Duke University			UUCP:  decvax!duke!dukee!rfk
        Durham, NC    27706  		ARPA:  rfk@dukee.egr.duke.edu

	  "When the branching gets conditional, the prefetching 
	   gets tough, and the tough get going!"

=========================================================================
        Robert F. Krick  		PHONE: (919) 684-3123 x61
    	Dept. of Electrical Eng. 	CSNET: rfk@dukee
    	Duke University			UUCP:  decvax!duke!dukee!rfk

rod@venera.isi.edu (Rodney Doyle Van Meter III) (09/19/89)

I came in in the middle of this conversation, but it seems to me that
I was once told that a relatively old machine (early 60s) had a user-
or compiler-settable bit in conditional branch instructions to
indicate which was the more likely path. Keeps your hardware easy,
you only need one prefetch unit, but complicates the life of SW
guys a bit. It's an easy guess in loops; a little harder in IF
statements.

		--Rod