[comp.arch] Branch prediction in the 532

hansen@mips.UUCP (Craig Hansen) (04/21/87)

I've been reading two publications from National, one the infamous "Series
32000: NS32532 Competitive Comparison" that lists the eight separate patent
applications we've already heard about, the other titled "Series 32000:
NS32532 Technical Introduction."

The feature that most caught my eye was that of branch prediction,
for which I quote from the "Technical Introduction":

	The selection is based on the branch condition and
	direction; forward branches are predicted as taken,
	backward branches as non-taken.

Well, I don't see how the branch prediction can depend on the branch
condition, since waiting for the condition corresponds to no branch
prediction at all.  (Certainly novel and non-obvious...)

But when I was a young sprout, they always told me that backward branches
often represented loops and were most frequently taken. Forward branches
often represent conditional expressions, and are more ambivalent. As far as
I can see from the description, there is no prediction based on past
behaviour from previous invocations of the branch instruction.

If the description in National's literature is entirely accurate, the branch
prediction scheme used is seriously broken and should predict the wrong
answer more often than the right answer. Then again, if National hasn't seen
silicon yet, and uses the opportunity to fix it now, I may have just saved
them a few tens of thousands of dollars in wafer costs and hundreds in
otherwise lost revenue....
-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

amos@instable.UUCP (04/22/87)

Looks like the person who wrote the brochure about the 32532 didn't
look at the spec; anyway it's certain that:

1. The whole point of branch prediction is not waiting for the condition
  codes; therefore the prediction cannot, and does not, rely on them.

2. Forward branches are predicted as not taken, backward branches are
  predicted as taken, no matter who says otherwise. (We have tried
  several methods of state machines, caches, etc. and found out that
  this simple-minded method had the best ratio of relative performance
  increase vs. cost in chip area and overhead.)
-- 
	Amos Shapir
National Semiconductor (Israel)
6 Maskit st. P.O.B. 3007, Herzlia 46104, Israel  Tel. (972)52-522261
amos%nsta@nsc.com {hplabs,pyramid,sun,decwrl} 34.48'E 32.10'N

robison@uiucdcsb.cs.uiuc.edu (04/22/87)

Are there any good papers on branch prediction?  I've seen two schemes:

	1) Put an extra bit in the instruction indicating which way the branch
	   will probably go.  This is a compile-time prediction.

	2) Keep a record of past branches, i.e. run-time prediction.   

Though I've heard of examples of machines with each scheme, I've never
seen any studies of their efficiency.

Arch D. Robison
University of Illinois at Urbana-Champaign
	
CSNET: robison@UIUC.CSNET
UUCP: {ihnp4,pur-ee,convex}!uiucdcs!robison
ARPA: robison@B.CS.UIUC.EDU (robison@UIUC.ARPA)

roger@nsc.nsc.com (Roger Thompson) (04/23/87)

In article <324@dumbo.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:

> 
> 	The selection is based on the branch condition and
> 	direction; forward branches are predicted as taken,
> 	backward branches as non-taken.
> 
YES the brochure has misplaced a "non" in the above paragraph.
Reverse the usage.

> Well, I don't see how the branch prediction can depend on the branch
> condition, since waiting for the condition corresponds to no branch
> prediction at all.  (Certainly novel and non-obvious...)

The condition factor that is used is that which is specified by the
opcode, not the actual result from the operation being tested.  The
concept is that the predictor has an algorithm that has multiple
inputs.  Under certain conditions, the predictor determines one answer or
the other.  The prediction should not be confussed with the result.
Our prediction algorithm is accurate to 70 - 80%.

> 
> But when I was a young sprout, they always told me that backward branches
> often represented loops and were most frequently taken. Forward branches
> often represent conditional expressions, and are more ambivalent. 
> 
Agreed.

shebanow@ji.Berkeley.EDU (Mike Shebanow) (04/23/87)

In article <165100008@uiucdcsb> Arch Robison writes:
>
>Are there any good papers on branch prediction?  I've seen two schemes:
>
>	1) Put an extra bit in the instruction indicating which way the branch
>	   will probably go.  This is a compile-time prediction.
>
>	2) Keep a record of past branches, i.e. run-time prediction.   
>
>Though I've heard of examples of machines with each scheme, I've never
>seen any studies of their efficiency.

There have been many good papers on the subject. Here is a partial
bibliography:

	(Note: ISCA = International Symposium on Computer Architecture,
	IBM TDB = IBM Technical Disclosure Bulletin)

[1] S. McFarling and J. Hennesy, "Reducing the Cost of Branches," Proc.
    13th ISCA, June 1986, pp. 396-403.

[2] J. K. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target
    Buffer Designs," IEEE Computer, vol. 17, #1, Jan 1984.

[3] J. E. Smith, "A Study of Branch Prediction Strategies," 8th ISCA, May 1981,
    pp. 135-148.

[4] J. Losq, "Generalized History Table for Branch Prediction," IBM TDB,
    vol 25, #1, June 1982, pp. 99-101.

[5] A. Liles Jr. and B. E. Wilner, "Branch Prediction Mechanism." IBM TDB,
    vol 22, #7, Dec 1979, pp. 3013-3016.

[6] G. Rao, "Technique for Minimizing Branch Delay Due to Incorrect Branch
    History Table Predictions," IBM TDB, vol 25, #1, June 1982, pp. 97-98.

Based on the results published, the best prediction algorithms use run-time
information. As developed by IBM, a branch target buffer is kept in the
machine (BTB). The BTB is an associative table (direct mapped, n-way set
associative, fully associative, whatever) which uses the address of the
BRANCH as an index. Associated with the branch address are two pieces
of information (usually):

	- target address of the branch (where to go if branch taken -- OR,
	    perhaps the first few instructions at the target address (eg.,
	    AM29000))
	- some prediction information (see below).

The type of prediction information used depends on the algorithm. A simple
scheme which works well is to use a 2 bit counter. The counter is incremented
(until it hits the value 3) when the branch is taken, and decremented (until
it hits the value 0) when the branch is not taken. When a branch is encountered,
it is predicted taken if the counter value is 2 or 3, and predicted not taken
if the counter value is 0 or 1. Lee and Smith describe a more complex scheme
in their paper which provides even better accuracy. Using these runtime
prediction algorithms, accuracies better than 90% have been achieved. With
the opcode method National is using, accuracies in the 70-80% are usual (as
they report).

I want to point out that accuracy is a poor indicator of performance. A much
better measure (for any particular architecture) is instruction block size
(IBS). IBS is defined to be the average number of instructions executed
between branch prediction mistakes. If we have a pipeline of depth 'd',
then an !APPROXIMATE! measure of the efficiency of this pipe (in the presence
of branch stalls, ignoring other types of stalls) is (IBS/(IBS + d)). For
small pipe depths, IBS tends not to matter. For deep pipes though, a small
value for IBS will kill you. I would be curious if anyone on comp.arch 
has these numbers.

Mike Shebanow

marc@ucla-cs.UUCP (04/23/87)

In article <165100008@uiucdcsb> robison@uiucdcsb.cs.uiuc.edu writes:
>
>Are there any good papers on branch prediction?  I've seen two schemes:
>
>	1) Put an extra bit in the instruction indicating which way the branch
>	   will probably go.  This is a compile-time prediction.
>
>	2) Keep a record of past branches, i.e. run-time prediction.   
>
>Though I've heard of examples of machines with each scheme, I've never
>seen any studies of their efficiency.
>
>Arch D. Robison
>University of Illinois at Urbana-Champaign
>	

The best paper I have read about branch prediction is:

"Branch Prediction Strategies and Branch Target Buffer Design"
Johnny K. F. Lee, Hewlett-Packard
Alan Jay Smith, UCB.
Computer (IEEE), January 1984, pp 6-22.

Also you might want to look at the architecture issues concerning the
Branch Target Cache of the Am29000, they claim a 60% hit ratio.

			Enjoy,
					Marc Tremblay
					UCLA Computer Science Dept.

kck@dana.UUCP (Ken Klingman) (04/24/87)

> 
> 2. Forward branches are predicted as not taken, backward branches are
>   predicted as taken, no matter who says otherwise. (We have tried
>   several methods of state machines, caches, etc. and found out that
>   this simple-minded method had the best ratio of relative performance
>   increase vs. cost in chip area and overhead.)
> -- 
> 	Amos Shapir

I kind of like the way branch prediction works on the Ridge 32 (and 3200)
architecture.  The conditional branch instructions have a branch predic-
tion bit.  If set, the instruction pre-fetch logic starts fetching from
the target of the branch; if not set, the pre-fetch logic continues
fetching along the same instruction stream.

This allows code generators and assembly language programmers to predict
whether or not branches will be/won't be taken.  In general the Ridge
compilers adopt the same philosophy of predict backward branches and don't
predict forward branches, but all the assembly coded library routines like
strcpy and friends make extensive use of it.

One iteresting possibility when the code generator has control over branch
prediction is how to predict if-then-else constructs.  Do you predict the
then and not predict the else?  Or do you predict the else and not the then?
Or maybe even more conterintuitively, maybe you predict/don't predict both?
I never had time to run an analysis to see which possibility was best.

Ken Klingman
Dana Computer
hplabs!dana!kck

firth@sei.cmu.edu (Robert Firth) (04/24/87)

In article <18537@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes:
...
>Based on the results published, the best prediction algorithms use run-time
>information. As developed by IBM, a branch target buffer is kept in the
>machine (BTB). The BTB is an associative table (direct mapped, n-way set
>associative, fully associative, whatever) which uses the address of the
>BRANCH as an index. Associated with the branch address are two pieces
>of information (usually):
>
>	- target address of the branch (where to go if branch taken -- OR,
>	    perhaps the first few instructions at the target address (eg.,
>	    AM29000))
>	- some prediction information (see below).

Mike, could you please elaborate on this, because after 2 days I find
myself no nearer to understanding it.  The purpose of the "prediction"
information seems to be to allow the processor to prefetch whichever
of the branch continuations (successor or destination) is more likely
to be needed.

But if you have a branch cache, surely BOTH continuations are available
already - one in the normal pipeline and the other in the cache.  So
who cares which is needed? - and why bother to predict it?

shebanow@ji.Berkeley.EDU (Mike Shebanow) (04/25/87)

In article <1123@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP writes:
>Mike, could you please elaborate on this, because after 2 days I find
>myself no nearer to understanding it.  The purpose of the "prediction"
>information seems to be to allow the processor to prefetch whichever
>of the branch continuations (successor or destination) is more likely
>to be needed.
>
>But if you have a branch cache, surely BOTH continuations are available
>already - one in the normal pipeline and the other in the cache.  So
>who cares which is needed? - and why bother to predict it?

Sure. For example, consider an instruction sequence like this:

(#1)	op1	a, b	{some don't care operation}
(#2)	op2	b, c	{some don't care operation}
(#3)	op3	a, c	{this one sets the condition codes}
(#4)	bequ	loc	{use the 'zero' bit in the condition codes}
(#5)	.....

Now assume that this is to run on a machine implemented using a 4 stage
pipe, with instruction #1 at the back of the pipe:

    +---------------+---------------+---------------+---------------+
--->|   prefetch    |    decode	    |    execute    |    retire	    |--->
    +---------------+---------------+---------------+---------------+

cycle #
1	   (#1)
2	   (#2)		   (#1)
3	   (#3)		   (#2)		   (#1)
4	   (#4)		   (#3)		   (#2)		   (#1)

The question is what to do now, load in the target of the branch (instruction
at loc) or load in the instruction following the branch (#5)?:

5	   (loc)	   (#4)		   (#3)		   (#2)
OR!!!
5	   (#5)		   (#4)		   (#3)		   (#2)

You don't know which, as the branch depends on instruction #3, which is in
decode.  Without prediction, the pipeline would have to stall until #3
executes (at least). With prediction, the pipeline is kept full. The downside
of branch prediction is if you predict wrong. In that case, all instructions
fetched after the predicted branch must be flushed.

sorry for the long winded explanation,

Mike Shebanow

jfh@killer.UUCP (04/28/87)

In article <324@dumbo.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:
[ Stuff ]
> 
> The feature that most caught my eye was that of branch prediction,
> for which I quote from the "Technical Introduction":
> 
> 	The selection is based on the branch condition and
> 	direction; forward branches are predicted as taken,
> 	backward branches as non-taken.
> 
[ Alot more stuff ]
I too had wondered about this branch prediction mess, and I agree with this
poster about this being the WRONG way to do things.  I was prepared to make
the argument that trying to find the branch direction would require fetching
the offset (assuming relative branches), which would obligate the processor
to a possibly unneeded fetch, along with running this offset through some logic
(one or two more micro instructions) to figure out the direction, which
sounded like a real loser.  However, I suspect that someone at AMD would tell
me how wonderfully pipelined their machine was and so on ...

[ Now for the revised posting ]

Craig stated that, like the rest of us, he was raised to believe that
backwards branches make nice branches after while loops and such, and
forward branches make nice branches before if-thens.  Has anyone bothered
to think about adding branch instructions that tell the CPU explicitly
whether or not they expect to be taken?  For example,

	MBLEQ - most of the time branch less or equal
	SBLEQ - seldom branch less or equal

Compilers would be able to pick the instructions they wanted.  (Speaking of
patents, if this idea is patentable and has not yet been so, anyone but
Nat-Semi can have it ... :-) :-) so long as I get credit ... :-) :-)

- John		(jfh@killer.UUCP)

Disclaimer:
	No disclaimer.  Whatcha gonna do, sue me?

pop@mtu.UUCP (05/01/87)

In article <5662@shemp.UCLA.EDU> marc@CS.UCLA.EDU (Marc Tremblay) writes:
>In article <165100008@uiucdcsb> robison@uiucdcsb.cs.uiuc.edu writes:
>>
>>Are there any good papers on branch prediction?  I've seen two schemes:
>>
>>	1) Put an extra bit in the instruction indicating which way the branch
>>	   will probably go.  This is a compile-time prediction.
>>
>>	2) Keep a record of past branches, i.e. run-time prediction.   
>>
>>Though I've heard of examples of machines with each scheme, I've never
>>seen any studies of their efficiency.
>>
>>Arch D. Robison
>>University of Illinois at Urbana-Champaign
>>	

You might also check out my article about branch prediction which combines
the two:

%A D. A. Poplawski
%T Low cost branch prediction
%J Proceedings of the Twenty-Third Allerton Conference on Communication,
   Control and Computing
%P 979-983
%D October, 1985

Basically the idea is to monitor the execution of a program for a (very) few
data sets, recording the number of times each unique branch instruction
either branched and didn't branch, then predict all future executions of the
same branch instruction according to the dominate outcome.  It turns out that
while there are always a few branch instructions which are unpredictable (i.e,
branch half the time and don't branch half the time), the VAST majority are
very predictible, and from my observations of programs like the C compiler,
which you probably wouldn't think has very predictable branching behavior,
the prediction accuracy is very high, even when just one sample of the
execution is made and then used on other VERY different data sets.
Basically the prediction accuracies are comparable with dynamic (i.e.,
expensive) schemes while the implementation cost is comparable with static
(i.e., cheap) schemes.

lm@cottage.WISC.EDU (Larry McVoy) (05/01/87)

In article <269@mtu.UUCP> pop@mtu.UUCP (Dave Poplawski) writes:
>In article <5662@shemp.UCLA.EDU> marc@CS.UCLA.EDU (Marc Tremblay) writes:
>>In article <165100008@uiucdcsb> robison@uiucdcsb.cs.uiuc.edu writes:
>>>
>>>Are there any good papers on branch prediction?  I've seen two schemes:

I don't know whether anyone mentioned this one or not, but a very complete
paper is 

    A Study of Branch Prediction Strategies by J.E. Smith, appeared
    in IEEE ??? in 1981.

7 different strategies, 6 "real", static [i.e., all branches taken], and
dynamic [i.e., cache recently taken branches] are explained, simulated, and
the results compared.  The bottom line from this paper is that just about ANY
branch prediction is a win (lowest is 35 % hit ratio, with average in the
80's) and that some very simple branch predictions are extremely effective.

The one I liked the best is a dynamic one that maintains an extra bit in
the icache that is set iff the cache entry is a branch instruction that was
taken on the last execution.  Quite easy to implement, stingy on chip real
estate, and on the average was right about 90.3% of the time (highest 98.9,
lowest was 76.1).

Mail me if you need more.
Larry McVoy 	        mcvoy@rsch.wisc.edu  or  uwvax!mcvoy

"What a wonderful world it is that has girls in it!"

adb@homxa.UUCP (A.BERENBAUM) (05/06/87)

In article <809@killer.UUCP>, jfh@killer.UUCP (John Haugh) writes:
> [ Stuff ]
> Has anyone bothered to think about adding branch instructions that tell
> the CPU explicitly whether or not they expect to be taken?  For example,
> 
> 	MBLEQ - most of the time branch less or equal
> 	SBLEQ - seldom branch less or equal
> 
> Compilers would be able to pick the instructions they wanted.  (Speaking of
> patents, if this idea is patentable and has not yet been so, anyone but
> Nat-Semi can have it ... :-) :-) so long as I get credit ... :-) :-)
> 
> - John		(jfh@killer.UUCP)

The CRISP Microprocessor has a single flag bit (call it True/False), and four
conditional branch instructions: 
	if True jump, predicted taken
	if True jump, predicted not taken
	if False jump, predicted taken
	if False jump, predicted not taken
I think this covers what you ask for, but I'm sorry, you don't get credit.
There's a paper in the 1987 COMPCON about compiling for CRISP that discusses
how a compiler can exploit the prediction bits (it can do quite well -- we get
correct prediction of nearly 90% as I recall).

	Alan Berenbaum, AT&T-something-or-other
	research!adb

pdg@ihdev.ATT.COM (Joe Isuzu) (05/06/87)

In article <809@killer.UUCP> jfh@killer.UUCP (John Haugh) writes:
>Craig stated that, like the rest of us, he was raised to believe that
>backwards branches make nice branches after while loops and such, and
>forward branches make nice branches before if-thens.  Has anyone bothered
>to think about adding branch instructions that tell the CPU explicitly
>whether or not they expect to be taken?  For example,
>
>	MBLEQ - most of the time branch less or equal
>	SBLEQ - seldom branch less or equal

The AT&T CRISP processor does this, with instructions like

	ifFjmpn, ifFjmpy (if false, expect to jump no or yes)
	ifTjmpn, ifTjmpy (if true, expect to jump no or yes)

These take 0 cycles for a correct prediction, and a max of 3 cycles
for an incorrect prediction (-1 cycle per instruction between the
compare and the branch).

For more info, see any of the CRISP articles from IEEE Compcon '87.
-- 

Paul Guthrie
ihnp4!ihdev!pdg			This Brain left intentionally blank.

csg@pyramid.UUCP (Carl S. Gutekunst) (05/10/87)

>> Has anyone bothered to think about adding branch instructions that tell
>> the CPU explicitly whether or not they expect to be taken?  For example,
>> Compilers would be able to pick the instructions they wanted.
>
>The CRISP Microprocessor has a single flag bit (call it True/False), and four
>conditional branch instructions: 

The Pyramid CPU uses a similar arrangement, with two instructions: "Branch on
False Condition" and "Branch on True Condition." Each has the usual assortment
of condition code tests. I never remember which assumes branch taken and which
assumes it's not; fortunately the compilers don't get similarly confused. :-)

<csg>