[comp.arch] Branch Delay Annullment

tim@amdcad.AMD.COM (Tim Olson) (06/15/88)

In a recent posting discussing the Motorola 88000 architecture, the
section on branching contained the following information:

| Which forms of branch delay are present in instruction set
| [execute N if no branch, execute N if branch, execute N always]?
| 	execute 1 always and execute 1 if no branch
| What are the taken and not-taken cycle counts for each branch type,
| not including the N delayed instructions, if executed?
| 	execute 1 always: 1 cycle, taken or not
| 	execute 1 if no branch: 1 cycle untaken, 2 cycles taken

The last statement says that branch delay annullment ("squashing") takes
place if the branch is taken, causing the branch to take 2 cycles.

This is the opposite of what the SPARC annulled branch does -- it
squashes untaken branches.  Squashing the untaken branches seems more
effective to me.  Take, for example, a simple loop:

loop:
	load	r0, addr
	add	r0, r0, 1
	store	r0, addr
	add	addr, addr, 4
	add	count, count, 1
	cpge	bool, count, MAX
	  jmpf	bool, loop
	  nop

With untaken branch-delay squashing, we can rewrite this as

	load	r0, addr
loop:
	add	r0, r0, 1
	store	r0, addr
	add	addr, addr, 4
	add	count, count, 1
	cpge	bool, count, MAX
	  jmpf	bool, loop		/* squashed on fall-through */
	  load	r0, addr

and perform the load for the subsequent loop iteration in the delay slot
of the jump.  Since loops are usually executed many times, the
annul-untaken form would seem to give the best overall performance.

Any thoughts as to the benefits of annul-taken form?

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

baum@Apple.COM (Allen J. Baum) (06/15/88)

[]
>In article <22065@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:
>...this says that m88000 branch delay annullment ("squashing") takes
>place if the branch is taken, causing the branch to take 2 cycles.
>
>This is the opposite of what the SPARC annulled branch does -- it
>squashes untaken branches.  Squashing the untaken branches seems more
>effective to me. (..example showing top of loop duplicated in branch shdw)
>
>Any thoughts as to the benefits of annul-taken form?

In the looping case, obviously squash-on-non-taken works. However, in the
standard if-then-else case (which can occur inside a loop remember, so it
can occur even more frequently than loop-closers dynamically) just the
opposite is true. The HP Precision squashes according to the sign of the
branch displacement in conditional branches. Backwards branches (loop
closers) squash on non-taken, forward branches (if-then-else types, typically)
squash on taken. The best of both worlds, I think.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

tom@nud.UUCP (Tom Armistead) (06/15/88)

In article <22065@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:

[Concerning 88k branch delay slot handling ]

>This is the opposite of what the SPARC annulled branch does -- it
>squashes untaken branches.  Squashing the untaken branches seems more
>effective to me.  Take, for example, a simple loop:
>	load	r0, addr
>loop:
>	add	r0, r0, 1
>	store	r0, addr
>	add	addr, addr, 4
>	add	count, count, 1
>	cpge	bool, count, MAX
>	  jmpf	bool, loop		/* squashed on fall-through */
>	  load	r0, addr
>of the jump.  Since loops are usually executed many times, the
>annul-untaken form would seem to give the best overall performance.
>Any thoughts as to the benefits of annul-taken form?

   The same loop can be written in 88K asm as:  (I'm not familiar with
SPARC code so I hope this is equivalent - it illustrates the point 
anyway).

(addr, count, bool are registers I presume.)

loop:
	ld	r2,addr,0
	add	r2,r2,1
	st	r2,addr,0
	add	count,count,1
	cmp	bool,count,MAX
	bb0.n	eq,bool,loop		; This branch effectively takes
	add	addr,addr,4		; only 1 tick.


    The number of loop instructions is equivalent in either the 
annul-taken form or the always executed form (for this example
anyway).  The only slight difference is that no "cleanup" instruction
was required in the always executed form.  There might be some instances
where annul-taken is better but I don't know of any specific ones. 

    As an aside, the 88k has some addressing modes which will allow the above
code to be written more efficiently as:

	add	count,r0,MAX-1
loop:
	ld	r2,addr[count]
	add	r2,r2,1
	st	r2,addr[count]
	bcnd.n	ne0,count,loop		; This branch effectively takes
	sub	count,count,1		; only 1 tick.
-- 
Just a few more bits in the stream.

The Sneek

daryl@hpcldko.HP.COM (Daryl Odnert) (06/16/88)

The Hewlett Packard Precision Architecture uses the following approach:
When a branch instruction is executed, the delay slot instruction executed
before control flow is transferred to the branch target.  A bit is provided
in each branch instruction, however, that can cause the nullification of
the delay slot instruction.  Exactly what happens depends on the direction
of the branch and whether or not it is a conditional branch.

       When nullification is selected in a branch instruction...

   Branch type         Branch direction    Branch taken?   Is the delay slot
							   instruction
							   executed?
   conditional         forward             yes                    no
   conditional         forward             no                     yes
   conditional         backward            yes                    yes
   conditional         backward            no                     no
   unconditional       forward             always                 no
   unconditional       backward            always                 no


Daryl Odnert
HP Computer Language Lab
hplabs!hpcllcm!daryl

markhall@pyramid.pyramid.com (Mark Hall) (06/16/88)

In article <12211@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <22065@amdcad.AMD.COM> tim@amdcad.AMD.COM (Tim Olson) writes:
>>
>>Any thoughts as to the benefits of annul-taken form?
>
>In the looping case, obviously squash-on-non-taken works. However, in the
>standard if-then-else case (which can occur inside a loop remember, so it
>can occur even more frequently than loop-closers dynamically) just the
							       ^^^^^^^^
>opposite is true. 
 ^^^^^^^^^^^^^^^^

Let's see if I have this right.  I assume when dynamic annulment happens
it means ``this branch took the unexpected path''.  For a loop then,
branching out is the unexpected thing, and so, using squash-on-not-taken,
we annul on exit only.  Fine.

By saying ``squash-on-taken'' works for if-then-else statements, you 
imply that it is unexpected to branch to the else part.  I can live 
with that.

But doesn't ``squash-on-taken'' also mean we annul when branching
around a simple IF statement (with no ELSE part)?  Do you feel
that it is unexpected to branch around such IF's?  You don't really
say so in your posting, and I am just curious.  Given the number of 
such IF-THENs I see in code that do fatal error checking
(like malloc returning NULL),  I think there would be a LOT of 
dynamic annulment for squash-on-taken.  

Anyone have any simulations or traces showing whether it is more 
common for branches to be taken or not taken?  Is it very lopsided
one way or the other, or is it about even? How about stats on
the number of IF-THENs in source code as opposed to IF-THEN-ELSEs?

-Mark Hall (smart mailer): markhall@pyramid.pyramid.com
	   (uucp paths  ): {decwrl|sun|seismo|lll-lcc}!pyramid!markhall

tim@amdcad.AMD.COM (Tim Olson) (06/16/88)

In article <1081@nud.UUCP> tom@nud.UUCP (Tom Armistead) writes:
|     The number of loop instructions is equivalent in either the 
| annul-taken form or the always executed form (for this example
| anyway).  The only slight difference is that no "cleanup" instruction
| was required in the always executed form.  There might be some instances
| where annul-taken is better but I don't know of any specific ones. 

I agree -- delayed branches are a win for the cases in which you can
find a delay instruction which is beneficial in both cases, or is
beneficial in the most-taken direction and is "innocuous" in the other. 
The case I was discussing was when one of these could not be found.  In
that case, an "annul-untaken" jump would allow you to always fill the
branch delay slot in a loop.

This is admittedly infrequent (we didn't add annulling branches in the
29000, because we felt it wouldn't buy much, and could cause
restartability problems when a trap occured between a jump and its
annulled delay slot), I was just wondering what was the intended use of
the "annul-taken" form.  It would seem to benefit "if" statements the
most.

	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

aglew@urbsdc.Urbana.Gould.COM (06/16/88)

..> Squashing

Let's see, the full set is squash-never, squash-always,
squash-if-taken, squash-if-not-taken. HP Precision has 
squash-depending-on-sign-of-displacement.

Was just reading the latest IEEE MICRO, with the ARM chip and
its conditions on every instruction. If you are willing to waste
the bits on every instruction, and the condition in the branch
can be evaluated fast enough, then the conditions on every 
instruction give you all the squashing you need in the delay slot(s).

baum@Apple.COM (Allen J. Baum) (06/17/88)

[]
>In article<27527@pyramid.pyramid.com>markhall@pyramid.UUCP (Mark Hall) writes:
>In article <12211@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>>
>>In the looping case, obviously squash-on-non-taken works. However, in the
>>standard if-then-else case (which can occur inside a loop remember, so it
>>can occur even more frequently than loop-closers dynamically) just the
>							       ^^^^^^^^
>>opposite is true. 
> ^^^^^^^^^^^^^^^^
>
>....But doesn't ``squash-on-taken'' also mean we annul when branching
>around a simple IF statement (with no ELSE part)?  Do you feel
>that it is unexpected to branch around such IF's?  You don't really
>say so in your posting, and I am just curious.  Given the number of 
>such IF-THENs I see in code that do fatal error checking
>(like malloc returning NULL),  I think there would be a LOT of 
>dynamic annulment for squash-on-taken.  
>

Oops. Its been so long since I've looked at this that I forgot about a second 
part of the argument. You're absolutely right that there is a lot of code of
the form:     If (error-condition( {handle it};  and that this code is executed
infrequently. If this is the case, it would pay to have this code be executed
out of line; that is, you would like to branch to it, and then branch back
to the main line code. The most frequently executed code is then straight-line,
a real win in situtations where you benefit from bust mode i-fetches. 
This costs nothing in the frequent case, possibly saving some cache misses,
while costing cache misses, and an extra cycle, in the infrequent case.
This assumes that your compiler can predict the probability of branching.
The best I've seen is about ~80%+- with static branch prediction.

>Anyone have any simulations or traces showing whether it is more 
>common for branches to be taken or not taken?  Is it very lopsided
>one way or the other, or is it about even? How about stats on
>the number of IF-THENs in source code as opposed to IF-THEN-ELSEs?

I do not have any other good numbers besides these observations. Most
numbers that I've seen suggest that branches are taken roughly 50% (maybe
as high as 60% for some code. Often it is unclear whether these figures include
unconditional branches or not- sometimes they do, and sometimes not. I don't
recall seeing any numbers at all on branching that separates forwards and
backwards branching. If there is some advantage one way or the other,
the compiler code rearrange the code to favor one or the other, which would
skew the statistics in any case. The IF-THEN vs. IF-THEN-ELSE numbers I have
are static, and from too small a sample.
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

jesup@cbmvax.UUCP (Randell Jesup) (06/18/88)

In article <12300@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>Oops. Its been so long since I've looked at this that I forgot about a second 
>part of the argument. You're absolutely right that there is a lot of code of
>the form:     If (error-condition( {handle it};  and that this code is executed
>infrequently. If this is the case, it would pay to have this code be executed
>out of line; that is, you would like to branch to it, and then branch back
>to the main line code. The most frequently executed code is then straight-line,
>a real win in situtations where you benefit from bust mode i-fetches. 

	I don't know about you, but I tend to write code that looks more
like:
	if (!error) {
		...
	} [else { ... }]

	Especially useful when doing resource allocations, like "if (!(ptr =
(struct ptr *) malloc(xyzzy))) { use ptr } else { return error }".

	It comes down to both the type of application and the individual
coding style (and of course language).

Randell Jesup, Commodore Engineering {uunet|rutgers|ihnp4|allegra}!cbmvax!jesup

jgk@speech2.cs.cmu.edu (Joe Keane) (06/20/88)

In article <4053@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>	I don't know about you, but I tend to write code that looks more
>like:
>	if (!error) {
>		...
>	} [else { ... }]
>
>	Especially useful when doing resource allocations, like "if (!(ptr =
>(struct ptr *) malloc(xyzzy))) { use ptr } else { return error }".

I think that's less common, especially with multiple errors.  A lot of
my code looks like:

	if ((... = open (...)) < 0)
	  quit (...);
	if (!(... = fopen (...)))
	  quit (...);
	if (!(... = malloc (...)))
	  quit (...);
	if ((... = read (...)) < 0)
	  quit (...);
	if (fscanf (...) != 1)
	  quit (...);
	...

It'd be nice if the compiler knew quit (same as `fprintf (stderr,
...); exit (...);') is a `dead' function, so it knows the branch is
usually taken.

--Joe

colwell@mfci.UUCP (Robert Colwell) (06/20/88)

In article <1995@pt.cs.cmu.edu> jgk@speech2.cs.cmu.edu (Joe Keane) writes:
>In article <4053@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>>	I don't know about you, but I tend to write code that looks more
>>like:
>>	if (!error) {
>>		...
>>	} [else { ... }]
>>
>I think that's less common, especially with multiple errors.  A lot of
>my code looks like:
>
>	if ((... = open (...)) < 0)
>	  quit (...);
>	if (!(... = fopen (...)))
>	  quit (...);
>	if (!(... = malloc (...)))
>	  quit (...);
>	if ((... = read (...)) < 0)
>	  quit (...);
>
>It'd be nice if the compiler knew quit (same as `fprintf (stderr,
>...); exit (...);') is a `dead' function, so it knows the branch is
>usually taken.

All true, but I suspect that there is no practical way for my code
generator to produce code so bad that I'd ever care on code sequences
like these.  This stuff isn't where my programs spend their time, so
optimizing branch latency and filling delay slots in this vicinity
probably constitutes nano-tweaking.  Unless code gets executed a lot
of times, I don't especially care how optimal it is, and those system
calls are going to swamp the conditionals by orders of magnitude
anyway.

The interesting cases are in or near the inner loops, where the
conditionals are going to be evaluated zillions of times.  In there,
I care a lot about how good my code is.  But remember (for scientific
code) jumps are very predictable.  Of all branch instructions
executed, typically something like 90% of them will take their
predicted directions.  Or to put it another way, "The more
predictable jumps are much more frequently executed." (Joseph Fisher,
"Wide Instruction Word Architectures:  Solving the Supercomputer
Software Problem", Proceedings of the INRIA Int'l Seminar on
Scientific Supercomputers, Feb., 1987).


Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090