[comp.arch] delayed branch

cdddta@tasu76.UUCP (David Deitcher) (07/30/89)

"Delayed branch" is a technique used by RISC machines to make use of the
extra cycle needed to calculate branch targets. The compiler will put
an instruction after the branch to be executed by the CPU while the
branch target is being calculated. Does anyone have information as to
how often the compiler is able to put a useful instruction after the
branch as opposed to filling it with a NOP?

-----------------------------------------------------
David Deitcher        (National Semiconductor Israel)
UUCP:    ...!{amdahl,hplabs,decwrl}!nsc!taux01!cdddta
Domain:  cdddta@taux01.nsc.com

cik@l.cc.purdue.edu (Herman Rubin) (08/01/89)

In article <2246@taux01.UUCP>, cdddta@tasu76.UUCP (David Deitcher) writes:
> "Delayed branch" is a technique used by RISC machines to make use of the
> extra cycle needed to calculate branch targets. The compiler will put
> an instruction after the branch to be executed by the CPU while the
> branch target is being calculated. Does anyone have information as to
> how often the compiler is able to put a useful instruction after the
> branch as opposed to filling it with a NOP?

This has nothing to do with the capabilities of the compiler, but I have
often wanted to put one, or even several, instructions between a branch
instruction and its execution.  As most problems have a substantial amount
of order independence of operations, I would be surprised if a useful
instruction could not be inserted in well over 90% of the cases.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

verwer@ruuinf.cs.ruu.nl (Nico Verwer) (08/01/89)

In article <2246@taux01.UUCP>, cdddta@tasu76.UUCP (David Deitcher) writes:
> "Delayed branch" is a technique used by RISC machines to make use of the
> extra cycle needed to calculate branch targets. The compiler will put
> an instruction after the branch to be executed by the CPU while the
> branch target is being calculated. Does anyone have information as to
> how often the compiler is able to put a useful instruction after the
> branch as opposed to filling it with a NOP?
In "reduced instruction set computers", Communications of the ACM, jan. 1985,
(pp. 8 -- 21), D.A Patterson writes (on page 13):
" RISC optimizing compilers are able to succesfully rearrange instructions
" to use the cycle after the delayed branch more than 90 percent of the
" time. Hennesy has found that more than 20 percent of all instructions
" are executed in the delay after the branch.
The article he is referring to is in IEEE Trans.Comput., but had to be
published yet at that time.

=====================================================================
 Nico Verwer        University of Utrecht, Dept. of Computer Science
 VOICE: +31-30-533921           PHYSICAL MAIL: || Postbus 80.089  ||
 INTERNET: verwer@cs.ruu.nl                    || 3508 TB Utrecht ||
 UUCP: ...!hp4nl!ruuinf!verwer                 || the Netherlands ||
=====================================================================

frazier@oahu.cs.ucla.edu (Greg Frazier) (08/01/89)

In article <1462@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
->In article <2246@taux01.UUCP>, cdddta@tasu76.UUCP (David Deitcher) writes:
->> branch target is being calculated. Does anyone have information as to
->> how often the compiler is able to put a useful instruction after the
->> branch as opposed to filling it with a NOP?
->
->This has nothing to do with the capabilities of the compiler, but I have
->often wanted to put one, or even several, instructions between a branch
->instruction and its execution.  As most problems have a substantial amount
->of order independence of operations, I would be surprised if a useful
->instruction could not be inserted in well over 90% of the cases.

Studies have shown (I love that phrase) that one delay can be filled
70% of the time, and a second 30% of the time (approx, of course).
There are two problems with filling delay slots.  The first is that
branches occur on the order of every 3 to 5 instructions - that's
a lot of slots to fill.  The second is that one often branches on
a value immediately after calculating it (i++; if i > 5 ...).  Oh,
you want a reference for those numbers?  The branch frequency comes
from some of the RISC papers from Berkeley (Patterson, et. al.).  I
think the slot filling numbers come from the same, but I'm not sure.

Greg
&&&&&&&&&&&&&&&&&&&&&@@@@@@@@@@@@@@@@@@@@@@######################
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

raj@hpindwa.HP.COM (Richard Jones) (08/02/89)

  on the subject of how often the branch delay slot is useful, an informal
look at some code (compiler optimised) for the HP3000 PA (precision arch.)
would lead me to believe that that slot can be used frequently for useful
work - one obvious example is the branch taking you to another procedure -
the delay slot can be used to loadup a parameter...

of course, your milage may vary,
rick 'by no means a hardware type' jones

baum@Apple.COM (Allen J. Baum) (08/03/89)

[]
>In article blahblahblah frazier@cs.ucla.edu (Greg Frazier) writes:
>
>Studies have shown (I love that phrase) that one delay can be filled
>70% of the time, and a second 30% of the time (approx, of course).
>There are two problems with filling delay slots.  The first is that
>branches occur on the order of every 3 to 5 instructions - that's
>a lot of slots to fill.  The second is that one often branches on
>a value immediately after calculating it (i++; if i > 5 ...).  Oh,
>you want a reference for those numbers?  The branch frequency comes
>from some of the RISC papers from Berkeley (Patterson, et. al.).  I
>think the slot filling numbers come from the same, but I'm not sure.

An interesting point to ponder is that delayed loads also have to schedule
something into their shadows, and they are (very roughly) as frequent as 
branches. Now, there are only so many instructions that can be re-arranged,
so it is possible that although branches shadows can be filled 70% of the time,
and load shadows can be filled 70% of the time, it may not be true that you can
fill both of the 70% of the time. Any comments from someone who has actually
measured this? It might be interesting to turn off filling of one, and see how
the percentage of filling the other increases.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

lord@se-sd.NCR.COM (Dave Lord ) (08/03/89)

In article <2246@taux01.UUCP> cdddta@tasu76.UUCP (David Deitcher) writes:
>"Delayed branch" is a technique used by RISC machines to make use of the
>extra cycle needed to calculate branch targets. The compiler will put
>an instruction after the branch to be executed by the CPU while the
>branch target is being calculated. Does anyone have information as to
>how often the compiler is able to put a useful instruction after the
>branch as opposed to filling it with a NOP?

You mean in theory or in real life? :-) I've looked at code generated
by three different compilers for the 88K (GreenHills, GNU, & LPI) and
I don't believe any of them EVER put a useful instruction in the
delayed branch slot. Admittedly the 88K is still pretty new and these
were all early compilers. I suspect that the reason
the delayed branch slots are not used is that the register allocators
are not smart enough to hold a register after a branch. 
Hopefully this will change. Anyone have
any idea what percentage of typical code is branches? It would be
interesting to know how much performance could be gained 
by filling those slots.

glenn@csri.toronto.edu (Glenn Mackintosh) (08/03/89)

frazier@oahu.cs.ucla.edu (Greg Frazier) writes:
>There are two problems with filling delay slots.  The first is that
>branches occur on the order of every 3 to 5 instructions - that's
>a lot of slots to fill.  The second is that one often branches on
>a value immediately after calculating it (i++; if i > 5 ...).

Hmmm. That just made me think about something. Has anyone looked at taking
something like the above and having the compiler recognize that it could
rearrange things a bit in cases like this? For example the compiler could
change this to do a compare against 4 and move the increment into the delay
slot after the test. I have a feeling that a lot off loop code has this kind
of termination structure. If you just add/subtract some fixed number from
your loop control variable and then test to see if it has gone beyond a
fixed range then you could easily adjust the range so that the test could be
done first.

Has anybody done this kind of thing and if so how did it come out?

                    Glenn Mackintosh
                    Univ. of Toronto

CSNET/ARPA:   glenn@eecg.{toronto.edu,utoronto.ca}
CDNNET:       glenn@eecg.toronto.cdn
BITNET:       glenn@eecg.utoronto.bitnet (may not work from all sites)
UUCP:         uunet!utai!eecg!glenn

preston@titan.rice.edu (Preston Briggs) (08/03/89)

In article <33669@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>An interesting point to ponder is that delayed loads also have to schedule
>something into their shadows, and they are (very roughly) as frequent as 
>branches. Now, there are only so many instructions that can be re-arranged,
>so it is possible that although branches shadows can be filled 70% of the time,
>and load shadows can be filled 70% of the time, it may not be true that you can
>fill both of the 70% of the time. Any comments from someone who has actually
>measured this? It might be interesting to turn off filling of one, and see how
>the percentage of filling the other increases.

Martin Hopkins, writing about the PL.8 compiler and the 801,
points out that it's especially nice if you can schedule a
load in the branch "shadow" (great term by the way).
That way the load completes while the branch is branching.
The PL.8 compiler does some non-trivial rearrangement 
(really rewriting) to achieve this.

I schedule for both cases, but I don't have any good numbers I can
give you now (haven't measured the results recently; wish I had a pixie).
My impression is that I do better on branchs than loads.

On the RT, the "flavor" of the two problems is a little
different.  Loads cast a long shadow and I try to avoid
scheduling instructions that use the loaded register during the shadow.
For branches, I try and find an instruction that doesn't
interfere with the CC and can execute at the end of the block.
They aren't necessarily competing goals.

An easy example is a mem-mem copy, that is a LOAD followed by a STORE.
The STORE can be scheduled in the branch slot, and the LOAD
as early as possible (up to 6 cycles anyway).

On the other hand, your point seems valid.
I'm just not sure of an easy way to measure the effect.

Regards,
Preston Briggs

hankd@pur-ee.UUCP (Hank Dietz) (08/03/89)

In article <33669@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>An interesting point to ponder is that delayed loads also have to schedule
>something into their shadows, and they are (very roughly) as frequent as 
>branches. Now, there are only so many instructions that can be re-arranged,
>so it is possible that although branches shadows can be filled 70% of the time,
>and load shadows can be filled 70% of the time, it may not be true that you can
>fill both of the 70% of the time. Any comments from someone who has actually
>measured this? It might be interesting to turn off filling of one, and see how
>the percentage of filling the other increases.

I think delayed loads are MORE frequent than delayed branches for most code
and most target machines, however, the two are essentially independent
pipelines.  In other words, for a lot of target machines, one can use a
delayed branch in a delay slot for a delayed load (and vice versa).  Hence,
with appropriate compiler algorithms, the two don't interfere that much.

I have a student, Ashar Nissar, who is doing his MS thesis on compile-time
scheduling for multiple pipelines.  It's a little early to give results, and
number results are far too dependent on the precise target machine, but
qualitatively it is safe to say that having multiple INDEPENDENT pipelines
doesn't seem to cause much trouble.

Notice that I didn't say that the usual pipeline scheduling techniques work
all that well...  we are currently using an algorithm based on trying ALL
legal & useful permutations of instruction sequences prior to register
allocation.  Right now, we can quickly get an optimal schedule for up to
about 16 instructions, but have to truncate the search for larger schedules.

						-hankd@ecn.purdue.edu

roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (08/03/89)

In article <1996@se-sd.NCR.COM> lord@se-sd.UUCP (Dave Lord (SSP)) writes:
>Hopefully this will change. Anyone have
>any idea what percentage of typical code is branches? It would be
>interesting to know how much performance could be gained 
>by filling those slots.

For the motorola 68k I've seen one study quoting 23% of executed code 
being branches. I seem to remember hearing somewhere else a figure of
1 in 6 so somewhere between 1 in 4 and 1 in 6 would be my guess.

Which explains why (to answer an unasked question :-) you'll see instruction
cache line sizes for 68k architectures getting up to 16 bytes but no more: 
average instruction length is 3 bytes  3x5=15 pulling in extra instructions 
would just waste memory bandwidth.
-- 
I don't know what the question means, but the answer is yes...
KLM - Koninklijke Luchtvaart Maatschappij => coneenclicker lughtfart matscarpie
Roelof Vuurboom  SSP/V3   Philips TDS Apeldoorn, The Netherlands   +31 55 432226
domain: roelof@idca.tds.philips.nl             uucp:  ...!mcvax!philapd!roelof

hankd@pur-ee.UUCP (Hank Dietz) (08/04/89)

In article <1989Aug2.160111.9324@jarvis.csri.toronto.edu> glenn@csri.toronto.edu (Glenn Mackintosh) writes:
>... For example the compiler could
>change this to do a compare against 4 and move the increment into the delay
>slot after the test. ...

I haven't seen loop transforms done to optimize FINE GRAIN pipelines...
although it shouldn't be all that hard to do.  A few years ago, somebody (at
Stanford?) had a paper out on converting between equality and inequality
comparisons so that better instruction sequences would apply.  That's the
closest I recall seeing.

						-hankd@ecn.purdue.edu

pauls@apple.com (Paul Sweazey) (08/04/89)

In article <221@ssp1.idca.tds.philips.nl> roelof@idca.tds.PHILIPS.nl (R. 
Vuurboom) writes:
> For the motorola 68k I've seen one study quoting 23% of executed code 
> being branches. I seem to remember hearing somewhere else a figure of
> 1 in 6 so somewhere between 1 in 4 and 1 in 6 would be my guess.
> 
> Which explains why (to answer an unasked question :-) you'll see instruction
> cache line sizes for 68k architectures getting up to 16 bytes but no more: 
> average instruction length is 3 bytes  3x5=15 pulling in extra instructions 
> would just waste memory bandwidth.

However, there is a high probability that the branch is a few instructions away, or will soon run or branch to the remainder of the larger cache line.  Other issues such as memory operation or interconnect bandwidth and latency may be the real deciding factors.

Paul Sweazey
Apple Computer, Inc.
pauls@apple.com
(408)-974-0253

jesup@cbmvax.UUCP (Randell Jesup) (08/05/89)

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>Studies have shown (I love that phrase) that one delay can be filled
>70% of the time, and a second 30% of the time (approx, of course).
>There are two problems with filling delay slots.  The first is that
>branches occur on the order of every 3 to 5 instructions - that's
>a lot of slots to fill.  The second is that one often branches on
>a value immediately after calculating it (i++; if i > 5 ...).  Oh,
>you want a reference for those numbers?  The branch frequency comes
>from some of the RISC papers from Berkeley (Patterson, et. al.).  I
>think the slot filling numbers come from the same, but I'm not sure.

	Note that a) branch frequency depends a lot on the architecture, and
also on the compiler's optimizer; b) slot-filling percentages (often taken
as a "universal" figure) depend highly on the instruction set of the machine,
and even the type of code being compiled.  In particular, 2-address instruction
machines are better at filling branches than 3-address machines (they're more
likely to have an instruction pair that can fill a 2-instruction delay, and
more likely to have pairs of instructions they can split up to fill branch and
load delays.

	One has to be careful about assumptions.  It's easy to take someone
else's research as being directly applicable to a different architecture, but
often it isn't, or is only partially applicable.

	The acid test is to simulate the damn thing with real code and
real compilers and gather real statistics.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

jonah@db.toronto.edu (Jeffrey Lee) (08/07/89)

In article <7543@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>        Note that a) branch frequency depends a lot on the architecture, and
>also on the compiler's optimizer; b) slot-filling percentages (often taken
>as a "universal" figure) depend highly on the instruction set of the machine,
>and even the type of code being compiled.

Take a look at the article by Dan Klein about CISC/RISC compiler
construction in the proceedings of the Winter 89 USENIX (San Diego).
He surveyed a number of CISC and RISC compilers and took polls of the
usage frequency of the various instructions.  He found a surprising
degree of similarity of instruction frequency in code generated by C
compilers on various CISC/RISC platforms.  The following table is not
found in the article, but is derived from the information in it:

I/KVax move  arith log   cmp cmpbra  bra  call   (excluding noop)
 1000   402    80   12   133   157    90   126   vax/pcc
 1033   420    81   14   130   158    98   130   vax/tartan
 1158   472    96   15   166   165   103   136   vax/dec
 1262   588   156   20   135   155    83   122   mc68020/gnu
 1331   616   187   21   138   147    86   131   mc68020/sun
 1470   801   127   66*  151   148    67   164   sparc/sun
 1582   829   193   66*   72   170   117   131   88100/greenhills
 1763   969   290   37    37   206   114   153   r2000/mips

 [*: ori is often used to load the lower portion of a register.]

Most of the extra operations added in the RISC code seem to be loads
and stores (lack of memory-memory operations) and arithmetic/logical
operations (adds/shifts to make up for the reduced addressing modes).
Somewhat surprisingly, although the RISC code has more instructions
than the CISC code, it is less than a two-fold expansion in code size.
Adjusting for average instruction size and adding in the data space,
these RISC programs were only 25% to 50% larger than the VAX programs.
(The 68020 programs were often 25% smaller than the VAX code.)

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>Studies have shown (I love that phrase) that one delay can be filled
>70% of the time, and a second 30% of the time (approx, of course).

This paper showed a measured branch and call (static) frequency ranging
from 22.6% to 27.8% for R2000/SPARC/88100 and from 27.5% to 37.4% for
MC68020/VAX.  Loads made up 22% to 31% of the RISC code.  The delay
slot filling situation is somewhat confusing since some of these
processors have interlocked loads while others do not.  Similarly, some
processors conditionally execute the branch delay slot instruction
while others always execute it.  [GOS==Grain-Of-Salt: I may be wrong in
the following.]  The MIPS/R2000 compiler used there [check the paper
for the version number--I don't have a copy handy] filled
load/branch/call [GOS] delay slots about 76% of the time leading to a
14% noop frequency.  The Sun/SPARC compiler has interlocked loads [GOS]
and filled its branch/call [GOS] delay slots about 75% of the time
leading to a 4% noop frequency.  The 88100 has interlocked loads and
conditional delay slots [GOS] leading to 0% NOPs.

Interlocked loads make it easier on the compiler writer:  It means they
don't have to worry about filling delay slots after load instructions.
Unfortunately it also means that they might not try to prefetch data
and end up with slower code as a result (with stalls instead of NOPs).
However, if they do try to prefetch data (and if the interlocks do not
lengthen the cycle time) they could be ahead: a one cycle stall should
be slightly faster than a NOP because it never can cause a cache miss.
Additionally, the lack of NOPs means that a few more ``useful''
instructions can be stored in the same sized cache.

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>There are two problems with filling delay slots.  The first is that
>branches occur on the order of every 3 to 5 instructions - that's
>a lot of slots to fill.  ...

Having conditional delay slots for conditional branches takes some of
the sting out of this.  Branch slots are harder to fill when they are
unconditional:  the delay instruction must be ``useful'' (or harmless)
for both the branch-taken and branch-not-taken cases.  The delay action
can be made ``conditional'' by adding silicon to nullify the action
(e.g. prevent writeback of the results to the register cells or abort a
load/store instruction) in the case of the branch not taken (or vice
versa).  In this situation the branch slot can always be filled for
static branch targets: simply duplicate the branch target instruction
in the delay slot and move the target forward by one instruction.
Alternatively, the delay instruction could be aborted when the branch
IS taken and used if the branch falls through.  Which choice to
implement depends on the expected probability of branch taken versus
not taken.  [Sorry, I have no data on this.  John??]  Of course in
these case the delay instruction is only useful when the branch is
taken (or not taken).

It is better to fill an unconditional branch slot than a conditional
one assuming that the delay action is useful in both cases, however it
is also useful to have the option for conditional delay slots to
statisfy (half-of) the remaining cases.  Whether or not the extra
silicon is justified depends on (a) how often conditional branches are
expected, (b) how often we expect unconditional delay slots to remain
unfilled, and (c) how hard it is to ``nullify'' the delay instructions
without increasing the cycle time (among other things).  [N.B. One
might need some circuitry to nullify partial instructions anyways so
that we can abort execution and restart again in case of a page fault.]

In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>  ...The second is that one often branches on
>a value immediately after calculating it (i++; if i > 5 ...).

As to using a value immediately after calculating it: I don't know
about other processors, but on the MIPS/R2000 there is little problem
in this.  The result of any register/immediate integer instruction
(other than multiply or divide) is always available for use as an
operand in the next instruction without any delay.  Having just altered
the value is actually advantageous: it means that the value will be
stored-in or cached-in a register.  Since many branches tend to be
preceeded by increment or decrement instructions rather than multiplies
or divides, there is little problem.  If the value being tested is
coming from memory, you are stuck with the load delay (but may be able
to pre-fetch it some of the time.)

Perhaps someone who has worked on designing RISC chips would care
comment on the cost of adding interlocked loads and conditional delay
slots.

--
Jeff Lee <jonah@db.toronto.edu> or <...!utzoo!utai!jonah>

paul@moncam.co.uk (Paul Hudson) (08/07/89)

For some reason known only to the designers, the ARM doesnt have a
delayed branch slot (blowing away the pipeline instead). What it does
have is conditional execution of every instruction. This means that
many (conditional) branches can be replaced by conditionally executing
the next few instructions instead, thus side-stepping the scheduling
problem.. Since also you can choose whether an arithmetic instruction
affects the condition codes, it's easy to arrange (usually in a
peepholing pass). IMHO, they should have had a delay slot too, though.

The ARM instruction set is one of the nicest to deal with, both for
hand-coding and compiler writing - well worth a look.




--
Paul Hudson	These opinions void where prohibited by law.
Until 23 August (but mail will be forwarded for a while)
		 MAIL: Monotype ADG, Science Park, Cambridge, CB4 4FQ, UK.
		PHONE: +44 (223) 420018	  EMAIL: paul@moncam.co.uk,
		  FAX: +44 (223) 420911		 ...!ukc!acorn!moncam!paul
On vacation until September 6, then
		EMAIL:  ..!mcvax!i2unix!iconet!trzdor1!paul,
			paul@trzdor1.ico.olivetti.com

jrc@ukc.ac.uk (John) (08/07/89)

In article <PAUL@marvin.moncam.co.uk> paul@moncam.co.uk (Paul Hudson) writes:
>For some reason known only to the designers, the ARM doesnt have a
>delayed branch slot (blowing away the pipeline instead). What it does
>have is conditional execution of every instruction ...

I believe it's because the ARMs designers are all assembly code freaks and
didn't like the idea of having to do all that reordering by hand. Something of
a copout!

>... IMHO, they should have had a delay slot too, though.

Yes, it would have added another few % of performance at no cost (except to
the mental health of Roger Wilson).

>--
>Paul Hudson	These opinions void where prohibited by law.

John Cupitt

wilken@husky.ece.cmu.edu (Kent Wilken) (08/08/89)

In article <1989Aug6.164210.11976@jarvis.csri.toronto.edu> jonah@db.toronto.edu (Jeffrey Lee) writes:

>... Having conditional delay slots for conditional branches takes
>some of the sting out of this. ...
>The delay action can be made ``conditional'' by adding silicon to
>nullify the action (e.g. prevent writeback of the results to the
>register cells or abort a load/store instruction) in the case of the
>branch not taken (or vice versa). ...
>Which choice to implement depends on the expected probability of
>branch taken versus not taken.  [Sorry, I have no data on this.] ...

The paper "Reducing the Cost of Branches" by McFarling and Hennessy (13th
Comp. Arch. Conf. pp 396-403) addresses this topic.  Data is presented that
shows the branch-taken probability to be roughly 65%.  The paper shows that
the nullify action, called "squashing", can significantly reduce the cost of
branches for a profiled program.  ("Cost" is a performance merit that
includes the performance cost of NOPs).  A following paper "Architectural
Tradeoffs in the Design of the MIPS-X" by Chow and Horowitz (14th Comp.
Arch. Conf. pp 300-308) shows that squashing melds nicely with exception
handling (i.e., the amount of added silicon is small).  They also report that
the MIPS-X, which has two branch delay slots, uses from 16-18% NOPs.

Kent Wilken <wilken@husky.ece.cmu.edu>

rcpieter@eutrc4.urc.tue.nl (Pieter Schoenmakers) (08/09/89)

Just wondering---
 - What happens on existing processors which use delayed branches when the
instruction put in the branch instruction's shadow is also a branch?
 - Do existing processors have a seperate adder for use by branches only, or
are there restrictions on the possible instructions which can be put in the
shadow?
 - Are there any processors which use branch delaying and don't have a
seperate address calculation adder?

Tiggr

tim@cayman.amd.com (Tim Olson) (08/09/89)

In article <828@eutrc3.urc.tue.nl> rcpieter@rc4.urc.tue.nl writes:
| Just wondering---
|  - What happens on existing processors which use delayed branches when the
| instruction put in the branch instruction's shadow is also a branch?

You get what we term a "visit".  This executes the single instruction
at the first branch target, then continues with the second (delay-slot)
branch's target.  The second branch does not have a physical delay slot
following it; rather, the instruction at the first branch's target acts
as the second branch's delay slot.

This does have its uses -- for example, a debugger can replace an
instruction with a specific breakpoint instruction, saving the replaced
instruction elsewhere.  When it comes time to restart execution, the
debugger can execute the replaced instruction "out-of-line" with a
visit, allowing the breakpoint instruction to physically remain at the
breakpoint location.

|  - Do existing processors have a seperate adder for use by branches only, or
| are there restrictions on the possible instructions which can be put in the
| shadow?
|  - Are there any processors which use branch delaying and don't have a
| seperate address calculation adder?

I know of no processors that have delayed branches and restrictions on
delay instruction types.  However, there really isn't any resource
contention between a branch and its delay slot.  The delay slot is just
being decoded by the time the branch executes.  The reason that a
processor may have a separate adder to calculate PC-relative branch
offset addresses is that they must perform an instruction cache lookup
during the branch execution, and thus must have the address calculated
by the end of the decode stage (at least, this is the way it works on
the Am29000).

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

jk3k+@andrew.cmu.edu (Joe Keane) (08/09/89)

In article <26667@amdcad.AMD.COM> tim@cayman.amd.com (Tim Olson) writes:
>I know of no processors that have delayed branches and restrictions on
>delay instruction types.

The IBM RT manual says (rather verbosely IMHO):

Certain instructions are not allowed to be the subject of a branch with execute
instruction.  Since the branch with execute instructions change the normal
sequential execution of instructions, the subject instructions cannot be an
instruction that also changes the instruction sequencing; otherwise the
processor may be put in an unpredictable state.  Thus, all branches, jumps,
traps, Load Program Status, Supervisor Call, and Wait instructions cannot be
subject instructions.  The system processor provides hardware that detects
these illegal instructions and causes a program check interrupt if any of these
instructions occur as the subject of a branch with execute instruction.

jjb@sequent.UUCP (Jeff Berkowitz) (08/09/89)

In article <828@eutrc3.urc.tue.nl> rcpieter@rc4.urc.tue.nl writes:
>Just wondering---
> - What happens on existing processors which use delayed branches when the
>instruction put in the branch instruction's shadow is also a branch?

If you haven't thought this through, it's quite amusing.  Given such a
machine, the typical CISC code (erroneous in this case, of course) -

    looptop:
	...
	jsr	subr
	br	looptop

causes exactly *one* instruction at "subr" to be executed before going
back to "looptop".  (When executing the jsr, you fetch the branch; when
executing the branch, you fetch @subr; when executing @subr, you fetch
at looptop, etc).

On one machine in my past, the architects were disturbed enough by this
behavior to "fix" it in hardware: if a branch occurred in the shadow,
the machine automatically converted the "shadowed" branch into a noop!
If the first branch happened to be a jsr, as in the example above, the
machine would not only "noop" it but would remember that it should be
the return addrss of "subr", so it would be executed when you got back.
This allowed code like

	jsr syscall_enter
	bcs error		# executed on return from syscall_enter

to work "just like a CISC".

The machine actually had two instructions in the "shadow" (we called
these "trailers").  So there were actually three possible return
addresses of any jsr instruction -

    the next instruction, if it was a branch, jump, or jsr
    else the instruction after that, if *it* was a branch, jump, or jsr
    else the instruction after *that*.

The instruction pipeline had to determine which to "push" on-the-fly.

Was this complexity worth it?  Well, presumably it improved code density.
I am not aware of any measurements.  It made breakpoint debugging a
serious pain; you needed a zillion types of breakpoint instructions
so that if you set a breakpoint on a branch that was the second trailer
of a jsr, the breakpoint instruction itself would *also* look like a
branch.  Anyway the machine was not successful, in part because of the
complexity of trying to implement a lot of little rules like this in
hardware.
-- 
Jeff Berkowitz N6QOM			uunet!sequent!jjb
Sequent Computer Systems		Custom Systems Group

tim@cayman.amd.com (Tim Olson) (08/09/89)

In article <0YrsEMy00V4GQ=CJZw@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes:
| The IBM RT manual says (rather verbosely IMHO):
| 
| Certain instructions are not allowed to be the subject of a branch with execute
| instruction.  Since the branch with execute instructions change the normal
| sequential execution of instructions, the subject instructions cannot be an
| instruction that also changes the instruction sequencing; otherwise the
| processor may be put in an unpredictable state.  Thus, all branches, jumps,
| traps, Load Program Status, Supervisor Call, and Wait instructions cannot be
| subject instructions.  The system processor provides hardware that detects
| these illegal instructions and causes a program check interrupt if any of these
| instructions occur as the subject of a branch with execute instruction.

Thanks for the information -- I didn't have my RT arch. manual with me to look it
up.   Does anyone else know of other processors with such restrictions?


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

marvin@oakhill.UUCP (Marvin Denman) (08/09/89)

In article <26667@amdcad.AMD.COM> tim@amd.com writes:
 >In article <828@eutrc3.urc.tue.nl> rcpieter@rc4.urc.tue.nl writes:
 >| Just wondering---
 >|  - What happens on existing processors which use delayed branches when the
 >| instruction put in the branch instruction's shadow is also a branch?

 >You get what we term a "visit".  This executes the single instruction
 >at the first branch target, then continues with the second (delay-slot)
 >branch's target.  The second branch does not have a physical delay slot
 >following it; rather, the instruction at the first branch's target acts
 >as the second branch's delay slot.

This is a good explanation of what occurs with delayed branches.  The 88100 
has a few more variations of this effect due to the ability of the instruction 
set to either execute the instruction in the delay slot or to waive the
instruction in the delay slot.  This for instance allows a "visit" that does
not execute the instruction.  Some of the other variations of execute/no 
execute may produce other obscure code sequencing.  Before using any of 
these tricks see the warning below.

 >|  - Do existing processors have a seperate adder for use by branches only, or
 >| are there restrictions on the possible instructions which can be put in the
 >| shadow?
 >|  - Are there any processors which use branch delaying and don't have a
 >| seperate address calculation adder?
 >
 >I know of no processors that have delayed branches and restrictions on
 >delay instruction types.  However, there really isn't any resource
 >contention between a branch and its delay slot.  The delay slot is just
 >being decoded by the time the branch executes.  The reason that a
 >processor may have a separate adder to calculate PC-relative branch
 >offset addresses is that they must perform an instruction cache lookup
 >during the branch execution, and thus must have the address calculated
 >by the end of the decode stage (at least, this is the way it works on
 >the Am29000).
 >
 >-- Tim Olson
 >Advanced Micro Devices
 >(tim@amd.com)

The 88100 does have a separate adder for branches like most other chips with
delayed branching, but the User's Manual explicitly states that for future
compatibility the delay slot instruction cannot be a trap, jump, branch, or
any other instruction that modifies the instruction pointer.  In other words
this behavior, while it may be useful for at least one special case 
application, is not guaranteed to work the same on all implementations of the
architecture.  This allows the possibility of different pipelining schemes in
the future without unduly tying the architect's hands.

Marvin Denman
Motorola 88000 Design

joeg@oakhill.UUCP (Joe Gutierrez) (08/10/89)

>In article <2246@taux01.UUCP> cdddta@tasu76.UUCP (David Deitcher) writes:
>>"Delayed branch" is a technique used by RISC machines to make use of the
>>extra cycle needed to calculate branch targets. The compiler will put
>>an instruction after the branch to be executed by the CPU while the
>>branch target is being calculated. Does anyone have information as to
>>how often the compiler is able to put a useful instruction after the
>>branch as opposed to filling it with a NOP?

In article <1996@se-sd.NCR.COM> lord@se-sd.NCR.COM (Dave Lord ) writes:
>You mean in theory or in real life? :-) I've looked at code generated
>by three different compilers for the 88K (GreenHills, GNU, & LPI) and
>I don't believe any of them EVER put a useful instruction in the
>delayed branch slot. Admittedly the 88K is still pretty new and these
>were all early compilers. I suspect that the reason
>the delayed branch slots are not used is that the register allocators
>are not smart enough to hold a register after a branch. 
>Hopefully this will change. Anyone have
>any idea what percentage of typical code is branches? It would be
>interesting to know how much performance could be gained 
>by filling those slots.

Current 88000 compilers are quite effective in scheduling instructions in
the delay slot of branches if the compiler optimization flags are enabled.
But if the optimizers are not used, the delayed branch instructions are not
emitted.

In addition to tons of other statistics, we have measured
the frequency of branches and the branch delay slot utilization on several real 
applications including while running the C compiler, running a lisp 
interpreter, and running several Unix utilities.  The applications were 
compiled with the Green Hills C compiler using the -OLM level of optimization.
The measurements were made dynamically while running the applications on an
instruction set simulator.  The sample was comprised of over 400 million
instructions executed.  Of these, 13% were branches 
which were taken. Of all the branches taken, 70% executed an instruction in the delay slot.  Interestingly enough, the compiler doesn't always try
to schedule instructions in the delay slot of branches. If it has to choose
between scheduling an instruction in the delay slot of a conditional branch or
scheduling the same instruction in the delay slots of an 
unconditional operation such as a load, a multiply or a floating point 
operation, it will choose the unconditional operation.

                              Jim Klingshirn 
							  Motorola Inc. (88000 Design Group)

hsit@smdvx1.intel.com (Hon P. Sit) (08/11/89)

In article <2284@yogi.oakhill.UUCP> cs.utexas.edu!oakhill!marvin (Marvin Denman) writes:
>
>The 88100 does have a separate adder for branches like most other chips with
>delayed branching, but the User's Manual explicitly states that for future
>compatibility the delay slot instruction cannot be a trap, jump, branch, or
>any other instruction that modifies the instruction pointer.  In other words
>this behavior, while it may be useful for at least one special case 
>application, is not guaranteed to work the same on all implementations of the
>architecture.  This allows the possibility of different pipelining schemes in
>the future without unduly tying the architect's hands.
>

The Intel i860 has the similar restrictions.

gnu@hoptoad.uucp (John Gilmore) (08/11/89)

tim@cayman.amd.com (Tim Olson) wrote:
> Does anyone else know of other processors with such restrictions?

I'm surprised that nobody mentioned the SPARC.  It has restrictions on
which types of branches can sit in the delay slot of which other
types.  I think in the first draft of the architecture I was the one
who noticed that the intended "return from interrupt" sequence was one
of the invalid ones!  I don't have my SPARC manual handy but as I
recall the invalid combinations are defined to "keep you executing in
the same address space but otherwise jump to an undefined location"...
I found this quite a botch for a CPU architecture but I'm not a chip
designer -- I got into this business via software.  Then again, it's
been on the market for a few years and nobody seems to be screaming
about it.

A case where this bit me came up in the observation below:

While examining the function block profiler code (cc -a) I noticed an
interesting thing.  If you do:

	bcond,a	foo		[,a means annul]
	instruction
foo:

What you have is a "skip on not condition" instruction.  If the
condition is true, it does a delayed branch to foo, executing the
instruction in the delay slot.  If the condition is false, it falls
thru, but annuls the instruction.  In either case, the execution time
is the same (two cycles) and you end up at foo.

This reminds me of the "skip" instructions on the old DG Nova and Eclipse.
Quite nice on machines with a single size instruction.

You can also think of it as a "conditionally execute one instruction"
instruction; in this case you don't have to mentally reverse the condition.
E.g.   blt,a foo; insn; foo:   executes insn if less than.

There's a serious catch to it on the SPARC:  the second instruction
cannot be a delayed control transfer [i.e. a branch with a delay
slot].  If it is, what the CPU does is undefined!

I was hoping to use this to shorten the block profiling code, but it doesn't
work because the second instruction is a CALL.  Still, there are probably
places where the optimizers can use it.
-- 
John Gilmore      {sun,pacbell,uunet,pyramid}!hoptoad!gnu      gnu@toad.com
      "And if there's danger don't you try to overlook it,
       Because you knew the job was dangerous when you took it"

jgh@root.co.uk (Jeremy G Harris) (08/11/89)

In article <26676@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>Does anyone else know of other processors with such restrictions?

The Motorola MC88100 User's Manual disallows traps, jumps, branches etc
from use in branch delay slots "to ensure future compatibility".

I'm intrigued that AMD didn't (not displeased, note).  The first time
I was reading a 29K manual and realised you might be able to do this
sort of thing, I said to myself "Aha! The PDP-10 EXEC instruction!"

Jeremy
-- 
Jeremy Harris			jgh@root.co.uk

rtrauben@cortex.Sun.COM (Richard Trauben) (08/11/89)

In article <8266@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>tim@cayman.amd.com (Tim Olson) wrote:
>> Does anyone else know of other processors with such restrictions?
>I'm surprised that nobody mentioned the SPARC. 
Tim was informed off-line.

> SPARC has restrictions on which types of branches can sit in the delay 
> slot of which other types. I think in the first draft of the architecture 
> I was the one who noticed that the intended "return from interrupt" 
> sequence was one of the invalid ones!  

	The first draft of the architecture spec predates me. However, 
	branch couples where a return from trap instruction resides in the
	delay slot is perfectly legal sparc code. The primary use of the
	branch couple is to restart faulted instructions that reside in 
	the delay slot. Thus the sequence:

		a1:	 JMPL t1 (t1= address of instruction to restart)
		a2:  RETT t2 (t2= address of successor to restarted instruction)
		.
		.
		t1:	 I1		 (restarted fault instruction with exception)
		.
		.
		t2:  I2 	 (successor PC to faulted instruction) 
		t2+: I2+	 (successor to successor)
		.
		.
	will be executed as a1, a2, t1, t2, t2+.

>There's a serious catch to it on the SPARC:  the second instruction
>cannot be a delayed control transfer [i.e. a branch with a delay
>slot].  If it is, what the CPU does is undefined!

	Your memory is in error. The SPARC architecture no such requirement.  
	Branch couples require the first branch to be unconditional. The second
	is unconstrained. If the first branch does not meet the above requirement
	the behavior is implementation dependent. Table 4-12 of page 47 in the 
	rev 50 SPARC architecture manual (August 8,1987) identifies the behavior
	for all types of branch couples.

- Richard Trauben
  Sun Microsystems
  Mt View, California

tim@cayman.amd.com (Tim Olson) (08/11/89)

In article <8266@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
| tim@cayman.amd.com (Tim Olson) wrote:
| > Does anyone else know of other processors with such restrictions?
| 
| I'm surprised that nobody mentioned the SPARC.

Well, from the previous postings and email conversations, it appears
that nearly every RISC processor besides the Am29000 has a restriction
on what can go in a branch delay slot, including SPARC, MIPS, 88000,
i860, and ROMP.  Most of the restrictions are advisory (don't do this; the
result is undefined), but the ROMP has hardware to detect and trap this
condition.

One interesting thing to think about if control transfers are allowed in
branch delay slots is how a delay-slot call should work:

loop:
	.
	.
	jmp	loop
	call	lr0, function
exit:
	.
	.


Calls are typically defined in RISC processors to save the return
address in a register.  Since calls themselves have delay slots, the
return address is normally the second instruction after the call.

The action that a delay-slot call takes depends upon how the return
address is calculated in the processor.  It could either be the address
of the call + 2 (words), or the address of the call's delay slot
instruction + 1.  These normally result in the same value, but if the
call is itself in a delay slot, they work differently:

	ret <- call+2			ret <- call_delay+1

	jmp	loop			jmp	loop
	call	lr0, function		call	lr0, function
	<inst at loop>			<inst at loop>
	<inst at function>		<inst at function>
	.				.
	<return from function>		<return from function>
	<inst at exit+1>		<inst at loop+1>

In the former case, the jmp/call pair acts as a visit to the jmp's
target, and does not execute the instruction at exit (it substitutes the
jmp's target for the call's delay slot).  In the later case, the
jmp/call pair continues the loop, executing the first instruction of the
loop just before the call target is executed, and returns to the second
instruction in the loop.  The Am29000 exhibits the second behavior.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

andrew@frip.WV.TEK.COM (Andrew Klossner) (08/12/89)

Dave Lord (lord@se-sd.NCR.COM) writes:

	"I've looked at code generated by three different compilers for
	the 88K (GreenHills, GNU, & LPI) and I don't believe any of
	them EVER put a useful instruction in the delayed branch slot."

The GNU-88k compiler as developed at Data General does a pretty good
job of filling branch delay slots, if you remember to include the -O
(optimize) option.  In a random sample of one big file (malloc.c), I
found that 37 out of 82 branches did useful work in the delay slot
(i.e., the "opcode.n" form of the branch was emitted).

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

mccoy@pixar.UUCP (Daniel McCoy) (08/17/89)

In article <26676@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
> (Response to other articles about branch shadow restrictions.)
>
>  Does anyone else know of other processors with such restrictions?

Weitek XL.
I left the manuals at my previous job, but, if I remember correctly,
some instruction types make other uses of the field required to control 
neutralization (their term for non-execution of shadow instructions).
Their neutralization control allows shadow instructions to be:
     always executed
     executed only if branch taken
     executed only if branch not taken
     executed and return address incremented for calls
They provide a post-processor for compiled code that does a good job
of filling slots.
In hand coding I was able to fill nearly all of the slots.
The control over execution for branch taken/not taken is especially
good for hand coding.  You can execute the instructions on the
most critical path and put the delays on the least critical paths.

Not a bad processor for embedded controller stuff.
The floating point pipeline is certainly easier to use than the i860.

Dan McCoy	Pixar

pc@eecg.toronto.edu (Paul Chow) (08/31/89)

In article <26716@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>Well, from the previous postings and email conversations, it appears
>that nearly every RISC processor besides the Am29000 has a restriction
>on what can go in a branch delay slot, including SPARC, MIPS, 88000,
>i860, and ROMP.  Most of the restrictions are advisory (don't do this; the
>result is undefined), but the ROMP has hardware to detect and trap this
>condition.
>
>One interesting thing to think about if control transfers are allowed in
>branch delay slots is how a delay-slot call should work:

I missed the beginning of all this but I thought this might be interesting.

The Stanford MIPS-X processor has a branch/jump delay of 2.  The new
PC is computed by adding the offset to the value of the current PC when
the instruction reaches its ALU stage.  The pipeline is: IF RF ALU MEM WB.
So for the following:

	0	br offset=5
	1	nop (delay slot 1)
	2	nop (delay slot 2)
	3	...
	...
	7	nop (br destination)

the sequence would be 0, 1, 2, 7.

Now consider this interesting loop:

	0	nop
	1	nop
	2	nop
	3	br offset=-1
	4	br offset=-1
	5	br offset=-1
	6	nop
	7	nop

The sequence of the PC is 0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, ...

One morning I walked in and I saw someone running the simulator with
the PC running backwards!  This worried me because I had designed the PC,
and had never expected that it could `increment' backwards...

On the question of calls in delay slots, MIPS-X does the same as Tim explains
for the Am29000, except that there are two delay slots.

Paul Chow
Dept. of Electrical Engineering
University of Toronto

pc@eecg.toronto.edu