cdddta@tasu76.UUCP (David Deitcher) (07/30/89)
"Delayed branch" is a technique used by RISC machines to make use of the extra cycle needed to calculate branch targets. The compiler will put an instruction after the branch to be executed by the CPU while the branch target is being calculated. Does anyone have information as to how often the compiler is able to put a useful instruction after the branch as opposed to filling it with a NOP? ----------------------------------------------------- David Deitcher (National Semiconductor Israel) UUCP: ...!{amdahl,hplabs,decwrl}!nsc!taux01!cdddta Domain: cdddta@taux01.nsc.com
cik@l.cc.purdue.edu (Herman Rubin) (08/01/89)
In article <2246@taux01.UUCP>, cdddta@tasu76.UUCP (David Deitcher) writes: > "Delayed branch" is a technique used by RISC machines to make use of the > extra cycle needed to calculate branch targets. The compiler will put > an instruction after the branch to be executed by the CPU while the > branch target is being calculated. Does anyone have information as to > how often the compiler is able to put a useful instruction after the > branch as opposed to filling it with a NOP? This has nothing to do with the capabilities of the compiler, but I have often wanted to put one, or even several, instructions between a branch instruction and its execution. As most problems have a substantial amount of order independence of operations, I would be surprised if a useful instruction could not be inserted in well over 90% of the cases. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
verwer@ruuinf.cs.ruu.nl (Nico Verwer) (08/01/89)
In article <2246@taux01.UUCP>, cdddta@tasu76.UUCP (David Deitcher) writes: > "Delayed branch" is a technique used by RISC machines to make use of the > extra cycle needed to calculate branch targets. The compiler will put > an instruction after the branch to be executed by the CPU while the > branch target is being calculated. Does anyone have information as to > how often the compiler is able to put a useful instruction after the > branch as opposed to filling it with a NOP? In "reduced instruction set computers", Communications of the ACM, jan. 1985, (pp. 8 -- 21), D.A Patterson writes (on page 13): " RISC optimizing compilers are able to succesfully rearrange instructions " to use the cycle after the delayed branch more than 90 percent of the " time. Hennesy has found that more than 20 percent of all instructions " are executed in the delay after the branch. The article he is referring to is in IEEE Trans.Comput., but had to be published yet at that time. ===================================================================== Nico Verwer University of Utrecht, Dept. of Computer Science VOICE: +31-30-533921 PHYSICAL MAIL: || Postbus 80.089 || INTERNET: verwer@cs.ruu.nl || 3508 TB Utrecht || UUCP: ...!hp4nl!ruuinf!verwer || the Netherlands || =====================================================================
frazier@oahu.cs.ucla.edu (Greg Frazier) (08/01/89)
In article <1462@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: ->In article <2246@taux01.UUCP>, cdddta@tasu76.UUCP (David Deitcher) writes: ->> branch target is being calculated. Does anyone have information as to ->> how often the compiler is able to put a useful instruction after the ->> branch as opposed to filling it with a NOP? -> ->This has nothing to do with the capabilities of the compiler, but I have ->often wanted to put one, or even several, instructions between a branch ->instruction and its execution. As most problems have a substantial amount ->of order independence of operations, I would be surprised if a useful ->instruction could not be inserted in well over 90% of the cases. Studies have shown (I love that phrase) that one delay can be filled 70% of the time, and a second 30% of the time (approx, of course). There are two problems with filling delay slots. The first is that branches occur on the order of every 3 to 5 instructions - that's a lot of slots to fill. The second is that one often branches on a value immediately after calculating it (i++; if i > 5 ...). Oh, you want a reference for those numbers? The branch frequency comes from some of the RISC papers from Berkeley (Patterson, et. al.). I think the slot filling numbers come from the same, but I'm not sure. Greg &&&&&&&&&&&&&&&&&&&&&@@@@@@@@@@@@@@@@@@@@@@###################### Greg Frazier o Internet: frazier@CS.UCLA.EDU CS dept., UCLA /\ UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier ----^/---- /
raj@hpindwa.HP.COM (Richard Jones) (08/02/89)
on the subject of how often the branch delay slot is useful, an informal look at some code (compiler optimised) for the HP3000 PA (precision arch.) would lead me to believe that that slot can be used frequently for useful work - one obvious example is the branch taking you to another procedure - the delay slot can be used to loadup a parameter... of course, your milage may vary, rick 'by no means a hardware type' jones
baum@Apple.COM (Allen J. Baum) (08/03/89)
[] >In article blahblahblah frazier@cs.ucla.edu (Greg Frazier) writes: > >Studies have shown (I love that phrase) that one delay can be filled >70% of the time, and a second 30% of the time (approx, of course). >There are two problems with filling delay slots. The first is that >branches occur on the order of every 3 to 5 instructions - that's >a lot of slots to fill. The second is that one often branches on >a value immediately after calculating it (i++; if i > 5 ...). Oh, >you want a reference for those numbers? The branch frequency comes >from some of the RISC papers from Berkeley (Patterson, et. al.). I >think the slot filling numbers come from the same, but I'm not sure. An interesting point to ponder is that delayed loads also have to schedule something into their shadows, and they are (very roughly) as frequent as branches. Now, there are only so many instructions that can be re-arranged, so it is possible that although branches shadows can be filled 70% of the time, and load shadows can be filled 70% of the time, it may not be true that you can fill both of the 70% of the time. Any comments from someone who has actually measured this? It might be interesting to turn off filling of one, and see how the percentage of filling the other increases. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
lord@se-sd.NCR.COM (Dave Lord ) (08/03/89)
In article <2246@taux01.UUCP> cdddta@tasu76.UUCP (David Deitcher) writes: >"Delayed branch" is a technique used by RISC machines to make use of the >extra cycle needed to calculate branch targets. The compiler will put >an instruction after the branch to be executed by the CPU while the >branch target is being calculated. Does anyone have information as to >how often the compiler is able to put a useful instruction after the >branch as opposed to filling it with a NOP? You mean in theory or in real life? :-) I've looked at code generated by three different compilers for the 88K (GreenHills, GNU, & LPI) and I don't believe any of them EVER put a useful instruction in the delayed branch slot. Admittedly the 88K is still pretty new and these were all early compilers. I suspect that the reason the delayed branch slots are not used is that the register allocators are not smart enough to hold a register after a branch. Hopefully this will change. Anyone have any idea what percentage of typical code is branches? It would be interesting to know how much performance could be gained by filling those slots.
glenn@csri.toronto.edu (Glenn Mackintosh) (08/03/89)
frazier@oahu.cs.ucla.edu (Greg Frazier) writes: >There are two problems with filling delay slots. The first is that >branches occur on the order of every 3 to 5 instructions - that's >a lot of slots to fill. The second is that one often branches on >a value immediately after calculating it (i++; if i > 5 ...). Hmmm. That just made me think about something. Has anyone looked at taking something like the above and having the compiler recognize that it could rearrange things a bit in cases like this? For example the compiler could change this to do a compare against 4 and move the increment into the delay slot after the test. I have a feeling that a lot off loop code has this kind of termination structure. If you just add/subtract some fixed number from your loop control variable and then test to see if it has gone beyond a fixed range then you could easily adjust the range so that the test could be done first. Has anybody done this kind of thing and if so how did it come out? Glenn Mackintosh Univ. of Toronto CSNET/ARPA: glenn@eecg.{toronto.edu,utoronto.ca} CDNNET: glenn@eecg.toronto.cdn BITNET: glenn@eecg.utoronto.bitnet (may not work from all sites) UUCP: uunet!utai!eecg!glenn
preston@titan.rice.edu (Preston Briggs) (08/03/89)
In article <33669@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >An interesting point to ponder is that delayed loads also have to schedule >something into their shadows, and they are (very roughly) as frequent as >branches. Now, there are only so many instructions that can be re-arranged, >so it is possible that although branches shadows can be filled 70% of the time, >and load shadows can be filled 70% of the time, it may not be true that you can >fill both of the 70% of the time. Any comments from someone who has actually >measured this? It might be interesting to turn off filling of one, and see how >the percentage of filling the other increases. Martin Hopkins, writing about the PL.8 compiler and the 801, points out that it's especially nice if you can schedule a load in the branch "shadow" (great term by the way). That way the load completes while the branch is branching. The PL.8 compiler does some non-trivial rearrangement (really rewriting) to achieve this. I schedule for both cases, but I don't have any good numbers I can give you now (haven't measured the results recently; wish I had a pixie). My impression is that I do better on branchs than loads. On the RT, the "flavor" of the two problems is a little different. Loads cast a long shadow and I try to avoid scheduling instructions that use the loaded register during the shadow. For branches, I try and find an instruction that doesn't interfere with the CC and can execute at the end of the block. They aren't necessarily competing goals. An easy example is a mem-mem copy, that is a LOAD followed by a STORE. The STORE can be scheduled in the branch slot, and the LOAD as early as possible (up to 6 cycles anyway). On the other hand, your point seems valid. I'm just not sure of an easy way to measure the effect. Regards, Preston Briggs
hankd@pur-ee.UUCP (Hank Dietz) (08/03/89)
In article <33669@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >An interesting point to ponder is that delayed loads also have to schedule >something into their shadows, and they are (very roughly) as frequent as >branches. Now, there are only so many instructions that can be re-arranged, >so it is possible that although branches shadows can be filled 70% of the time, >and load shadows can be filled 70% of the time, it may not be true that you can >fill both of the 70% of the time. Any comments from someone who has actually >measured this? It might be interesting to turn off filling of one, and see how >the percentage of filling the other increases. I think delayed loads are MORE frequent than delayed branches for most code and most target machines, however, the two are essentially independent pipelines. In other words, for a lot of target machines, one can use a delayed branch in a delay slot for a delayed load (and vice versa). Hence, with appropriate compiler algorithms, the two don't interfere that much. I have a student, Ashar Nissar, who is doing his MS thesis on compile-time scheduling for multiple pipelines. It's a little early to give results, and number results are far too dependent on the precise target machine, but qualitatively it is safe to say that having multiple INDEPENDENT pipelines doesn't seem to cause much trouble. Notice that I didn't say that the usual pipeline scheduling techniques work all that well... we are currently using an algorithm based on trying ALL legal & useful permutations of instruction sequences prior to register allocation. Right now, we can quickly get an optimal schedule for up to about 16 instructions, but have to truncate the search for larger schedules. -hankd@ecn.purdue.edu
roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (08/03/89)
In article <1996@se-sd.NCR.COM> lord@se-sd.UUCP (Dave Lord (SSP)) writes: >Hopefully this will change. Anyone have >any idea what percentage of typical code is branches? It would be >interesting to know how much performance could be gained >by filling those slots. For the motorola 68k I've seen one study quoting 23% of executed code being branches. I seem to remember hearing somewhere else a figure of 1 in 6 so somewhere between 1 in 4 and 1 in 6 would be my guess. Which explains why (to answer an unasked question :-) you'll see instruction cache line sizes for 68k architectures getting up to 16 bytes but no more: average instruction length is 3 bytes 3x5=15 pulling in extra instructions would just waste memory bandwidth. -- I don't know what the question means, but the answer is yes... KLM - Koninklijke Luchtvaart Maatschappij => coneenclicker lughtfart matscarpie Roelof Vuurboom SSP/V3 Philips TDS Apeldoorn, The Netherlands +31 55 432226 domain: roelof@idca.tds.philips.nl uucp: ...!mcvax!philapd!roelof
hankd@pur-ee.UUCP (Hank Dietz) (08/04/89)
In article <1989Aug2.160111.9324@jarvis.csri.toronto.edu> glenn@csri.toronto.edu (Glenn Mackintosh) writes: >... For example the compiler could >change this to do a compare against 4 and move the increment into the delay >slot after the test. ... I haven't seen loop transforms done to optimize FINE GRAIN pipelines... although it shouldn't be all that hard to do. A few years ago, somebody (at Stanford?) had a paper out on converting between equality and inequality comparisons so that better instruction sequences would apply. That's the closest I recall seeing. -hankd@ecn.purdue.edu
pauls@apple.com (Paul Sweazey) (08/04/89)
In article <221@ssp1.idca.tds.philips.nl> roelof@idca.tds.PHILIPS.nl (R. Vuurboom) writes: > For the motorola 68k I've seen one study quoting 23% of executed code > being branches. I seem to remember hearing somewhere else a figure of > 1 in 6 so somewhere between 1 in 4 and 1 in 6 would be my guess. > > Which explains why (to answer an unasked question :-) you'll see instruction > cache line sizes for 68k architectures getting up to 16 bytes but no more: > average instruction length is 3 bytes 3x5=15 pulling in extra instructions > would just waste memory bandwidth. However, there is a high probability that the branch is a few instructions away, or will soon run or branch to the remainder of the larger cache line. Other issues such as memory operation or interconnect bandwidth and latency may be the real deciding factors. Paul Sweazey Apple Computer, Inc. pauls@apple.com (408)-974-0253
jesup@cbmvax.UUCP (Randell Jesup) (08/05/89)
In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >Studies have shown (I love that phrase) that one delay can be filled >70% of the time, and a second 30% of the time (approx, of course). >There are two problems with filling delay slots. The first is that >branches occur on the order of every 3 to 5 instructions - that's >a lot of slots to fill. The second is that one often branches on >a value immediately after calculating it (i++; if i > 5 ...). Oh, >you want a reference for those numbers? The branch frequency comes >from some of the RISC papers from Berkeley (Patterson, et. al.). I >think the slot filling numbers come from the same, but I'm not sure. Note that a) branch frequency depends a lot on the architecture, and also on the compiler's optimizer; b) slot-filling percentages (often taken as a "universal" figure) depend highly on the instruction set of the machine, and even the type of code being compiled. In particular, 2-address instruction machines are better at filling branches than 3-address machines (they're more likely to have an instruction pair that can fill a 2-instruction delay, and more likely to have pairs of instructions they can split up to fill branch and load delays. One has to be careful about assumptions. It's easy to take someone else's research as being directly applicable to a different architecture, but often it isn't, or is only partially applicable. The acid test is to simulate the damn thing with real code and real compilers and gather real statistics. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
jonah@db.toronto.edu (Jeffrey Lee) (08/07/89)
In article <7543@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes: > Note that a) branch frequency depends a lot on the architecture, and >also on the compiler's optimizer; b) slot-filling percentages (often taken >as a "universal" figure) depend highly on the instruction set of the machine, >and even the type of code being compiled. Take a look at the article by Dan Klein about CISC/RISC compiler construction in the proceedings of the Winter 89 USENIX (San Diego). He surveyed a number of CISC and RISC compilers and took polls of the usage frequency of the various instructions. He found a surprising degree of similarity of instruction frequency in code generated by C compilers on various CISC/RISC platforms. The following table is not found in the article, but is derived from the information in it: I/KVax move arith log cmp cmpbra bra call (excluding noop) 1000 402 80 12 133 157 90 126 vax/pcc 1033 420 81 14 130 158 98 130 vax/tartan 1158 472 96 15 166 165 103 136 vax/dec 1262 588 156 20 135 155 83 122 mc68020/gnu 1331 616 187 21 138 147 86 131 mc68020/sun 1470 801 127 66* 151 148 67 164 sparc/sun 1582 829 193 66* 72 170 117 131 88100/greenhills 1763 969 290 37 37 206 114 153 r2000/mips [*: ori is often used to load the lower portion of a register.] Most of the extra operations added in the RISC code seem to be loads and stores (lack of memory-memory operations) and arithmetic/logical operations (adds/shifts to make up for the reduced addressing modes). Somewhat surprisingly, although the RISC code has more instructions than the CISC code, it is less than a two-fold expansion in code size. Adjusting for average instruction size and adding in the data space, these RISC programs were only 25% to 50% larger than the VAX programs. (The 68020 programs were often 25% smaller than the VAX code.) In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >Studies have shown (I love that phrase) that one delay can be filled >70% of the time, and a second 30% of the time (approx, of course). This paper showed a measured branch and call (static) frequency ranging from 22.6% to 27.8% for R2000/SPARC/88100 and from 27.5% to 37.4% for MC68020/VAX. Loads made up 22% to 31% of the RISC code. The delay slot filling situation is somewhat confusing since some of these processors have interlocked loads while others do not. Similarly, some processors conditionally execute the branch delay slot instruction while others always execute it. [GOS==Grain-Of-Salt: I may be wrong in the following.] The MIPS/R2000 compiler used there [check the paper for the version number--I don't have a copy handy] filled load/branch/call [GOS] delay slots about 76% of the time leading to a 14% noop frequency. The Sun/SPARC compiler has interlocked loads [GOS] and filled its branch/call [GOS] delay slots about 75% of the time leading to a 4% noop frequency. The 88100 has interlocked loads and conditional delay slots [GOS] leading to 0% NOPs. Interlocked loads make it easier on the compiler writer: It means they don't have to worry about filling delay slots after load instructions. Unfortunately it also means that they might not try to prefetch data and end up with slower code as a result (with stalls instead of NOPs). However, if they do try to prefetch data (and if the interlocks do not lengthen the cycle time) they could be ahead: a one cycle stall should be slightly faster than a NOP because it never can cause a cache miss. Additionally, the lack of NOPs means that a few more ``useful'' instructions can be stored in the same sized cache. In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >There are two problems with filling delay slots. The first is that >branches occur on the order of every 3 to 5 instructions - that's >a lot of slots to fill. ... Having conditional delay slots for conditional branches takes some of the sting out of this. Branch slots are harder to fill when they are unconditional: the delay instruction must be ``useful'' (or harmless) for both the branch-taken and branch-not-taken cases. The delay action can be made ``conditional'' by adding silicon to nullify the action (e.g. prevent writeback of the results to the register cells or abort a load/store instruction) in the case of the branch not taken (or vice versa). In this situation the branch slot can always be filled for static branch targets: simply duplicate the branch target instruction in the delay slot and move the target forward by one instruction. Alternatively, the delay instruction could be aborted when the branch IS taken and used if the branch falls through. Which choice to implement depends on the expected probability of branch taken versus not taken. [Sorry, I have no data on this. John??] Of course in these case the delay instruction is only useful when the branch is taken (or not taken). It is better to fill an unconditional branch slot than a conditional one assuming that the delay action is useful in both cases, however it is also useful to have the option for conditional delay slots to statisfy (half-of) the remaining cases. Whether or not the extra silicon is justified depends on (a) how often conditional branches are expected, (b) how often we expect unconditional delay slots to remain unfilled, and (c) how hard it is to ``nullify'' the delay instructions without increasing the cycle time (among other things). [N.B. One might need some circuitry to nullify partial instructions anyways so that we can abort execution and restart again in case of a page fault.] In article <26139@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: > ...The second is that one often branches on >a value immediately after calculating it (i++; if i > 5 ...). As to using a value immediately after calculating it: I don't know about other processors, but on the MIPS/R2000 there is little problem in this. The result of any register/immediate integer instruction (other than multiply or divide) is always available for use as an operand in the next instruction without any delay. Having just altered the value is actually advantageous: it means that the value will be stored-in or cached-in a register. Since many branches tend to be preceeded by increment or decrement instructions rather than multiplies or divides, there is little problem. If the value being tested is coming from memory, you are stuck with the load delay (but may be able to pre-fetch it some of the time.) Perhaps someone who has worked on designing RISC chips would care comment on the cost of adding interlocked loads and conditional delay slots. -- Jeff Lee <jonah@db.toronto.edu> or <...!utzoo!utai!jonah>
paul@moncam.co.uk (Paul Hudson) (08/07/89)
For some reason known only to the designers, the ARM doesnt have a delayed branch slot (blowing away the pipeline instead). What it does have is conditional execution of every instruction. This means that many (conditional) branches can be replaced by conditionally executing the next few instructions instead, thus side-stepping the scheduling problem.. Since also you can choose whether an arithmetic instruction affects the condition codes, it's easy to arrange (usually in a peepholing pass). IMHO, they should have had a delay slot too, though. The ARM instruction set is one of the nicest to deal with, both for hand-coding and compiler writing - well worth a look. -- Paul Hudson These opinions void where prohibited by law. Until 23 August (but mail will be forwarded for a while) MAIL: Monotype ADG, Science Park, Cambridge, CB4 4FQ, UK. PHONE: +44 (223) 420018 EMAIL: paul@moncam.co.uk, FAX: +44 (223) 420911 ...!ukc!acorn!moncam!paul On vacation until September 6, then EMAIL: ..!mcvax!i2unix!iconet!trzdor1!paul, paul@trzdor1.ico.olivetti.com
jrc@ukc.ac.uk (John) (08/07/89)
In article <PAUL@marvin.moncam.co.uk> paul@moncam.co.uk (Paul Hudson) writes: >For some reason known only to the designers, the ARM doesnt have a >delayed branch slot (blowing away the pipeline instead). What it does >have is conditional execution of every instruction ... I believe it's because the ARMs designers are all assembly code freaks and didn't like the idea of having to do all that reordering by hand. Something of a copout! >... IMHO, they should have had a delay slot too, though. Yes, it would have added another few % of performance at no cost (except to the mental health of Roger Wilson). >-- >Paul Hudson These opinions void where prohibited by law. John Cupitt
wilken@husky.ece.cmu.edu (Kent Wilken) (08/08/89)
In article <1989Aug6.164210.11976@jarvis.csri.toronto.edu> jonah@db.toronto.edu (Jeffrey Lee) writes: >... Having conditional delay slots for conditional branches takes >some of the sting out of this. ... >The delay action can be made ``conditional'' by adding silicon to >nullify the action (e.g. prevent writeback of the results to the >register cells or abort a load/store instruction) in the case of the >branch not taken (or vice versa). ... >Which choice to implement depends on the expected probability of >branch taken versus not taken. [Sorry, I have no data on this.] ... The paper "Reducing the Cost of Branches" by McFarling and Hennessy (13th Comp. Arch. Conf. pp 396-403) addresses this topic. Data is presented that shows the branch-taken probability to be roughly 65%. The paper shows that the nullify action, called "squashing", can significantly reduce the cost of branches for a profiled program. ("Cost" is a performance merit that includes the performance cost of NOPs). A following paper "Architectural Tradeoffs in the Design of the MIPS-X" by Chow and Horowitz (14th Comp. Arch. Conf. pp 300-308) shows that squashing melds nicely with exception handling (i.e., the amount of added silicon is small). They also report that the MIPS-X, which has two branch delay slots, uses from 16-18% NOPs. Kent Wilken <wilken@husky.ece.cmu.edu>
rcpieter@eutrc4.urc.tue.nl (Pieter Schoenmakers) (08/09/89)
Just wondering--- - What happens on existing processors which use delayed branches when the instruction put in the branch instruction's shadow is also a branch? - Do existing processors have a seperate adder for use by branches only, or are there restrictions on the possible instructions which can be put in the shadow? - Are there any processors which use branch delaying and don't have a seperate address calculation adder? Tiggr
tim@cayman.amd.com (Tim Olson) (08/09/89)
In article <828@eutrc3.urc.tue.nl> rcpieter@rc4.urc.tue.nl writes: | Just wondering--- | - What happens on existing processors which use delayed branches when the | instruction put in the branch instruction's shadow is also a branch? You get what we term a "visit". This executes the single instruction at the first branch target, then continues with the second (delay-slot) branch's target. The second branch does not have a physical delay slot following it; rather, the instruction at the first branch's target acts as the second branch's delay slot. This does have its uses -- for example, a debugger can replace an instruction with a specific breakpoint instruction, saving the replaced instruction elsewhere. When it comes time to restart execution, the debugger can execute the replaced instruction "out-of-line" with a visit, allowing the breakpoint instruction to physically remain at the breakpoint location. | - Do existing processors have a seperate adder for use by branches only, or | are there restrictions on the possible instructions which can be put in the | shadow? | - Are there any processors which use branch delaying and don't have a | seperate address calculation adder? I know of no processors that have delayed branches and restrictions on delay instruction types. However, there really isn't any resource contention between a branch and its delay slot. The delay slot is just being decoded by the time the branch executes. The reason that a processor may have a separate adder to calculate PC-relative branch offset addresses is that they must perform an instruction cache lookup during the branch execution, and thus must have the address calculated by the end of the decode stage (at least, this is the way it works on the Am29000). -- Tim Olson Advanced Micro Devices (tim@amd.com)
jk3k+@andrew.cmu.edu (Joe Keane) (08/09/89)
In article <26667@amdcad.AMD.COM> tim@cayman.amd.com (Tim Olson) writes: >I know of no processors that have delayed branches and restrictions on >delay instruction types. The IBM RT manual says (rather verbosely IMHO): Certain instructions are not allowed to be the subject of a branch with execute instruction. Since the branch with execute instructions change the normal sequential execution of instructions, the subject instructions cannot be an instruction that also changes the instruction sequencing; otherwise the processor may be put in an unpredictable state. Thus, all branches, jumps, traps, Load Program Status, Supervisor Call, and Wait instructions cannot be subject instructions. The system processor provides hardware that detects these illegal instructions and causes a program check interrupt if any of these instructions occur as the subject of a branch with execute instruction.
jjb@sequent.UUCP (Jeff Berkowitz) (08/09/89)
In article <828@eutrc3.urc.tue.nl> rcpieter@rc4.urc.tue.nl writes: >Just wondering--- > - What happens on existing processors which use delayed branches when the >instruction put in the branch instruction's shadow is also a branch? If you haven't thought this through, it's quite amusing. Given such a machine, the typical CISC code (erroneous in this case, of course) - looptop: ... jsr subr br looptop causes exactly *one* instruction at "subr" to be executed before going back to "looptop". (When executing the jsr, you fetch the branch; when executing the branch, you fetch @subr; when executing @subr, you fetch at looptop, etc). On one machine in my past, the architects were disturbed enough by this behavior to "fix" it in hardware: if a branch occurred in the shadow, the machine automatically converted the "shadowed" branch into a noop! If the first branch happened to be a jsr, as in the example above, the machine would not only "noop" it but would remember that it should be the return addrss of "subr", so it would be executed when you got back. This allowed code like jsr syscall_enter bcs error # executed on return from syscall_enter to work "just like a CISC". The machine actually had two instructions in the "shadow" (we called these "trailers"). So there were actually three possible return addresses of any jsr instruction - the next instruction, if it was a branch, jump, or jsr else the instruction after that, if *it* was a branch, jump, or jsr else the instruction after *that*. The instruction pipeline had to determine which to "push" on-the-fly. Was this complexity worth it? Well, presumably it improved code density. I am not aware of any measurements. It made breakpoint debugging a serious pain; you needed a zillion types of breakpoint instructions so that if you set a breakpoint on a branch that was the second trailer of a jsr, the breakpoint instruction itself would *also* look like a branch. Anyway the machine was not successful, in part because of the complexity of trying to implement a lot of little rules like this in hardware. -- Jeff Berkowitz N6QOM uunet!sequent!jjb Sequent Computer Systems Custom Systems Group
tim@cayman.amd.com (Tim Olson) (08/09/89)
In article <0YrsEMy00V4GQ=CJZw@andrew.cmu.edu> jk3k+@andrew.cmu.edu (Joe Keane) writes: | The IBM RT manual says (rather verbosely IMHO): | | Certain instructions are not allowed to be the subject of a branch with execute | instruction. Since the branch with execute instructions change the normal | sequential execution of instructions, the subject instructions cannot be an | instruction that also changes the instruction sequencing; otherwise the | processor may be put in an unpredictable state. Thus, all branches, jumps, | traps, Load Program Status, Supervisor Call, and Wait instructions cannot be | subject instructions. The system processor provides hardware that detects | these illegal instructions and causes a program check interrupt if any of these | instructions occur as the subject of a branch with execute instruction. Thanks for the information -- I didn't have my RT arch. manual with me to look it up. Does anyone else know of other processors with such restrictions? -- Tim Olson Advanced Micro Devices (tim@amd.com)
marvin@oakhill.UUCP (Marvin Denman) (08/09/89)
In article <26667@amdcad.AMD.COM> tim@amd.com writes: >In article <828@eutrc3.urc.tue.nl> rcpieter@rc4.urc.tue.nl writes: >| Just wondering--- >| - What happens on existing processors which use delayed branches when the >| instruction put in the branch instruction's shadow is also a branch? >You get what we term a "visit". This executes the single instruction >at the first branch target, then continues with the second (delay-slot) >branch's target. The second branch does not have a physical delay slot >following it; rather, the instruction at the first branch's target acts >as the second branch's delay slot. This is a good explanation of what occurs with delayed branches. The 88100 has a few more variations of this effect due to the ability of the instruction set to either execute the instruction in the delay slot or to waive the instruction in the delay slot. This for instance allows a "visit" that does not execute the instruction. Some of the other variations of execute/no execute may produce other obscure code sequencing. Before using any of these tricks see the warning below. >| - Do existing processors have a seperate adder for use by branches only, or >| are there restrictions on the possible instructions which can be put in the >| shadow? >| - Are there any processors which use branch delaying and don't have a >| seperate address calculation adder? > >I know of no processors that have delayed branches and restrictions on >delay instruction types. However, there really isn't any resource >contention between a branch and its delay slot. The delay slot is just >being decoded by the time the branch executes. The reason that a >processor may have a separate adder to calculate PC-relative branch >offset addresses is that they must perform an instruction cache lookup >during the branch execution, and thus must have the address calculated >by the end of the decode stage (at least, this is the way it works on >the Am29000). > >-- Tim Olson >Advanced Micro Devices >(tim@amd.com) The 88100 does have a separate adder for branches like most other chips with delayed branching, but the User's Manual explicitly states that for future compatibility the delay slot instruction cannot be a trap, jump, branch, or any other instruction that modifies the instruction pointer. In other words this behavior, while it may be useful for at least one special case application, is not guaranteed to work the same on all implementations of the architecture. This allows the possibility of different pipelining schemes in the future without unduly tying the architect's hands. Marvin Denman Motorola 88000 Design
joeg@oakhill.UUCP (Joe Gutierrez) (08/10/89)
>In article <2246@taux01.UUCP> cdddta@tasu76.UUCP (David Deitcher) writes: >>"Delayed branch" is a technique used by RISC machines to make use of the >>extra cycle needed to calculate branch targets. The compiler will put >>an instruction after the branch to be executed by the CPU while the >>branch target is being calculated. Does anyone have information as to >>how often the compiler is able to put a useful instruction after the >>branch as opposed to filling it with a NOP? In article <1996@se-sd.NCR.COM> lord@se-sd.NCR.COM (Dave Lord ) writes: >You mean in theory or in real life? :-) I've looked at code generated >by three different compilers for the 88K (GreenHills, GNU, & LPI) and >I don't believe any of them EVER put a useful instruction in the >delayed branch slot. Admittedly the 88K is still pretty new and these >were all early compilers. I suspect that the reason >the delayed branch slots are not used is that the register allocators >are not smart enough to hold a register after a branch. >Hopefully this will change. Anyone have >any idea what percentage of typical code is branches? It would be >interesting to know how much performance could be gained >by filling those slots. Current 88000 compilers are quite effective in scheduling instructions in the delay slot of branches if the compiler optimization flags are enabled. But if the optimizers are not used, the delayed branch instructions are not emitted. In addition to tons of other statistics, we have measured the frequency of branches and the branch delay slot utilization on several real applications including while running the C compiler, running a lisp interpreter, and running several Unix utilities. The applications were compiled with the Green Hills C compiler using the -OLM level of optimization. The measurements were made dynamically while running the applications on an instruction set simulator. The sample was comprised of over 400 million instructions executed. Of these, 13% were branches which were taken. Of all the branches taken, 70% executed an instruction in the delay slot. Interestingly enough, the compiler doesn't always try to schedule instructions in the delay slot of branches. If it has to choose between scheduling an instruction in the delay slot of a conditional branch or scheduling the same instruction in the delay slots of an unconditional operation such as a load, a multiply or a floating point operation, it will choose the unconditional operation. Jim Klingshirn Motorola Inc. (88000 Design Group)
hsit@smdvx1.intel.com (Hon P. Sit) (08/11/89)
In article <2284@yogi.oakhill.UUCP> cs.utexas.edu!oakhill!marvin (Marvin Denman) writes: > >The 88100 does have a separate adder for branches like most other chips with >delayed branching, but the User's Manual explicitly states that for future >compatibility the delay slot instruction cannot be a trap, jump, branch, or >any other instruction that modifies the instruction pointer. In other words >this behavior, while it may be useful for at least one special case >application, is not guaranteed to work the same on all implementations of the >architecture. This allows the possibility of different pipelining schemes in >the future without unduly tying the architect's hands. > The Intel i860 has the similar restrictions.
gnu@hoptoad.uucp (John Gilmore) (08/11/89)
tim@cayman.amd.com (Tim Olson) wrote:
> Does anyone else know of other processors with such restrictions?
I'm surprised that nobody mentioned the SPARC. It has restrictions on
which types of branches can sit in the delay slot of which other
types. I think in the first draft of the architecture I was the one
who noticed that the intended "return from interrupt" sequence was one
of the invalid ones! I don't have my SPARC manual handy but as I
recall the invalid combinations are defined to "keep you executing in
the same address space but otherwise jump to an undefined location"...
I found this quite a botch for a CPU architecture but I'm not a chip
designer -- I got into this business via software. Then again, it's
been on the market for a few years and nobody seems to be screaming
about it.
A case where this bit me came up in the observation below:
While examining the function block profiler code (cc -a) I noticed an
interesting thing. If you do:
bcond,a foo [,a means annul]
instruction
foo:
What you have is a "skip on not condition" instruction. If the
condition is true, it does a delayed branch to foo, executing the
instruction in the delay slot. If the condition is false, it falls
thru, but annuls the instruction. In either case, the execution time
is the same (two cycles) and you end up at foo.
This reminds me of the "skip" instructions on the old DG Nova and Eclipse.
Quite nice on machines with a single size instruction.
You can also think of it as a "conditionally execute one instruction"
instruction; in this case you don't have to mentally reverse the condition.
E.g. blt,a foo; insn; foo: executes insn if less than.
There's a serious catch to it on the SPARC: the second instruction
cannot be a delayed control transfer [i.e. a branch with a delay
slot]. If it is, what the CPU does is undefined!
I was hoping to use this to shorten the block profiling code, but it doesn't
work because the second instruction is a CALL. Still, there are probably
places where the optimizers can use it.
--
John Gilmore {sun,pacbell,uunet,pyramid}!hoptoad!gnu gnu@toad.com
"And if there's danger don't you try to overlook it,
Because you knew the job was dangerous when you took it"
jgh@root.co.uk (Jeremy G Harris) (08/11/89)
In article <26676@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >Does anyone else know of other processors with such restrictions? The Motorola MC88100 User's Manual disallows traps, jumps, branches etc from use in branch delay slots "to ensure future compatibility". I'm intrigued that AMD didn't (not displeased, note). The first time I was reading a 29K manual and realised you might be able to do this sort of thing, I said to myself "Aha! The PDP-10 EXEC instruction!" Jeremy -- Jeremy Harris jgh@root.co.uk
rtrauben@cortex.Sun.COM (Richard Trauben) (08/11/89)
In article <8266@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >tim@cayman.amd.com (Tim Olson) wrote: >> Does anyone else know of other processors with such restrictions? >I'm surprised that nobody mentioned the SPARC. Tim was informed off-line. > SPARC has restrictions on which types of branches can sit in the delay > slot of which other types. I think in the first draft of the architecture > I was the one who noticed that the intended "return from interrupt" > sequence was one of the invalid ones! The first draft of the architecture spec predates me. However, branch couples where a return from trap instruction resides in the delay slot is perfectly legal sparc code. The primary use of the branch couple is to restart faulted instructions that reside in the delay slot. Thus the sequence: a1: JMPL t1 (t1= address of instruction to restart) a2: RETT t2 (t2= address of successor to restarted instruction) . . t1: I1 (restarted fault instruction with exception) . . t2: I2 (successor PC to faulted instruction) t2+: I2+ (successor to successor) . . will be executed as a1, a2, t1, t2, t2+. >There's a serious catch to it on the SPARC: the second instruction >cannot be a delayed control transfer [i.e. a branch with a delay >slot]. If it is, what the CPU does is undefined! Your memory is in error. The SPARC architecture no such requirement. Branch couples require the first branch to be unconditional. The second is unconstrained. If the first branch does not meet the above requirement the behavior is implementation dependent. Table 4-12 of page 47 in the rev 50 SPARC architecture manual (August 8,1987) identifies the behavior for all types of branch couples. - Richard Trauben Sun Microsystems Mt View, California
tim@cayman.amd.com (Tim Olson) (08/11/89)
In article <8266@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: | tim@cayman.amd.com (Tim Olson) wrote: | > Does anyone else know of other processors with such restrictions? | | I'm surprised that nobody mentioned the SPARC. Well, from the previous postings and email conversations, it appears that nearly every RISC processor besides the Am29000 has a restriction on what can go in a branch delay slot, including SPARC, MIPS, 88000, i860, and ROMP. Most of the restrictions are advisory (don't do this; the result is undefined), but the ROMP has hardware to detect and trap this condition. One interesting thing to think about if control transfers are allowed in branch delay slots is how a delay-slot call should work: loop: . . jmp loop call lr0, function exit: . . Calls are typically defined in RISC processors to save the return address in a register. Since calls themselves have delay slots, the return address is normally the second instruction after the call. The action that a delay-slot call takes depends upon how the return address is calculated in the processor. It could either be the address of the call + 2 (words), or the address of the call's delay slot instruction + 1. These normally result in the same value, but if the call is itself in a delay slot, they work differently: ret <- call+2 ret <- call_delay+1 jmp loop jmp loop call lr0, function call lr0, function <inst at loop> <inst at loop> <inst at function> <inst at function> . . <return from function> <return from function> <inst at exit+1> <inst at loop+1> In the former case, the jmp/call pair acts as a visit to the jmp's target, and does not execute the instruction at exit (it substitutes the jmp's target for the call's delay slot). In the later case, the jmp/call pair continues the loop, executing the first instruction of the loop just before the call target is executed, and returns to the second instruction in the loop. The Am29000 exhibits the second behavior. -- Tim Olson Advanced Micro Devices (tim@amd.com)
andrew@frip.WV.TEK.COM (Andrew Klossner) (08/12/89)
Dave Lord (lord@se-sd.NCR.COM) writes:
"I've looked at code generated by three different compilers for
the 88K (GreenHills, GNU, & LPI) and I don't believe any of
them EVER put a useful instruction in the delayed branch slot."
The GNU-88k compiler as developed at Data General does a pretty good
job of filling branch delay slots, if you remember to include the -O
(optimize) option. In a random sample of one big file (malloc.c), I
found that 37 out of 82 branches did useful work in the delay slot
(i.e., the "opcode.n" form of the branch was emitted).
-=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP]
(andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
mccoy@pixar.UUCP (Daniel McCoy) (08/17/89)
In article <26676@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: > (Response to other articles about branch shadow restrictions.) > > Does anyone else know of other processors with such restrictions? Weitek XL. I left the manuals at my previous job, but, if I remember correctly, some instruction types make other uses of the field required to control neutralization (their term for non-execution of shadow instructions). Their neutralization control allows shadow instructions to be: always executed executed only if branch taken executed only if branch not taken executed and return address incremented for calls They provide a post-processor for compiled code that does a good job of filling slots. In hand coding I was able to fill nearly all of the slots. The control over execution for branch taken/not taken is especially good for hand coding. You can execute the instructions on the most critical path and put the delays on the least critical paths. Not a bad processor for embedded controller stuff. The floating point pipeline is certainly easier to use than the i860. Dan McCoy Pixar
pc@eecg.toronto.edu (Paul Chow) (08/31/89)
In article <26716@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: >Well, from the previous postings and email conversations, it appears >that nearly every RISC processor besides the Am29000 has a restriction >on what can go in a branch delay slot, including SPARC, MIPS, 88000, >i860, and ROMP. Most of the restrictions are advisory (don't do this; the >result is undefined), but the ROMP has hardware to detect and trap this >condition. > >One interesting thing to think about if control transfers are allowed in >branch delay slots is how a delay-slot call should work: I missed the beginning of all this but I thought this might be interesting. The Stanford MIPS-X processor has a branch/jump delay of 2. The new PC is computed by adding the offset to the value of the current PC when the instruction reaches its ALU stage. The pipeline is: IF RF ALU MEM WB. So for the following: 0 br offset=5 1 nop (delay slot 1) 2 nop (delay slot 2) 3 ... ... 7 nop (br destination) the sequence would be 0, 1, 2, 7. Now consider this interesting loop: 0 nop 1 nop 2 nop 3 br offset=-1 4 br offset=-1 5 br offset=-1 6 nop 7 nop The sequence of the PC is 0, 1, 2, 3, 4, 5, 4, 3, 2, 1, 0, 1, 2, 3, ... One morning I walked in and I saw someone running the simulator with the PC running backwards! This worried me because I had designed the PC, and had never expected that it could `increment' backwards... On the question of calls in delay slots, MIPS-X does the same as Tim explains for the Am29000, except that there are two delay slots. Paul Chow Dept. of Electrical Engineering University of Toronto pc@eecg.toronto.edu