chris@mimsy.umd.edu (Chris Torek) (12/23/90)
In article <14692@goofy.megatest.UUCP> megatest!djones@decwrl.dec.com (Dave Jones) writes: >Okay, yet one more thing: If you have a lot of keywords, and you compile this >on an old, brain-damaged BSD compiler, remember to use the -J flag, which >means, "Do not generate pseudo-random jumps." :-( Minor correction: this is not what -J means. The VAX architecture is particularly maddening when it comes to branches. It has four different kinds of branches: conditional byte branches: b<cond> <relative displacement> (the condition can be `unconditional') unconditional word branches: brw <relative displacement> unconditional longword branches (`jumps'): jmp <address> and `special' branches (built into instructions like acbl and sobgeq) which are sometimes byte displacements and sometimes word displacements. The BSD VAX assembler offers `pseudo branch' instructions for b<cond> and for unconditional byte-or-word branches: j<cond> or `jbr'. These expand as necessary: a tstl r9 jlssu label will assemble to a `branch if less than (unsigned)' if `label' is within a byte displacement; otherwise it will assemble to a `branch if greater than or equal to unsigned; branch word to label', where the (new, reversed) conditional branch merely branches AROUND the unconditional branch. As a special optimization, the assembler will sometimes% use something called `branch tunneling', in which: tstl r9 jlssu label ... sneaky: jbr label ... label: assembles to the sequence `test, conditional branch to sneaky, ..., unconditional branch to label'. Here `sneaky' is within range of the desired conditional branch but `label' is not. ----- % I have never actually seen any tunneled branches, but there *is* code in the `as' source to do it. ----- Now, all this is rather complicated, and quite difficult to do well. If you are not careful, a branch optimizer of this sort runs in O(n^3) time. Sun Microsystems used to have such a branch optimizer in their assembler. This was fixed after the assembler was observed to take more than 30 hours of CPU time on a Sun-3 when assembling Pascal compiler output. (To wit, TeX. The compiler itself had only run for perhaps 20 minutes.) The BSD branch optimizer runs in O(n log n) time, as I recall. It does not do a perfect job. In particular, it can only expand j<cond> branches to b<not(cond)>+brw, and not to b<not(cond)>+jmp. It cannot handle the special instruction branches at all. The `-J' flag tells the assembler to turn *all* j<cond> branches into b<not(cond)>+jmp sequences, and to turn all jbr branches into jmp branches. This is rarely optimal. The assembler should be improved to handle distant branches, someday; but the need for this is rare, and the cost of doing it always seems unwarranted. (Some people are hoping the VAX architecture will die before they have to fix `as -J' :-) .) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris [Hmmn. I wrote the original AIX assembler for the RT PC, another machine with span-dependent jumps, by mutating the System III Vax assembler. The code to do span-dependent jumps was pretty straightforward, in pass 1 it assumed they'd all be the long form and made a table of all of them, then between passes 1 and 2 iterated over that table looking for branches that could be converted to the short form. Worked pretty well, better than IBM's assember which only complained if you guessed wrong. There were only two formats rather than 3, but I don't see that 3 should have been that much harder. I didn't try to make the tunnel code work, time was limited and the win fairly low. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
henry@zoo.toronto.edu (Henry Spencer) (12/30/90)
In article <28765@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >a special optimization, the assembler will sometimes% use something called >`branch tunneling'... Properly known as branch chaining; it's been around since at least the Bliss-11 compiler. It wins on code size but typically loses on speed, especially on modern machines where extra pipeline breaks really hurt. Our moderator writes: >[... I wrote the original AIX assembler for the RT PC, another machine with >span-dependent jumps, by mutating the System III Vax assembler. The code to >do span-dependent jumps was pretty straightforward, in pass 1 it assumed >they'd all be the long form and made a table of all of them, then between >passes 1 and 2 iterated over that table looking for branches that could be >converted to the short form... General comment: this is theoretically the wrong approach, since there are sequences where the decisions interact so that branch X can be short only if branch Y is also short, and a one-at-a-time algorithm that starts with all of them long won't shorten them. In practice, almost all branches are short and *any* reasonable algorithm will work well enough. Things like iterative scanning of tables are overkill; a simple FIFO buffer in the output stage, with branches made long if if they are pushed out the end before their target is known, will get almost all of them. -- Henry Spencer at U of Toronto Zoology, henry@zoo.toronto.edu utzoo!henry [It's true, the scheme that I used in the AIX assembler won't produce optimal results, but since the code already was there and worked, I went with the flow. On the other hand, perfect branch optimization is NP complete so barring major positive news in the P=NP department, any perfect method is very slow. A paper by Tom Szymanski in the CACM ten or so years ago tells more than you'd ever want to know about the topic, I can dig up the reference if there's interest. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
lehotsky@aries.osf.org (Alan Lehotsky) (01/02/91)
Henry Spenser writes... >Our moderator writes: >>[... I wrote the original AIX assembler for the RT PC, another machine with >>span-dependent jumps, by mutating the System III Vax assembler. The code to >>do span-dependent jumps was pretty straightforward, in pass 1 it assumed >>they'd all be the long form and made a table of all of them, then between >>passes 1 and 2 iterated over that table looking for branches that could be >>converted to the short form... > >General comment: this is theoretically the wrong approach, since there are >sequences where the decisions interact so that branch X can be short only >if branch Y is also short, and a one-at-a-time algorithm that starts with >all of them long won't shorten them. Actually, the algorithm that Bliss uses is exactly the opposite. The assumption is that ALL span-dependent instructions [SDI's] will be short. For each SDI, two span values are calculated: - one assuming that all intervening SDI's are resolved to their smallest size. [Call this the MIN] - the other assuming that all SDI's are resolved long. [Call this the MAX] Then you just iterate over the list of SDIs, and consider the three cases: 1. The MIN and MAX are both small enough to reach with the short form instruction. [Resolve this SDI as short and adjust all intervening SDI MAXs down] 2. The MIN and the MAX are both larger than the short form. [ Resolve this SDI as long and adjust all intervening SDI MINs up] 3. The MIN is small enough, but the MAX is larger than the short form would allow. [Wait for more information] If at the end of an iteration, you have not found any candidates for case 1 or 2, then you are in a mutual dependency situation in which if all intervening SDIs were resolved short, then each SDI could be resolved short. So, just resolve the rest of the SDIs as short. In the worst case, this algorithm could converge very slowly. In practice (measured on Bliss-32 and Bliss-16) it terminated after 3 or fewer iterations. (Of course, the Bliss-32 algorithm was actually more complex, as it combined 3 different possible lengths AND the ability to branch chain as well. Back when B32 was written, memory was still more important than speed [remember, the 11/780 was supposed to be able to run with ONLY 256KB!!!!], so you could choose to optimize for space at the expense of the speed of branching to a branch.) There are several papers on SDI and branch-chaining in the first 2 volumes of TOPLAS. [See the next message for a short bibliography. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
ccplumb@spurge.uwaterloo.ca (Colin Plumb) (01/04/91)
Well, the nastiest case in production is the Inmos Transputer, which has a different instruction length for each 4 bits of offset. 4 bits is 1 byte of instruction, 8 bits is 2 bytes, 12 bits is 3, etc. A worst-case jump is 8 bytes. Since every constant in the program is expressed this way, the problem is put off until the linker. The linker has a nasty habit of being slow. Eventually Cogent Research (where I used to work) rewrote it. Basically, they build a list of all the non-constant offsets and the labels they're relative to. Keep track of the distance between labels, beginning by assuming they're all one byte long. Then run over the list extending things until it stabilizes. I'm sure someone could come up with an example which only lengthens one instruction per scan, and each instruction lengthens a few times, but as long as it's only label1-label2, it's monotonically increasing and you have reasonable worst-case run-time (and a lot faster than the wierd heuristics that were in use before, in practice). If you allow stranger expressions, I suppose it's possible for this to produce non-optimal results, but it won't happen very often in practice. -- -Colin -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.