[comp.compilers] Vax span-dependent jumps

chris@mimsy.umd.edu (Chris Torek) (12/23/90)

In article <14692@goofy.megatest.UUCP> megatest!djones@decwrl.dec.com
(Dave Jones) writes:
>Okay, yet one more thing: If you have a lot of keywords, and you compile this
>on an old, brain-damaged BSD compiler, remember to use the -J flag, which
>means, "Do not generate pseudo-random jumps." :-(

Minor correction: this is not what -J means.

The VAX architecture is particularly maddening when it comes to branches.
It has four different kinds of branches:

	conditional byte branches: b<cond> <relative displacement>
		(the condition can be `unconditional')
	unconditional word branches: brw <relative displacement>
	unconditional longword branches (`jumps'): jmp <address>

and `special' branches (built into instructions like acbl and sobgeq)
which are sometimes byte displacements and sometimes word displacements.
The BSD VAX assembler offers `pseudo branch' instructions for b<cond>
and for unconditional byte-or-word branches: j<cond> or `jbr'.  These
expand as necessary: a

		tstl	r9
		jlssu	label

will assemble to a `branch if less than (unsigned)' if `label' is within
a byte displacement; otherwise it will assemble to a `branch if greater
than or equal to unsigned; branch word to label', where the (new, reversed)
conditional branch merely branches AROUND the unconditional branch.  As
a special optimization, the assembler will sometimes% use something called
`branch tunneling', in which:

		tstl	r9
		jlssu	label
		...
	sneaky:	jbr	label
		...
	label:

assembles to the sequence `test, conditional branch to sneaky, ...,
unconditional branch to label'.  Here `sneaky' is within range of the
desired conditional branch but `label' is not.
-----
% I have never actually seen any tunneled branches, but there *is* code
  in the `as' source to do it.
-----

Now, all this is rather complicated, and quite difficult to do well.  If
you are not careful, a branch optimizer of this sort runs in O(n^3) time.
Sun Microsystems used to have such a branch optimizer in their assembler.
This was fixed after the assembler was observed to take more than 30 hours
of CPU time on a Sun-3 when assembling Pascal compiler output.  (To wit,
TeX.  The compiler itself had only run for perhaps 20 minutes.)

The BSD branch optimizer runs in O(n log n) time, as I recall.  It does
not do a perfect job.  In particular, it can only expand j<cond> branches
to b<not(cond)>+brw, and not to b<not(cond)>+jmp.  It cannot handle the
special instruction branches at all.  The `-J' flag tells the assembler
to turn *all* j<cond> branches into b<not(cond)>+jmp sequences, and to
turn all jbr branches into jmp branches.  This is rarely optimal.

The assembler should be improved to handle distant branches, someday; but
the need for this is rare, and the cost of doing it always seems unwarranted.
(Some people are hoping the VAX architecture will die before they have to
fix `as -J' :-) .)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris
[Hmmn.  I wrote the original AIX assembler for the RT PC, another machine with
span-dependent jumps, by mutating the System III Vax assembler.  The code to
do span-dependent jumps was pretty straightforward, in pass 1 it assumed
they'd all be the long form and made a table of all of them, then between
passes 1 and 2 iterated over that table looking for branches that could be
converted to the short form.  Worked pretty well, better than IBM's assember
which only complained if you guessed wrong.  There were only two formats
rather than 3, but I don't see that 3 should have been that much harder.  I
didn't try to make the tunnel code work, time was limited and the win fairly
low.  -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

henry@zoo.toronto.edu (Henry Spencer) (12/30/90)

In article <28765@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>a special optimization, the assembler will sometimes% use something called
>`branch tunneling'...

Properly known as branch chaining; it's been around since at least the
Bliss-11 compiler.  It wins on code size but typically loses on speed,
especially on modern machines where extra pipeline breaks really hurt.

Our moderator writes:
>[... I wrote the original AIX assembler for the RT PC, another machine with
>span-dependent jumps, by mutating the System III Vax assembler.  The code to
>do span-dependent jumps was pretty straightforward, in pass 1 it assumed
>they'd all be the long form and made a table of all of them, then between
>passes 1 and 2 iterated over that table looking for branches that could be
>converted to the short form...

General comment:  this is theoretically the wrong approach, since there are
sequences where the decisions interact so that branch X can be short only
if branch Y is also short, and a one-at-a-time algorithm that starts with
all of them long won't shorten them.

In practice, almost all branches are short and *any* reasonable algorithm
will work well enough.  Things like iterative scanning of tables are
overkill; a simple FIFO buffer in the output stage, with branches made
long if if they are pushed out the end before their target is known,
will get almost all of them.
-- 
Henry Spencer at U of Toronto Zoology, henry@zoo.toronto.edu   utzoo!henry
[It's true, the scheme that I used in the AIX assembler won't produce optimal
results, but since the code already was there and worked, I went with the
flow.  On the other hand, perfect branch optimization is NP complete so
barring major positive news in the P=NP department, any perfect method is
very slow.  A paper by Tom Szymanski in the CACM ten or so years ago tells
more than you'd ever want to know about the topic, I can dig up the reference
if there's interest. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

lehotsky@aries.osf.org (Alan Lehotsky) (01/02/91)

Henry Spenser writes...

>Our moderator writes:
>>[... I wrote the original AIX assembler for the RT PC, another machine with
>>span-dependent jumps, by mutating the System III Vax assembler.  The code to
>>do span-dependent jumps was pretty straightforward, in pass 1 it assumed
>>they'd all be the long form and made a table of all of them, then between
>>passes 1 and 2 iterated over that table looking for branches that could be
>>converted to the short form...
>
>General comment:  this is theoretically the wrong approach, since there are
>sequences where the decisions interact so that branch X can be short only
>if branch Y is also short, and a one-at-a-time algorithm that starts with
>all of them long won't shorten them.

	Actually, the algorithm that Bliss uses is exactly the
	opposite.  The assumption is that ALL span-dependent
	instructions [SDI's] will be short.  For each SDI, two span
	values are calculated:

		- one assuming that all intervening SDI's are resolved
		  to their smallest size.  [Call this the MIN]

		- the other assuming that all SDI's are resolved long.
		  [Call this the MAX]

	Then you just iterate over the list of SDIs, and consider the
	three cases:

		1. The MIN and MAX are both small enough to reach with
		   the short form instruction.  [Resolve this SDI as
		   short and adjust all intervening SDI MAXs down]

		2. The MIN and the MAX are both larger than the short
		   form. [ Resolve this SDI as long and adjust all
		   intervening SDI MINs up]

		3. The MIN is small enough, but the MAX is larger than
		   the short form would allow.  [Wait for more
		   information]

	If at the end of an iteration, you have not found any
	candidates for case 1 or 2, then you are in a mutual
	dependency situation in which if all intervening SDIs were
	resolved short, then each SDI could be resolved short.  So,
	just resolve the rest of the SDIs as short.

	In the worst case, this algorithm could converge very slowly.
	In practice (measured on Bliss-32 and Bliss-16) it terminated
	after 3 or fewer iterations.

	(Of course, the Bliss-32 algorithm was actually more complex,
	as it combined 3 different possible lengths AND the ability to
	branch chain as well.  Back when B32 was written, memory was
	still more important than speed [remember, the 11/780 was
	supposed to be able to run with ONLY 256KB!!!!], so you could
	choose to optimize for space at the expense of the speed of
	branching to a branch.)

There are several papers on SDI and branch-chaining in the first 2
volumes of TOPLAS.
[See the next message for a short bibliography. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

ccplumb@spurge.uwaterloo.ca (Colin Plumb) (01/04/91)

Well, the nastiest case in production is the Inmos Transputer, which has a
different instruction length for each 4 bits of offset.  4 bits is 1 byte
of instruction, 8 bits is 2 bytes, 12 bits is 3, etc.  A worst-case jump is
8 bytes.

Since every constant in the program is expressed this way, the problem is
put off until the linker.  The linker has a nasty habit of being slow.
Eventually Cogent Research (where I used to work) rewrote it.  Basically,
they build a list of all the non-constant offsets and the labels they're
relative to.  Keep track of the distance between labels, beginning by assuming
they're all one byte long.  Then run over the list extending things until
it stabilizes.  I'm sure someone could come up with an example which only
lengthens one instruction per scan, and each instruction lengthens a few times,
but as long as it's only label1-label2, it's monotonically increasing
and you have reasonable worst-case run-time (and a lot faster than the wierd
heuristics that were in use before, in practice).

If you allow stranger expressions, I suppose it's possible for this to
produce non-optimal results, but it won't happen very often in practice.
-- 
	-Colin
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.