[net.arch] Delayed Branches

firth@sei.cmu.edu (Robert Firth) (09/03/86)

To respond to some issues that have been raised.

First, I don't believe the Assembler should
reorganise code.  The purpose of assembler code
is to allow the user total control over what
the machine will do, and the assembler has no business
taking away any of that control.

Moreover, why would you WANT the assembler to
reorganise?  Presumably you are writing in
assembler to do things the compiler can't, such
as access interrupts, devices &c, in which case you
probably need tight control.  If you are writing
in assembler because you don't trust the optimising
compiler, well, would you trust an optimising
assembler?

Filling the hole after a delayed branch, and similar
transformations, are done at a late stage in the
compilation process, because it is only at that
late stage that the program representation has
branch instructions in it.  While being compiled,
a program usually goes through several intermediate
transformations, gradually becoming more and more
like machine code.  What starts as a Pascal WHILE
loop becomes a tree node with the key token "while",
then maybe a flowgraph node with two entries and
one exit, then maybe a code node with "branch false"
token.  Only at the last stage is it possible to
do instruction reordering.

Unfortunately, this way of building compilers misses
a lot of optimisations.  To reuse a previous example,
filling the hole by loop rotation may be impossible,
because at the stage where the compiler knows about
branches, it has forgotten about WHILE statements,
and must laboriously reanalyse the flow by scanning the
linear machine code.  The lesson here is that one can
get good code with successive optimisations based on
partial information and heuristics, but to get truly
"optimal" code one must never throw information away.
Hence, the program representation must contain the
fact that there is a branch, and SIMULTANEOUSLY the
fact that this branch is the branch back at the
end of a pure WHILE loop.  

mash@mips.UUCP (John Mashey) (09/04/86)

In article <304@sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes as
shown below, generally saying that a reorganizing assembler is a bad idea.
First, I give some data about use of assembler at MIPS, then repond to
the specific comments:

SOME DATA: the 4.3BSD version of the MIPS UNIX kernel has 10105 lines of
assembler, of which 5944 are IEEE floating-point emulation (for systems
without fltpt hardware), and the remaining 4161 are the traditional
"machine" directory routines like locore.s, etc.  The 4161 equates to the
VAX's 3122 lines; our coding style is less dense and has more comments.
Thus, to first approximation, there is a similar amount of assembler code.
Of the 10105 lines of code, about 800  (i.e., 8%) are done ".noreorder",
i.e., "leave this code alone".  Interestingly, the FP emulation  never
uses .noreorder, so of the code that uses it, 800/4161, or about 20%
is .noreorder.
Outside the kernel, assembler is used for the usual things.  I didn't
gather the stats, but I'd say there's a typical amount of assembler for a
UNIX port.
Diagnostics, especially AVTs (Architectural Verification Tests) often use
assembler, and they use .noreorder quite heavily.

Now, for comments:
>To respond to some issues that have been raised.
>
>First, I don't believe the Assembler should
>reorganise code.  The purpose of assembler code
>is to allow the user total control over what
>the machine will do, and the assembler has no business
>taking away any of that control.
No.  Purposes of assembler also include:
	a) Writing things that cannot be expressed otherwise.
	b) Getting the utmost speed for truly time-critical routines.
The data above speaks for itself: most of our code doesn't need to
turn reordering off. The comment above is like aying that one should
write in machine languauge, and not let an assembler do "nasty" things
like assigning addresses to symbols, or Span-Dependent-Instruction
generation.
>
>Moreover, why would you WANT the assembler to
>reorganise?  Presumably you are writing in
>assembler to do things the compiler can't, such
>as access interrupts, devices &c, in which case you
>probably need tight control.  If you are writing
>in assembler because you don't trust the optimising
>compiler, well, would you trust an optimising
>assembler?
I WANT it to reorganize our code (except when we tell it not to) because:
	a) it runs faster.
	b) It suppresses a level of detail that we don't need to be bothered
	with all of the time.
	c) Sometimes it does optimizations that would be very tricky to code
	and maintain, i.e., when it fills a delay slot by grabbing the
	branch's target instruction into the delay slot and then branching
	1 instruction beyond the original target.
	d) If the assembler doesn't do reorganization, EVERY compiler must
	do it.  If customers want to move other compilers to our system,
	they can either generate U-code (intermediate code) as input to
	the optimizer, or assembler as input to the assembler.  If the
	natural choice is to generate assembler, and if the assembler doesn't
	reorganize, then these optimizations are lost, unless the compiler
	is made to do it, which is usually not desirable.  I.e., people have
	compilers with simple back-end code generation that absolutely
	don't want to know much about specific machien properties.
People write assembler, when necessary, for the reasons above, NOT
because they don't trust optimizing compilers.  In our case, the assembler
does all sorts of things, like generating optimal sequences for multiplies.
(Is this useful? Well, I suppose that if you want multiply by 7, you could
turn that into shift-left-3 & subtract yourself, but that's ugly when you
have:	mul	$1,SIZE	where SIZE is a #define shared with C programs.)
>
>Filling the hole after a delayed branch, and similar
>transformations, are done at a late stage in the
>compilation process, because it is only at that
>late stage that the program representation has
>branch instructions in it.  While being compiled,
>a program usually goes through several intermediate
>transformations, gradually becoming more and more
>like machine code.... Only at the last stage is it possible to
>do instruction reordering.
Yes.  Those who do reordering in the compiler per se do it at the end.
>
>Unfortunately, this way of building compilers misses
>a lot of optimisations.  To reuse a previous example,
>filling the hole by loop rotation may be impossible,
>because at the stage where the compiler knows about
>branches, it has forgotten about WHILE statements,
>and must laboriously reanalyse the flow by scanning the
----------^^^^^^^^^^^  it's not that laborious.
>linear machine code.  The lesson here is that one can
>get good code with successive optimisations based on
>partial information and heuristics, but to get truly
>"optimal" code one must never throw information away.
>Hence, the program representation must contain the
>fact that there is a branch, and SIMULTANEOUSLY the
>fact that this branch is the branch back at the
>end of a pure WHILE loop.  
There is some truth here, in the sense that generating truly optimal code
means that some compiler phase knows "everything", and does everything
at once, sort of.  Unfortunately, such compilers are hard to build and
get right, hence, most people don't.  As it turns out, there are certainly
interactions amongst optimizer/code generator/reorganizer that one must
design carefully, in the same way that people design code generators
knowing that there will (or will not) be a peephole optimizer tweaking the
generated code.  Good optimizing compilers are hard, and as usual, a
reasonable engineering solution is to divide up the complexity into manageable
hunks.  Although I can't offer hard data, I have examined at least 10K lines
of disassembled code (usually of UNIX kernels) generated this way:
	a) There are a few tweaks that can be done to improve the code,
	where interactions of optimizer, code generator, and reorganizer
	could be tuned up.
	b) There isn't much in a), however. There's maybe 2-4% improvement,
	at best (in terms of code space, anyway), that I could identify.
	That doesn't mean there isn't a bunch that I couldn't find, but I
	did look pretty carefully.
	c) I've looked very hard at pices of code that contained higher
	than usual fractions of unfillable delay slots.  In most cases,
	the code was already as good as could reasonably be gotten, i.e.,
	the nops were intrinsic to the structure of the code, and doing
	better would require mind-reading compilers.

This discussion seems to be caught in an infinite loop: a) reorganizing
assemblers are claimed to be unworkable, a Bad Idea, or otherwise awful.
b) Somebody who actually uses one (me!) cites data and experience that
says otherwise.  A while later, a) happens again, and somebody else tells
me that this kind of software doesn't work, which contradicts what I see.
Maybe my comments are getting lost in the net, or maybe I'm just not
communicating them properly, but it's getting depressingly repetitive.

To move this discussion along, it would help to get more data
from people who actually use reorganizers, whether implemented in
compiler or assembler.  For example, I'd love to hear from Spectrum folks,
for example.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

henry@utzoo.UUCP (Henry Spencer) (09/07/86)

> First, I don't believe the Assembler should
> reorganise code.  The purpose of assembler code
> is to allow the user total control over what
> the machine will do...

John Mashey has already rebutted this in some detail.  I'd like to add one
thing:  I think we have an unrecognized cultural difference here.  The Unix
community assumes that compilers generate assembler, while a lot of other
people assume the compiler goes direct to binary.  Reorganization at the
assembler level makes rather more sense when you realize that *all* code
goes through the assembler.  I.e., most of the code going into the assembler
was *not* written directly by humans.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry