[comp.compilers] SPARC code generation references

ressler@cs.cornell.edu (Gene Ressler) (05/25/91)

A couple of months ago I posted asking for SPARC code gen references.  Sorry
for the delay.  Here are the non-redundant replies I received.  In fact our
need went away, so none of these have been thoroughly tested.  We've since
heard that the SUN manual mentioned in the first reply is good albeit scarce.
Someone here also reviewed the Oberon material suggested, finding it to be
useful, but certainly not a basic source.  There have also been posts in
response or otherwise related; they aren't repeated here.

Gene

-----

From mwb2c@euclid.acc.virginia.edu Thu Mar 21 02:09:31 1991

With regard to your post requesting SUN specific references for optimizing
code for a SPARC, have you seen the SPARC Architecture Manual which SUN
publishes (Version 8 is the most recent).  It's a good place to start and
includes a great deal of assistance to those writing software (Appendix D,
Software Considerations).  I just finished a code generator for the SUN IV
and found this manual REALLY useful.  It was rather difficult to get a hold
of however, since it was just published in December (and as of mid-January,
no one at SUN had any idea what it was).  If you want more information on who
to call, part #, etc., let me know.

-----

From larry@titan.tsd.arlut.utexas.edu Thu Mar 21 08:45:12 1991

Theere is a SPARC version of OBERON available via anonymous ftp at
129.132.101.33.  It discusses optomization on the SPARC and tricks they used
to beat the the native C compiler on the SPARC.  You might download their
documentation.

-----

From cargo@cherry.cray.com Thu Mar 21 09:50:24 1991

I note (in the recent BYTE) that there exists SPARC International, 535
Middlefield Road, Menlo Park, CA 94025, (415) 321-8692, Internet:
info@SPARC.COM.

-----

From pardo@cs.washington.edu Thu Mar 21 16:46:02 1991

I have heard that every existing SPARC predicts branches taken.

%T The SPARC Architecture Manual Version 8
%R Sun Microsystems Part Number 800-1399-12

Page 237: Fill delay slots, Space out fp instructions, make consecutive
instructions independent, avoid consecutive stores.

Page 287-294: Implemetation Characteristics for Fujitsu0, Cypress0, Cypress1,
BIT0, Matsushita0.

Look in back issues of comp.compilers -- sometime in the past 1-2 months
there was a brief analysis of ld vs. ldd performance.  It might have come
from the above manual.

The LSI logic windowed dataflow SPARC chip will perform immediate branches
(call) much better than indirect (jmpl) branches because it prefetches a long
way.  Conditional branches predicted as taken applies doubly here -- more
(dependence on) prefetching means higher penalties for mis-predicted
branches.  The LSI probably has a branch cache.

Cache size is an important consideration if you are e.g., unrolling loops.
Consult your manufacturer.

Non-leaf procedures can be optimized if non-leaf behavior is uncommon.  That
is,

      int
    foo (int *a, int *n)
    {
      if (*n == 0)
	reload (a, BUFSIZE);
      --*n;
      return (a[*n]);
    }

is compiled by all compilers I know of as:

    save
    test
    branch
    call
  norefill:
    sub
    index
    ret
    restore

or some such.  Faster on average is:

    test
    branch
    save	// moved
    call
    restore	// moved
  norefill:
    sub
    ret
    index

but harder to determine when it is profitable.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

torek@elf.ee.lbl.gov (Chris Torek) (05/27/91)

[Apologies for all the quoting, but I found it difficult to edit this
down any further.]

In article <91-05-100@comp.compilers> ressler@cs.cornell.edu (Gene Ressler)
quotes pardo@cs.washington.edu (David Keppel):
>Non-leaf procedures can be optimized if non-leaf behavior is uncommon.  That
>is,
>
>    int foo(int *a, int *n) {
>      if (*n == 0) reload(a, BUFSIZE);
>      --*n;
>      return (a[*n]);
>    }
>
>is compiled by all compilers I know of as:
>
>    save
>    test
>    branch
>    call
>  norefill:
>    sub
>    index
>    ret
>    restore
>
>or some such.  Faster on average is:
>
>    test
>    branch
>    save	// moved
>    call
>    restore	// moved
>  norefill:
>    sub
>    ret
>    index
>
>but harder to determine when it is profitable.

Oddly enough, a day or two before this appeared I was talking to John Gilmore
and Sean Fagan about this very issue.  The answer is that this is profitable
more often than might appear at first, and the technique can be (and I claim
should be) applied to every function.  The general idea is this:

	Move save and restore instructions `inward', renumbering
	registers, until they `bump into' call instructions or until
	the registers do not fit in the free %o set.  (Then move the
	restore outward to fit in a delay slot, if necessary.)  If
	there are no subcalls, or a variable is not live across a
	subcall, %g temporaries may be used as well to reduce the
	register pressure.

This technique might be called `leaf crushing' (I wanted to call it `leaf
burning' but that was a bit too much of a stretch :-) ).

The trick is that the input parameters to each function show up in `%o0'
through `%o5', which `change over' to `%i0' through `%i5' when a SAVE
instruction is done.  The change is done inside the CPU by renumbering the
registers (changing the register window pointer) and can be done the same way
in the compiler.

You might think that you would often run into register allocation problems,
but the `not live across subcall' condition helps out quite a bit.

Let us use the original function as an example.  The obvious naive expansion
of

	int foo(int *a, int *n) {
		if (*n == 0)
			reload(a, BUFSIZE);
		--*n;
		return (a[*n]);
	}

is

	_foo:
		save			! get a new window
		ld	[%i1], %o0	! *n
		tst	%o0		! 0?
		bnz	L1		! no, skip call
		 nop			! nothing to do in delay slot
		mov	%i0, %o0	! first parameter for reload()
		call	_reload		! reload(a, BUFSIZ)
		 mov	1024, %o1	! second parameter for reload()
	L1:
		ld	[%i1], %o0	! *n
		dec	%o0		! - 1
		st	%o0, [%i1]	! fix *n
	!	ld	[%i1], %o0	! (only if terribly stupid)
		sll	%o0, 2, %o0	! * 4
		add	%i0, %o0, %o0	! a + *n
		ld	[%o0], %o0	! return value
		ret			! jump back to caller
		 restore		! but first go back to previous window

Some fairly simple alias analysis should reduce this to:

	_foo:
		save
		ld	[%i1], %o0	! *n
		tst	%o0		! 0?
		bnz	L1		! no, skip call
		 nop
		mov	%i0, %o0	! as before
		call	_reload
		 mov	1024, %o1
		ld	[%i1], %o0	! *n may have changed
	L1:
		dec	%o0		! compute *n - 1
		st	%o0, [%i1]	! update *n
		sll	%o0, 2, %o0	! as before
		add	%i0, %o0, %o0
		ld	[%o0],
		ret
		 restore

but this is not all that much better.  (The obvious delay-slot fill, copying
the `dec %o0' up and using an annulling branch, and moving L1 down one line,
can be done later.)

Now let us try `leaf-crushing':

	int foo(int *a, int *n) {
		if (*n == 0)
	---------------------------------------
			reload(a, BUFSIZE);
			[[note: *n is unknown]]
	---------------------------------------
		--*n;
		return (a[*n]);
	}

Here I have marked the last `no subroutine calls' line and the first `end of
all subroutine calls' lines.  These mark the barriers for `easy' moving of
save/restore.  I have also noted where the value of *n becomes unknown and
must be reloaded (typically, in C at least, compilers assume that anything
pointed-to is unknown after any subroutine call, so this is not an unusual
case).  If we can estimate register needs, we can see that, before and after
the call, all we need are the two parameters `a' and `n', and a temporary to
hold *n; this easily fits in the leaf set.  So now we generate this:

	_foo:
		ld	[%o1], %o2	! *n
		tst	%o2		! 0?
		bnz	L1		! no, skip call
		 nop
		save			! not a leaf
--->		mov	%i0, %o0	! first parameter for reload()
		call	_reload		! etc.
		 mov	1024, %o1
--->		ld	[%i1], %i2	! *n may have changed
		restore
	L1:
		dec	%o2		! compute *n - 1
		st	%o2, [%o1]	! update *n
		sll	%o2, 2, %o2	! as before
		add	%o0, %o2, %o2
		ld	[%o2], %o0
		retl			! leaf return
		 nop

The lines marked with arrows ---> show the register renumbering effect.
Within the `non-leaf' part of the function, everything that was called `%o'
is called `%i' instead.  Other temporaries that do not die across the `save'
or `restore' boundaries may be kept in %g registers as well; these are not
renumbered, but do die across the calls themselves (unlike the %o/%i
temporaries---this makes %o/%i temporaries `more valuable').

If we do the usual delay-slot filling on the last example, we get just what
is desired:

	_foo:
		ld	[%o1], %o2
		tst	%o2		! if *n != 0 ...
		bnz,a	L1		! compute *n - 1 and skip call
		 dec	%o2
		save
		mov	%i0, %o0
		call	_reload
		 mov	1024, %o1
		ld	[%i1], %i2	! update register holding *n
		restore
		dec	%o2		! compute *n - 1
	L1:
		st	%o2, [%o1]	! update *n
		sll	%o2, 2, %o2
		add	%o0, %o2, %o2	! compute &a[n]
		retl			! leaf return
		 ld	[%o2], %o0	! value = a[n]

As long as (6 - n_parameters [%o registers] + 7 [%g registers]) is enough
registers, `save' and `restore' instructions can be moved inward at no cost
whatsoever---it may not help, but it will not hurt.  David Keppel is,
however, correct in pointing out that once these registers run out, or if the
`save' and `restore' instructions are to be duplicated (see below), the
profitability of moving the save and restore inward is murkier.

The typical case that does not benefit here, but might from a more
sophisticated technique, is code of the form:

	hard(parms) {
		if (something)
			a();
		compute;
		if (somethingelse)
			b();
	}

where the `something' and `somethingelse' are rare.  It would be possible for
a programmer with a leaf-crushing compiler to rewrite this as:

	uglier(parms) {
		if (!something && !somethingelse) {
			compute;
			return;
		}
		if (something)
			a();
		compute;
		if (somethingelse)
			b();
	}

and in fact this sort of duplication is useful if the `something's do
not change across an inner loop.  Common-code-path-analyzing compilers
(such as MIPS', using pixie feedback) might do this automatically
anyway; if this is done before leaf-crushing, functions like `hard'
above will be properly optimized.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

compilers-sender@iecc (05/28/91)

There's a more general technique that, at the expense of code space, is
faster most of the time.  You compile one version of the code that assumes
it's wrapped in a save/restore.  Then you start compiling, outside-in, a
version that does not use save-restore.  Every time you find you need to use
save-restore, you insert it and branch to the corresponding location in the
wrapped version.

This copes with such annoying cases as

void bufferchars(const char *s)
{
	register char c;
	while (c = *s++) {
		*buf++ = c;
		if (buf >= buflim) {
			flushbuf(buf, buflim-bufbase);
			buf = bufbase;
		}
	}
}

If you have a string that fits in the buffer, no save/restores are generated;
but if the string is many times larger than the buffer, you still only get
one save/restore.

I have glossed over many issues, mostly in that word "corresponding", as if
the optimal code didn't depend on how many registers you have free.  But the
idea is there - postpone the save as long as possible, then do it and use an
alternate code sequence.

(Of course, do dead-code elimination, and move code around to turn jumps into
fallthroughs.)
-- 
	-Colin

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

preston@ariel.rice.edu (Preston Briggs) (05/28/91)

>Oddly enough, a day or two before this appeared I was talking to John Gilmore
>and Sean Fagan about this very issue.
...
>	Move save and restore instructions `inward', renumbering
>	registers, until they `bump into' call instructions or until
>	the registers do not fit in the free %o set.  (Then move the
...
>This technique might be called `leaf crushing' (I wanted to call it `leaf
>burning' but that was a bit too much of a stretch :-) ).

Nevertheless, it's already been done as part of the MIPS compiler.
It's called "shrink wrapping" and is described in Chow's 1988 paper

	Minimizing Register Usage at Procedure Calls
	by Fred Chow
	in Proceedings of the Sigplan 88 Conference on Programming
	Language Design and Implementation
	pages 85-94
	published as Siplan Notices, Volume 23, Number 7
	July 1988

See section 5, "Shrink-wrapping Callee-saved Registers".

Preston Briggs
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@june.cs.washington.edu (David Keppel) (05/29/91)

torek@elf.ee.lbl.gov (Chris Torek) writes:
>[You can `leaf crush' on the SPARC as long as there are enough
> registers: non-parameter (sp/ret-pc) `o' registers and 7 `g'
> registers.]

Be careful with this.  According to the SPARC ABI (Applications Binary
Interface) standard, an ABI conforming program does something like:

	%g0 is always zero
	%g1 is a caller-save `very temp'
	%g2..%g4 are reserved for application code and may not
		be modified by libraries
	%g5..%g7 may be read but not written by an application
		(e.g., the analog of %g2..%g4 for libraries)

Eek!

I don't sit on the committee.  I think some of this is useful but as
much as they've done is too much, since it potentially leaves (so to
speak :-) only one `very temp'.  As an aside, you can't manually
spill/restore reserved `g' registers because they might be needed by
an asynchronous signal.  Ah, omniscience by committee is just as hard
as any other time :-)

		;-D on  ( An opinion a day... )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.