ressler@cs.cornell.edu (Gene Ressler) (05/25/91)
A couple of months ago I posted asking for SPARC code gen references. Sorry
for the delay. Here are the non-redundant replies I received. In fact our
need went away, so none of these have been thoroughly tested. We've since
heard that the SUN manual mentioned in the first reply is good albeit scarce.
Someone here also reviewed the Oberon material suggested, finding it to be
useful, but certainly not a basic source. There have also been posts in
response or otherwise related; they aren't repeated here.
Gene
-----
From mwb2c@euclid.acc.virginia.edu Thu Mar 21 02:09:31 1991
With regard to your post requesting SUN specific references for optimizing
code for a SPARC, have you seen the SPARC Architecture Manual which SUN
publishes (Version 8 is the most recent). It's a good place to start and
includes a great deal of assistance to those writing software (Appendix D,
Software Considerations). I just finished a code generator for the SUN IV
and found this manual REALLY useful. It was rather difficult to get a hold
of however, since it was just published in December (and as of mid-January,
no one at SUN had any idea what it was). If you want more information on who
to call, part #, etc., let me know.
-----
From larry@titan.tsd.arlut.utexas.edu Thu Mar 21 08:45:12 1991
Theere is a SPARC version of OBERON available via anonymous ftp at
129.132.101.33. It discusses optomization on the SPARC and tricks they used
to beat the the native C compiler on the SPARC. You might download their
documentation.
-----
From cargo@cherry.cray.com Thu Mar 21 09:50:24 1991
I note (in the recent BYTE) that there exists SPARC International, 535
Middlefield Road, Menlo Park, CA 94025, (415) 321-8692, Internet:
info@SPARC.COM.
-----
From pardo@cs.washington.edu Thu Mar 21 16:46:02 1991
I have heard that every existing SPARC predicts branches taken.
%T The SPARC Architecture Manual Version 8
%R Sun Microsystems Part Number 800-1399-12
Page 237: Fill delay slots, Space out fp instructions, make consecutive
instructions independent, avoid consecutive stores.
Page 287-294: Implemetation Characteristics for Fujitsu0, Cypress0, Cypress1,
BIT0, Matsushita0.
Look in back issues of comp.compilers -- sometime in the past 1-2 months
there was a brief analysis of ld vs. ldd performance. It might have come
from the above manual.
The LSI logic windowed dataflow SPARC chip will perform immediate branches
(call) much better than indirect (jmpl) branches because it prefetches a long
way. Conditional branches predicted as taken applies doubly here -- more
(dependence on) prefetching means higher penalties for mis-predicted
branches. The LSI probably has a branch cache.
Cache size is an important consideration if you are e.g., unrolling loops.
Consult your manufacturer.
Non-leaf procedures can be optimized if non-leaf behavior is uncommon. That
is,
int
foo (int *a, int *n)
{
if (*n == 0)
reload (a, BUFSIZE);
--*n;
return (a[*n]);
}
is compiled by all compilers I know of as:
save
test
branch
call
norefill:
sub
index
ret
restore
or some such. Faster on average is:
test
branch
save // moved
call
restore // moved
norefill:
sub
ret
index
but harder to determine when it is profitable.
--
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
torek@elf.ee.lbl.gov (Chris Torek) (05/27/91)
[Apologies for all the quoting, but I found it difficult to edit this down any further.] In article <91-05-100@comp.compilers> ressler@cs.cornell.edu (Gene Ressler) quotes pardo@cs.washington.edu (David Keppel): >Non-leaf procedures can be optimized if non-leaf behavior is uncommon. That >is, > > int foo(int *a, int *n) { > if (*n == 0) reload(a, BUFSIZE); > --*n; > return (a[*n]); > } > >is compiled by all compilers I know of as: > > save > test > branch > call > norefill: > sub > index > ret > restore > >or some such. Faster on average is: > > test > branch > save // moved > call > restore // moved > norefill: > sub > ret > index > >but harder to determine when it is profitable. Oddly enough, a day or two before this appeared I was talking to John Gilmore and Sean Fagan about this very issue. The answer is that this is profitable more often than might appear at first, and the technique can be (and I claim should be) applied to every function. The general idea is this: Move save and restore instructions `inward', renumbering registers, until they `bump into' call instructions or until the registers do not fit in the free %o set. (Then move the restore outward to fit in a delay slot, if necessary.) If there are no subcalls, or a variable is not live across a subcall, %g temporaries may be used as well to reduce the register pressure. This technique might be called `leaf crushing' (I wanted to call it `leaf burning' but that was a bit too much of a stretch :-) ). The trick is that the input parameters to each function show up in `%o0' through `%o5', which `change over' to `%i0' through `%i5' when a SAVE instruction is done. The change is done inside the CPU by renumbering the registers (changing the register window pointer) and can be done the same way in the compiler. You might think that you would often run into register allocation problems, but the `not live across subcall' condition helps out quite a bit. Let us use the original function as an example. The obvious naive expansion of int foo(int *a, int *n) { if (*n == 0) reload(a, BUFSIZE); --*n; return (a[*n]); } is _foo: save ! get a new window ld [%i1], %o0 ! *n tst %o0 ! 0? bnz L1 ! no, skip call nop ! nothing to do in delay slot mov %i0, %o0 ! first parameter for reload() call _reload ! reload(a, BUFSIZ) mov 1024, %o1 ! second parameter for reload() L1: ld [%i1], %o0 ! *n dec %o0 ! - 1 st %o0, [%i1] ! fix *n ! ld [%i1], %o0 ! (only if terribly stupid) sll %o0, 2, %o0 ! * 4 add %i0, %o0, %o0 ! a + *n ld [%o0], %o0 ! return value ret ! jump back to caller restore ! but first go back to previous window Some fairly simple alias analysis should reduce this to: _foo: save ld [%i1], %o0 ! *n tst %o0 ! 0? bnz L1 ! no, skip call nop mov %i0, %o0 ! as before call _reload mov 1024, %o1 ld [%i1], %o0 ! *n may have changed L1: dec %o0 ! compute *n - 1 st %o0, [%i1] ! update *n sll %o0, 2, %o0 ! as before add %i0, %o0, %o0 ld [%o0], ret restore but this is not all that much better. (The obvious delay-slot fill, copying the `dec %o0' up and using an annulling branch, and moving L1 down one line, can be done later.) Now let us try `leaf-crushing': int foo(int *a, int *n) { if (*n == 0) --------------------------------------- reload(a, BUFSIZE); [[note: *n is unknown]] --------------------------------------- --*n; return (a[*n]); } Here I have marked the last `no subroutine calls' line and the first `end of all subroutine calls' lines. These mark the barriers for `easy' moving of save/restore. I have also noted where the value of *n becomes unknown and must be reloaded (typically, in C at least, compilers assume that anything pointed-to is unknown after any subroutine call, so this is not an unusual case). If we can estimate register needs, we can see that, before and after the call, all we need are the two parameters `a' and `n', and a temporary to hold *n; this easily fits in the leaf set. So now we generate this: _foo: ld [%o1], %o2 ! *n tst %o2 ! 0? bnz L1 ! no, skip call nop save ! not a leaf ---> mov %i0, %o0 ! first parameter for reload() call _reload ! etc. mov 1024, %o1 ---> ld [%i1], %i2 ! *n may have changed restore L1: dec %o2 ! compute *n - 1 st %o2, [%o1] ! update *n sll %o2, 2, %o2 ! as before add %o0, %o2, %o2 ld [%o2], %o0 retl ! leaf return nop The lines marked with arrows ---> show the register renumbering effect. Within the `non-leaf' part of the function, everything that was called `%o' is called `%i' instead. Other temporaries that do not die across the `save' or `restore' boundaries may be kept in %g registers as well; these are not renumbered, but do die across the calls themselves (unlike the %o/%i temporaries---this makes %o/%i temporaries `more valuable'). If we do the usual delay-slot filling on the last example, we get just what is desired: _foo: ld [%o1], %o2 tst %o2 ! if *n != 0 ... bnz,a L1 ! compute *n - 1 and skip call dec %o2 save mov %i0, %o0 call _reload mov 1024, %o1 ld [%i1], %i2 ! update register holding *n restore dec %o2 ! compute *n - 1 L1: st %o2, [%o1] ! update *n sll %o2, 2, %o2 add %o0, %o2, %o2 ! compute &a[n] retl ! leaf return ld [%o2], %o0 ! value = a[n] As long as (6 - n_parameters [%o registers] + 7 [%g registers]) is enough registers, `save' and `restore' instructions can be moved inward at no cost whatsoever---it may not help, but it will not hurt. David Keppel is, however, correct in pointing out that once these registers run out, or if the `save' and `restore' instructions are to be duplicated (see below), the profitability of moving the save and restore inward is murkier. The typical case that does not benefit here, but might from a more sophisticated technique, is code of the form: hard(parms) { if (something) a(); compute; if (somethingelse) b(); } where the `something' and `somethingelse' are rare. It would be possible for a programmer with a leaf-crushing compiler to rewrite this as: uglier(parms) { if (!something && !somethingelse) { compute; return; } if (something) a(); compute; if (somethingelse) b(); } and in fact this sort of duplication is useful if the `something's do not change across an inner loop. Common-code-path-analyzing compilers (such as MIPS', using pixie feedback) might do this automatically anyway; if this is done before leaf-crushing, functions like `hard' above will be properly optimized. -- In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427) Berkeley, CA Domain: torek@ee.lbl.gov -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
compilers-sender@iecc (05/28/91)
There's a more general technique that, at the expense of code space, is faster most of the time. You compile one version of the code that assumes it's wrapped in a save/restore. Then you start compiling, outside-in, a version that does not use save-restore. Every time you find you need to use save-restore, you insert it and branch to the corresponding location in the wrapped version. This copes with such annoying cases as void bufferchars(const char *s) { register char c; while (c = *s++) { *buf++ = c; if (buf >= buflim) { flushbuf(buf, buflim-bufbase); buf = bufbase; } } } If you have a string that fits in the buffer, no save/restores are generated; but if the string is many times larger than the buffer, you still only get one save/restore. I have glossed over many issues, mostly in that word "corresponding", as if the optimal code didn't depend on how many registers you have free. But the idea is there - postpone the save as long as possible, then do it and use an alternate code sequence. (Of course, do dead-code elimination, and move code around to turn jumps into fallthroughs.) -- -Colin -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
preston@ariel.rice.edu (Preston Briggs) (05/28/91)
>Oddly enough, a day or two before this appeared I was talking to John Gilmore >and Sean Fagan about this very issue. ... > Move save and restore instructions `inward', renumbering > registers, until they `bump into' call instructions or until > the registers do not fit in the free %o set. (Then move the ... >This technique might be called `leaf crushing' (I wanted to call it `leaf >burning' but that was a bit too much of a stretch :-) ). Nevertheless, it's already been done as part of the MIPS compiler. It's called "shrink wrapping" and is described in Chow's 1988 paper Minimizing Register Usage at Procedure Calls by Fred Chow in Proceedings of the Sigplan 88 Conference on Programming Language Design and Implementation pages 85-94 published as Siplan Notices, Volume 23, Number 7 July 1988 See section 5, "Shrink-wrapping Callee-saved Registers". Preston Briggs -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@june.cs.washington.edu (David Keppel) (05/29/91)
torek@elf.ee.lbl.gov (Chris Torek) writes: >[You can `leaf crush' on the SPARC as long as there are enough > registers: non-parameter (sp/ret-pc) `o' registers and 7 `g' > registers.] Be careful with this. According to the SPARC ABI (Applications Binary Interface) standard, an ABI conforming program does something like: %g0 is always zero %g1 is a caller-save `very temp' %g2..%g4 are reserved for application code and may not be modified by libraries %g5..%g7 may be read but not written by an application (e.g., the analog of %g2..%g4 for libraries) Eek! I don't sit on the committee. I think some of this is useful but as much as they've done is too much, since it potentially leaves (so to speak :-) only one `very temp'. As an aside, you can't manually spill/restore reserved `g' registers because they might be needed by an asynchronous signal. Ah, omniscience by committee is just as hard as any other time :-) ;-D on ( An opinion a day... ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.