[comp.compilers] SPARC references

ressler@cs.cornell.edu (Gene Ressler) (03/21/91)

We are developing code generators for the Sun SPARCstations and need to
compile a bibliography of applicable (i.e.  SPARC and SPARC
optimization-specific) references.  If you'll e-mail your suggestions,
I'll post a summary.  I've seen the discussion referring to chip
manufacturers' literature.

Is there anything better?  Anything Sun-specific?  _Any_ suggestion would
be helpful, as even our assembler references are mighty thin.

Gene Ressler
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

salomon@ccu.umanitoba.ca (Dan Salomon) (03/25/91)

Check out the book "Microprocessors: A Programmer's View"
by Dewar & Smosna.  They compare the SPARC and the MIPS and give
clues to optimizations that didn't occur to me by reading the
SPARC architecture manual.  The book is very informative, even though
the writing is flat footed.
-- 

Dan Salomon -- salomon@ccu.UManitoba.CA
               Dept. of Computer Science / University of Manitoba
	       Winnipeg, Manitoba, Canada  R3T 2N2 / (204) 275-6682

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

jpff@maths.bath.ac.uk (John ffitch) (03/28/91)

There was reference to a book by Dewar and Smosna which compared SPARC and
MIPS. Does anyone know the publisher?  I have been unable to trace it in
our library.  So far my experience of SPARC has been less than good, as
the explanation in the Architecture manual leaves a lot to the
imagination.  Some thing which really explained what happens to the stack
or window roll, how va_args is supposed to work, and so on would be really
helpful.  In my code someone keeps trampling on the stack, and honest guv'
it's not me!

==John
 

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

salomon@ccu.umanitoba.ca (Dan Salomon) (03/30/91)

In article <1991Mar28.115715.3545@maths.bath.ac.uk> John ffitch <jpff@maths.bath.ac.uk> writes:
>There was reference to a book by Dewar and Smosna which compared SPARC and
>MIPS. Does anyone know the publisher?  I have been unable to trace it in
>our library.

I was the original poster of the information.  Unfortunately my copy was
lent out at the time so all I could provide were the authors last names
and the title.  Here is the full reference:

    Microprocessors: A Programmers View
    Robert B.K. Dewar & Matthew Smosna
    McGraw Hill (c) 1990
    ISBN 0-07-016639-0


>So far my experience of SPARC has been less than good, as
>the explanation in the Architecture manual leaves a lot to the
>imagination.  Some thing which really explained what happens to the stack
>or window roll, how va_args is supposed to work, and so on would be really
>helpful.

You won't find that kind of detail in Dewar and Smosna.  Check out the
manual "Porting Software to SPARC Sytems" that comes bound with the
"Assembler Language Reference Manual".  It has a little information on
this topic in the section "Porting C Programs" and the subsection on
varargs().

>In my code someone keeps trampling on the stack, and honest guv'
>it's not me!

This is indeed a "gotcha" of programming on the SPARC.  If you read the
specifications of the calling conventions carefully (in the appendix
called "Software Considerations" of the architecture manual) you will find
that the top of the stack must at all times contain 23 words that can be
clobbered by called procedures.  So any space that you use on the stack
must have been allocated below the stack top by the SAVE instruction.

The difficulty that I have had is in finding specific details on the
calling conventions that the C compiler uses.  I believe, for instance,
that it will say that the caller, not the callee, is responsible for
saving the floating-point registers.
-- 

Dan Salomon -- salomon@ccu.UManitoba.CA
               Dept. of Computer Science / University of Manitoba
	       Winnipeg, Manitoba, Canada  R3T 2N2 / (204) 275-6682
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

chased@Eng.Sun.COM (David Chase) (04/02/91)

> The difficulty that I have had is in finding specific details on the
> calling conventions that the C compiler uses.  I believe, for instance,
> that it will say that the caller, not the callee, is responsible for
> saving the floating-point registers.

Get your hands on a copy of the Applications Binary Interface.  There
are two versions pertinent to your problems: the Generic ABI, and the
SPARC Processor Supplement.  These are published by Prentice-Hall, and
describe the "AT&T System V Applications Binary Interface".

And yes, the caller is responsible for saving the floating-point registers.

David Chase
Sun
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@cs.washington.edu (David Keppel) (04/02/91)

>John ffitch <jpff@maths.bath.ac.uk> writes:
>>[Somebody's trampling the stack, and honest, it isn't me!]

Back when I was trying to port a threads package to the SPARC I had an
opportunity to learn all about the stack layout conventions.  I wrote up
my experience, which is mostly concerned with the interaction between
register windows and stack layout.  Nonetheless, it might prove
instructive for anybody who's interested in the stack layout.

For the next two weeks or so, you can get a copy of my ``what I learned''
writeup via anonymous ftp from `cs.washington.edu' (128.95.1.4) in
`pub/pardo/README.SPARC'.  

	;-D on  ( SPARC dereferences, too! )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

torek@elf.ee.lbl.gov (Chris Torek) (04/12/91)

(I will try to stick to `language' issues here.)

>In article <1991Mar28.115715.3545@maths.bath.ac.uk> John ffitch
><jpff@maths.bath.ac.uk> writes:
[about SPARC machines]
>>Some thing which really explained what happens to the stack or window
>>roll, how va_args is supposed to work, and so on would be really helpful.

In article <1991Mar29.214751.3045@ccu.umanitoba.ca> salomon@ccu.umanitoba.ca
(Dan Salomon) writes:
>... the top of the stack must at all times contain 23 words that can be
>clobbered by called procedures.  So any space that you use on the stack
>must have been allocated below the stack top by the SAVE instruction.

No and yes: you must typically reserve at least 96 bytes, but the reason
given above is incomplete.

The SPARCstation has `register windows': there are some number of
registers arranged in a circular fashion, with overlap.  A five bit
field in the CPU Processor Status Register (PSR), called the Current
Window Pointer (CWP), tells which window is `current'.  References
to Input, Local, and Output registers are really references to
registers in the current window.

The five bit field guarantees that no more than 32 windows will ever
exist.  Actual SPARC implementations have fewer windows (e.g., this
SparcStation-1 has 7).  Call the actual number `nwindows'.  Two special
unprivileged instructions allow you to alter the CWP field:

 -- SAVE: this decrements CWP.  If the result is 31, it is changed to
    nwindows - 1.  In other words, this computes
	psr<4:0> <- (psr<4:0> - 1) mod nwindows;

 -- RESTORE: this increments CWP.  If the result is nwindows, it is
    changed to 0.  In other words, this computes:
	psr<4:0> <- (psr<4:0> + 1) mod nwindows;

Another privileged register, the Window Invalid Mask (WIM), holds a
bitmask of `invalid' windows.  This is used, e.g., to keep subroutines
from `stepping on' the contents of some other subroutine's window.
SAVE and RESTORE trap to the operating system, rather than doing their
usual job, if the bit corresponding to the new CWP field is set in the
WIM.  That is, SAVE and RESTORE really do this:

	new_cwp <- (psr<4:0> OP 1) mod nwindows; // OP => + or -
	if ((1 << new_cwp) & WIM) then trap; else psr<4:0> <- new_cwp;

Every trap begins by doing an implicit SAVE (even if the result makes
CWP indicate an invalid window) and writing some trap recovery
information into the Local registers in the new window, thus the
operating system must always maintain at least one invalid window (for
traps).  (Trap handlers must either run entirely within their special
window, or else go through some fairly major gyrations, which makes
writing the trap code very interesting, but this is mainly an
architectural issue....)

For simplicity, let us assume that the machine has 7 windows, leaving
at most 6 to user programs.  Suppose that a user program is started
with CWP=6 and window 0 marked invalid.  This means the user program
can use windows 6, 5, 4, 3, 2, and 1 without causing a trap.  Let us
also assume that nothing else uses any windows (e.g., all interrupts
are disabled), and that each subroutine uses one new window.  Then
we might have a situation like this:

	_startup	window = 6
	main()		window = 5
	init()		window = 4
	initobj()	window = 3
	initobjtab()	window = 2
	emalloc()	window = 1

Now emalloc() calls malloc(), which attempts a SAVE instruction.
Window 0 is invalid and this therefore traps.  What must the trap
handler do?

Somehow, the trap handler must make window 0 `available'.  For window 0
to be available, window 6 must not be in use (it must be available for
traps to scribble into)---but window 6 contains values that the C
library startup code may need.  These must be saved somewhere.

SunOS, Sprite, and 4BSD all use the same technique: they write the
contents of window 6 into the place to which window 6's stack pointer
points.  Clearly window 6's registers must be saved into some location
unique to this invocation of _startup.  One technique, used (I believe)
on the Pyramid, is to have a separate `control stack'.  The advantage
here is that if the control stack pointer is not user-modifiable, the
O/S can be sure that it points to a valid place.  Existing SPARC window
save code goes through the above-noted gyrations in order to verify the
user stack pointer.  Things are particularly exciting when the user
stack happens to have been paged out.  With a control stack, the O/S
can guarantee a minimal in-core region whenever the user process is
runnable.  Depending on time pressures and compatibility bugaboos, we
may investigate using a control stack instead of, or in addition to,
the user's stack, in 4BSD, assuming I ever get 4BSD going (maybe if I
stopped working on this news article... :-) ).  Control stacks have the
disadvantage that one must partition the virtual space in advance.  If
the partitioning is a mismatch for the process, this may put an
artificially low limit on the number of stack frames.  One can move
move the control stack in virtual space, but this quickly becomes
complex.

In any case, current systems want to do the following within the trap
handler:

	(change to window 6)
	std	%l0, [%sp + (0*8)]	! store Local registers into stack
	std	%l2, [%sp + (1*8)]
	std	%l4, [%sp + (2*8)]
	std	%l6, [%sp + (3*8)]
	std	%i0, [%sp + (4*8)]	! store Input registers into stack
	std	%i2, [%sp + (5*8)]
	std	%i4, [%sp + (6*8)]
	std	%i6, [%sp + (7*8)]
	(change back to window 0, set WIM to 1 << 6, return from trap)

This whole sequence imposes one constraint, and the `std' instructions
impose another:

 A. There must be at least 64 bytes at each window's %sp that are
    otherwise unused.
 B. Each window's %sp must be doubleword (8 byte) aligned.

If these conditions are not met, SunOS and Sprite kill the process.
(My kernel uses a special per-process save area to hold the values
until C code can store them into the user stack, and I do not bother to
check for 8-byte alignment in this code, so in theory your program will
continue to run, albeit slowly, if you goof up the alignment.)  The
obvious inverse sequence occurs when RESTORE instructions trap (the
trap handler is somewhat peculiar since the implicit SAVE on each trap
moves the CWP in the wrong direction).

This takes care of the first 64 bytes, or 16 words, that Dan Salomon
mentioned.  What about the other 7 words?

Sun defined their stack frame format to include another 8 words on
every stack frame, partitioned as follows (see <machine/frame.h>):

	1*4 bytes:	fr_stret, `struct return addr'
	6*4 bytes:	fr_argd[6], `arg dump area'
	1*4 bytes:	fd_argx[1], `array of args past the sixth'

The `struct return addr' field is normally unused.  For C functions
that return a structure object, however, Sun's compiler does the
following.  Suppose function f() returns a structure.  Then:

 1. Routines that call f() set their own fr_stret frame element to
    point to the place in which f() should store its return value.

 2. Routines that call f() do so with the sequence:

	call	_f
	nop		! or pass an argument
	unimp	SIZE

    Here SIZE is the number of bytes the caller expects f() to store
    through fr_stret; this is stored in an otherwise unused bit field
    within the UNIMP instruction.

 3. f() returns not with the usual `ret; restore' sequence, but rather
    through a jump to .stret1, .stret2, .stret4, or .stret8.  These
    library routines are given the address of f()'s return value (which
    f() has built somewhere in its own stack frame) and the size of
    this value.  They then:

     A. Check the instruction at the return location.  If it is not an
	UNIMP, they just return (thus discarding f()'s return value).
	Otherwise:

     B. Read the `size' field out of the UNIMP instruction.  If this
	matches the number of bytes f() wants to return, they copy that
	many bytes from f()'s return structure to wherever fr_stret
	points, and then advance the return address over the UNIMP
	instruction.  If the sizes do not match, they leave the return
	address alone.  They then return from f(); if the sizes did
	not match, this causes a runtime error (a core dump) because
	of the UNIMP instruction.

    The only difference between the four `.stret' routines is the loop
    used to copy the return value: .stret1 copies bytes, .stret2 copies
    halfwords, .stret4 and .stret8 copy words (.stret8 could copy
    doublewords but on SparcStation-1s this is no faster).  (I have to
    wonder why Sun do not simply have f() do the work inline and call
    bcopy() or memmove().)

There are a number of reasons why this is the wrong approach, but I
will not go into them here.  GCC uses the correct technique: callers of
f() pass an extra `hidden' argument which points to the place f()
should write its return value, and small structures are returned
entirely within registers, without any copying.  This does not catch
runtime errors but is considerably more efficient (oops, I said I was
not going to go into this :-) ).

This leaves the arg dump area and arg extension space.  fr_argx is
simply an array of `arguments that did not fit into the 6 Input
registers'; its size actually varies, and is usually 0 (which is then
rounded up to 1 because of the stack alignment constraints).  The arg
dump area has two essentially separate uses:

 A. Functions that run out of registers may spill some of them into
    the arg dump area.  Sun's compiler only ever spills input registers
    0 through 5 into the corresponding space (probably because of the
    next use).  Functions whose arguments' addresses are taken can
    likewise use the arg dump area to store these.

 B. Functions that take variable arguments can write their input
    registers here.  Since the arg dump area immediately precedes the
    extension space, this puts all parameters into a single contiguous
    region on the stack.  This means that old (broken) code that
    assumes contiguous addressible arguments continues to work.

Both of these can be (and, I claim, should be) done differently.
Functions that must spill registers, or take addresses of parameters,
can allocate their own stack space.  Functions with variable argument
lists can write the register arguments into local stack space and
retrieve arguments using something like:

	next_arg = --regarg >= 0 ? *regblock++ : *argx++;

(note that large objects such as arguments of type `double' must be
handled differently).  This slows down varargs functions, but they are
rare.

At the least, the `fr_stret' field should not exist; this would allow
input registers to be stored with `std' instructions, which *are*
faster on some implementations  It would also allow the argument
extension area to occpy no space in the usual case.

Anyway, this is where the 96 bytes per stack frame disappears to.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab CSE/EE (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.