[comp.arch] Number of registers, MIPS, GNU cc

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/05/91)

New Year, New Life. Here is, incredibly enough, some (little) real data.

I have had some time over the past few days to make GNU CC 1.37.93
generate code under different assumptions on a MIPS machine (DECsystem
5830).

GNU CC's register allocation is driven by a couple of #define's in its
machine dependent header (in this case tm-mips.h). The first tells the
allocator which registers it can use as scratch registers, and the
second tells it which are not available across function calls.

Normally a MIPS machine has 32 registers (I am mainly interested in the
non floating point ones), of which 25 can be used as spare, and of these
only 9 are preserved across function calls. I provided alternative
definitions as follows (see source snippet at the end):

	name	total	spare	preserved

	U12	12*	5	5
	U16	16*	9	9
	U24	24*	17	9
	U32	32	25	9

The numbers with an asterisk don't mean that I have actually lowered the
number of registers visisble to GNU cc to that level -- for convenience
I have just marked the balance to 32 "unavailable".  It may also be
noted that the percentage of spare registers preserved by the callee
rises steadly; I might have kept the proportion to about 1/3, but this
was not deemed terribly important (I did check and it does not make a
lot of difference).

So I first compiled GNU CC (this results in three executables, gcc, cpp,
and cc1) with the MIPS 'cc -O2', and called the resulting compiler
version UMIPS. I then recompiled (various times, actually) the compiler
itself with GNU 'cc -O2', and called the resulting version U32G32 (the
executable uses 32 registers and generates code for 32 registers).

I then compiled regclass.c another three times, for the U12, U16, and
U24 options, and obtained another three compiler versions, U32G12,
U32G16, and U32G24; all such versions are identical except for the two
tables of registers.

At this point I could take some benchmark and compile it with U32G12..32
and compare the results. Just to make things interesting I decided to
use GNU CC itself as benchmark -- on a whim. So, I recompiled GNU CC
another three times, with U32G12, U32G16 and U32G24, and I obtained
three new compiler versions called U12G32, U16G32, U24G32. Note that
these three compiler versions all generate code for 32 registers to make
their work equivalent, but each version uses only 12, 16, 24 registers
to do the compilation.

Having got all these version, I wanted to look at their text sizes with:

	size U12G32/gcc U12G32/cpp U12G32/cc1

and here are the results:

			gcc		cpp		cc1

	U12G32		37,864		90,112		733,184
	U16G32		32,768		77,824		655,360
	U24G32		32,768		73,728		614,400
	U32G32		32,768		73,728		602,112
	UMIPS		32,768		77,824		634,880

I then took a few largish source modules of GNU CC itself, and compiled
them with the various compilers, with a command line like

	time ./gcc -BU12G32/ -S -O2 -I. -Iconfig loop.c

Note that I said -O2, because all compiles so far had been done with
this level of optimization (which is _broadly_ equivalent between GNU cc
and MIPS cc), and -S, because I wanted to measure only the speed of gcc,
cpp and cc1, not that of the MIPS as. I also measured separately the
speed of cpp, for some reason. Here is a table of the user times for
compiling loop.c, in seconds (with a variability of say +/- 0.5
seconds), as reported by time(1):

			cc1	spare	normalized

	U12G32		39.4	5	153%
	U16G32		30.9	9	120%
	U24G32		26.4	17	103%
	U32G32		26.7	25	104%
	UMIPS		25.6	(?25)	100%

Note that I have added the number of spare registers avaiable on the
smae line.  Compilation times for other sources, or for gcc and cpp, are
more or less in the same pattern.


There are a number of observations to make; some are:

	1)  The GNU cc is an exceptionally large 'nonfloat' benchmark;
	    it is not the only possible one, and it is not 'typical'.

	2)  The GNU cc source is exceptionally intensive in fairly
	    complicated structure manipulations, and has normally quite
	    large function bodies.

	3)  Even if it has 'register' declarations here and there,
	    the GNU cc source is written in a somewhat haphazard way
	    and with the idea that an optimizer will sort things out.

	4)  The GNU cc register allocation algorithm makes no attempt
	    to take into account the dynamic frequency of use, even if
	    it does pay some regard to loops. It disregards by default,
	    when optimization is enabled, all 'register' indications.

	5)  Statistics exist to show that almost never in any stretch of
	    (nonfloat) code there are more than a couple dozen live
	    variables -- the frequently used ones are usually much less
	    than that.

	6)  The GNU cc is not especially tuned to the MIPS architecture;
	    the MIPS cc is, after all, even if its roots are in some
	    architecture dependent technology.

It would be nice if anybody could run a similar test on a 68030 or 68040, or
on a SPARC. In any case here is the modified bit of "tm-mips.h":

/* 1 for registers that have pervasive standard uses
   and are not available for the register allocator.

   On the MIPS, see conventions, page D-2

   I have chosen not to  take Multiply/Divide HI,LO or PC into
   account.  */

#ifdef REGS12
#define FIXED_REGISTERS {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		         1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,\
			 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		         1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0	\
}
#define CALL_USED_REGISTERS {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		             1, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,\
		             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		             1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0\
}
#endif

#ifdef REGS16
#define FIXED_REGISTERS {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		         0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,\
			 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0	\
}
#endif

#ifdef REGS24
#define FIXED_REGISTERS {1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,\
		         0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,\
			 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0,\
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0	\
}
#endif

#ifdef REGS32
#define FIXED_REGISTERS {1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1,\
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,\
		         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0	\
}
#endif

/* 1 for registers not available across function calls.
   These must include the FIXED_REGISTERS and also any
   registers that can be used without being saved.
   The latter must include the registers where values are returned
   and the register where structure-value addresses are passed.
   Aside from that, you can include as many other registers as you like.  */

#ifndef CALL_USED_REGISTERS
#define CALL_USED_REGISTERS {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		             0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1,\
		             1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,\
		             1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0\
}
#endif
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

ts@cup.portal.com (Tim W Smith) (01/05/91)

<	name	total	spare	preserved
<
<	U12	12*	5	5
<	U16	16*	9	9
<	U24	24*	17	9
<	U32	32	25	9

How come spare + preserved - total is +-2 all the time rather
than 0?  What are the other -+2 registers?

					Tim Smith

pcg@aber-cs.UUCP (Piercarlo Grandi) (01/10/91)

    I (pcg) had written:

  	name	total	spare	preserved
  
  	U12	12*	5	5
  	U16	16*	9	9
  	U24	24*	17	9
  	U32	32	25	9

In article <37605@cup.portal.com> ts@cup.portal.com (Tim W Smith) writes:
  
  How come spare + preserved - total is +-2 all the time rather
  than 0?  What are the other -+2 registers?

Sorry for not being clear here: 'preserved' and 'spare' overlap. For example
U24 has 32 real registers, 8 are marked unavailable, 7 are marked fixed use
(really the same as the previous 8), 17 are marked available, and 9 of these
17 are preserved across function calls (callee saves). Have a look at the
tail of the original message, where the definitions of the relevant arrays
are, and everything will be clear.

I have done some further experiments that show that the number of registers
saved by the callee or by the caller does not matter, at least in this case.

Incidentally: I have measured, for fun, also the time taken by the GNU cpp
on the same source. There number of registers seems to make a far more
pronounced difference than for GNU cc1. This is highly perplexing, so I am
looking into it -- will report, eventually.

Finally: if comebody has some substantial floating point program(s), it
would be interesting to see hwo number of registers affect them.
-- 
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk