[comp.arch] 80486 vs. 68040 code size

davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) (04/29/89)

  Someone posted an opinion that images on the 68040 will be smaller
than the 80386. I didn't think this sounded right, so I made a few
measurements. Note that this is not presented as a complete study, but
as a starting point which reflects my experience with 286/386 and
680[023]0.

  I checked the size of image files compiled from the same source on
Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares). I noted
the size of the actual (stripped) file, and the values reported by
"size." I used available source to insure that differences in the
functions of SysV and BSD programs in bin would not skew the results.
All compiles with -O.

source		CPU	 file	text	data	  bss
zoo 2.01	386	 55368	43560	10312	 20904
		68k	 73728	57344	16384	 24136
compress 4.0	386	 20300	13236	 5964	442596
		68k	 32768	24576	 8192	419456
memacs 3.9n	386	 93080	75768	16280	 19592
		68k	114688	90112	24576	 18524

2nd order effects:

  Some people may claim that the Microsoft compiler is better than the
Sun compiler. The granularity of the library may be different. I am
assuming that the code for 486/68040 will be the same size (I don't
believe there are any new instructions in the 486, and I don't remember
any in the '40. Certainly not any which might drop the code size more
than a % or so.

What I think is happening:

  Although the 68k family has more registers than the 386, which should
decrease the number of load/store cycles, I believe that this is
outweighed by the many single byte opcodes in the 386. While the 286 was
limited in number of registers, the 386 is far better than the 286 because it
has more registers, and 32 bit registers. This prevents wasting two
registers on one 32 bit value.

  When the 386 runs out of registers, many instructions can be done to
memory, so the time goes up but not the number of instructions. Also
the 386 uses short addresses (1-2 bytes) which are extended at runtime.
This also helps cut the size of the code.

  I would welcome some data resulting from compilation using the gcc
compiler for both systems, assuming that someone has a working copy for
Xenix/386 (I'd like a copy, too). I hope to find time to measure the
bytes/instruction of the code, since I assume this is where the 386
wins.

  Since I am not making any claims of "better" let's not bring them
into the discussion, if any. Obviously I use machine of both type, so I
don't think I had any particular bias about code size.

-- 
	bill davidsen		(wedu@crd.GE.COM)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

matloff@tinman.Berkeley.EDU (Norman Matloff) (04/29/89)

In article <13699@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:

>  Someone posted an opinion that images on the 68040 will be smaller
>than the 80386. I didn't think this sounded right, so I made a few
>measurements. 

Everything else being equal, a machine with a more orthogonal 
structure  -- many op codes can be used with addressing modes
and many registers  --  should produce larger programs.  Thus the
68000 family machines, being more orthogonal than those in the
8086 family, should produce larger code, as shown by your data.

   Norm

elg@killer.Dallas.TX.US (Eric Green) (04/29/89)

in article <13699@steinmetz.ge.com>, davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) says:
>   Someone posted an opinion that images on the 68040 will be smaller
> than the 80386. I didn't think this sounded right, so I made a few

For programs with less than 64K of data and 64K of text, the
generated 8086 code is more compact than 68000 code. This is because
a) it's using 16-bit pointers, and b) all those one-byte opcodes.

On the other hand, for programs with more than 64K of data (and, to
some extent, programs with more than 64K of text), the 68000 is 
more efficient, because it doesn't have to be loading those damned
segment registers all the time.

It's probably a tossup as to whether the 80386 vs. 68040 comparison
will continue that trend. I suspect that for small codes, the 80386
will continue having more compact object code. But once the need for
32-bit segments occurs, the 80386 and 680xx are likely to be about the
same as far as code density goes. 

>   I checked the size of image files compiled from the same source on
> Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares). I noted

A possible flaw: I seem to recall that SunOS's old compilers, at
least, were based on the PCC. The PCC isn't exactly the world's most
sophisticated compiler. The code it produces is, to put it charitably,
merely adequate. On the other hand, I suspect that the Xenix compiler
has been much tweaked by Microsoft's compiler experts....

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
| \X/           Newsflash: DP director fired for buying IBM!                |

khb@fatcity.Sun.COM (Keith Bierman Sun Tactical Engineering) (05/01/89)

In article <7952@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>
>A possible flaw: I seem to recall that SunOS's old compilers, at
>least, were based on the PCC. The PCC isn't exactly the world's most
>sophisticated compiler. The code it produces is, to put it charitably,
>merely adequate. On the other hand, I suspect that the Xenix compiler
>has been much tweaked by Microsoft's compiler experts....

Well, Sun's compilers haven't exactly stood still. OS 3.4 is quite
old, and the compilers are nowhere close to our current ones (in
terms of robustness, code quality, etc.)

Keith H. Bierman      |*My thoughts are my own. Only my work belongs to Sun*
It's Not My Fault     |	Marketing Technical Specialist 
I Voted for Bill &    |   Languages and Performance Tools. 
Opus            (* strange as it may seem, I do more engineering now     *)

pb@idca.tds.PHILIPS.nl (Peter Brouwer) (05/01/89)

In article <13699@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>
>  I checked the size of image files compiled from the same source on
>Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares).
....... lots deleted
>  Some people may claim that the Microsoft compiler is better than the
>Sun compiler. The granularity of the library may be different.

Is microsoft delivering a c compiler for unix/xenix?



-- 
#  Peter Brouwer,                     ##
#  Philips TDS, Dept SSP-V2           ## voice +31 55 432523
#  P.O. Box 245                       ## UUCP address ..!mcvax!philapd!pb
#  7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl

pb@idca.tds.PHILIPS.nl (Peter Brouwer) (05/01/89)

In article <7952@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>in article <13699@steinmetz.ge.com>, davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) says:
>>   Someone posted an opinion that images on the 68040 will be smaller
>> than the 80386. I didn't think this sounded right, so I made a few
>
>For programs with less than 64K of data and 64K of text, the
>generated 8086 code is more compact than 68000 code. 
>
>On the other hand, for programs with more than 64K of data (and, to
>some extent, programs with more than 64K of text), the 68000 is 
>more efficient, because it doesn't have to be loading those damned
>segment registers all the time.
>
>>   I checked the size of image files compiled from the same source on
>> Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares). I noted
>
>A possible flaw: I seem to recall that SunOS's old compilers, at
From this and previous postings I get the impression than one is comparing
apples and oranges.
I think that comparing code size should be done on code that is generated
with the same operating system, probably unix.
If that is the case than the 386 compiler does not use the segment registers
in the user code. All memory references will be 32 bits references.
The compiler we have for the 386 on unix is from AT&T unix V release 3.1
It does not use the segment registers, that leaves 8 registers. From
these registers 3 are used for the c register declarations ( 4 in 3.2 ).
For the motorola the compiler can hold ten register declarations.
This make a big diffrence in code size and execution speed.

So before more comparisons are made lets do it in the same operating
environment and not compare code generated under MSDOS with code generated
under UNIX.

-- 
#  Peter Brouwer,                     ##
#  Philips TDS, Dept SSP-V2           ## voice +31 55 432523
#  P.O. Box 245                       ## UUCP address ..!mcvax!philapd!pb
#  7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl

davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) (05/02/89)

In article <405@ssp2.idca.tds.philips.nl> pb@idca.tds.PHILIPS.nl (Peter Brouwer) writes:
| elg@killer.Dallas.TX.US (Eric Green) writes:
| >davidsen@steinmetz.ge.com (Wm. E. Davidsen Jr) says:

	[ ... my introductory remarks stripped ... ]

| >>   I checked the size of image files compiled from the same source on
| >> Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares). I noted
| >
| >A possible flaw: I seem to recall that SunOS's old compilers, at
| From this and previous postings I get the impression than one is comparing
| apples and oranges.

| I think that comparing code size should be done on code that is generated
| with the same operating system, probably unix.

  I realize that BSD bigots hate to recognize any other flavor of unix,
but for most of us SysV, Xenix, Ultrix, SunOS and BSD are all called
UNIX.

| If that is the case than the 386 compiler does not use the segment registers
| in the user code. All memory references will be 32 bits references.
| The compiler we have for the 386 on unix is from AT&T unix V release 3.1
| It does not use the segment registers, that leaves 8 registers. From
| these registers 3 are used for the c register declarations ( 4 in 3.2 ).
| For the motorola the compiler can hold ten register declarations.
| This make a big diffrence in code size and execution speed.
| 
| So before more comparisons are made lets do it in the same operating
| environment and not compare code generated under MSDOS with code generated
| under UNIX.

  This totally mystifies me. Did someone bring in some MS-DOS figures?
My figures were clearly for Xenix, and I clearly stated the versions and
stuff (I think the info was quoted and appears above).
-- 
	bill davidsen		(wedu@crd.GE.COM)
  {uunet | philabs}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

pb@idca.tds.PHILIPS.nl (Peter Brouwer) (05/02/89)

In article <13718@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
>
>| >>   I checked the size of image files compiled from the same source on
>| >> Xenix/386 v2.3.1 and SunOS 3.4 (on a 3/280 if anyone cares). I noted
>My figures were clearly for Xenix, and I clearly stated the versions and
>stuff (I think the info was quoted and appears above).
A table with sizes of executables generated with size deleted.....

To get a good feeling about difference in size one should also compare
the size of the objects from which the executable is linked. This eliminates
the difference is size of linked in library parts of the operating system.
Of cource they also indicate something about the difference in size but
there might be sytem dependent code in it.

My experience is comparing sizes this way is that small funtions in C code
tend to get 5-10% bigger in object for the 386 uptill 50% for very large
functions. Using the same source codes with the same ifdef and compiler flags.

An other way in comparing sizes is using the output of nm ( it lists the 
function sizes ) or the output of the ld with the -m flag to generate a
link map.
-- 
#  Peter Brouwer,                     ##
#  Philips TDS, Dept SSP-V2           ## voice +31 55 432523
#  P.O. Box 245                       ## UUCP address ..!mcvax!philapd!pb
#  7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl

chip@vector.Dallas.TX.US (Chip Rosenthal) (05/02/89)

pb@idca.tds.PHILIPS.nl (Peter Brouwer) writes:
>Is microsoft delivering a c compiler for unix/xenix?

The C compiler for XENIX *is* Microsoft C.  (asm for XENIX is masm, which
creates horrible problems for gcc, I'm told.)
-- 
Chip Rosenthal / chip@vector.Dallas.TX.US / Dallas Semiconductor / 214-450-5337

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/03/89)

In article <13718@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
    
      This totally mystifies me. Did someone bring in some MS-DOS figures?
    My figures were clearly for Xenix, and I clearly stated the versions and
    stuff (I think the info was quoted and appears above).

But the comparisons you made may still be apple vs. oranges... I have the
strong suspicion that you compared executable sizes, which tipically include
libraries that are very different. For example, stdio with/without printf is
very different between SunOS and System5/386. Also, you have shared
libraries in some systems (SunOS 4, Unix 5.3), and not in others. There are
many other factors that influence executable size; e.g. SunOS uses 8k
buffers for stdin/stdout, etc...

The comparison ought to be made between the .o files of fairly large
modules.  As to this, my impression is that 386 objects are a bit smaller
than 68020 objects, thanks to many common 386 instructions being short (as
somebody correctly remarked about unorthogonality). I would venture to say
(base on a very small sample) that 386 code is about 5-10% smaller than
68020 code. Both compilers used are highly optimizing.

Byt the way, both the 386 and the 68020 (which are supposedly very similar
to the 486 and 68040) use linear 32 bit addresses (virtually nobody uses the
48 bit pointer large model of the 386), so this is not a factor.

The 68020 has more registers than the 386, but this ought to count for
little, as according to literature the knee of the curve for C is before or
at 4 registers (important note: for typical Unix utilities, e.g. excluding
floating point) because of the average simplicity of C expressions.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

cquenel@polyslo.CalPoly.EDU (24 more school days) (05/04/89)

In 9681 pcg@cs.aber.ac.uk (Piercarlo Grandi) sez:
>Both compilers used are highly optimizing.
>
>The 68020 has more registers than the 386, but this ought to count for
>little, as according to literature the knee of the curve for C is before or
>at 4 registers (important note: for typical Unix utilities, e.g. excluding
>floating point) because of the average simplicity of C expressions.

	Hahahahah ha ha ha ha ha. 
	Oh. excuse me. (chuckle)
	I can't help myself.

	Highly optimizing compilers have long been able to make very
	good use of more than four registers.

>Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
>Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
>Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk



-- 
@---@  -----------------------------------------------------------------  @---@
\. ./  | Chris (The Lab Rat) Quenelle      cquenel@polyslo.calpoly.edu |  \. ./
 \ /   |  You can keep my things, they've come to take me home -- PG   |   \ / 
==o==  -----------------------------------------------------------------  ==o==

davidsen@sungod.steinmetz (William Davidsen) (05/04/89)

In article <902@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
|In article <13718@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
|    
|      This totally mystifies me. Did someone bring in some MS-DOS figures?
|    My figures were clearly for Xenix, and I clearly stated the versions and
|    stuff (I think the info was quoted and appears above).
|
|But the comparisons you made may still be apple vs. oranges... I have the
|strong suspicion that you compared executable sizes, which tipically include

  The conditions I used were stated in the original posting in great
detail. I leave it to the reader to decide the value of the results
posted, but I clearly didn't try to compare MS-DOS and SunOS as the
poster implied. I used the best 386 and best 68k compiler to which I
have access, and made no claims or deductions from the results.

  Everything I have seen indicates that you are correct in assuming
5-10% smaller code (text segment) for the 386 *using compilers available
to me*.

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/07/89)

In article <10979@polyslo.CalPoly.EDU> cquenel@polyslo.CalPoly.EDU (24 more
school days) writes:
    
    	Hahahahah ha ha ha ha ha. 
    	Oh. excuse me. (chuckle)
    	I can't help myself.

This is fairly evident :-] :-] ....
    
    	Highly optimizing compilers have long been able to make very
    	good use of more than four registers.

Well, in a long discussion a few months ago in comp.lang.c, nobody has been
able (in a period of two months) to quote any FIGURES that support this and
other urban legends (e.g. Elvis is alive and designed the Z80000 :->). There
are plenty of figures though about the average extreme simplicity of actual
statements/expressions/constructs in algorithmic level languages, which does
not bode well for the usefulness of large register files.

You may become less amused after having read some old and half forgotten
papers, e.g. one by knuth on the complexity of fortran programs, and one that
appeared in sigplan on the average complexity of C expressions and the
number of registers needed to optimize them (dated early seventies one and
late seventies the other, if I remember well -- this is not the machine on
which I keep all my references, unfortunately).

Even some of the RISC guys, and some that do a great deal of inter-expression
register assignments (which I don't like at all -- long life "register"!),
like MIPS, don't see a great advantage in extravagant register file sizes.

	If John Mashey could give us some of the numbers that MIPS have
	surely got on where they estimate the knees of the curves for
	registers for intra and inter statement optimization to be, we would
	be greatly enlightened (e.g. as to why their register file is so
	different from that of SPARC and 29K); maybe he has already, and I
	have missed them.

Sixteen registers in total, of which some are dedicated, some are assigned
to global values (I remember having seen papers that say that keeping in two
registers the constants 0 and 1 may pay big on some/many architectures),
some are used for "register" variables, may be reckoned to be plenty enough,
and don't leave much more than four for scratch usage. Eight (e.g. the 486)
starts to be a bit constraining, admittedly, for getting four scratch
registers and soem for "register" variables.

As to some small bit of available evidence, it is not very controversial
that the 386 is usually a tad faster than the 68020 with roughly equivalent
system technology (e.g. a cached 386 at 20Mhz tipically beats a cached 68020
at 25 Mhz under gcc), and code size is a tad smaller as well.  This may not
mean much, may not be just *because* it has less registers, but it seems to
indicate that at least the smaller number of registers does not hurt too
much.

Moreover on CISC machines that have on chip caches (like the 486 and the
68040), large register files (whose main use is as a kind of statically
managed cache -- or stack, ala SPARC) would probably not make a whole lot of
difference (except perhaps to [understatement of the year] the not
uninmportant issue of chip component count).
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mash@mips.COM (John Mashey) (05/07/89)

In article <902@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
....
>The 68020 has more registers than the 386, but this ought to count for
>little, as according to literature the knee of the curve for C is before or
>at 4 registers (important note: for typical Unix utilities, e.g. excluding
>floating point) because of the average simplicity of C expressions.

Could you post the specific reference, with detail about what the circumstances
that led to this conclusion?
	1) It is very easy to generate such a conclusion for a specific
	machine & compiler combination, such that the conclusion is totally
	irrelevant to other combinations.  For example, I once saw a statement
	that data on a PDP-10 said that the compilers seldom used more than
	15 registers.....well, a PDP-10 has 16 registers....
	2) Good optimizing compilers make the number go up; in any case,
	the number 4 seems a bit low for a generic number.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tim@crackle.amd.com (Tim Olson) (05/08/89)

In article <907@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
| In article <10979@polyslo.CalPoly.EDU> cquenel@polyslo.CalPoly.EDU writes:
|     
|     	Highly optimizing compilers have long been able to make very
|     	good use of more than four registers.
| 
| Well, in a long discussion a few months ago in comp.lang.c, nobody has been
| able (in a period of two months) to quote any FIGURES that support this and
| other urban legends (e.g. Elvis is alive and designed the Z80000 :->). There
| are plenty of figures though about the average extreme simplicity of actual
| statements/expressions/constructs in algorithmic level languages, which does
| not bode well for the usefulness of large register files.

OK, here are some figures to play with:

The Am29000 calling convention says that global temporary registers are
killed across a procedure call, while values in local registers remain
alive.  When the Am29000 compilers perform lifetime analysis for
register assignment, they note which values must be alive across a
procedure call and assign those to local registers; the others are
assigned to global registers.

A static analysis of 495 functions shows that an average of 6.6 global
registers and an average of 7.0 local registers are used per function,
with the following register-use histogram: (ain't 'awk' wonderful? ;-)

	495 total functions
	ave globals: 6.5596     ave locals: 7.04242
	--- globals ---         --- locals ---
	  0:  1.82% (9)           0:  1.01% (5)
	  1:  0.81% (4)           1:  5.66% (28)
	  2:  7.27% (36)          2:  9.29% (46)
	  3:  4.24% (21)          3: 13.94% (69)
	  4: 10.51% (52)          4: 14.95% (74)
	  5:  9.90% (49)          5:  8.08% (40)
	  6: 17.58% (87)          6: 10.71% (53)
	  7: 15.56% (77)          7:  7.47% (37)
	  8:  8.08% (40)          8:  3.84% (19)
	  9: 11.72% (58)          9:  3.84% (19)
	 10:  4.85% (24)         10:  4.85% (24)
	 11:  2.83% (14)         11:  3.03% (15)
	 12:  1.41% (7)          12:  1.01% (5)
	 13:  0.40% (2)          13:  1.01% (5)
	 14:  1.21% (6)          14:  1.41% (7)
	 15:  0.00% (0)          15:  0.81% (4)
	 16:  0.81% (4)          16:  0.40% (2)
	 17:  0.20% (1)          17:  0.81% (4)
	 18:  0.00% (0)          18:  0.40% (2)
	 19:  0.20% (1)          19:  1.41% (7)
	 20:  0.40% (2)          20:  0.61% (3)
	 21:  0.00% (0)          21:  1.62% (8)
	 22:  0.00% (0)          22:  0.61% (3)
	 23:  0.00% (0)          23:  0.20% (1)
	 24:  0.20% (1)          24:  0.00% (0)
	>24:  0.00% (0)         >24:  3.03% (15)

With a stack cache, the cost of saving and restoring the live registers
across procedure calls is constant up to the spill/fill boundary, while
with a simple global register file, the cost depends upon the number of
live registers (1 store + 1 load per live register).

If we assume an average of 7 live registers and 1.5% to 2.5% calls in
the dynamic instruction mix, live register save and restore would
account for an extra 21% ([7 stores + 7 loads]/66 instructions between
calls) to 35% ([7 stores + 7 loads]/40 instructions between calls) of
execution time, assuming that loads and stores take 1 cycle.  However,
leaf-procedure optimizations (which we also perform in the current
Am29000 compilers) can reduce the number of calls which must save live
registers by perhaps 30%, resulting in an overall execution time
increase of from 14.7% to 24.5%.

Now The Am29000 compilers perform CSE at a very low level (we assume
that registers are cheap, loads and stores expensive), so we probably
have a somewhat higher number of live registers across function calls
than other architectures.  I'm sure someone at MIPS can supply us with
their equivalent numbers.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

seibel@cgl.ucsf.edu (George Seibel) (05/08/89)

In article <25546@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>In article <907@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>| In article <10979@polyslo.CalPoly.EDU> cquenel@polyslo.CalPoly.EDU writes:

>|     	Highly optimizing compilers have long been able to make very
>|     	good use of more than four registers.

>| Well, in a long discussion a few months ago in comp.lang.c, nobody has been
>| able (in a period of two months) to quote any FIGURES that support this and

>OK, here are some figures to play with:
 [...]
>A static analysis of 495 functions shows that an average of 6.6 global
>registers and an average of 7.0 local registers are used per function,
>with the following register-use histogram: (ain't 'awk' wonderful? ;-)

Thanks for the data, Tim.   I applaud the analysis of real world code
in computer design, but I'm concerned that any large database of code
might be heavily stacked with code that just doesn't matter.   Many
millions of dollars of supercomputer time are burned yearly at the
NSF centers;  A handful of applications programs account for much of
this time.   If you profile these applications you typically find a
small fraction of the code is responsible for a large fraction of the
cpu time.   I have looked at the assembly code for some of these
computational kernals, and said, "Damn! look at all these spills!"
Eight (vector) registers was not "enough" in the cases I looked at.

My main point, however, is that modern micros are already fast enough
for much of what we do.  If someone designs a machine to run vi ten
times faster than a 68030, I'm not going to be impressed.  As machines
get faster, the code database from which one should design either gets
smaller, since there are less existing functions that still represent
bottlenecks, or it gets harder to build, because it becomes more heavily
weighted with the code we would *like* to write, except that machines
aren't yet fast enough to run it.

George Seibel, UCSF

peter@ficc.uu.net (Peter da Silva) (05/08/89)

In article <907@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) writes:
> As to some small bit of available evidence, it is not very controversial
> that the 386 is usually a tad faster than the 68020 with roughly equivalent
> system technology (e.g. a cached 386 at 20Mhz tipically beats a cached 68020
> at 25 Mhz under gcc), and code size is a tad smaller as well.

Of course you realise you're comparing intel's latest production technology
with Motorola's last-but-one generation. How does the 80386 compare to the
68030?
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

bcase@cup.portal.com (Brian bcase Case) (05/09/89)

>Well, in a long discussion a few months ago in comp.lang.c, nobody has been
>able (in a period of two months) to quote any FIGURES that support this [that
>lots of registers can be used to advantage by compilers.]
>There are plenty of figures though about the average extreme simplicity of
>actual statements/expressions/constructs in algorithmic level languages, which
>does not bode well for the usefulness of large register files.

Um, I, as well as many who read here, can give you at least a few references
that talk about the number of registers that can be put to use.  The average
"extreme simplicity" of individual arithmetic statements is only a part of
what determines the number of registers that can be used.

>You may become less amused after having read some old and half forgotten
>papers, e.g. one by knuth on the complexity of fortran programs, and one that
>appeared in sigplan on the average complexity of C expressions and the
>number of registers needed to optimize them (dated early seventies one and
>late seventies the other, if I remember well -- this is not the machine on
>which I keep all my references, unfortunately).

These findings are relevant only if you want always to load the operands
from memory before executing the expression and always to store the result
to memory after executing the expresson.  If you would like to keep the
source operands and the result around for a while to reduce the number of
memory references, and thereby create an opportunity to increase performance,
then you will want more registers.

>	If John Mashey could give us some of the numbers that MIPS have
>	surely got on where they estimate the knees of the curves for
>	registers for intra and inter statement optimization to be, we would
>	be greatly enlightened (e.g. as to why their register file is so
>	different from that of SPARC and 29K); maybe he has already, and I
>	have missed them.

I believe he has in the past given such numbers, and they are something like
24 to 28 registers, depending on a few variables.  The reason the 29K and
SPARC have larger register files is not because we thought there was vastly
more opportunity for common subexpression elimination or that intra-procedure
register lifetimes are longer than MIPS thinks they are.  The reason has to
do with decreasing procedure call overhead.  For the 29K, in which, unlike
the SPARC, all registers are simultaneously addressable, we (or at least I)
thought that since there is no substitute for registers (that is, they are
much faster, in an earlier pipestage, and have 3x the bandwidth of a plain
old cache), it is not a bad idea to have lots of them.  This might sound
unscientific, but we didn't just say "this feels good, let's do it."

>Moreover on CISC machines that have on chip caches (like the 486 and the
>68040), large register files (whose main use is as a kind of statically
>managed cache -- or stack, ala SPARC) would probably not make a whole lot of
>difference (except perhaps to [understatement of the year] the not
>uninmportant issue of chip component count).

Having the needed operands in the register file is a lot better than having
them in the data cache, unless the cache is triple-ported and in the decode
stage of the pipe.  But then, it is just a large register file....

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (05/09/89)

In article <25546@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>
>The Am29000 calling convention says that global temporary registers are
>killed across a procedure call, while values in local registers remain
>alive. 

This certainly is a good way to do it.  Is the terminology you
use for local and global standard? (It is the opposite of what I have
usually heard before, and seems counterintuitive - it seems that the
temporaries should be local temporaries, although as you have used it
the registers are "global temporary registers" because they are assigned
to handle temporaries in all procedures- hence "global".)

>A static analysis of 495 functions shows that an average of 6.6 global
>registers and an average of 7.0 local registers are used per function,
:
>	 11:  2.83% (14)         11:  3.03% (15)
:

An interesting set of figures.  Using 32 general purpose registers, with
16 for local and 16 for temporaries, would certainly seem to fit, given where
the knee of the curve is.

Anyway, I wonder what the results look like for things like double
precision: Linpack, the Livermore Loops, the NAS kernels, etc.
(In other words, 64 bit floating point numeric codes...)  ...?

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

tim@crackle.amd.com (Tim Olson) (05/09/89)

In article <25127@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
| In article <25546@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
| >
| >The Am29000 calling convention says that global temporary registers are
| >killed across a procedure call, while values in local registers remain
| >alive. 
| 
| This certainly is a good way to do it.  Is the terminology you
| use for local and global standard? (It is the opposite of what I have
| usually heard before, and seems counterintuitive - it seems that the
| temporaries should be local temporaries, although as you have used it
| the registers are "global temporary registers" because they are assigned
| to handle temporaries in all procedures- hence "global".)

The terms "local" and "global" refer to the different types of registers
on the Am29000 -- there are 64 global registers, which are addressed
absolutely, and 128 local registers, which are addressed relative to an
internal stack pointer (used to implement a software stack cache).  32
of the globals are reserved for the OS for important things (like TLB
miss page table pointers, etc.); there are 24 globals reserved for
temporary registers -- the rest are for stack pointers, etc.

| >A static analysis of 495 functions shows that an average of 6.6 global
| >registers and an average of 7.0 local registers are used per function,
| :
| >	 11:  2.83% (14)         11:  3.03% (15)
| :
| 
| An interesting set of figures.  Using 32 general purpose registers, with
| 16 for local and 16 for temporaries, would certainly seem to fit, given where
| the knee of the curve is.

Except for the constant saving and restoring of live variables which
occurs if you don't have stack cache hysterisis...

Also, techniques used to uncover more parallelism, such as loop
unrolling, software pipelining, and trace-scheduling tend to require
many more registers.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/09/89)

In article <25546@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:

    |     	Highly optimizing compilers have long been able to make very
    |     	good use of more than four registers.

	[ .... ]
    
    OK, here are some figures to play with:

	[ .... ]
    
    A static analysis of 495 functions shows that an average of 6.6 global
      ^^^^^^
    registers and an average of 7.0 local registers are used per function,
							^^^^
    with the following register-use histogram: (ain't 'awk' wonderful? ;-)

Too bad that these figures don't mean anything, except that your compiler
can 'make use of more than four registers'. The 'very good' after 'make' is
not proved at all. To prove that you need to generate code assuming that you
have say 1 to 16 register available, and then show that as the number of
register increases, program speed/code size improves significantly.

The one paper I read about this (unfortunately for John Mashey I cannot
find the exact reference -- the reason is too embarassing, even if not for
me, to state publicly) was about taking the PCC (for the PDP) and changing the
number of registers available to its Sethi-Ullman register allocator, and
then benchmarking a few Unix tools.

They found that in these conditions (CISC machine, no interexpression
optimization, virtually only fixed point computation) speed/code size did
not improve substantially with more than three scratch registers, and four
were plenty.

I can imagine that for machines not like the 386/68020, e.g. RISC machines
with a reg-reg architecture, more registers may be useful, but as far as I
know there are no figures for this situation. This is an interesting
research project: take GCC for the SPARC, and redo the exercise. Or the AMD
29k compiler, or the MIPS compilers suite, etc...

I still find it difficult that one would find a substantial difference
(especially given the abundant statistics on the simplicity of the average
expression -- expressions with more than two operators are a rarity) and
indeed the AMD data above seem to say that seven registers is about what a
compiler can use (for expression optimization). This, let me say, looks like
four registers + three for local "register" variables :->.

As to the six global registers, their contribution is hard to assess. But on
them let me say that on one thing I agree: global "register" variables (that
unfortunately C does not have, thus forcing the compiler to intuit them) are
demonstrably good in one important case, when the program to which they are
global uses them to cache the state of some automaton, e.g. an interpreter.

In the end, I see here nothing against the idea that 16 total registers is
plenty and 8 adequate if a bit constraining, but the difference is not going
to be great... All the more so in the original context, CISC reg-mem machines
and using as metric code size.

As to me, I am much fond of zero address architectures, with the tip of the
stack (let's make it four tips of stack) cached. I like CRISP architectures...
:-> :->.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/09/89)

In article <4103@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:

    In article <907@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) writes:
    > As to some small bit of available evidence, it is not very controversial
    > that the 386 is usually a tad faster than the 68020 with roughly equivalent
    > system technology (e.g. a cached 386 at 20Mhz tipically beats a cached 68020
    > at 25 Mhz under gcc), and code size is a tad smaller as well.

    Of course you realise you're comparing intel's latest production technology
    with Motorola's last-but-one generation.

No, I did not realize that. Maybe I was wrong -- but what I see around me,
avaiable for purchase, is 68020/386 workstations, for roughly the same price.
If I were going to compare the Intel/Motorola families, I would couple
them as 6809/8088, 68010/80286, 68020/386.

Also, the 68020 was introduced not much before the 386, and certainly the
386 has been around far longer than the 68030. Or I am goofing?

    How does the 80386 compare to the 68030?

You get me a 68030 and I tell you :->... From various papers and benchmarks
on the few 68030 machines around (Next, Macs, Suns), it seems that one
statement I have read (68030 == 68020+10/15%) is quite reasonable overall.

This would put the 68030 on a par with the 386 in speed. As to the original
issue, code size, that is not going to be influenced. Maybe having 16
registers is really better than 8 registers only, but then probably the
difference is not large enough to compensate for the shorter common
instructions of the 386.

Ah, by the way: I HATE the Intel architecture. Too bad that their technology
floats on a sea of dollars from all the suckers (like me :->) that buy
AT clones...
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/10/89)

In article <25127@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

    >A static analysis of 495 functions shows that an average of 6.6 global
    >registers and an average of 7.0 local registers are used per function,
    :
    >	 11:  2.83% (14)         11:  3.03% (15)
    :

    An interesting set of figures.  Using 32 general purpose registers, with
    16 for local and 16 for temporaries, would certainly seem to fit, given where
    the knee of the curve is.

Note that this is NOT the curve '# of regs' vs. 'code size' or 'program speed'.
It is the curve '# of regs' vs. 'max # of regs that a given optimizer can make
any use of in several procedures'. Therefore 32 registers seem to be an
UPPER BOUND on the number of registers that in the worst case may be useful.

    Anyway, I wonder what the results look like for things like double
    precision: Linpack, the Livermore Loops, the NAS kernels, etc.
    (In other words, 64 bit floating point numeric codes...)  ...?

This would be interesting to see. I suspect that more registers would
be nice, but then all these codes are usually vectorizable, and then one
should use vector instructions on vector registers...

Hint: The number of scratch registers a compiler finds *useful* for
optimizing is more or less related directly to the maximum number of
subexpressions that can be computed concurrently at any one given time.  In
other words, if four is that number, that means that in the tipical
statement/expression data dependencies are such that at most four
subexpressions could be concurrently computed. In normal programs, it is
hard to see how this implicit degree of concurrency could be raised much.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (05/10/89)

In article <4103@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes:
>In article <907@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) writes:
>> As to some small bit of available evidence, it is not very controversial
>> that the 386 is usually a tad faster than the 68020 with roughly equivalent
>> system technology (e.g. a cached 386 at 20Mhz tipically beats a cached 68020
>> at 25 Mhz under gcc), and code size is a tad smaller as well.
>
>Of course you realise you're comparing intel's latest production technology
>with Motorola's last-but-one generation. How does the 80386 compare to the
>68030?

If you take a look at the latest dhrystone (for whatever it's worth) 
results, you'll see that the 80386 does it faster than the 68030.  

There seems to be alot of variability though - like a 16Mhz '386 is 
faster than a NeXt 33Mhz 68030 (both using GCC)?  


-- 
  Jon Zeeff			zeeff@b-tech.ann-arbor.mi.us
  Ann Arbor, MI			sharkey!b-tech!zeeff

phd_ivo@gsbacd.uchicago.edu (05/10/89)

>There seems to be alot of variability though - like a 16Mhz '386 is 
>faster than a NeXt 33Mhz 68030 (both using GCC)?  

Even though some popular magazines have said so, to the best of my knowledge
and all NeXt sales prospectuses, the NeXt runs at 25MHz, not 33 MHz.

Apropos unusual architectures (and Burroughs access checking): being a complete
outsider, I am quite startled by the fact that computers do not implement much
group checking. If we can waste a bit on parity check (how many parity errors
have you had?), why can't we use one to signal READ-ONLY or NO-ACCESS for each
byte? This way, one can put a single No-ACCESS byte between datastructures at
link/load time, which should detect out of pointer ranges much more quickly
than the wait-and-see in use in our systems. (At least I have spent days of my
life searching where wild pointers.) It would also seem that this concept would
go quite well with object-oriented languages, which try to but never can check
if non-member functions access objects.

/ivo welch	phd_ivo@gsbacd.uchicago.edu

jesup@cbmvax.UUCP (Randell Jesup) (05/10/89)

In article <9329@b-tech.ann-arbor.mi.us> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
>If you take a look at the latest dhrystone (for whatever it's worth) 
>results, you'll see that the 80386 does it faster than the 68030.  

	Dhrystone falls down a bit here, since it over-relies on an unusually
large number of string operations, and the string compares are unusually
long (before a difference).  I believe the x86 family has string instructions
that help it with these operations.

>There seems to be alot of variability though - like a 16Mhz '386 is 
>faster than a NeXt 33Mhz 68030 (both using GCC)?  

	Interesting.  How about this: ~8500 dhrystones (register version)
for a 25Mhz '030 Amiga (gets a bit more if you disable task-switching - it
keeps polling the disk drives for insertion/removal, executing VBlank
interrupts, etc).  One of the many reasons why dhrystones should be taken
with a grain of salt, especially for comparing processors (Compiler tech 
gets in here too - the 8500 was with a microcomputer C compiler).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

lord@se-sd.sandiego.ncr.com (Dave Lord) (05/11/89)

In article <3150@tank.uchicago.edu> phd_ivo@gsbacd.uchicago.edu writes:

>byte? This way, one can put a single No-ACCESS byte between datastructures at
>link/load time, which should detect out of pointer ranges much more quickly
>than the wait-and-see in use in our systems. (At least I have spent days of my
>life searching where wild pointers.) It would also seem that this concept would
>go quite well with object-oriented languages, which try to but never can check
>if non-member functions access objects.

With my Computer Scientist hat on I think this is a great idea, but with
my 'can we sell this system' hat on I see a potential problem:

How much "C" code have you seen that does something like this:

	while (a[index] != 0 and index++ <= A_BOUNDS){ ... whatever ...};

The above is lousy code (it could also be done with pointers) but the
problem is that it almost always works OK. How many people will scream
because they have to clean up their programs to get them to run on
your system. Will there be problems porting important applications?

OK, if you don't see it, we are overrunning the end of the array. The
above case is easy to fix, but other cases might not be. And before
you flame me, I assure you, I wouldn't code that way (Usually).

Dave Lord

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (05/11/89)

In article <926@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>In article <25127@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>    >	 11:  2.83% (14)         11:  3.03% (15)
:
>    the knee of the curve is.

>Note that this is NOT the curve '# of regs' vs. 'code size' or 'program speed'.
>It is the curve '# of regs' vs. 'max # of regs that a given optimizer can make
>any use of in several procedures'. Therefore 32 registers seem to be an
>UPPER BOUND on the number of registers that in the worst case may be useful.

I assume that any purpose the optimizer can use the registers for is
legitimate, as long as it significantly reduces memory traffic.  That includes
using the registers as a "cache", if you like to think of it that way.  One
Fortran engine (the constraints are different from C of course) - the Cyber 205,
had 256 registers.  It turned out that this was "too many" - often
many of the registers went unused.  I always assume that the limiting factor
is the optimizer.  The now common rule of thumb that "32 registers is 
enough" is based on basic block optimization in C, generally.  Really smart
optimizers might be able to effectively use more registers.  The whole idea
of "RISC" is to design around what compilers can actually do, as opposed to
what they might do if they were smart enough.  Code size is generally not as
much of a factor, although it does affect speed indirectly (e.g. less dense
code requires a bigger I-cache, etc.).

The 205 optimizer assigned all non-common scalars to registers permanently.  
I doubt if this really saved memory traffic, because
it meant that all the scalar variables got swapped in even if they were not
used in that subroutine invocation.  But it did make use of a fast swap, and 
also permitted certain register to register operations involving scalars to 
take place in parallel with vector operations. Therefore, it may actually have
saved time.   In general, a dynamic analysis would be 
needed to determine the optimum register assignments for each module.

In the case of the 205, I would guess that around 64 registers would
probably have covered almost all of the cases.  Register assignments that I
looked at rarely used more than 50-60.  I note that the 205 calling
convention did define scratch registers not saved beyond procedure calls or
if I recall correctly, beyond basic block optimization (not sure about the
last statement).  I think 10 (?) scratch registers
were allocated, and these were not always enough.  16 would have been better,
at least for Fortran.

>   >  Anyway, I wonder what the results look like for things like double

>This would be interesting to see. I suspect that more registers would
>be nice, but then all these codes are usually vectorizable, and then one
>should use vector instructions on vector registers...

I didn't know that it was REQUIRED to have a vector machine to run
Fortran programs :-)

>Hint: The number of scratch registers a compiler finds *useful* for
>optimizing is more or less related directly to the maximum number of
>subexpressions that can be computed concurrently at any one given time.  In

The reason I brought it up is that the analyses one usually sees in this
group are based on C.  Fortran codes MAY (I don't know if they do) have
significantly different behavior in this respect for various reasons.
Therefore, it might be worth checking to make sure that the same assumptions
still apply.  It might be the case that 64 32-bit registers are a better fit
for the suggested Fortran examples.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117

hascall@atanasoff.cs.iastate.edu (John Hascall) (05/11/89)

In article <1921@se-sd.sandiego.ncr.com> lord@se-sd.sandiego.NCR.COM (Dave Lord(SSP)) writes:
>In article <3150@tank.uchicago.edu> phd_ivo@gsbacd.uchicago.edu writes:
 
 >byte? This way, one can put a single No-ACCESS byte between datastructures at
>>link/load time, which should detect out of pointer ranges much more quickly
 
>With my Computer Scientist hat on I think this is a great idea, but with
>my 'can we sell this system' hat on I see a potential problem:
 
>How much "C" code have you seen that does something like this:
 
>	while (a[index] != 0 and index++ <= A_BOUNDS){ ... whatever ...};
			     &&  (I presume)
  
>OK, if you don't see it, we are overrunning the end of the array. The
>above case is easy to fix, but other cases might not be. And before
>you flame me, I assure you, I wouldn't code that way (Usually).
 
     This can already break on many machines--say, for example, if the
     last byte of the array just happens to fall on the last byte of
     a page (and the next page is no-access).

     Since the C operator && guarantees short-circuiting I don't see any
     reason to use such sloppy code (just re-order the expressions).

     John Hascall
     ISU Comp Center

grunwald@flute.cs.uiuc.edu (05/11/89)

you neglect to mention that PCC is/was an antiquated compiler; a more
meaningful measure would be to look at e.g., `gcc' which does better
register allocation, CSE and code hoisting. How much do these affect
register usage?
--
Dirk Grunwald
Univ. of Illinois
grunwald@flute.cs.uiuc.edu

mash@mips.COM (John Mashey) (05/11/89)

In article <25254@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

>still apply.  It might be the case that 64 32-bit registers are a better fit
>for the suggested Fortran examples.

I don't have it handy, but a good example looking at the use of 64 regs
is:
	David W. Wall, Global register Allocation at Link Time,
SIGPLAN NOTICES 21(7)264-275, July 1986.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

pb@idca.tds.PHILIPS.nl (Peter Brouwer) (05/11/89)

In article <9329@b-tech.ann-arbor.mi.us> zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) writes:
>>Of course you realise you're comparing intel's latest production technology
>>with Motorola's last-but-one generation. How does the 80386 compare to the
>>68030?
>
>If you take a look at the latest dhrystone (for whatever it's worth) 
>results, you'll see that the 80386 does it faster than the 68030.  
>
>There seems to be alot of variability though - like a 16Mhz '386 is 
>faster than a NeXt 33Mhz 68030 (both using GCC)?  
Dhrystone figures cannot be used to compare systems performance!!!!!!!!
Dhrystone figures are heavily influenced by the strxxx functions.
For most unix 386 systems those functions are coded in assembler with
the repeat instruction. The C compiler we use does not generate those instructions.
What I am saying is that dhrystone comparison between 386 an 680x0 cannot
be used to compare systems. You need info about other benchmarks as well.
My experience is that a 68020 unix system performs better than a 368 system
both running with the same clock frequency. ( 16 Mhz ).
instructions 
-- 
#  Peter Brouwer,                     ##
#  Philips TDS, Dept SSP-V2           ## voice +31 55 432523
#  P.O. Box 245                       ## UUCP address ..!mcvax!philapd!pb
#  7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl

peter@ficc.uu.net (Peter da Silva) (05/11/89)

In article <922@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) writes:
> No, I did not realize that. Maybe I was wrong -- but what I see around me,
> avaiable for purchase, is 68020/386 workstations, for roughly the same price.

And if it wasn't for the sea of dollars you refer to the 386es would cost
as much, if not more than, the 030s. By the way, the first 030 implementation
I know of was the CSA 68030 card for the Amiga.

> If I were going to compare the Intel/Motorola families, I would couple
> them as 6809/8088, 68010/80286, 68020/386.

6809-8088? I love it. However, the 6809 really fell into the cracks... the
68000 was very close to beating it out and was contemporary with the 8086.

68000 -- 8086
68010 -- 80186	# Neither really took off, both just fixed some flaws
		  in the original, at the cost of some compatibility.
68020 -- 80286
68030 -- 80386
68040 -- 80486	# Both announced but not available.

> Also, the 68020 was introduced not much before the 386, and certainly the
> 386 has been around far longer than the 68030. Or I am goofing?

My memory is that the 68020 has been out nearly as long as the 286, and that
the 68030 has been around nearly as long as the 68030.
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

slackey@bbn.com (Stan Lackey) (05/11/89)

In article <921@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>In article <25546@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
>    |     	Highly optimizing compilers have long been able to make very
>    |     	good use of more than four registers.
>    A static analysis of 495 functions shows that an average of 6.6 global
>    registers and an average of 7.0 local registers are used per function,
>Too bad that these figures don't mean anything, except that your compiler
>can 'make use of more than four registers'. The 'very good' after 'make' is
>not proved at all. To prove that you need to generate code assuming that you
>have say 1 to 16 register available, and then show that as the number of
>register increases, program speed/code size improves significantly.

A static analysis of registers-used can be very misleading.  The way
the Alliant compiler assigned registers was by cycling through all
those available.  This was because the machine was designed with a
scalar pipeline, and the allocator was trying to give the peepholer
the best chance at scheduling.  With this algorithm, the number of
registers used is "all of them".  To derive the number NEEDED requires
lots more work, and is certain to be not ony application dependent,
but likely even more dependent upon the writing style of the software
engineer.
-Stan

tim@crackle.amd.com (Tim Olson) (05/11/89)

In article <39803@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
| A static analysis of registers-used can be very misleading. The way
| the Alliant compiler assigned registers was by cycling through all
| those available.  This was because the machine was designed with a
| scalar pipeline, and the allocator was trying to give the peepholer
| the best chance at scheduling.  With this algorithm, the number of
| registers used is "all of them".  To derive the number NEEDED requires
| lots more work, and is certain to be not ony application dependent,
| but likely even more dependent upon the writing style of the software
| engineer.

Agreed.  I just posted the static analysis because it was all I had time
for.  However, the compiler that was used does attempt to minimize
register usage through coloring, and does not use the cycling technique
you mention.



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

bcase@cup.portal.com (Brian bcase Case) (05/12/89)

>...you need to generate code assuming that you
>have say 1 to 16 register available, and then show that as the number of
>register increases, program speed/code size improves significantly.

Well, it's old and CISCy stuff, but the paper:

Chow and Hennessey, "Register Allocation by Priority-based Coloring,"
Proc. SIGPLAN Symp. on Compiler Construction, SIGPLAN notices vol. 19,
No. 6, June 1984.

shows some performance numbers for a variable number of registers.  The
architectures were to the PDP-10 and the 68000.  A max. of 9 registers
was available.  The fastest performance was achieved when the max. number
of regs. was used.

>changing the number of registers available to its Sethi-Ullman register
>allocator, and then benchmarking a few Unix tools.
>They found that in these conditions (CISC machine, no interexpression
>optimization, virtually only fixed point computation) speed/code size did
>not improve substantially with more than three scratch registers, and four
>were plenty.

But of course!  Is this a surprise?  It isn't to me.

>I can imagine that for machines not like the 386/68020, e.g. RISC machines
>with a reg-reg architecture, more registers may be useful, but as far as I
>know there are no figures for this situation. This is an interesting
>research project: take GCC for the SPARC, and redo the exercise. Or the AMD
>29k compiler, or the MIPS compilers suite, etc...

The paper quoted above concluded that for the CISC-ish PDP-10 and 68000,
using all 9 available registers was the best (the rest were reserved for 
exclusive use by the code generator, I think).

>I still find it difficult that one would find a substantial difference
>(especially given the abundant statistics on the simplicity of the average
>expression -- expressions with more than two operators are a rarity) and
>indeed the AMD data above seem to say that seven registers is about what a
>compiler can use (for expression optimization). This, let me say, looks like
>four registers + three for local "register" variables :->.

You are forgetting, maybe?, that registers are used in clever ways to
avoid save/restore overhead on procedure calls?  The following paper:

Wall, "Global Register Allocation At Link Time," ACM SIGPLAN conf. on
Compiler Construction, June 1986 (sorry, I can get a more complete ref.
if anyone wants it).

talks about using 52 registers with link-time allocation (the machine,
the DECWRL Titan, has 64 GPRs).  The allocator tried to keep as many
procedure contexts in registers as possible.  Neat stuff.

>As to the six global registers, their contribution is hard to assess. But on
>them let me say that on one thing I agree: global "register" variables (that
>unfortunately C does not have, thus forcing the compiler to intuit them) are
>demonstrably good in one important case, when the program to which they are
>global uses them to cache the state of some automaton, e.g. an interpreter.

For the 29K, they are not used for storing data declared to be in the C
global scope. They are temporaries used for expression evaluation, etc.
If the compiler could put globally-scoped data in global registers (such
as can be done by the DECWRL "at-link-time" stuff), many more could be used.

henry@utzoo.uucp (Henry Spencer) (05/12/89)

In article <921@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>They found that in these conditions (CISC machine, no interexpression
>optimization, virtually only fixed point computation) speed/code size did
>not improve substantially with more than three scratch registers, and four
>were plenty.

You forgot one condition:  ancient and stupid compiler.  Do remember, also,
that most any compiler for the 11 was full of implicit assumptions to the
effect that very few registers were available.
-- 
Mars in 1980s:  USSR, 2 tries, |     Henry Spencer at U of Toronto Zoology
2 failures; USA, 0 tries.      | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ddb@ns.network.com (David Dyer-Bennet) (05/12/89)

In article <420@ssp2.idca.tds.philips.nl> pb@idca.tds.PHILIPS.nl (Peter Brouwer) writes:
:Dhrystone figures cannot be used to compare systems performance!!!!!!!!
:Dhrystone figures are heavily influenced by the strxxx functions.
:For most unix 386 systems those functions are coded in assembler with
:the repeat instruction. 

Possibly the Dhrystone is TOO heavily influenced by strxxx functions.
However, if *all* 386 C libraries handle the strxxx functions much
better than *all* 680x0 C libraries do, I think something pretty
interesting is showing.
-- 
David Dyer-Bennet, ddb@terrabit.fidonet.org, or ddb@ns.network.com
or ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
or ...!{rutgers!dayton | amdahl!ems | uunet!rosevax}!umn-cs!ns!ddb
or Fidonet 1:282/341.0, (612) 721-8967 9600hst/2400/1200/300

elg@killer.Dallas.TX.US (Eric Green) (05/12/89)

in article <922@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
> Also, the 68020 was introduced not much before the 386, and certainly the
> 386 has been around far longer than the 68030. Or I am goofing?

You're goofing. 68020's have been around at least since 1986 (which is
when I saw my first one, a CCA add-on processor for the Amiga). And
Suns 3s had them before then.  The '020 became available a bit over a
year after the '286, I seem to recall.  '386 machines only recently
have become widely available, i.e., the current availability of the
'386 is about the same as the availability of the '020 in 1986.


The '030, on the other hand, came out slightly after the '386. It's
only now coming into widespread use, about a year after the '386 did.

As for the price difference: Apollo's low-end '030 machine is
advertised for around $6k, which is about what a competitive '386
machine (4 megs of RAM, super-VGA, etc.) would cost. Remember, these
'030 machines you're seeing for sale are Unix workstations, with
multimegabytes of RAM, and a high-speed network card or hard drive,
running Unix, with large (expensive) displays. These aren't the "price
buster" CGA 1 megabyte MS-DOS '386 machines you can get for $3000 out
of Computer Shopper.... to compare, you have to add a bit (and you
still won't get there, because '386 Unixes generally don't do
full-blown windowing and networking). It's interesting to note that
Apollo's low-end '030 machine costs about the same as Apple's '020
machine (is this a case of comparing Apples and, uh, Apollos? ;-).


> on the few 68030 machines around (Next, Macs, Suns), it seems that one
> statement I have read (68030 == 68020+10/15%) is quite reasonable

Figures I've seen indicate about a 20-30% improvement. Note that
performance gains can be sapped by high-overhead operating systems
(e.g. Mach on the NeXT) or by anemic memory systems (e.g. Mac IIx).
You've benchmarked 68030-based Suns??? Considering that they only
recently introduced them, you've been working fast! (the Sun's got
about the same overhead as a NeXT anyhow, except it doesn't have the
object-oriented layer to the user interface). 
    A fair comparison would be, hmm, a generic '386 clone with AT&T's
Sys V and Unix compiler against a generic '020/'030 VME Unix system
running generic AT&T Sys V. I've never seen such a comparison done.
All I've ever seen is comparisons between apples and oranges, uh, Sys
V/Microsoft C vs. BSD4.2/PCC. Each machine would have to have the same
amount of main memory, the same amount of cache, and the same clock
speed. Then we could decide.

> This would put the 68030 on a par with the 386 in speed. As to the

It would, sort of -- I believe they both do a fetch in 2 cycles?
EXCEPT -- 33mhz 68030s are available NOW. For the '386, it's "Real
Soon Now". In general, Motorola has been ramping the clock up on the
'030 faster than Intel has on the '386. Of course there's a limit to
this -- how fast can your cache be accessed? But by that time, the
'040 will be out, hopefully with increased bus bandwidth e.g. Harvard
memory architecture.

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
|  //    Join the Church of HAL, and worship at the altar of all computers  |
|\X/   with three-letter names (e.g. IBM and DEC). White lab coats optional.|

rpw3@amdcad.AMD.COM (Rob Warnock) (05/12/89)

One thing Tim Olson did leave out was the *dynamic* use of the Am29000
local register file (a.k.a. stack cache). Since with the Am29000, "spilling"
a few regs to memory to make room for a new subroutine context usually
does not cause a matching "fill" when the routine returns [those 128 local
regs give you a *lot* of hysteresis], the interesting "knee of the curve"
is in the tradeoff between cache spill/fill traffic and register cache size.
That is, how much incremental spill/fill traffic do you save (thus incremental
bus bandwidth saved and thus performance gained) for each additional register
you put in the stack cache?

I forget the exact numbers, but as I recall, at 128 local registers (which
at ~7.5 local regs/routine would mean ~17 subroutine contexts "live" in the
cache) the slope was still about 0.1% of the CPU gained for each additional
reg you added to the cache. [Correct me if I'm way off, Tim.] That's not
enough to be worth doubling the local register file to 256, but is certainly
enough to not want to cut the size of the cache back to 64 (especially since
as you started cutting back the cost/reg would go up).

This is also why -- given the very large register stack cache -- the
compilers bother to reuse locals: If you can keep the average subroutine
context small, you can keep a lot of live (parent) contexts in the cache
without spilling/filling. This has to be traded off against performance
lost by *not* using another reg when it could save CPU time.  The "cost"
of 0.1% (due to spill/fill traffic) for another reg is a parameter to the
compiler's register allocation algorithm.

And sometimes the C compiler *does* use more than 32 locals (for really
messy routines with lots of nested loops); it's just that the overall
number of registers spilled/filled per subroutine call is *very* low.
[In fact, much less than 1 for any of Dhrystone-1/-2/diff/grep/as/nroff].


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

irf@kuling.UUCP (Bo Thide') (05/12/89)

In article <8081@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>in article <922@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
>> Also, the 68020 was introduced not much before the 386, and certainly the
>> 386 has been around far longer than the 68030. Or I am goofing?
>
>You're goofing. 68020's have been around at least since 1986 (which is
>when I saw my first one, a CCA add-on processor for the Amiga). And
>Suns 3s had them before then.  The '020 became available a bit over a

And HP9000/300s had them before Sun....

>As for the price difference: Apollo's low-end '030 machine is
>advertised for around $6k, which is about what a competitive '386
>machine (4 megs of RAM, super-VGA, etc.) would cost. Remember, these
 
We just bought our HP9000/340 with 68030/16.7 MHz for the equivalent
of $5.2 k.

>> on the few 68030 machines around (Next, Macs, Suns), it seems that one
                                                 ^^^^
Sun just announced theirs.  Have they delivered yet?  I got my HP9000/370
half a year ago.  It runs at 8+ MIPS .....

>> statement I have read (68030 == 68020+10/15%) is quite reasonable
>Figures I've seen indicate about a 20-30% improvement. Note that

.. and runs almost twice as fast as my old HP9000/350 with 68020/25 MHz
(about 12000 Dhrys(2.1)/sec vs. 6500).



   ^   Bo Thide'--------------------------------------------------------------
  | |       Swedish Institute of Space Physics, S-755 91 Uppsala, Sweden
  |I|    [In Swedish: Institutet f|r RymdFysik, Uppsalaavdelningen (IRFU)]
  |R|  Phone: (+46) 18-403000.  Telex: 76036 (IRFUPP S).  Fax: (+46) 18-403100 
 /|F|\        INTERNET: bt@irfu.se       UUCP: ...!uunet!sunic!irfu!bt
 ~~U~~ -----------------------------------------------------------------sm5dfw

jed4885@ultb.UUCP (J.E. Dyer) (05/12/89)

In article <1921@se-sd.sandiego.ncr.com> lord@se-sd.sandiego.NCR.COM (Dave Lord(SSP)) writes:
>In article <3150@tank.uchicago.edu> phd_ivo@gsbacd.uchicago.edu writes:
>
>>byte? This way, one can put a single No-ACCESS byte between datastructures at
>>link/load time, which should detect out of pointer ranges much more quickly
>>than the wait-and-see in use in our systems. (At least I have spent days of my
>
>	while (a[index] != 0 and index++ <= A_BOUNDS){ ... whatever ...};

	The problem I see with this method is that not all array access
	is sequential.  For example, I might have another array which
	contains the index I want to use.  For example:

	for (i = 0; i < A_Bounds; i++) {
		x = a[b[i]];
		....
	}

	In this case, if the value in b[i] is invalid, the chances are
	that it's not going to be equal to A_Bounds, which is the only
	error that a N0-access byte would detect.  IMHO, parity checking
	is much more valuable.

						--Jason

-sig-of-the-day-
"Kludge??? That's not a Kludge, it's _ART_!" 
BITNET: JED4885@RITVAX			UUCP: jed4885@ultb

nelson@berlioz (Ted Nelson) (05/12/89)

In article <18235@cup.portal.com> bcase@cup.portal.com (Brian Case) writes:
>
>Well, it's old and CISCy stuff, but the paper:
>
>Chow and Hennessey, "Register Allocation by Priority-based Coloring,"
>Proc. SIGPLAN Symp. on Compiler Construction, SIGPLAN notices vol. 19,
>No. 6, June 1984.
>
>shows some performance numbers for a variable number of registers.  The
>architectures were to the PDP-10 and the 68000.  A max. of 9 registers
>was available.  The fastest performance was achieved when the max. number
>of regs. was used.
>

It should be noted that in that article the effects of a variable number of
  registers was only run on the PDP-10, and furthermore only on short/simple
  programs such as the Queens Problem, Quicksort, etc.

The overall results were:
  0 registers:  1.0  (basis of comparison)
  2 registers:  0.88
  4 registers:  0.84
  6 registers:  0.75
  9 registers:  0.73

For most of the runs, the difference between 6 and 9 registers were minimal
  (0.01 or less), but differences > 0.01 showed up only in Quicksort, Fast
  Fourier Transform, Sieve, and inverse of a matrix.

It's an article worth reading.  If you're really looking for a wealth of
  data (applicable to CISC as well), peek at "Stategies for Managing the
  Register File in RISC" by Tamir and Sequin (IEEE Transactions on Comp,
  Vol C-32, # 11, November 1983, p. 977).

-- Ted.

les@unicads.UUCP (Les Milash) (05/13/89)

In article <25602@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
>[..] Since with the Am29000, "spilling"
>a few regs to memory to make room for a new subroutine context usually
>does not cause a matching "fill" when the routine returns [those 128 local
>regs give you a *lot* of hysteresis], the interesting "knee of the curve"
>is in the tradeoff between cache spill/fill traffic and register cache size.
>That is, how much incremental spill/fill traffic do you save (thus incremental
>bus bandwidth saved and thus performance gained) for each additional register
>you put in the stack cache?

relatedly i had to admire sun installing compiler tricks to turn tail-
recursive C into iterative C.

henry@utzoo.uucp (Henry Spencer) (05/14/89)

In article <436@unicads.UUCP> les@unicads.UUCP (Les Milash) writes:
>relatedly i had to admire sun installing compiler tricks to turn tail-
>recursive C into iterative C.  

Optimizing tail recursion into iteration is a win no matter what the
architecture is.  Admittedly the register-window faction has greater
incentive, but it's not a "compiler trick" (unless it's implemented
in some bizarre way).
-- 
Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/16/89)

In article <1989May11.210653.2125@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
     
    You forgot one condition:  ancient and stupid compiler.  Do remember, also,
    that most any compiler for the 11 was full of implicit assumptions to the
    effect that very few registers were available.

Now I may be totally wrong, but I seem to remember that they used PCC and tuned
its (bastardized) Sethi Ullman register allocator, which is both pretty
good (even the [S]CC one was quite good) and with no assumptions at all.

They were as surprised as anybody at their results, and the obvious
conclusion was the simple "most C expression are very simple, even those in
critical paths". (my memory supplies me with something like: 80% of all
C expressions have less than two operators, and about half have just one,
on the usual sample of UNIX/C code).
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

chris@mimsy.UUCP (Chris Torek) (05/17/89)

In article <949@aber-cs.UUCP> pcg@aber-cs.UUCP (Piercarlo Grandi) writes:
>... I seem to remember that they used PCC and tuned its (bastardized)
>Sethi Ullman register allocator, which is both pretty good (even the
>[S]CC one was quite good) and with no assumptions at all.

(why is the `distribution' `eunet,world'?)

Anyway:  The Sethi-Ullman algorithm is intended to solve the problem
of computing the number of registers required to evaluate a single
expression.  That is, given any arbitrary chunk of code, one breaks
it into individual expressions, feeds that expression through a S-U
numberer, uses the result to decide what registers get what, when to
spill to memory, and so forth, generates one expression, then frees
up all the registers for the next.  None of the registers carry their
values from one expression to the next.

Modern compilers, on the other hand, usually take entire functions (or,
increasingly, take entire programs, typically in the `link' phase) and
allocate registers globally across the thing.  If a() calls b() and b()
calls c() and c() puts arr[17] in r33 and a() happens to need arr[17],
the compiler leaves the value in r33 by arranging for b() to use r32
and r34 so that when b() returns to a(), a() need not reload r33 (or
any other register).  PCC does not now and never has done this sort of
analysis; it forgets everything between one expression and the next.
(Typically a peephole optimiser attempts to make up for this later.)

In compiler circles, Sethi-Ullman numbers are considered rather passe' :-)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

henry@utzoo.uucp (Henry Spencer) (05/18/89)

In article <949@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>    You forgot one condition:  ancient and stupid compiler.  Do remember, also,
>    that most any compiler for the 11 was full of implicit assumptions to the
>    effect that very few registers were available.
>
>Now I may be totally wrong, but I seem to remember that they used PCC and tuned
>its (bastardized) Sethi Ullman register allocator, which is both pretty
>good (even the [S]CC one was quite good) and with no assumptions at all.

PCC is almost a toy by modern standards.  The code it generates for a single
expression isn't bad -- it's not always perfect, but it's not bad, if the
person who ported the compiler did a good job -- but the near-total lack of
more global optimization is a performance disaster.  PCC was designed for
portability first and code quality second, at a time when compiler porting
was not nearly as well understood as it is today.

And note that I said "*implicit* assumptions" (emphasis added).  I'm not
talking about "#define MAXREGS 6", but about things like choices of
strategy based on the notion that registers are scarce and costly and it
is very important not to occupy them unnecessarily.
-- 
Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/18/89)

In article <17556@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
    In article <949@aber-cs.UUCP> pcg@aber-cs.UUCP (Piercarlo Grandi) writes:

    >... I seem to remember that they used PCC and tuned its (bastardized)
    >Sethi Ullman register allocator, which is both pretty good (even the
    >[S]CC one was quite good) and with no assumptions at all.

    (why is the `distribution' `eunet,world'?)

Reason again too embarassing to mention. Will send it to you by mail :-(.
[Actually it is no longer necessary, so will change it]

    Anyway:  The Sethi-Ullman algorithm is intended to solve the problem
    of computing the number of registers required to evaluate a single
    expression. [ ... ]

Entirely correct. Indeed I had clarified that 'four registers is plenty' was
about C programs, integer, utilities, expression optimization, on a reg-mem
machine (like the 386-68020 to which my comment originally applied).

    Modern compilers, on the other hand, usually take entire functions (or,
    increasingly, take entire programs, typically in the `link' phase) and
    allocate registers globally across the thing.

Indeed the debate had moved on to discuss the number of registers useful for
inter expression optimizers, and/or for reg-reg, floating point, numeric
cases. As to this I have been guarded, not recollecting any DATA; I have
only said that my *hunch* is that a few more registers may be useful for
reg-reg and floating/numeric, and that four to eight additional registers
for inter expression optimization would probably be enough (depending on the
architecture and type of code), based on my experience of designing programs
using "register".

Somebody then offered some DATA points; John Mashey has offered his
recollection that 24-28 is a good number to have on a MIPS (hence their 32),
and some AMD fellow (Tim Olson?) has posted statistics that, using a static
analysis, their compiler can on average fill at most 6+7 registers after
global optimization (I am not clear whether this includes intra expression
optimization or not), thus providing an upper bound to the number of *useful*
registers.

And we have been given by some kind person at WRL some additional data
points, for FORTRAN, floating, numeric, global optimization, on a reg-reg
machine, and a diabolically clever compiler.  These numbers (for that is
probably the worst case) say that 4 to 7 are enough for expression
optimization, and 8 for global optimization, IFF the selection of variables
to cache is done on the basis of dynamic frequency of use; otherwise the
compiler has to use about six/seven times that number of registers (i.e. has
to cache virtually everything in sight) to get the about same improvement,
which while good (up to 30%) is not world shattering either.

    PCC does not now and never has done this sort of analysis; it forgets
    everything between one expression and the next.

That is why I like it; with PCC I can specify ("performance-by-design") what
I know to be dynamically important global variables to retain in registers
across expressions, where it matters, without having a very complex compiler
that implements "performance-by-second-guessing-the-designer", which is not
(according to the WRL figures) terribly effective if done using static
analysis (no surprise), and if done using dynamic figures is OK, but then
why not let the programmer do it, at a huge cost in compiler complexity and
bugginess?

Ironically, large register sets, which have some pretty important costs
(e.g. context switch), may be of benefit in reducing the complexity of
compilers; in a machine with 64-256 registers you can essentially cache
everything in sight, which may mean a less sophisticated optimizer can do
the work (you still have a problem of register *allocation*, maybe across
basic blocks, as a stack etc..., but that is much more mechanistic than
trying to do optimizations -- or not?).

    (Typically a peephole optimiser attempts to make up for this later.)

When the PCC peephole optimizer limited itself mostly to cleaning up
redundant code sequences and other truly peephole optimizations it was as
reliable as the compiler (i.e. quite); now that people (e.g. SUN, AT&T Sys V
5.3.2) do global analysis in the c2 "peephole optimizer", you have to mark
those sources that should not be compiled with -O... Progress!

    In compiler circles, Sethi-Ullman numbers are considered rather passe' :-)

Fads... :-). It's old, but it works... And sethi-ullman is almost an order of
magnitude smaller than the very sophisticated algorithms being used now (my,
in PCC the sethi ullman routine is a few dozen lines long!!).

APOLOGIES:

	In some past posting I expounded on why what IMNHO are "competent"
	programmers don't need to rely on global optimizers/many registers.

	This did not imply any judgement on the competence of the person
	whose posting (in which he was suggesting that it was incomp...
	oops, lack of familiarity with the relevant literature and
	frequentation of "goggle-eyed" grad students that made me disagree
	with him), I was commenting on.

	I cannot resist also apologizing, in a different vein :-), to John
	Mashey for his opinion that I may not be aware of what has happened
	in the latest 15-20 years in compiler research, and for later stating
	that I spout falsehoods (numbers, please...); and to Chris Torek's
	for his assumption that somebody among the learned participants to
	this forum still needed details on the "limits" of Sethi-Ullman and
	of the PCC, after my having actually extolled them.

-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/18/89)

In article <959@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
   
    	I cannot resist also apologizing, in a different vein :-), to John
    	Mashey for his opinion that I may not be aware of what has happened
    	in the latest 15-20 years in compiler research, and for later stating
    	that I spout falsehoods (numbers, please...);

I have to apologize, more seriously, for the latter assertion. It was Tim
Olson, not John Mashey (sorry for the latter:

    From: tim@crackle.amd.com (Tim Olson)
    Message-ID: <25651@amdcad.AMD.COM>

    In article <950@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
    | there is the little matter of a few FACTs, called CRISP, NOVIX and
    | TRANSPUTER, that seem to be always forgotten (not to mention the 32532,
    | which has the extremely embarassing property of being a simple, well
    | designed, reg-mem CISC that outruns most RISCs around...) by reg-reg
    | and otherwise RISC designs.
    
    You speak of FACTS, then spout FALSEHOODS.  Rather than guessing, why
    don't you really study the performance of the processors you mention as
    compared to current RISC processors.
		^^^^^^^

Probably my memory fails me here, but I seem to remember that in
architectural comparisons you don't compare processors that have been around
for years with processors what have just been released; you compare them at
equivalent levels of technology (we have even been arguing as tohwether the
386 is the technological equivalent of the 68020 or 68030 -- on which I may
have changed my opinion, incidentally, so far, to the '68025' :-]).

As far as I know, *current* RISC processors are not the same as those that
you can find *around* yourself. MIPS are not embarassed because their most
successuful product so far, the R2000, which has been *around* for a while,
is vastly inferior to the R3000, which is their *current* implementation
(Sun may be a bit, but then they are expecting the new vastly improved SPARC
chips from Cypress and Fujitsu to take over...).

I am curious to see if National will want to do a successor to the 532, and
if NOVIX will be able to, and if INMOS will descend on earth and do the
same... :-) (CRISP is a no goer, of course, unfortunately).
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

ron@motmpl.UUCP (Ron Widell) (05/19/89)

In article <922@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>
>Also, the 68020 was introduced not much before the 386, and certainly the
>386 has been around far longer than the 68030. Or I am goofing?
>
You're goofing.
The 68020 was in customer's hands (with some erratum, i.e. bugs) through
normal distribution channels in 1984, about two years before the 386; and
was bug-free (as far as we could tell) in 1985.

The 030 and 386 are contemporaries. (within about six months)

>-- 
>Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
>Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
>Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

Regards,

-- 
Ron Widell, Field Applications Eng.	|UUCP: {...}mcdchg!motmpl!ron
Motorola Semiconductor Products, Inc.,	|Voice:(612)941-6800
9600 W. 76th St., Suite G		| I'm from Silicon Tundra,
Eden Prairie, Mn. 55344 -3718		| what could I know?

daveh@cbmvax.UUCP (Dave Haynie) (05/19/89)

in article <922@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
> Summary: 68030 == 68020+10/15%

> You get me a 68030 and I tell you :->... From various papers and benchmarks
> on the few 68030 machines around (Next, Macs, Suns), it seems that one
> statement I have read (68030 == 68020+10/15%) is quite reasonable overall.

Apparently, if you do nothing more than drop a 68030 into an existing 68020
system, this is almost exactly what you get.  The Mac II vs. IIx is a good
example of this situation.  However, once you design a real 68030 system,
it's much more like (68030 = 68020+50/100%).  Especially in systems that
use the Morotola MMU.  A 68030 system can access memory exactly twice as fast
as the 68020/68851 combo, at the same clock speed.  It also has a cache line
"burst" fetch mode that allows much more efficient operation between an '030
and a medium speed memory system.  NeXT, for instance, uses this burst mode,
and gets decent performance out of slow memory.  They could be accessing
memory around 50% faster than they are, though -- a machine in the price range
of the NeXT should have an external cache.  Some decent examples of tight
'030 systems are the Apollo 4500 and the Hp 9000/370.  They ain't cheap, but
they're examples of what the CPU can do.  Now only if the bean counters would
let us '030 PC designers build 'em like the Workstation guys get to....

> Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
> Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
> Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

daveh@cbmvax.UUCP (Dave Haynie) (05/19/89)

in article <4152@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) says:

> And if it wasn't for the sea of dollars you refer to the 386es would cost
> as much, if not more than, the 030s. By the way, the first 030 implementation
> I know of was the CSA 68030 card for the Amiga.

I think you could get a VME card from Motorola before that, but I can't think
of anything else.

>> If I were going to compare the Intel/Motorola families, I would couple
>> them as 6809/8088, 68010/80286, 68020/386.

> 68020 -- 80286

Time frame wise, fer shure; the 68020 came rapidly on the heels of the actual
release of the 68010; one of the reason the '010 fell into the cracks.  As
for preformance, the '286 pales in comparison, though the marketplace's great
acceptance of the '286 led to real fast versions of it long before any fast
68000s or 68010s came along.  Neither the '286 nor the 68000/10 had all you'd
want in a CPU.

> 68030 -- 80386

For what you get, this is a good pairing.  They're in at least the same 
league, speed wise, both have built-in paging MMUs, etc.  The first 20MHz
68030s I got my hands on were last May or so.  In fact, I got a 20MHz part,
hot off the presses, one Friday morning, and a 25Mhz part hand carried
later that afternoon.  Both were of the "it works at room temperature", no
marks on the package, variety.  Moto underestimates themselves; the 20MHz
part has been running daily for over a year...

> 68040 -- 80486	# Both announced but not available.

These seem to track each other feature-wise, though from what I've read of
each, the '040 look much more sophisticated in many respects.  That's not to
say Intel didn't do their homework, they did.  But they seem to have spent
lots of thought on how to make the '486 a much better engine for MS-DOS
applications, rather than how they might make it the fastest possible thing
that eats up '386 instructions.  Considering what '486 machines will be 
doing, I think that was the right decision.

> My memory is that the 68020 has been out nearly as long as the 286, and that
> the 68030 has been around nearly as long as the 68030.

I could hardly argue that last one.  First big Motorola splash on the '030
was at Fall Comdex a year and a half ago.  I think folks were building '386
machines then, but they weren't significant.  Time flies...

> Peter da Silva, Xenix Support, Ferranti International Controls Corporation.
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

tim@crackle.amd.com (Tim Olson) (05/19/89)

In article <959@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
| Somebody then offered some DATA points; John Mashey has offered his
| recollection that 24-28 is a good number to have on a MIPS (hence their 32),
| and some AMD fellow (Tim Olson?) has posted statistics that, using a static
| analysis, their compiler can on average fill at most 6+7 registers after
| global optimization (I am not clear whether this includes intra expression
							    ^^^^^
Don't you mean "inter" here?

| optimization or not), thus providing an upper bound to the number of *useful*
| registers.

The statistics posted said that, on average, 6.5 global registers (not
live across a function call) and 7.0 local registers (live across a
function call) were used per function. However, this is certainly not an
"upper bound to the number of *useful* registers."  As others have
pointed out, registers can also be used for parameter passing and return
values, emulating a stack cache, etc.

| 	I cannot resist also apologizing, in a different vein :-), to John
| 	Mashey for his opinion that I may not be aware of what has happened
| 	in the latest 15-20 years in compiler research, and for later stating
| 	that I spout falsehoods (numbers, please...); and to Chris Torek's
	^^^^^^^^^^^^^^^^^^^^^^^
That was me.  I sort of flew off the handle there, but you made certain
assertions that *you* have not backed up by any data.



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)