[comp.arch] how many regs

mash@mips.COM (John Mashey) (05/09/89)

In article <907@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
....
>Well, in a long discussion a few months ago in comp.lang.c, nobody has been
>able (in a period of two months) to quote any FIGURES that support this and
>other urban legends (e.g. Elvis is alive and designed the Z80000 :->). There
>are plenty of figures though about the average extreme simplicity of actual
>statements/expressions/constructs in algorithmic level languages, which does
>not bode well for the usefulness of large register files.

The simplicity of source statements has little to do with the number of
registers desirable, unless the only compiler your have generates code
on a statement-by-statement basis only, i.e., no optimization.
For example, consider a typical RISC (i.e., load/store), and the C stmts:
	a = b + 5;
	c = b + 7;
If you generate code for each statement at a time, and don't do register
allocation, you need just one register:
	load r1,b; add r1,r1,5; store r1,a
	load r1,b; add r1,r1,7; store r1,c
Some simple inter-statement optimization would want 2 regs:
	load r1,b; add r2,r1,5; store r2,a
	add r2,r1,7; store r2,c
And if you were in the middle of code that had more references to these, and
had allocated a: r2, b: r1, c: r3: you now have 3 regs
	add r2,r1,5; add r3,r1,7
and if these statements were in the middle of a loop someplace where the 
optimizer had already identified these as useful expressions to have around,
you could actually get 5 regs used (a,b,c,b+5,b+7), although the resulting
code for these statements might then look like:
	(nothing: later references to a or c would reference the regs where
	b+5 and b+7 were stored)
	
>Even some of the RISC guys, and some that do a great deal of inter-expression
>register assignments (which I don't like at all -- long life "register"!),
>like MIPS, don't see a great advantage in extravagant register file sizes.
Why don't you like inter-expression register assignments?

>	If John Mashey could give us some of the numbers that MIPS have
>	surely got on where they estimate the knees of the curves for
>	registers for intra and inter statement optimization to be, we would
>	be greatly enlightened (e.g. as to why their register file is so
>	different from that of SPARC and 29K); maybe he has already, and I
>	have missed them.

A few years ago, we did the experiment of running the number of registers
up and down to see what happened.  For our machine, for our compilers,
for whichever benchmarks we did (large programs, but I don't recall which),
the knee of the curve was in the 24-28 range, for generally-allocatable
registers.  Both HP and IBM found the same range in independent studies,
although, I don't think this is published anywhere, it being the kind of
data obtained in bars arguing over architecture.  We haven't kept such
analyses around, having already made the decisions relevant thereto.
However, I would observe that I've looked at tons of object code, and
the registers get used.  Note that they are useful in two distinct ways:
	1) To evaluate expressions, including global optimizations.
	2) To have enough scratch registers that many functions need
	0 (leaf) or 1 register, unless the optimizer decides it's really
	worth having a bunch of registers.  Note that if you only have
	X registers available, and you generally need approx. X to do
	reasonable expression evaluation, you must save/restore a healthy
	percentage of X registers across function calls, or go completely
	to callee-save.  Most people with this kind of architecture
	have found it best to split the registers between callee-save
	and caller-save.  In our case, we save about 1.6 regs/average
	function call, across wide range of benchmarks, and that is due to
	having ENOUGH registers to allow both safe and scratch registers,
	and still have enough scratch registers to do plenty of evaluation.
Note that 2) is a subtle issue, easily overlooked; but is very important,
especially in the "register-window vs non-register-window" wars.

>....
>As to some small bit of available evidence, it is not very controversial
>that the 386 is usually a tad faster than the 68020 with roughly equivalent
>system technology (e.g. a cached 386 at 20Mhz tipically beats a cached 68020
>at 25 Mhz under gcc), and code size is a tad smaller as well.  This may not
>mean much, may not be just *because* it has less registers, but it seems to
>indicate that at least the smaller number of registers does not hurt too
>much.
I'm not sure I necessarily believe the relative performance claim; in any
case, I would bet the biggest difference is attributable to the
2-cycle bus access (386) versus 3-cycle access (68020).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/10/89)

In article <19063@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
    
    The simplicity of source statements has little to do with the number of
    registers desirable, unless the only compiler your have generates code
    on a statement-by-statement basis only, i.e., no optimization.
					    ^^^^^^^^^^^^^^^^^^^^^
Optimization is not just (and maybe even not most importantly) inter
statement...

    For example, consider a typical RISC (i.e., load/store), and the C stmts:
    	a = b + 5;
    	c = b + 7;
	[ .... ]

Your example works, but under special case assumptions: that you are working on
a reg-reg architecture, whereas we were discussing reg-mem ones; that putting
all three a,b,c in registers is worthwhile because they are going to be used
heavvily in other parts of the program.

The reg-reg assumption actually may point at one of their weaknesses, that
since the cost of computing with parts of your operands in memory has a high
fixed cost, you tend to want to store everything in regs, even if they are
used little. In a reg-mem architecture little use variables in memory do not
carry costs as high when you use them.
    	
    Why don't you like inter-expression register assignments?

Well, I like them, as long as the compiler does not do them, but the
programmer does, by using explicit "register" declarations. But let's not
resurrect the comp.lang.c debate here, and not now (it will restart in
comp.lang.c, as soon I can reload my notes from then... :-/ :-/).
    
    A few years ago, we did the experiment of running the number of registers
    up and down to see what happened.  For our machine, for our compilers,
    for whichever benchmarks we did (large programs, but I don't recall which),
    the knee of the curve was in the 24-28 range, for generally-allocatable
    registers.  Both HP and IBM found the same range in independent studies,
    [ .... ]

Uh? This really astonishes me. I would have bet that even for a RISC, even
doing inter statement optimization, the number was about 8-12 rather than 4
(rationale: 4 scratches + 4 for register variables automatically chosen by
the compiler+4 for RISC'iness at most).

    However, I would observe that I've looked at tons of object code, and
    the registers get used.

Disclaimer: I have only worked extensively on reg-mem machines so far. For
such machines I beg to differ; my impression is that for intra statement
optimization four scratch regs is enough, and for inter statement
optimization ("register" variables) another four is enough. Hence my hunch
that 8 (386), or 3+3 (plus 2 for system work) is a bit tight, but still
tolerable, and 16 (68020) is even overabundant.

    Note that they are useful in two distinct ways:

    	1) To evaluate expressions, including global optimizations.

Conceded (as long as the global optimizations are done by the programmer,
or, ahem, are implicit in suitably designed language constructs).

    	2) To have enough scratch registers that many functions need
    	0 (leaf) or 1 register, unless the optimizer decides it's really
    	worth having a bunch of registers.  Note that if you only have
    	X registers available, and you generally need approx. X to do
    	reasonable expression evaluation, you must save/restore a healthy
    	percentage of X registers across function calls, or go completely
    	to callee-save.  Most people with this kind of architecture
    	have found it best to split the registers between callee-save
    	and caller-save.  In our case, we save about 1.6 regs/average
    	function call, across wide range of benchmarks, and that is due to
    	having ENOUGH registers to allow both safe and scratch registers,
    	and still have enough scratch registers to do plenty of evaluation.

    Note that 2) is a subtle issue, easily overlooked; but is very important,
    especially in the "register-window vs non-register-window" wars.

Ahhhhhhhh. What you are saying is that you are using registers as a
statically allocated cache, and that this is good not because they are
frequently used, but because they would otherwise be frequently
saved/restored... Well, well, well. If you want a reg-reg architecture, you
pay the price, you take your chances. Me, my idea of RISC is a (mostly) zero
address architecture with 8/12 bit instructions, and four (to avoid extra
push/pop pairs in multiplexing a single one for the up to four independent
computations) arith stacks.

Note that It is assumed that RISC == reg-reg, and that load-store == reg-reg;
neither these equations are necessarily true, as one could have RISC ==
stack-stack or load-store == stack-stack... 
    
	[ ... me saying that the 386 is faster than 68020 at same Mhz ... ]
    I'm not sure I necessarily believe the relative performance claim;

Well, I admit I exxxagerated a bit :->; e.g., while I get about 5% more
Dhrystones from my home 386@20Mhz than from the Sun3/280@25Mhz at work,
the difference is not very significant... I would reckon that overall the
386 is (conservatively) 10-15% faster than the 68020 at the same Mhz.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mash@mips.COM (John Mashey) (05/11/89)

In article <927@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>In article <19063@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>    
>    The simplicity of source statements has little to do with the number of
>    registers desirable, unless the only compiler your have generates code
>    on a statement-by-statement basis only, i.e., no optimization.
>					    ^^^^^^^^^^^^^^^^^^^^^
>Optimization is not just (and maybe even not most importantly) inter
>statement...
>
>    For example, consider a typical RISC (i.e., load/store), and the C stmts:
>    	a = b + 5;
>    	c = b + 7;
>	[ .... ]
>
>Your example works, but under special case assumptions: that you are working on
>a reg-reg architecture, whereas we were discussing reg-mem ones; that putting
>all three a,b,c in registers is worthwhile because they are going to be used
>heavily in other parts of the program.
	These are not special case assumptions, although, of course the
	example was put toghether to illustrate the point.
	1) These days, many machines are reg-reg architectures.
	2) Optimizers do find this kind of stuff, especially in FORTRAN
	codes or some of the larger hunks of C.  they don't have to be used
	heavily, they just have to be used enough to make it worthwhile.
	3) I didn't realize the discussion was limited to reg-mem
	architectures: I recall the following part of a posting:
-------
>Even some of the RISC guys, and some that do a great deal of inter-expression
>register assignments (which I don't like at all -- long life "register"!),
>like MIPS, don't see a great advantage in extravagant register file sizes.
>
>	If John Mashey could give us some of the numbers that MIPS have
>	surely got on where they estimate the knees of the curves for
>	registers for intra and inter statement optimization to be, we would
>	be greatly enlightened (e.g. as to why their register file is so
>	different from that of SPARC and 29K); maybe he has already, and I
>	have missed them.
--------
I did, although I'm sorry it wasn't more.  It's not something we have
much motivation to keep current, unlike the plethora of other statistics
that we keep around.  Should be good paper topic for somebody.
It IS important to account for other issues, and there's no reason
that the answer should be the same for other architectural designs.
I.e., if you end up with memory ops, rather than load-store, you're less
motivated to have more registers [you'd think].  On the other hand,
even there, you sometimes win because of cycle-count (not instruction-count)
issues, i.e., the latency cycle(s) you get on most machines from fetching
data from memory (even cache memory).

>The reg-reg assumption actually may point at one of their weaknesses, that
>since the cost of computing with parts of your operands in memory has a high
>fixed cost, you tend to want to store everything in regs, even if they are
>used little. In a reg-mem architecture little use variables in memory do not
>carry costs as high when you use them.
You'd be surprised, especially in heavily-pipelined machines.  You must be
thinking of counting INSTRUCTIONS, not cycles: most fast (i.e., seriously
pipelined) machines cost you a stall cycle if you want to fetch something
from memory and use it right away, so even on a machine with mem->reg
operations, you might choose to sometimes generate a load, followed by an
op, because you might be able to rearrange code and get something in to
cover the load latency. [people sometimes found this on the S/370s].
    	
>    Why don't you like inter-expression register assignments?

>Well, I like them, as long as the compiler does not do them, but the
>programmer does, by using explicit "register" declarations....

OPINION: the above statement sends me back to 15-20 years ago....
really, if you believe this, you are not keeping up with what's
happening in the computer business.

>Disclaimer: I have only worked extensively on reg-mem machines so far. For
>such machines I beg to differ.
As noted, there is no "right" number for every architecture and
language; you'll get away with fewer registers in C than some of the
others.
>
>    Note that they are useful in two distinct ways:
>
>    	1) To evaluate expressions, including global optimizations.
>
>    	2) To have enough scratch registers that many functions need
>    	0 (leaf) or 1 register, unless the optimizer decides it's really
>    	worth having a bunch of registers.  Note that if you only have
>    	X registers available, and you generally need approx. X to do
>    	reasonable expression evaluation, you must save/restore a healthy
>    	percentage of X registers across function calls, or go completely
>    	to callee-save.  Most people with this kind of architecture
>    	have found it best to split the registers between callee-save
>    	and caller-save.  In our case, we save about 1.6 regs/average
>    	function call, across wide range of benchmarks, and that is due to
>    	having ENOUGH registers to allow both safe and scratch registers,
>    	and still have enough scratch registers to do plenty of evaluation.
>
>    Note that 2) is a subtle issue, easily overlooked; but is very important,
>    especially in the "register-window vs non-register-window" wars.
>
>Ahhhhhhhh. What you are saying is that you are using registers as a
>statically allocated cache, and that this is good not because they are
>frequently used, but because they would otherwise be frequently
>saved/restored... Well, well, well. If you want a reg-reg architecture, you
No.  The registers are frequently used. I said the issue was subtle.
In a leaf routine,  (on an R3000, but also, very similar on others')
	1) One need not save/restore the return address
	2) Most (or usually) all of the local variables get grabbed into
		scratch registers that need not be saved.
	3) Now, the stack frame has evaporated, and so we need not move
	the stack pointer around, and we already usually didn't have
	a frame pointer.
Since leaf routines are often about 50% of the dynamic function calls, 
this is relevant, and a similar, albeit less strong effect happens on others.
Having plenty of scratch registers also means you can pass a reasonable
number of arguments in registers, avoiding doing stores in the caller,
and loads in the callee.  The point is, that a lot of load/store
traffic around function calls disappears if you have enough registers
and smart compilers (whether or not you have windows, which of course,
can get rid of a few more).  Fast machines hate loads, because they
usually cost you stall cycles.
pay the price, you take your chances. Me, my idea of RISC is a (mostly) zero
>address architecture with 8/12 bit instructions, and four (to avoid extra
>push/pop pairs in multiplexing a single one for the up to four independent
>computations) arith stacks.
This is a fine OPINION; the current round of new computer architectures has
voted widely, and decisively, for load-store machines with "plenty"
of registers addressable at any point in the program. (plenty = usually
32, as in HP PA, MIPS, SPARC, MC88000, i860).  In particular, although
I've always admired the old B5500, it seems that zero-address architectures
are difficult to build to really go fast...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

preston@titan.rice.edu (Preston Briggs) (05/12/89)

Why I like registers

1) My code generators have three parts:
instruction selection, register allocation, and instruction scheduling.

These are all difficult problems; each is NP-Complete.
Additionally, they interfere with each other.
The best instruction depends on the register class of the operands,
pipeline scheduling increases register lifetimes, and so forth.

Lots of registers provides a simplifying separation of concerns.

With adequate registers: 
  I can choose instructions without
    without worrying about running out of a particular class;
  I can schedule instructions before register allocation without 
    the artificial anti-dependences introduced by register allocation
    and without worrying about eccessively long register lifetimes; and
  I can use some sort of global (intra-routine) coloring allocator
    to avoid ad hoc local methods.

2) The above arguments generally apply to integer registers.
I also like lots of FP registers.  These are harder to use;
generally FP values are hidden in arrays where optimizers
can't safely get at them.  By using (more expensive) dependence-based
optimization, a variety of transformation can be applied
to use (very profitably) almost any number of FP registers,
particularly with "typical, numeric Fortran."  From local
experiments, we want more than 16 FP registers, perhaps 32
is adequate.

For examples, see "Estimating and improving interlock for
pipelined architectures" by Callahan, Cocke, and Kennedy.
Proceedings of the 1987 International Conference on Parallel
Processing, August 87.  Or "Compiling C for Vectorization,
Parallelization, and Inline Expansion" by Allen and Johnson,
SIGPLAN 88.  Or (more to the point, but hard to get?) "Why even
scalar machines need vector compilers" by Allen and Lew,
TR, Ardent Computer, January 88.

3) An experiment.  An integer Fortran(!) program.  A non-recursive
version of quicksort.  An experimental, optimizing compiler
for the IBM RT (16 integer registers, 1 is stack pointer).
Compiler does partial redundancy elimination (global common
subexpressions and loop invarients), value numbering over
extended basic blocks, and dead code elimination.
(No strength reduction or global constant propagation).
Uses a Chaitin-style, graph coloring, global register allocator.

With 16 registers, it sorts 200,000 integers in 8.2 seconds.

regs		spills		run-time	obj-size
---------------------------------------------------------
16		3 live ranges	8.2 seconds	360 bytes
14		5		8.3		368
12		8		8.7		400
10		13		10.0		440
8		17		13.2		464

Some caveats, I only tried this one integer program
(as a part of another study), maybe our register allocator
isn't the best for just a few registers, and so forth...

On the other hand, our allocator does a good job and lots of FP
intensive programs suggest that we could often effectively use many
more than 16 integer registers.




Finally, let's talk about optimization.  
Generally, optimization competes with register allocation --
optimization lengthens live ranges and increases register pressure.

I think that optimization by the compiler vs. the programmer is a Good Thing.
It lets the programmer write good code more quickly and compactly.

For example, consider the simple C statements

	for (i=0; i<length; i++)
	    array[i] = brray[i] + crray[i];

Well, we could strength reduce by hand (and save a very important
loop test (sarcasm)), giving

	ap = &array[0];
	bp = &brray[0];
	cp = &crray[0];
	do {
	    *ap = *bp + *cp;
	    ap++;
	    bp++;
	    cp++;
	} while (ap < &array[length]);

But, I'd like to see the optimizer produce

	i = 0;
	if (i < length) {
	    do {
		*(array+i) = *(brray+i) + *(crray+i);
		i++;
	    } while (i < length);
	}

This version saves registers, saves branches, is safe,
and creates opportunites for hoisting loop invarients.

The first example is, I suspect, easier to write and maintain than
the other examples.  Additionally, it's also easier for the
optimizer to understand.  

Finally, the straightforward style
is more portable (if you do all your optimization at the source level,
you must know how many register are available, ...).
For (a final) example, consider DMXPY, from LINPACK.
The basic computation is:
	do 1 j = 1, n2
	    do 1 i = 1, n1
1		y(i) = y(i) + x(j) * m(i, j)

In LINPACK though,
it's been carefully hand coded to produce nice code
on some machine.  Many loops have been unrolled and
the results are probably fabulous on a Cray.
The main loop looks like
	do 1 j = jmin, n2, 16
	    do 1 i = 1, n1
1		y(i) = ((((((((((((((( (y(i))
		     + x(j-15) * m(i, j-15) + x(j-14) * m(i, j-14))
		     + x(j-13) * m(i, j-13) + x(j-12) * m(i, j-12))
		     + x(j-11) * m(i, j-11) + x(j-10) * m(i, j-10))
		     + x(j- 9) * m(i, j- 9) + x(j- 8) * m(i, j- 8))
		     + x(j- 7) * m(i, j- 7) + x(j- 6) * m(i, j- 6))
		     + x(j- 5) * m(i, j- 5) + x(j- 4) * m(i, j- 4))
		     + x(j- 3) * m(i, j- 3) + x(j- 2) * m(i, j- 2))
		     + x(j- 1) * m(i, j- 1) + x(j   ) * m(i, j   ))

A fairly complex expression.  The results weren't
very fabulous on my RT.
I count 16 floating point values
that are loop invarient in the i loop.  Tough to handle
with only 8 FP registers.  It would also take a healthy optimizer
to generate a minimal set of addressing expressions for all the
array references.

On the other hand, the basic code is (by comparison) crystal clear.
A fancy, dependence-based optimizer could rework it to run quickly
on an RT, a MIPs, or a Cray; the optimizations used depending
upon the architecture of the target.

So, ...
lots of inflammatory material I guess.

	Regards,
	Preston Briggs

bcase@cup.portal.com (Brian bcase Case) (05/12/89)

[A response pointing out that inter-statement optimization creates
opportunities for increased register use.]

>Your example works, but under special case assumptions: that you are working on
>a reg-reg architecture, whereas we were discussing reg-mem ones; that putting
>all three a,b,c in registers is worthwhile because they are going to be used
>heavvily in other parts of the program.

A value needn't be "heavily" used in order to benefit; if it is reused
once and kept in a register instead of in memory, then the program has been
sped up.  On a CISC or a RISC, this is true.  A well-designed CISC (like
the 486 or the 040) can get as much benefit from lots of registers as
can a RISC.  (This results from the fact that the cores of these well-
designed CISCs are essentially RISC pipelines....)

>The reg-reg assumption actually may point at one of their weaknesses, that
>since the cost of computing with parts of your operands in memory has a high
>fixed cost, you tend to want to store everything in regs, even if they are
>used little. In a reg-mem architecture little use variables in memory do not
>carry costs as high when you use them.

A memory-resident value on a reg-mem machine might not carry as high a cost
in *code size,* but the same number of memory references must be executed
on a reg-reg machine and a reg-mem machine.  The same number of processor
cycles will be used.  If the 30% code size penalty of a reg-reg machine is
of concern to you, then, as discussed a while back, the use of a reg-mem or
mem-mem machine is obviously warranted!
    	
>>    the knee of the curve was in the 24-28 range, for generally-allocatable

>Uh? This really astonishes me. I would have bet that even for a RISC, even
>doing inter statement optimization, the number was about 8-12 rather than 4
>(rationale: 4 scratches + 4 for register variables automatically chosen by
>the compiler+4 for RISC'iness at most).

The difference comes from the clever things that can be one for procedure
calls when enough registers are available.

>Disclaimer: I have only worked extensively on reg-mem machines so far. For
>such machines I beg to differ; my impression is that for intra statement
>optimization four scratch regs is enough, and for inter statement
>optimization ("register" variables) another four is enough. Hence my hunch
>that 8 (386), or 3+3 (plus 2 for system work) is a bit tight, but still
>tolerable, and 16 (68020) is even overabundant.

This is if you are willing to keep values in memory and pay the attendant
memory access penalty.

>Ahhhhhhhh. What you are saying is that you are using registers as a
>statically allocated cache, and that this is good not because they are
>frequently used, but because they would otherwise be frequently
>saved/restored... Well, well, well. If you want a reg-reg architecture, you
>pay the price, you take your chances. Me, my idea of RISC is a (mostly) zero
>address architecture with 8/12 bit instructions, and four (to avoid extra
>push/pop pairs in multiplexing a single one for the up to four independent
>computations) arith stacks.

This is a strange idea of RISC (my opinion).  But you are entitled to your
own definition.  However, don't be surprised if few other people share
your idea.  (Ther is, by now, a large body of literature that uses a
definition, although sometimes weakly defined, of RISC different from
yours.  Precedent is not on your side.)

>Note that It is assumed that RISC == reg-reg, and that load-store == reg-reg;
>neither these equations are necessarily true, as one could have RISC ==
>stack-stack or load-store == stack-stack... 

Again, few people, I think, will understand that you are so liberally
interpreting RISC unless you explicitly state.  Maybe your idea of RISC
comes partly from the claims that the Transputer is a RISC?

elg@killer.Dallas.TX.US (Eric Green) (05/12/89)

in article <927@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
> In article <19063@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>>     
>>     The simplicity of source statements has little to do with the number of
>>     registers desirable, unless the only compiler your have generates code
>>     on a statement-by-statement basis only, i.e., no optimization.
>> 					    ^^^^^^^^^^^^^^^^^^^^^
>> Optimization is not just (and maybe even not most importantly) inter
>> statement...
>> 
>>     For example, consider a typical RISC (i.e., load/store), and the C stmts:
>>     	a = b + 5;
>>     	c = b + 7;
>> 	[ .... ]
>> 
> Your example works, but under special case assumptions: that you are working on
> a reg-reg architecture, whereas we were discussing reg-mem ones; that putting
> all three a,b,c in registers is worthwhile because they are going to be used
> heavvily in other parts of the program.

I program a 68000 a lot. I suspect a 68000 is a fairly typical
reg-memory machine. I write a lot of "C" code. The "C" compiler I'm
using is fairly PCC-like, i.e. loses all register values between
expressions. In performance-critical portions of the code, I end up
modifying the assembly language output to keep as many values in
registers as possible. This is especially important with globals,
because they can't be put into registers normally.  Presto, fewer
instruction bytes, as much as 20% improvement in performance. If only
the compiler did it for me, eh?

But wait -- I've used such a machine -- a Pyramid 90x. It has a fairly
state-of-the-art compiler with a global optimizer, and basically
ignores "register" declarations. When I try to go into the assembly
code there and speed it up, guess what? I can't improve the register
allocation one iota. 

Anecedotal evidence, certainly. But good enough for me to conclude
that having lots of registers and a good global optimizer is a Good
Thing. People who disagree with that must have never looked at the
assembler-language output of their code on different machines....

> used little. In a reg-mem architecture little use variables in memory do not
> carry costs as high when you use them.

Foo. If you use the variable three times, you've saved 4 memory
fetches (2 addresses, 2 data) as vs. keeping it in a register. No matter
what kind of machine you're using.


>>     Why don't you like inter-expression register assignments?
> Well, I like them, as long as the compiler does not do them, but the
> programmer does, by using explicit "register" declarations. But

Have you ever looked at the output of a "C" compiler, and compared the
output of GCC to PCC (i.e. optimizing compiler vs. non-optimizing)?
Re-compiling a program with GCC is a sure-fired way of speeding it up
by 20% ;-).

> pay the price, you take your chances. Me, my idea of RISC is a (mostly) zero
> address architecture with 8/12 bit instructions, and four (to avoid extra
> push/pop pairs in multiplexing a single one for the up to four independent
> computations) arith stacks.

Uh, excuse me, have you ever read the various RISC papers? Reaching
over to my handy boxfull of 6x5 cards.... urk, can't find the one I
wanted, the 1986 David Patterson overview in CACM, but it should be
easy enough to find. I assume you know how to use the indexes in your
library? Look up a few RISC papers, then come back. Then you'll be
able to argue convincingly  about the various merits of register-register vs.
register-memory vs. stack models (hints: pipelining, locality of
reference >80% for code memory, cache size, program-fetch bandwidth
vs. data fetch bandwidth, ...). If you have something to say that
wasn't said in the latest CISC/RISC wars, I'm sure everybody would
appreciate hearing it -- but a word of warning, not much was left
UNSAID.

(the biggest argument of the RISC guys is that, because of locality of
reference program-wise, program memory bandwidth is almost
unlimited... it's data-memory bandwidth that's now the main limit,
which is why they want lots of registers and mostly
register-to-register operations. CISC folks, of course, say that those
qualities certainly aren't restricted only to RISC.... but, anyhow,
your reg-mem architecture is blown all to heaven by what the RISC
folks have actually PRODUCED IN SILICON, i.e. it is NOT the bemused
speculations of a goggle-eyed grad student).

> Note that It is assumed that RISC == reg-reg, and that load-store == reg-reg;
> neither these equations are necessarily true, as one could have RISC ==
> stack-stack or load-store == stack-stack... 

stack-stack must be reg-reg in order to be adequately fast. AT&T had a
novel architecture some years back called CRISP, which, if I recall,
was a stack-oriented RISC machine. AT&T eventually decided to use
SPARC for their RISC processor instead, in the event they built a
RISC-based computer... I don't really know what became of all that,
alas. It's possible to organize registers in a number of different,
novel ways. stack-stack saves on program-memory bandwidth, at the cost
of reducing flexibility of register use. But note that program-memory
bandwidth is the one thing there's no shortage of.

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
|  //    Join the Church of HAL, and worship at the altar of all computers  |
|\X/   with three-letter names (e.g. IBM and DEC). White lab coats optional.|

slackey@bbn.com (Stan Lackey) (05/13/89)

In article <8082@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>in article <927@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
>> used little. In a reg-mem architecture little use variables in memory do not
>> carry costs as high when you use them.
>
>Foo. If you use the variable three times, you've saved 4 memory
>fetches (2 addresses, 2 data) as vs. keeping it in a register. No matter
>what kind of machine you're using.

Don't confuse old microprocessors (the discussion context was 68000)
with real computers.  Yes I agree the 68000 serially did its memory
accesses as it needed them.  Many current current CISC's (1) have
instruction caches to supply the addresses in zero time and (2)
pipeline the memory data fetches in an earlier stage than the
execution.

As an example, on the Alliant, the instruction:

	ADDF <ea>, fp0

(1 cycle) takes the same amout of time as

	ADDF fp1, fp0

for displacement (16 and 32-bit), absolute, and auto-inc and -dec
addressing modes.  Two cycles if the memory data in unaligned.

Not to claim you don't need as many registers; pipelining actually does
make you want more, so you can take advantage of fetch-ahead and
store-behind.  Which uses lots or registers.
-Stan

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/15/89)

In article <8082@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
    
    I program a 68000 a lot. I suspect a 68000 is a fairly typical
    reg-memory machine. I write a lot of "C" code. The "C" compiler I'm
    using is fairly PCC-like, i.e. loses all register values between
    expressions. In performance-critical portions of the code, I end up
    modifying the assembly language output to keep as many values in
    registers as possible.

This is because you are one of the many C programmers that do not understand
"register" variAbles. Too bad for you and your work.

    This is especially important with globals, because they can't be put
    into registers normally.

If you use them intensively in a small section of code, you can assign them
to local register variables. But, a good programmer knows that if a program
uses intensively global variables, then probably program structure is wrong.
There is one good documented case where global "register" variables are
useful, and it is for interpreters, where you want to keep some of the state
of the simulated machine in global registers, because by definition they are
used/modified continuously.

    Presto, fewer instruction bytes, as much as 20% improvement in
    performance. If only the compiler did it for me, eh?

Because you are not good enough to do it yourself. Admittedly, a compiler
probably generates code that is more reliable and better than that done
by programmers that understand their language and the realities of their
codes well.

    > used little. In a reg-mem architecture little use variables in memory do not
    > carry costs as high when you use them.
    
    Foo. If you use the variable three times, you've saved 4 memory
    fetches (2 addresses, 2 data) as vs. keeping it in a register. No matter
    what kind of machine you're using.

The cost is in adding instructions. John Mashey, that does understand architectures,
said that this is not so important because you can use delay slots etc...
This is true, but the resulting code expansion 

Moreover your statement demonstrates of shallow understanding of codes and
chips:

Point 1: competent programmers know that codes consume most of their time
in tiny spots of the code, tipically in loops. Optimizing the rest usually
does not matter much (except for size).

Point 2: competent architecture designers (save for those that work for Intel
and other companies that can afford 1.2 millions of transistors :-/) know
that adding registers beyond what is needed has a cost, both (small) in chip
complexity, and in speed (because if you use them you have to save/restore
them at some point in time), which is well known to those that use C
competently:  using "register" inappropriately can *slow* your program (on a
reg-mem machine).

Again, as to the last point, if you have enough registers, the register file
becomes a kind of first level memory, so you can keep your stack, and globals
there. But then you pay in system terms at context switching time.

    >>     Why don't you like inter-expression register assignments?
    > Well, I like them, as long as the compiler does not do them, but the
    > programmer does, by using explicit "register" declarations. But
    
    Have you ever looked at the output of a "C" compiler, and compared the
    output of GCC to PCC (i.e. optimizing compiler vs. non-optimizing)?
    Re-compiling a program with GCC is a sure-fired way of speeding it up
    by 20% ;-).

This does not happen to my programs, when it matters. Why does it happen to
yours? :-( :-). Also, that 20% improvement may come at the expense of
reliability; the higher the optimization, the hairier the code generator,
the greater the risks. RMS itself said that he is disappointed at the time
it is taking to beta test a large complex, even if well written, beast like
gcc. Gcc/g++ does improve my programs however in two ways:

	1) code size. Admittedly PCC does not generate the densest code,
	because it code generator tables and pattern matcher is not quite as
	sophisticated and thorough as PCC's. On the other hand, the latter
	takes one tenth/one fifth the memory, while being a bit slower --
	PCC's tables are interetred, Gcc's are compiled -- if each is given
	unlimited memory, while being very much faster in limited memory.

	2) readability. Inlining is so much better than writing cpp macros...

    [ ... ] but, anyhow, your reg-mem architecture is blown all to heaven by
    what the RISC folks have actually PRODUCED IN SILICON, i.e. it is NOT
    the bemused speculations of a goggle-eyed grad student).

I never quite realized that the 32532, Novix, CRISP, Transputer (to cite my
favourites from CISC, RISC and RISCy camps) were bemused speculations by
"goggle eyed" grad students. Thank you for letting me know. :-(.

    > Note that It is assumed that RISC == reg-reg, and that load-store == reg-reg;
    > neither these equations are necessarily true, as one could have RISC ==
    > stack-stack or load-store == stack-stack... 
    
    stack-stack must be reg-reg in order to be adequately fast.
		^^^^

    But note that program-memory bandwidth is the one thing there's no
    shortage of.

Enough of this unsupported nonsense... Try read something about the machines
above. Don't believe every urban legend you hear :->.

-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/16/89)

In article <19413@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

    	These are not special case assumptions, although, of course the
    	example was put toghether to illustrate the point.
    	1) These days, many machines are reg-reg architectures.

Ahhh. This is a ridiculous argument. Proof-by-numbers is dangerous...
Can over 1 billion chinese be wrong in using abaci? :-) :-)

    	2) Optimizers do find this kind of stuff, especially in FORTRAN
    	codes or some of the larger hunks of C.  they don't have to be used
    	heavily, they just have to be used enough to make it worthwhile.

Again, on reg-reg machines. On Fortran, numeric code as you says.

    	3) I didn't realize the discussion was limited to reg-mem
    	architectures: I recall the following part of a posting:
		[ ... ]
Well, well. The discussion was not *limited*, it was in the context of reg-mem
machines. In extending it to RISC machines (about which, as I said, there is
even less data), you did something good. But please don't apply my statements
to a 68020-386 debate to RISC machines. On these, my hunch is that more
registers are OK.

    I did, although I'm sorry it wasn't more.  It's not something we have
    much motivation to keep current, unlike the plethora of other statistics
    that we keep around.  Should be good paper topic for somebody.

Seconded. Might get a go myself... I have a 386 at home, and will have g++
on it soon. Who can give me a reg-reg machine for comparison :-) ?

    It IS important to account for other issues, and there's no reason
    that the answer should be the same for other architectural designs.

    I.e., if you end up with memory ops, rather than load-store, you're less
    motivated to have more registers [you'd think].  On the other hand,
    even there, you sometimes win because of cycle-count (not instruction-count)
    issues, i.e., the latency cycle(s) you get on most machines from fetching
    data from memory (even cache memory).

Wise words. Very agreed. Still, the instruction count/code density issue
has its weight in system performance, of course.
    
    You'd be surprised, especially in heavily-pipelined machines.  You must be
    thinking of counting INSTRUCTIONS, not cycles: most fast (i.e., seriously
    pipelined) machines cost you a stall cycle if you want to fetch something
    from memory and use it right away, so even on a machine with mem->reg
    operations, you might choose to sometimes generate a load, followed by an
    op, because you might be able to rearrange code and get something in to
    cover the load latency. [people sometimes found this on the S/370s].
        	
    >    Why don't you like inter-expression register assignments?
    
    >Well, I like them, as long as the compiler does not do them, but the
    >programmer does, by using explicit "register" declarations....
    
    OPINION: the above statement sends me back to 15-20 years ago....
    really, if you believe this, you are not keeping up with what's
    happening in the computer business.

This is going back to the "volatile" debate, in the wrong newsgroup. I keep
up, but occasionally I disagree, especially when none of the great and good
over a period of months was able to quote figures to support their opinion.
    
    >Ahhhhhhhh. What you are saying is that you are using registers as a
    >statically allocated cache, and that this is good not because they are
    >frequently used, but because they would otherwise be frequently
    >saved/restored... Well, well, well. If you want a reg-reg architecture, you

So far we seem to agree (give and take a few registers) that the issue
requires more research, and that C on CISC is more favourable (admittedly)
than FORTRAN on RISC, etc... But here is something very interesting (sorry
for quoting so much, I'd have summarized, but I have become wary of that):

    No.  The registers are frequently used. I said the issue was subtle.
    In a leaf routine,  (on an R3000, but also, very similar on others')
    	1) One need not save/restore the return address
    	2) Most (or usually) all of the local variables get grabbed into
    		scratch registers that need not be saved.
    	3) Now, the stack frame has evaporated, and so we need not move
    	the stack pointer around, and we already usually didn't have
    	a frame pointer.

This is a very good argument, so far, for an AMD 29k style very large
register file, that becomes a statically managed first level memory, or for
a SPARC style set of (less statically managed) windows. When the register
file is very large, you are really almost dealing with a machine with fast
and slow stores, onw of which is addressed e.g. with 8 bit word addresses,
and the other with 32 bit byte addresses. The rules change dramatically. It
looks like old CYBERs (even the problem of swapping in/out the fast memory
on context switches). I happen to like the MIPS precisely because it has
NOT gone this route. I also like the Transput because it has taken this route,
but seriously (4kbyte onchip fast memory -- "registers" if you prefer).

    Since leaf routines are often about 50% of the dynamic function calls,
    this is relevant, and a similar, albeit less strong effect happens on
    others.  Having plenty of scratch registers also means you can pass a
    reasonable number of arguments in registers, avoiding doing stores in
    the caller, and loads in the callee.

    The point is, that a lot of load/store traffic around function calls
    disappears if you have enough registers and smart compilers (whether or
    not you have windows, which of course, can get rid of a few more).  Fast
    machines hate loads, because they usually cost you stall cycles.

        pay the price, you take your chances. Me, my idea of RISC is a (mostly) zero
    	>address architecture with 8/12 bit instructions, and four (to avoid extra
	>push/pop pairs in multiplexing a single one for the up to four independent
	>computations) arith stacks.

    This is a fine OPINION; the current round of new computer architectures
    has voted widely, and decisively, for load-store machines with "plenty"
    of registers addressable at any point in the program. (plenty = usually
    32, as in HP PA, MIPS, SPARC, MC88000, i860).

Again, proof-by-numbers. The current round of computer users, it could be
said, have voted decisively for segmented CISC architectures :-( :-(. Also,
it is not entirely an opinion; the only novelty I am citing is having
multiple arith stacks, but, while you say:

    In particular, although I've always admired the old B5500, it seems that
    zero-address architectures are difficult to build to really go fast...

there is the little matter of a few FACTs, called CRISP, NOVIX and
TRANSPUTER, that seem to be always forgotten (not to mention the 32532,
which has the extremely embarassing property of being a simple, well
designed, reg-mem CISC that outruns most RISCs around...) by reg-reg
and otherwise RISC designs.

Declaration of prejudice: I am all (well, 80% :->) for RISC. Of these I find
MIPS more admirable than most. The idea of simple, fast, reliable, is what I
like. It is obvious that I disagree from my armchair that the benefits of
RISC are there because of ALL the design choices of most RISCs.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

tim@crackle.amd.com (Tim Olson) (05/16/89)

In article <950@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
| there is the little matter of a few FACTs, called CRISP, NOVIX and
| TRANSPUTER, that seem to be always forgotten (not to mention the 32532,
| which has the extremely embarassing property of being a simple, well
| designed, reg-mem CISC that outruns most RISCs around...) by reg-reg
| and otherwise RISC designs.

You speak of FACTS, then spout FALSEHOODS.  Rather than guessing, why
don't you really study the performance of the processors you mention as
compared to current RISC processors.



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

rpw3@amdcad.AMD.COM (Rob Warnock) (05/18/89)

In article <950@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
+---------------
| This is a very good argument, so far, for an AMD 29k style very large
| register file, that becomes a statically managed first level memory, or for
| a SPARC style set of (less statically managed) windows...
+---------------

Hmmm... the 29k register windowing seems to have been misunderstood again...

On the Am29000, while the local registers (the 128 "windowed" or "stack
cache" regs) are statically *named* (on entry to a routine, lr0 = return
address, lr2 = first arg, etc.), the implied "first level memory" (when
the register set is considered as a stack cache) is *dynamically* managed
according to the instantaneous depth of the call stack by the routine entry
and exit assertions. That is, there is no predetermined correlation between
subroutine calls and spill/fill activity -- some regs are spilled (saved
to memory) when a new frame is opened if the cache is full, and regs are
filled (restored from memory) when a return is made to an upper-level
routine whose frame was spilled (i.e., stack cache is "empty"). Since
most frames are much smaller than the register file, there's *lots* of
hysteresis.

Also, the decrementing of the stack pointer (gr1) on routine entry
(to open a variable-sized frame of 2 to 126 regs) accomplishes a
dynamic remapping of which underlying physical registers the static
local register names will access. [The "mapping" is of course trivial:
bits <8:2> of "gr1" are added to the local register number (modulo 128)
to give physical register number.]

Since the overlapping windows [incoming args are in both the caller's
and callee's frames] may be of any size (2-126), the 29k's windows are
much "less statically managed" than the SPARC's. Even if the 29k had
the same number of registers available to the windowing mechanism
(instead of many more), it would still typically allow more routines'
windows to be in regs at once, reducing memory traffic.


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun}!redwood!rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

elg@killer.Dallas.TX.US (Eric Green) (05/18/89)

in article <948@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:
> In article <8082@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>     I program a 68000 a lot. I suspect a 68000 is a fairly typical
>     reg-memory machine. I write a lot of "C" code. The "C" compiler I'm
>     using is fairly PCC-like, i.e. loses all register values between
>     expressions. In performance-critical portions of the code, I end up
> This is because you are one of the many C programmers that do not understand
> "register" variAbles. Too bad for you and your work.

Excuse me? This is a rather outrageous accusation, considering that
you have seen neither my code, nor the results from my compiler. As a
matter of fact, I am VERY aware of the use of "register" variables. I
have to be, when writing performance-critical code. I also have to be
aware of how many registers I have, to do, by hand, keeping the
results of intermediate calculations in registers (btw: 4 pointer
registers, 5 data registers, the rest are used inside expressions). 
    The portability problems are obvious (all my carefully
common-subexpression-eliminated code is worthless on another
processor, or even with a different compiler). 
    If I could reuse data registers freely, I could write code that a
global optimizer couldn't touch (albeit with a helluva lot of work).
But there's one problem: Types. On a 68000, shorts and ints are 16
bits, longs are 32 bits. What this means is that if I declare a
register int xyz, I can't put a long into it -- the "C" compiler
generates a "move.w" instead of a "move.l". If I declare everything as
register long xyz, the "C" compiler generates a "add.l" instead of an
"add.w", i.e. I just lost all the time I'd saved. 

>     > used little. In a reg-mem architecture little use variables in memory do not
>     > carry costs as high when you use them.
>     Foo. If you use the variable three times, you've saved 4 memory
>     fetches (2 addresses, 2 data) as vs. keeping it in a register. No matter
>     what kind of machine you're using.
> The cost is in adding instructions. John Mashey, that does understand architectures,
> said that this is not so important because you can use delay slots
> etc...

Excuse me? I just said you'd save 4 memory fetches (2 address fetches,
2 data fetches). Where do you get the "added instructions"? Yes, you
have an initial "move.l" to get it into the register... but subsequent
instructions are normal 16-bit register-to-register instructions, not
16-bit plus 32-bit address (you're not counting the ADDRESS as part of
the instruction stream? SHAME!). 

> Moreover your statement demonstrates of shallow understanding of codes and
> chips:

Excuse me? Sounds as if you're acting from insufficient information.
I've used some of the same arguments that John Mashey et. al. used in
the old RISC vs. CISC wars and MIPS vs. SPARC wars. I have not said
anything particularly revolutionary, just common net.knowledge, as
supported by real.knowledge (i.e., I've read most of the RISC papers
that've come down the pike). 

> Point 2: competent architecture designers (save for those that work for Intel
> and other companies that can afford 1.2 millions of transistors :-/) know
> that adding registers beyond what is needed has a cost, both (small) in chip
> complexity, and in speed (because if you use them you have to save/restore
> them at some point in time), which is well known to those that use C
> competently:  using "register" inappropriately can *slow* your program (on a
> reg-mem machine).

"Smart" compilers only save/restore registers as needed. For example,
the MIPS compiler doesn't save/restore registers for "leaf" functions,
i.e., those functions that call no other functions (I believe the MIPS
folks said that "leaf" functions account for 40% or so of all
functions, but you'll have to ask them). As for the rest of your
arguments, that's why they invented windowed register stacks, and
AMD29000 style explicit windowing. A flat register architecture isn't
the only way to go, although MIPS has found a flat register
architecture with 32 registers to be quite adequate.

> Again, as to the last point, if you have enough registers, the register file
> becomes a kind of first level memory, so you can keep your stack, and globals
> there. But then you pay in system terms at context switching time.

a hundred context switches per second vs. several thousand subroutine
calls per second? I'll optimize the subroutine calls, thank you! We
hashed all this out a couple of years ago, during the last
RISC/CISC/MIPS/SPARC wars.

>     stack-stack must be reg-reg in order to be adequately fast.
> 		^^^^
>     But note that program-memory bandwidth is the one thing there's no
>     shortage of.
> 
> Enough of this unsupported nonsense... Try read something about the machines
> above. Don't believe every urban legend you hear :->.

I've read the CRISP paper, and several other stack-stack papers. In
all of them they mention that caching the top <n> entries of the stack
in hardware registers was a Big Win performance wise. All I said was
that a register is a register, whether it is accessed as a "stack" or
explicitly as a register. As for the statement "program-memory
bandwidth is the one thing there's no shortage of", I point you
towards: a) large cache memories, b) locality of reference (>90%, with
a large cache), and c) interleaved memories, which allow you to
execute sequentially, using slow memories, with no performance hit
(but when you hit a branch, you may have a major performance hit --
which is why reducing the number of branches in a RISC is a Good
Thing). None of this is particular new or revolutionary. Seymour Cray
has been doing it since the late 60's. The only new thing is that
these techniques are now being used in desktop computers, by, amongst
other, MIPS, AMD, Sun(Sparc) and Motorola (68040, 88000).

So while "essentially unlimited" is perhaps a bit strong, I (and most
RISC advocates) still maintain that the number of instructions, and
the size of each instruction, are NOT the limiting factor insofar as
performance goes. 

Good reference to "How fast can we fetch opcodes?": 
   "A VLIW architecture for a trace scheduling compiler"
   Robert P. COlwell et. al., CAM proceedings vol 15 #5 p180

--
|    // Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     |
|   //  ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849     |
|  //    Join the Church of HAL, and worship at the altar of all computers  |
|\X/   with three-letter names (e.g. IBM and DEC). White lab coats optional.|

cliff@ficc.uu.net (cliff click) (05/18/89)

In article <25651@amdcad.AMD.COM>, tim@crackle.amd.com (Tim Olson) writes:
> In article <950@aber-cs.UUCP> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
> | there is the little matter of a few FACTs, called CRISP, NOVIX and
> | TRANSPUTER, 
> You speak of FACTS, then spout FALSEHOODS.  Rather than guessing, why
> don't you really study the performance of the processors you mention as
> compared to current RISC processors.

Yes, how about some FACTS on these chips?  How about some facts on 
the 32bit version of the NOVIX, Phil Koopman's WISC chip, and 
Johns Hopkins Labs (nuts! I can't remember the name) stack oriented 
chip.  All of these chips were faster (more MIPS per Mhz) than the 
currently available chips (680xx, 80x86) in '87 - using fairly old 
technology.  How fast might they be if they had a sustained development 
effort on the order required to produce the 29000 and the 88000?

Did the big name chip developers miss something here?  Why didn't any
of them develop a dual (4?) stack chip, zero (ok 1 or 2) addressing 
modes, harvard architecure (3 data paths), 16 (or 32) intructions that 
were essentially the chips micro-code (instruction bits fed directly 
into the control lines on chip, very little decode time).  All of these 
chips could do a call/branch in 1 cycle, return in 0 cycles, and passed 
parameters lived on the stack with everything else.  Usually stacks 
were cached on chip with overflow to memory.

-- 
Cliff Click, Software Contractor at Large
Business: uunet.uu.net!ficc!cliff, cliff@ficc.uu.net, +1 713 274 5368 (w).
Disclaimer: lost in the vortices of nilspace...       +1 713 568 3460 (h).

daveh@cbmvax.UUCP (Dave Haynie) (05/19/89)

in article <950@aber-cs.UUCP>, pcg@aber-cs.UUCP (Piercarlo Grandi) says:

> Ahhh. This is a ridiculous argument. Proof-by-numbers is dangerous...
> Can over 1 billion chinese be wrong in using abaci? :-) :-)

Well, if you had to walk 5,000 miles to the nearest store that had batteries
for your calculator, an abacus would start looking mighty good.  Much better
than counting on your fingers...

The latest 68030 complier I've been using (Lattice 5.02 for Amiga OS) does
more with registers than just optimize expressions.  First of all, for short
programs, it'll reference code and data segments relative to (PC) and a
base register.  It also hooks nicely into the way the operating system works,
where shared libraries are referenced relative to a to a library pointer
stored in a register, so all library calls are of the form:

	move.l	LibraryBase,a6
	jsr	function1(a6)
	jsr	function2(a6)

etc.  The function calls expect their arguments in registers as well, and
the compiler learned some time ago how to handle this for the case of
system calls.  This method of parameter passing has now been extended, and
as a result, the compiler can optionally pass all parameters in registers, 
instead of on the stack, even between C functions.  This has a speedup effect
that varies between noticable and dramatic.  

> Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
> Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
> Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {uunet|pyramid|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
              Amiga -- It's not just a job, it's an obsession

pb@idca.tds.PHILIPS.nl (Peter Brouwer) (05/19/89)

In article <8125@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>But there's one problem: Types. On a 68000, shorts and ints are 16
>bits, longs are 32 bits. What this means is that if I declare a
Are you sure this is correct? The compilers I have used work with shorts as
16 bits ( word ) and ints and longs as 32 bits ( long word ). The situation you
describe is used on the 8086 family as far as I know. ( Lattice c on MSDOS ).
-- 
#  Peter Brouwer,                     ##
#  Philips TDS, Dept SSP-V2           ## voice +31 55 432523
#  P.O. Box 245                       ## UUCP address ..!mcvax!philapd!pb
#  7300 AE Apeldoorn, The Netherlands ## Internet pb@idca.tds.philips.nl

richard@aiai.ed.ac.uk (Richard Tobin) (05/19/89)

In article <8125@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>But there's one problem: Types. On a 68000, shorts and ints are 16
>bits, longs are 32 bits. What this means is that if I declare a
>register int xyz, I can't put a long into it -- the "C" compiler
>generates a "move.w" instead of a "move.l". If I declare everything as
>register long xyz, the "C" compiler generates a "add.l" instead of an
>"add.w", i.e. I just lost all the time I'd saved. 

Often you can get reasonable code out of a dumb compiler with code like
this:

    {
        register long t1;
        .....
    }
    {
        register short t2;
        .....
    }

(I.e. the compiler will use the same register for variables in different
blocks.)

But perhaps you already knew this.

-- Richard

-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

bcase@cup.portal.com (Brian bcase Case) (05/20/89)

>How about some facts on the 32bit version of the NOVIX, Phil Koopman's
>WISC chip, and Johns Hopkins Labs stack oriented chip.  All of these
>chips were faster (more MIPS per Mhz) than the ...(680xx, 80x86) in '87 -
>using fairly old technology.

Not hard to do.  The same (old technology) could be said of the MIPS and
SPARC implementations of the time.

>How fast might they be if they had a sustained development 
>effort on the order required to produce the 29000 and the 88000?

I give up, how fast?

>Did the big name chip developers miss something here?  Why didn't any
>of them develop a dual (4?) stack chip, zero (ok 1 or 2) addressing 
>modes, harvard architecure (3 data paths), 16 (or 32) intructions that 
>were essentially the chips micro-code (instruction bits fed directly 
>into the control lines on chip, very little decode time).  All of these 
>chips could do a call/branch in 1 cycle, return in 0 cycles, and passed 
>parameters lived on the stack with everything else.  Usually stacks 
>were cached on chip with overflow to memory.

Except for the 0-cycle return, most RISC chips share the same attributes.
However, have you considered the fact that the implemenations of commercial
RISCs are constrained by, for example, virtual memory?  Or the need to
support many different kinds of languages?  Having 3 or 4 memory ports is
CLEARLY a great idea if it fits in with the rest of the system design
center.  For Forth, a-ok.  For a chip that's going to have TLB(s) on chip,
not OK.  To answer your questions:

None of the commercial RISCs have stack architectures because such an
architecture defeats optimization strategies and doesn't make use of the
inherenently powerful (high-bandwidth, low-latency) chunk of hardware called
the 3-port (or ever more ports) register file.

All commercial RISC chips have 0 or 1 addressing mode.  (This is using my
definition of RISC.)

Most commercial RISCs have a Harvard implementation with a separate path
for instructions and data.  Implementing more 32-bit paths would have bad
consequences like slowing them all down.

While decoding a NOVIX instruction is simpler than that of most commercial
RISC instructions, the difference is not important.  What is important is
that the pipeline consisting of fetch, decode, execute, and write back (maybe
a memory stage in there too, like in MIPS) be uniform, that is, all stages
requiring essentially the same number of gate delays.  Greatly simplifying
the decoding beyond what has already been done is not productive for these
architectures.  (Analogy:  RISCs have fast procedure call mechanisms.
Speeding up the procedure call by another factor of 10 would have little
consequence.)

Commercial RISCs have 1 cycle branches and 1 cycle returns (which is really
a branch).  It might have been possible to architect and implement an
indirect branch (i.e. return) instruction that also specifies a (2-address)
arithmetic op, similar to what the NOVIX, et. al. has.  It would be somewhat
complex and irregular.  Scheduling the op in the delay slot is much more
general and doesn't make the ALU op different from all the rest (2-address
vs. 3-address).  The win would be probably be small in these architectures.

Commerical RISCs also cache the "stack" and everything else, on chip in the
general-purpose 3-port register file.  A processor has a memory hierarchy
with the register file at the top of the pyramid.  There are different ways
to design the register file. NOVIX, et. al. chose one way, everybody else
chose another.  Which is more general purpose?  Clearly, the 3-port register
file.

One bit (ok, one-half bit) of wisdom:  It is not always useful to look at
the features of one architecture and then conjecture that another would be
improved by adopting those features.  The reason is that an architecture is
a unit, the expression of a conceptual approach.  There is a complex matrix
of dependencies between the features of any architecture so that removing
one feature can invalidate all or a large fraction of the other features.
It is like saying that Chopin's piano concertos need to have a heavy beat.
The design center of the NOVIX chip, e.g., permits several memory ports and
a stack orientation.  This is fine if you can live with 64K of stack.  Chopin
is great until you want to go disco dancing....

Claim:  The best architectures are those that appear to have been designed
by one person.

gd@geovision.uucp (Gord Deinstadt) (05/20/89)

A few comments on / questions about super-global register allocation:

1.  How do you handle calls via a pointer?  Does your linker have to
be able to figure out all possible values of the function pointer?
Can this be done?  Or could you just tell the linker that these
functions are called somewhere and never mind where?  In this case it
would have to assume the worst, ie. that each was at the root of the tree.

2.  How do you handle signals?  I suppose the O/S could save all registers
before activating the signal handler, and restore them afterward.  No,
that doesn't work - restoring the registers resets some globals that you
WANTED to change.  (A solution for this would also solve 1.)

3.  All registers must be saved when calling a routine in a shared run-time
library.  Or, the linker could know which regs the library may use for each
entry point.  This gets more complicated if one library calls another, but
it seems manageable.

4.  The linker has to know if a variable is in shared memory.  Current 
optimizers are saved in situations such as this:

	for ( ; ; ) {
		wait_for_semaphore() ;
		switch (shmemp->thing) {

by the fact that shmemp->thing is in the heap, so they assume it might
have changed during the call to wait_for_semaphore().  However, a super-
global register allocator assumes it knows ALL about the program, and so
it is CERTAIN that the value can't have changed.  You have to tell it,
somehow.  A "shared" keyword in the storage declaration would suffice,
but that's a language change, and you know how those software types are!
:-) :-)

Gord Deinstadt		gdeinstadt@geovision

javoskamp@watmum.waterloo.edu (Jeff Voskamp) (05/20/89)

In article <427@ssp2.idca.tds.philips.nl> pb@idca.tds.PHILIPS.nl (Peter Brouwer) writes:
>In article <8125@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green) writes:
>>But there's one problem: Types. On a 68000, shorts and ints are 16
>>bits, longs are 32 bits. What this means is that if I declare a
>Are you sure this is correct? The compilers I have used work with shorts as
>16 bits ( word ) and ints and longs as 32 bits ( long word ). The situation you
>describe is used on the 8086 family as far as I know. ( Lattice c on MSDOS ).

Actually it depends on which machine you're working on.  Most compilers
that I know of for the Amiga, Mac and Atari ST are as described above
(original article), although some will use 32 bit regular ints with the
right options.

Don't forget that while comparisons, moves, logical operations and
simple math (addition and subtraction) work equally well on byte, word
and long word quantities the higher math functions (multiplication and
division) work partially with word values.  Multiplication is 16 bits
by 16 bits giving 32 bits while division is 32 bits divided by 16 bits
giving 16 bit remainder and quotient (1).  Therefore a "lazy" compiler
will default to 16 bit quantities.  

Pointers are of course 32 bits.

Great way to find out about "the world is a VAX" syndrome. :-)

Jeff Voskamp

(1) this is for a "vanilla" 68000.  Mileage may vary with other members
of the 680x0 processors.
It looks like we're in Trouble.
  No, I'm in trouble - you're invincible.		- My Secret Identity
bang path: ...{!uunet}!watmath!watmum!javoskamp 
domain   : javoskamp@watmum.uwaterloo.ca or javoskamp@watmum.waterloo.cdn

henry@utzoo.uucp (Henry Spencer) (05/21/89)

In article <427@ssp2.idca.tds.philips.nl> pb@idca.tds.PHILIPS.nl (Peter Brouwer) writes:
>>But there's one problem: Types. On a 68000, shorts and ints are 16
>>bits, longs are 32 bits...
>Are you sure this is correct? The compilers I have used work with shorts as
>16 bits ( word ) and ints and longs as 32 bits ( long word )...

On the 68000 and 68010, there was no consensus on this, and both choices
were found in operational compilers.  32-bit ints are generally better,
and the common operations were present, but some of the less common ones
(e.g. multiply) weren't, and most everything took an efficiency hit from
32-bit data on a 16-bit bus.  The width of int is supposed to be "the most
natural width", but for the 680[01]0 that could go either way, depending
on the implementor's prejudices.
-- 
Subversion, n:  a superset     |     Henry Spencer at U of Toronto Zoology
of a subset.    --J.J. Horning | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

pcg@aber-cs.UUCP (Piercarlo Grandi) (05/21/89)

In article <8125@killer.Dallas.TX.US> elg@killer.Dallas.TX.US (Eric Green)
writes (let's pick just one!):
    
    I've read the CRISP paper, and several other stack-stack papers. In
    all of them they mention that caching the top <n> entries of the stack
    in hardware registers was a Big Win performance wise. All I said was
    that a register is a register, whether it is accessed as a "stack" or
    explicitly as a register.

Too bad that there are non irrelevant differences between an explicitly
addressed set of registers used as stack cache and a stack cache, as far as
the whole system and the compiler are concerned. Also, too bad that usually
<n> is not that large :-), and in the range of the number of useful registers
in a register bank.

As somebody never tires to point out (Brian Case, if my memory serves me
correctly), three ported register files, from a CPU designer's point of view,
are a big win, whether you access them from the architecture as stack cache
or bank of registers and there is a good case for the stack cache argument.
-- 
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcvax!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mash@mips.COM (John Mashey) (05/22/89)

In article <18549@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes:
>>How about some facts on the 32bit version of the NOVIX, Phil Koopman's
>>WISC chip, and Johns Hopkins Labs stack oriented chip.  All of these
>>chips were faster (more MIPS per Mhz) than the ...(680xx, 80x86) in '87 -
>>using fairly old technology.

>Not hard to do.  The same (old technology) could be said of the MIPS and
>SPARC implementations of the time.

>>Did the big name chip developers miss something here?  Why didn't any
>>of them develop a dual (4?) stack chip, zero (ok 1 or 2) addressing 
>>modes, harvard architecure (3 data paths), 16 (or 32) intructions...

>However, have you considered the fact that the implemenations of commercial
>RISCs are constrained by, for example, virtual memory?  Or the need to
>support many different kinds of languages?

Brian gives a pretty good set of answers here.  I'd add a couple more:

1) Research chips are designed to explore some ideas, and they should be
designed, evaluated, and then (if desired) thrown away to go try the next
one.  In order to do them at speed, they may well ignore all sorts of
issues in order to explore whatever it is they're researching.

Good examples are the Stanford MIPSs and Berkeley RISC chips (including
SOAR, SPUR, etc).  These were important and worthy research efforts,
but I can't IMAGINE taking any of them, as is, straight into the commercial
market (note that although SPARC certainly resembles the RISC II/SOAR
chips, it of course addressed many additional issues).  

2) Chips designed to be commercial, but special-purpose (like NOVIX),
have a different set of constraints.  They have to be designed to be
testable, manufacturable, reasonably scalable into new technologies;
they might have to obey various interfacing constraints; they might make
unusual tradeoffs to do an excellent job at whatever they're aimed at.

Neither of the 2 above need to do what the 3rd class does:

3) Chips designed to be general-purpose chips covering a wide
range of languages and target environments not only need to be testable,
manufacturable, etc, but they need to have good performance and usability:
	a) over a range of languages
	b) over a range of operating systems
	c) over a range of hardware-system design points
	d) over a range of technologies, both at one time, and over time,
	i.e., when you design the very first one, you'd probably better
	have an idea where the technology is going, and what later
	implementations might look like, to avoid mistakes.
All of this means tradeoffs: even with a million transistors on a chip,
you STILL have to make tradeoffs....
As a result, one can almost always pick a very specific set of choies,
and then make tradeoffs to produce a processor that will be better at that
one thing than are any of the general-purpose ones.  On the other hand,
computing history tells us, that "that one thing" better have a large
market, i.e., large enough that the special-purpose processor gets the
investment to track the technology advances, and volume enough to keep
the costs low enough to stay competitive.  There clearly are niches
where this is possible: for example, digital signal processors, or
some graphics chips. 

One must take care not to compare apples and oranges amongst the 3
classes: all 3 are important, but they sure are different.

Finally, maybe somebody can set me straight, or provide some real DATA:
I have a vague memory of seeing something claiming that NOVIX had gone
out of business in the last few monthns.  Can anybody either confirm
that, or the inverse? 

And while I'm at it, there are rumors on the
street of Chapter-11 time for Edgecore(?) (used to be Edge).  Anybody
have any data on that one?  There's an interesting architectural
connection: Edge's business is selling very-high-end
68K-compatible boxes built from (I think) CMOS gate arrays, i.e.,
boxes OEMed to cover the high end of your line, if you're 68K-based.
(i.e., this is a strategy somewhat similar to NexGen's for the 386,
I think).  Any postings of data might be useful.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jbuck@epimass.EPI.COM (Joe Buck) (05/30/89)

In article <1989May20.223228.2456@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
(discussion: should 68000's and 68010's have 16-bit ints or 32-bit ints?)

>On the 68000 and 68010, there was no consensus on this, and both choices
>were found in operational compilers.  32-bit ints are generally better,
>and the common operations were present, but some of the less common ones
>(e.g. multiply) weren't, and most everything took an efficiency hit from
>32-bit data on a 16-bit bus.  The width of int is supposed to be "the most
>natural width", but for the 680[01]0 that could go either way, depending
>on the implementor's prejudices.

Henry, why do you say that 32-bit ints are "generally better" on
68000's and 68010's?  Seems to me the only reason someone would
consider such an unnatural thing is that there is so much sloppy code
around that assumes that ints are the same size as pointers.  Choosing
32-bit ints on a 68010 causes poorer performance and larger code.  It
takes cycles to fetch all those 32-bit values and memory to store
them.  While multiplies aren't all that common, the applications that
do generate them take big performance hits when library calls are
needed instead of single instructions.  With the 16-bit-int choice,
32-bit integers can be obtained where required by declaring variables
"long".  There isn't any way to get the more compact, efficient 16-bit
int operations if your compiler makes the (IMnotsoHO) wrong choice.

The most common values of integer variables are "0" and "1".  Why drag
those extra bytes everywhere?

I'm obviously in the minority on this.  I want to junk our old HP64000
development system, and all the newer emulator/cross-compiler products
want to do 32-bit ints for 68000's.  Grr.
-- 
-- Joe Buck	jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck

henry@utzoo.uucp (Henry Spencer) (05/31/89)

In article <3252@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
>Henry, why do you say that 32-bit ints are "generally better" on
>68000's and 68010's? ...

16-bit ints are a bit limiting at times.  Admittedly this is not common,
but it is a nuisance.  And then there are the sloppy programmers, the
ones who assume (as you mention) that pointers and ints are the same size,
or that ints are 32 bits, or that ints and longs are the same size and can
be used interchangeably (this is depressingly common in the Unix world today,
and last I looked, both Berkeley and AT&T were aiding and abetting this
disgusting practice).  Alas, one cannot just ignore the sloppy programmers;
sometimes one wants to run code they wrote without having to clean it up.

>... There isn't any way to get the more compact, efficient 16-bit
>int operations if your compiler makes the (IMnotsoHO) wrong choice.

Just to be heretical, one can observe that the same thing is true of
addresses.  You can run the 68k family with 16-bit addresses (although
they have to be *signed* 16-bit addresses!).  Furthermore, on the 68000
and 68010, there is a noticeable speed improvement to be had from doing
this.  And, as with ints, there isn't any way to get the more compact
and efficient 16-bit pointer operations if your compiler chooses 32 bits.
Hal Hardenbergh may be the only person on Earth who's made a serious try
at using "small model" :-) on the 68000, but it does work.
-- 
Van Allen, adj: pertaining to  |     Henry Spencer at U of Toronto Zoology
deadly hazards to spaceflight. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

wendyt@pyrps5 (Wendy Thrash) (05/31/89)

In article <3252@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
>With the 16-bit-int choice,
>32-bit integers can be obtained where required by declaring variables
>"long".  There isn't any way to get the more compact, efficient 16-bit
>int operations if your compiler makes the (IMnotsoHO) wrong choice.

Uh, Joe, can't you turn that argument around to say that 16-bit integers
can be obtained where required by declaring variables "short" (or am
I being C-chauvinistic)?

jbuck@epimass.EPI.COM (Joe Buck) (05/31/89)

I write:
>>With the 16-bit-int choice,
>>32-bit integers can be obtained where required by declaring variables
>>"long".  There isn't any way to get the more compact, efficient 16-bit
>>int operations if your compiler makes the (IMnotsoHO) wrong choice.

Wendy Thrash replies:
>Uh, Joe, can't you turn that argument around to say that 16-bit integers
>can be obtained where required by declaring variables "short" (or am
>I being C-chauvinistic)?

Sorry, Wendy, it doesn't work.  The reason it doesn't is because C
promotes shorts to ints before doing any arithmetic or function calls.
The choice is not symmetric (using short, int for the two sizes vs
int, long) because of the requirement to promote all smaller integer
sizes to int.  It does work to save space in array storage, but not
for scalar operations.  That's why I said what I did: There isn't any
way to get the more compact, efficient 16-bit int operations if your
compiler makes the wrong choice.

Why do we always get into these C discussions on comp.arch?  Why am I
complaining about something I am engaging in? :-)

-- 
-- Joe Buck	jbuck@epimass.epi.com, uunet!epimass.epi.com!jbuck

wendyt@pyrps5 (Wendy Thrash) (05/31/89)

In article <3256@epimass.EPI.COM> jbuck@epimass.EPI.COM (Joe Buck) writes:
>Sorry, Wendy, it doesn't work.  The reason it doesn't is because C
>promotes shorts to ints before doing any arithmetic or function calls. . . .

Before everyone on the net sends me mail telling me I'm an idiot, I'd like
to 1) plead temporary insanity for not thinking of function calls; and
2) remark that  C doesn't require promotion before doing arithmetic, but
requires that the results look as if it did.  For example, Vax gcc, when fed
	main() {
		short i, j, k, l;
		i = ((i+j)*(i-j));
		foo(i); }
produces
	. . .
	addw3 -4(fp),-8(fp),r0
	subw3 -8(fp),-4(fp),r1
	mulw3 r0,r1,-4(fp)
	cvtwl -4(fp),r0
	pushl r0
	calls $1,_foo
The convert is there for the call, but the arithmetic is done in short.
(Yes, there are other calculations on shorts that will require conversion
to longs as an intermediate step.)

This is in the X3J11 Rationale; e.g., in section 3.2.1.5: "Calculations
can also be performed in a 'narrower' type, by the _as if_ rule, so
long as the same end result is obtained."

>Why do we always get into these C discussions on comp.arch?  Why am I
>complaining about something I am engaging in? :-)

My apologies; I'll post no more on this topic, no matter what piece
of stupidity someone finds in this posting.

peter@ficc.uu.net (Peter da Silva) (05/31/89)

In article <1989May30.171335.473@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> Hal Hardenbergh may be the only person on Earth who's made a serious try
> at using "small model" :-) on the 68000, but it does work.

The Manx C compiler on the Amiga supports small model.
-- 
Peter da Silva, Xenix Support, Ferranti International Controls Corporation.

Business: uunet.uu.net!ficc!peter, peter@ficc.uu.net, +1 713 274 5180.
Personal: ...!texbell!sugar!peter, peter@sugar.hackercorp.com.

guy@auspex.auspex.com (Guy Harris) (06/01/89)

>>Henry, why do you say that 32-bit ints are "generally better" on
>>68000's and 68010's? ...
>
>16-bit ints are a bit limiting at times.  Admittedly this is not common,
>but it is a nuisance.  And then there are the sloppy programmers ...

Note also that, at least in the case of vendors who really wanted to
make 32-bit machines, that using the 680[01]0 as a 32-bit-"int" machine
meant that they could switch to the 68020 without requiring everybody to
recompile their code - Sun-2 binaries that doesn't use some Sun-2-only
hardware (I think the Sun-2 had a Sky Computers floating-point board
that fell into this category) can run on a Sun-3, modulo any
release-to-release binary incompatibilities, and I presume the same is
true of systems from Apollo, HP, and possibly others.

jeffa@hpmwtd.HP.COM (Jeff Aguilera) (06/02/89)

>                        And then there are the sloppy programmers, the
> ones who assume (as you mention) that pointers and ints are the same size,
> or that ints are 32 bits, or that ints and longs are the same size and can
> be used interchangeably (this is depressingly common in the Unix world today,
> and last I looked, both Berkeley and AT&T were aiding and abetting this
> disgusting practice).  Alas, one cannot just ignore the sloppy programmers;
> sometimes one wants to run code they wrote without having to clean it up.
> -- 
> Henry Spencer 
> ----------

This is a problem with poor language design, not sloppy programming.  C
provides no means for a programmer to express the intended range of an
integer.  The X Window System, for example, defines many integer subtypes:
INT16, INT32, UINT16, etc.  The short-int-long difficulties are isolated with
typedefs, but problems still linger.  36-bit architectures, for instance, are
a major headache.  On page 182 of the first edition of "The C Programming
Language," Kernighan and Ritchie summarize the characteristics of several
implementations of C:

                DEC PDP-11  Honeywell   IBM 370     Interdata
                            6000                    8/32

                ASCII       ASCII       EBCDIC      ASCII
    char        8 bits      9 bits      8 bits      8 bits
    int         16          36          32          32
    short       16          36          16          16
    long        32          36          32          32
    float       32          36          32          32
    double      64          72          64          64

When 64-bit machines are the rave, will short-int-long refer to 16-32-64 bit
integers?  Or 32-64-64?  Or 36-36-36  :-?  The programmer needs more control
than provided by C.
--------
jaa

elg@killer.DALLAS.TX.US (Eric Green) (06/02/89)

in article <4358@ficc.uu.net>, peter@ficc.uu.net (Peter da Silva) says:
> In article <1989May30.171335.473@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
>> Hal Hardenbergh may be the only person on Earth who's made a serious try
>> at using "small model" :-) on the 68000, but it does work.
> The Manx C compiler on the Amiga supports small model.

The Manx C compiler (on any 68K machine) supports "sort-of-small"
model. What Henry was talking about was "true" small model, where you
have 16 bit pointers. I.e. ALL data references are offset from a
register -- not just global data references -- i.e. malloc returns a
16-bit quantity.

--
    Eric Lee Green              P.O. Box 92191, Lafayette, LA 70509     
     ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg     (318)989-9849    
"I have seen or heard 'designer of the 68000' attached to so many names that
 I can only guess that the 68000 was produced by Cecil B. DeMille." -- Bcase

henry@utzoo.uucp (Henry Spencer) (06/04/89)

In article <670005@hpmwjaa.HP.COM> jeffa@hpmwtd.HP.COM (Jeff Aguilera) writes:
>>                        And then there are the sloppy programmers, the
>> ones who assume (as you mention) that pointers and ints are the same size,
>> or that ints are 32 bits, or that ints and longs are the same size...
>
>This is a problem with poor language design, not sloppy programming.  C
>provides no means for a programmer to express the intended range of an
>integer...

Actually, what we have here is sort of a borderline case, which can be
viewed either way depending on your prejudices.  As in so many other
things, C provides tools rather than solutions, and trusts the programmer
to use them properly.  Remember, C is interested in efficiency as well
as ease of expression.  It isn't always easy for a simple compiler to
go from intended ranges to efficient code, and it often isn't obvious
from the source just how efficient the code is going to be.  There can
also be problems if you want the code to be portable and to run the
hardware as efficiently as possible on multiple machines:  sometimes
you really do want to say "as big an integer as this machine supports
efficiently".

Also, I don't see how the express-the-range argument entirely addresses
the issues I raised.  Whether or not C could be better, the programmers
are supposed to be coding in C, not Pascal.  Sloppiness is sloppiness,
regardless of what inspires it.
-- 
You *can* understand sendmail, |     Henry Spencer at U of Toronto Zoology
but it's not worth it. -Collyer| uunet!attcan!utzoo!henry henry@zoo.toronto.edu