[comp.arch] more registers for ix86, was: Let's pretend

scott@labtam.labtam.oz (Scott Colwell) (01/03/91)

rcg@lpi.liant.com (Rick Gorton) writes:
>>  What features should be put into the CPU to improve performance and
>>reduce chip count?
>>
>SOME REGISTERS!!!!!

As has been pointed out, adding more registers to the ix86 would be almost
impossible (well impractical anyway) due to the current instruction coding.

This is not really the only way of addressing the problem, current compilers
use stack frame variables after the scarce registers have been allocated.
On the 486, reg to reg operations take one clock, cached memory to reg take
two and reg to cached mem take three. By improving this performance and making
some changes to the cache allocation scheme, cache can and does compensate
for the lack of general purpose registers.

It becomes interesting when you consider if the extra byte of offset from the
start of the stack frame that is required in the instruction sequence
is significant and if the other proposed scheme of another escape byte to allow
access to other new GP registers is better.

As microprocessor peformance continues to stretch memory bandwidth and latency,
keeping the code density reasonably high will always be a worthy aim. (As long
as the data bandwidth isn't stupidly high due to lack of registers :-)

Some ideas on how to make the on chip cache work better for stack frames;

	Allocate on writes for stack accesses. The 486 allocates on reads only
	which means that an automatic variable is guaranteed to be a cache miss
	on the first read. The cpu knows a stack address from data and code
	since it uses a different segment reg. If this status is propagated as
	the code/data status is, the cache could alter its behaviour for stack.

	As the on chip cache becomes larger, it may be worth the effort to
	have a special cache for the stack. It would be acceptable for it to
	not snoop _if_ it was guaranteed to only cache things on the stack.
	Using virtual tags might be an acceptable trade-off if the miss
	penalty was not too high after a flush. Filled from the on chip data
	cache ? (someone will let me know if I have presumed too much here :-)

-- 
Scott Colwell
Senior Design Engineer
Labtam Australia		net:	scott@labtam.oz.au
Melbourne, Australia 		phone:	+61-3-587-1444

kenw@skyler.calarc.ARC.AB.CA (Ken Wallewein) (01/04/91)

  A short while ago I was going to post a message saying "Hey, I know!
Virtual registers!" :-).

  Seriously, though, I remember reading the docs for processor in the
Texas Instruments 99/4A.  As I recall, it had three registers: a
couple for operands, and one pointer register for a "virtual" register
set in RAM.  Changing that one register's contents gave one a whole or
partially different set.  The docs claimed (of course) that this was the
way of the future, because processor <-> memory bandwidths were
increasing (hah :-).

  It seems to me that, with a good caching scheme, this could work
out to being just about as efficient as 'real' registers.  On the
other hand, I sometimes wonder whether registers aren't actually an
optimizing kludge, and that an instruction set which made no explicit
reference to registers at all might be more efficient.

  Speaking of caches, I seem to recall that single-storage-area caches
are supposed to be more efficient than caches which have separate
areas for different things.  But this would have difficulty dealing
with addresses which have low hit rates but high impact.  One would
almost have to have "cache priority".  Has this ever been done?

/kenw
kenw@noah.arc.ab.ca

baum@Apple.COM (Allen J. Baum) (01/04/91)

[]
>In article <5827@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes:
>On the 486, reg to reg operations take one clock, cached memory to reg take
>two and reg to cached mem take three. By improving this performance and making
>some changes to the cache allocation scheme, cache can and does compensate
>for the lack of general purpose registers.

I'm afraid that I can't let this statement of fact go unchallenged.
I'm assuming it is fact- I have no reason to believe otherwise.
But, the plain facts don't tell all the details, and those details are
significant.

For example, it is easy to make a memory (or cache) access a single single
if you make your cycle slow enough. I will make an assertion here- the critical
path for most processors with cache is the cache access path.

Secondly, you can make horribly complicated instructions take a single
cycle (even for fast cycles) if you make your pipeline deep enough.
Unfortunately, deep pipes have a problem with stalling because of branches and
dependencies when the pipeline gets deep. So, the book says that some
instruction will take only one cycle. What it doesn't say is that for an
average instruction mix the pipeline stalls so much that even with 100%
cache hits you'll get 2.5 cycles per instruction.

Be very careful when you think that cache accesses are fast & free- there
isn't a free lunch, and the the cost is tricky to measure. Second order
effects (like dependency stalls) are not necessarily down in the noise.

One further point- my sense of compiler technology is that 80x86 compilers
are very good at what they do. However, because of the lack of registers,
there are all sorts of techniques they don't even attempt to do. This
cost of opportunity is not easily measured either.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

firth@sei.cmu.edu (Robert Firth) (01/04/91)

In article <KENW.91Jan3123808@skyler.calarc.ARC.AB.CA> kenw@skyler.calarc.ARC.AB.CA (Ken Wallewein) writes:

>  Seriously, though, I remember reading the docs for processor in the
>Texas Instruments 99/4A.  As I recall, it had three registers: a
>couple for operands, and one pointer register for a "virtual" register
>set in RAM.  Changing that one register's contents gave one a whole or
>partially different set.  The docs claimed (of course) that this was the
>way of the future, because processor <-> memory bandwidths were
>increasing (hah :-).

From my (rather large) library of computer manuals:

 "2.1.4 Workspace Concept

  The 990 register fileis located in memory.  This imposes
  very little speed penalty because the 990 computers use
  fast semiconductor memory.  The classifications "hardware
  register" and "memory" are arbitrary when both are
  semiconductor devices."

[Texas Instruments: 990 Computer Family Systems Handbook, 1976]

Ah, but how else could you build a machine with a 250ns cycle time
that did a full context switch in less than 10us?  False premise,
maybe, but great engineering.

qzhe1@cs.aukuni.ac.nz (Qun Zheng ) (01/05/91)

Is there a modern processor without registers?  If there is, how do they catch
up the speed, playing which sort tricks on caches?

Chuck
-------
Dept CompSci
Auckland University
NEW ZEALAND

henry@zoo.toronto.edu (Henry Spencer) (01/06/91)

In article <KENW.91Jan3123808@skyler.calarc.ARC.AB.CA> kenw@skyler.calarc.ARC.AB.CA (Ken Wallewein) writes:
>... I sometimes wonder whether registers aren't actually an
>optimizing kludge, and that an instruction set which made no explicit
>reference to registers at all might be more efficient.

"More efficient" in what sense?  The fact is, if you have compilers that
can do intelligent and useful things like common-subexpression optimization,
there is a very high payoff for having a modest number of random-access
variables that are implemented in very fast storage and can be addressed
with very few instruction bits.  These need not be "registers" in the
traditional sense, as witness CRISP's use of special addressing modes
and a special cache for the very top of the current stack frame, but they
have to look a lot like them.

The major alternatives to explicit register references in instructions
are memory-to-memory machines and stack machines ("stack" here in the sense
of expression-evaluation stack, not call stack).  Memory-to-memory machines
lose big on the number of bulky memory addresses needed when expressions
start needing temporaries.  Stack machines lose big on the difficulty of
keeping useful values around for re-use, since the stack really wants to
throw a value away after using it.  Both lose big on the cost of memory
references unless the cache is *very* good.  Given good compilers, register
machines perform better at equivalent complexity:  that easily-accessed
fast temporary storage is very useful.

>  Speaking of caches, I seem to recall that single-storage-area caches
>are supposed to be more efficient than caches which have separate
>areas for different things...

Again, "more efficient" in what sense?  Current experience strongly points
to the reverse, at least for the obvious case of instructions vs. data.
Splitting a cache loses the ability to use any free space to fulfill a new
request, but gains the ability to specialize the hardware and parallelize
data paths.
-- 
"The average pointer, statistically,    |Henry Spencer at U of Toronto Zoology
points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu   utzoo!henry

baum@Apple.COM (Allen J. Baum) (01/06/91)

[]
>In article <1991Jan6.014925.10935@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:

--a nice summary of why registers win--
The reason that caches are used is because memory management is
 -hard to do
 -a pain in the ass
 -best done with run time info

When these are not an issue, then it can be done extremely efficiently
by the programmer. Lo and behold, for very small amounts of memory
(like <32 variables), the programmer still has problems, but some
other program (the compiler) can manage it. Hence, registers are extremely.

In general, memory addresses are much bigger than register references, but
in particular they don't have to be, e.g. CRISP which uses a small offset
from some pointer.

As has been pointed out in the past, registers are the top of the
memory hierarchy. The characteristics that make it different than
those other levels: multiple ports, sub-cycle access, and extremely
small size. (generally speaking, of course)

>>  Speaking of caches, I seem to recall that single-storage-area caches
>>are supposed to be more efficient than caches which have separate
>>areas for different things...
>
>Again, "more efficient" in what sense?  Current experience strongly points
>to the reverse, at least for the obvious case of instructions vs. data.
>Splitting a cache loses the ability to use any free space to fulfill a new
>request, but gains the ability to specialize the hardware and parallelize
>data paths.

In the sense that 1 unified cache will have a better hit ratio than
seperate I/D caches of half the size each, yes, it is more efficient.
However, hit ratio is not everything. The performance gained by having
split caches be accessed in parallel more than makes up for the fairly
small loss due to extra misses.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

baum@Apple.COM (Allen J. Baum) (01/06/91)

[]
>In article <qzhe1.663050569@aucs7.cs.aukuni.ac.nz> qzhe1@cs.aukuni.ac.nz (Qun                 Zheng          ) writes:
>Is there a modern processor without registers?  If there is, how do they catch
>up the speed, playing which sort tricks on caches?

The only one I'm aware of is the ATT CRISP, which is a mem-mem
architecture.  The only address modes are: immed, SP-relative,
indirect SP-relative, and absolute (? could be wrong here...PC-rel also?).

By caching the top of stack (i.e., the first 32 locs relative to the SP)
in an extremely small, dual ported cache, these references look more
like register references, even needed only five bits in the
instruction to reference them.

Unlike the usual cache tag mechanism, there is a bounds check, since
the top of stack must be contiguous. In addition, they don't allocate
on a miss- special instructions allocate and deallocate (on procedure
entry/exit). In that sense, it looks like register windows with
variable size windows.  So, its sorta like registers, except that it
goes to mem if you access something else. Needless to say, generally
the absolute and indirect modes will miss.

ALSO it will go to the registers if you access them through a pointer!
That means you don't have to worry about making sure a variable is in
memory if you might have a pointer to it.

For example, an interrupt merely changes the SP. All references then miss,
and it starts running slow, unless the SP gets restored long enough to save
the "registers"- but you don't have to save anything if you don't want to.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

scott@labtam.labtam.oz (Scott Colwell) (01/07/91)

baum@Apple.COM (Allen J. Baum) writes:
>>In article <5827@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes:
>>On the 486, reg to reg operations take one clock, cached memory to reg take
>>two and reg to cached mem take three. By improving this performance and making
>>some changes to the cache allocation scheme, cache can and does compensate
>>for the lack of general purpose registers.

>I'm afraid that I can't let this statement of fact go unchallenged.
>I'm assuming it is fact- I have no reason to believe otherwise.

Thankyou ;-)

>For example, it is easy to make a memory (or cache) access a single single
>if you make your cycle slow enough.
..
>Secondly, you can make horribly complicated instructions take a single
>cycle (even for fast cycles) if you make your pipeline deep enough.
>Unfortunately, deep pipes have a problem with stalling because of branches and
>dependencies when the pipeline gets deep.
..

I accept that all these are valid comments but again using the example
of the 486, Intel have come up with an implimentation that goes quickly.
It appears to have integer performance comparable to many of the RISC
processors (although not up with the faster ones.) If we accept that the
486 at 33MHz has around 15 vax mips (Intel's quoted integer spec is 17.6
Nov/Dec 1990 issue of Solutions,) then we get somewhere around 2.2 'vax
instructions' per clock.

I hate to appear to be supporting a less than perfect architecture but
for Intel, the name of the game is not to make a fast microprocessor
but to make the fastest x86 compatible processor. In this context, I think
that Intel have made good design tradeoffs. Imagine what might be produced if
the silicon technology and the effort that are put into the x86 and 860
groups was put into just one RISC family rather than the two RISC and one
CISC family that Intel have today. (i.e. x86, 860, 960)

-- 
Scott Colwell
Senior Design Engineer
Labtam Australia		net:	scott@labtam.oz.au
Melbourne, Australia 		phone:	+61-3-587-1444

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/07/91)

In article <5833@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes:

|                                             Imagine what might be produced if
| the silicon technology and the effort that are put into the x86 and 860
| groups was put into just one RISC family rather than the two RISC and one
| CISC family that Intel have today. (i.e. x86, 860, 960)

  I really don't see that any of the lines you mention are suffering
from lack of resources (maybe the 486), and no one CPU would be likely
to fit all the markets and be priced for mass marketing.

  And you don't have to read this group long to realize that some people
will hate everything Intel does, because Intel does it. Not that this is
unique, some people feel the same about IBM or Apple, CISC, etc.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

baum@Apple.COM (Allen J. Baum) (01/08/91)

[]
>In article <5833@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes:
--my comments on using cache as a substitute for more regs--
>
>I hate to appear to be supporting a less than perfect architecture but
>for Intel, the name of the game is not to make a fast microprocessor
>but to make the fastest x86 compatible processor. In this context, I think
>that Intel have made good design tradeoffs.

Well, we may be on the same side. Given the 80x86 ISA, then what Intel did
might be the best solution.

What I was arguing was that cache is not a substitute for more regs., although
marketing benchmarks and numbers might make it appear so.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/09/91)

On 6 Jan 91 01:49:25 GMT, henry@zoo.toronto.edu (Henry Spencer) said:

henry> The major alternatives to explicit register references in
henry> instructions are memory-to-memory machines and stack machines
henry> ("stack" here in the sense of expression-evaluation stack, not
henry> call stack).  Memory-to-memory machines lose big on the number of
henry> bulky memory addresses needed when expressions start needing
henry> temporaries.  Stack machines lose big on the difficulty of
henry> keeping useful values around for re-use, since the stack really
henry> wants to throw a value away after using it.

I'd like to repeat my usual fundamental truth on this: the latter
statement need not be true, if one has multiple stacks. A little
analysis will show why. Statistics exist that show that almost any
expression and a majority of code sequences can, on a single threaded
CPU (one with one ALU), be compiled without spills with FOUR spare
registers[1].  To add abundant inter-expression caching we need another
four.

This amounts to saying that usually not more than four things try to
happen independently in an expression. Statistics exist that show that
the average expression is exceedingly simple[2], and frequently resused
values are quite few as well.

Now, a single stack architecture loses out badly when two conditions are
true:

1) The lifetimes of the spills do not nest gracefully.

2) The CPU is not single threaded and has multiple ALUs.

A single stack loses badly because in both[3] cases operations can only
happen at the top of stack, and the separate logical (case 1) or
physical (case 2) computation streams interfere and trash around it.

The effect of case 1) is when you see a lot of seemingly unnecessary
POP, PUSH and DUP operations used just to rearrange the elements around
the top of stack, to bring each computation thread 'in focus' on it.
This happens especially often with Fortran style codes, some of which do
have complicated expressions and address/value stream interference.

The obvious solution is to have multiple tops of stack. The CRISP
architecture does this by allowing a short form of address to the first
32 slots on the stack; this seems to work quite well.

My favourite solution would be to have multiple stacks. How many? FOUR
is the answer, because it is exceedingly rare that an expression
contains as many independent threads of computation. A superscalar
machine may attach independent ALUs to each stack, or even specialize
them[4].

In the old problem of providing accumulators and on-chip cache, the
general purpose register file solution is to hide the accumulators and
make only the on-chip cache visible (thus register renaming), the stack
solution is to have a single accumulator and a stack cache behind it,
per stack.

Note that the stack solution has the benefit that it has better
information encoding for the most common case, as demonstrated by the
impressive code densities of stack based architectures[5].

----------------

[1] Examples:

a+b+c+d		this, on a single threaded CPU, only needs one register
		to be compiled without spills.

a/b + c*d	this instead needs two registers for compiling without
		spilling, because two things are going on at the same
		time (logically).

[2] Most expressions, staticall, are of the form 'a = b OP c' or
simpler; dynamically things cna be slightly more complicated. Note that
the operands may involve address computations.

[3] Case 1) seems to be relatively rare but important; case 2) is more
common than you think, because without being superscalar, many simple
implementations have different ALUs for computing values and addresses.

[4] Indeed in practice superscalar implementations find it exceedingly
hard to find degrees of microparallelism greater than two or three,
unless one is talking codes that are best suited to vector, SIMD or
MIMD, architectures. Special purpose architectures for special purpose
codes!

[5] The reason is that most expressions are indeed stack based, i.e. the
lifetimes of spills nest nicely, while register machines assume the
worst case, i.e. all spills have completely unrelated lifetimes and have
to be managed explicitly. A multiple stack machine is a nice compromise.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

anton@bkj386.uucp (Anton Aylward) (01/09/91)

CPU's without registers?

If you mean computational registers, as apart from internal (non programmable)
ones, how about the Texas SBP 9900.

/anton
-- 
           ____   __________
          /   /   /        |                       |  Anton J Aylward
         /   /   /         |       Analysis        |
        /    |  |        ,-'                       |  3355 Don Mills Rd,

preston@ariel.rice.edu (Preston Briggs) (01/09/91)

pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>I'd like to repeat my usual fundamental truth on this:
...

>[1] Examples:
>
>a+b+c+d		this, on a single threaded CPU, only needs one register
>		to be compiled without spills.
>
>a/b + c*d	this instead needs two registers for compiling without
>		spilling, because two things are going on at the same
>		time (logically).
>
>[2] Most expressions, staticall, are of the form 'a = b OP c' or
>simpler; dynamically things cna be slightly more complicated. Note that
>the operands may involve address computations.

These examples gloss over the number of registers required for addressing
and the amount of reuse available due to addressing.  We all know about
array addressing expressions.  Consider also pointer chasing in C or
Pascal and the addressing of non-local (and non-global) variables
in Algol and descendants.  A large portion of the work done by
various forms of strength reduction, common subexpression elimination
(local and global), and loop invariant code motion involve all these
implied calculations and memory accesses.  In the case of Fortran,
very little of the user's explicit code is ever touched by the optimizer
(mostly because it _is_ very simple, as mentioned above and in
Knuth's "An emperical study of fortran..." paper).

>[5] The reason is that most expressions are indeed stack based, i.e. the
>lifetimes of spills nest nicely, while register machines assume the
>worst case, i.e. all spills have completely unrelated lifetimes and have
>to be managed explicitly. A multiple stack machine is a nice compromise.

Are there any papers describing how to generate code for such a machine?
It seems as though the many stacks must still be managed explicitely.
Additionally, we must solve the problem of "which items on which stack"
and "when do we pop what values from the stack".

Preston Briggs