[comp.arch] Register Allocation and Aliasing

aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/05/90)

How often are quantities that *might* be allocated to registers *not*
allocated to registers, because of the remote *possibility* of
aliasing via some circuitous pointers?

    Hare brained idea: allocate quantities that *might* be aliased to
registers anyway.  Provide a register to contain the true memory
address of the aliased quantity, which causes a trap when the address
is accessed (or automagically forwards to/from the register).  Not
only are aliasing problems avoided, but you've got a set of data
address breakpoint registers as well! (ie. this technique could be
experimentally evaluated on machines that have data address
breakpoints).

Evaluation framework:
    Na is the number of such data address breakpoint registers provided.
    V(1)..V(Na) are possibly aliased values placed in registers.

    Tr is the performance (CPI) assuming V(1)..V(Na) are all in registers,
	and no aliasing occurs.

    Tm is the performance (CPI) assuming V(1)..V(Na) are placed in 
	memory whenever there is a possibility of aliasing.

    Fa is the frequency of actual aliasing

    Ca is the cost of handling the aliasing, using the data address breakpoints
	(large in SW, less but still significant in HW) 

    Obviously, Tr = Tr( V(1)..V(Na) ) = Tr(Na), and so on.

Then we compare to Tr(1-Fa) + Tr(Fa)(1+Ca) to Tm for a particular Na.

Unfortunately, I don't have any idea of what the parameters are like.
I have the impression that Fa is low, and I can ballpark Ca. But what
is Tm/Tr? Given the amount of fuss (number of papers) about aliasing
in compilers, Tm/Tr would seem to be large, but I'm not sure.
Especially as Na is varied?


(Well, let's see if I've embarassed myself as much as I did in my last
posting, where I said that 24-10=4. ;-} )
--
Andy Glew, aglew@uiuc.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

rfg@ncd.com (Ron Guilmette) (07/07/90)

In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write:
>How often are quantities that *might* be allocated to registers *not*
>allocated to registers, because of the remote *possibility* of
>aliasing via some circuitous pointers?
>
>    Hare brained idea: allocate quantities that *might* be aliased to
>registers anyway.  Provide a register to contain the true memory
>address of the aliased quantity, which causes a trap when the address
>is accessed (or automagically forwards to/from the register).  Not
>only are aliasing problems avoided, but you've got a set of data
>address breakpoint registers as well! (ie. this technique could be
>experimentally evaluated on machines that have data address
>breakpoints).

Actually, this sounds like a marvelous idea to me!

I guess that the only drawback would be the cost of the additional hardware
to support the data breakpoints, but (as you noted) some machines already
have such facilities, so it can't be THAT expensive.

Offhand, I'd say that this could potentially work really well, except that
I think that you would want (a) a set of data-breakpoint registers which
had a one-to-one relationship with the actual registers (but you probably
only need to cover about half of them for a typical 32 register RISC machine)
and (b) special load and store instructions for potentially aliased data
values.

The special load instruction would not only load the data value into the
specified data register, it would also force the address used for the load
into the (associated) data breakpoint register.  Finally, it would set a
"valid" bit associated with the given data-breakpoint register.

The special store instruction would simply, unset the "valid" bit for the
specified data-breakpoint register (thus preventing further data acceses
to the address indicated in the given breakpoint-register from being treated
as if the data-value in the data-register were still the valid (and most
up-to-date one).

The idea of getting the hardware to automagically satisfy references to
temporarily "register resident" data values directly from the registers
would eliminate the overhead of the (software) handing of breakpoints,
and would make this idea even better.

As I said, I think that it would *not* be necessary to associate one data-
breakpoint register with each actual data register.  Keep in mind that
registers often hold computational results as well as simple copies of
variable values.  (And don't forget the registers like PC, SP, and FP.
These don't need the special treatment either.)

Having a machine that did this stuff would effectively render alias analysis
(in compilers) "impotent and obsolete".

> Given the amount of fuss (number of papers) about aliasing in compilers...

It is certainly a pain in the $#^%$$^$%$ to  do good alias analysis statically
in software (as opposed to dynamically and in hardware) especially for
everybody's current favorite language, C.  Further, I'd have to guess that
static (compile-time) mechanisms for dealing with (i.e. preventing) aliasing
problems could NEVER match the performance characteristics of the hardware-
based dynamic scheme you have proposed.

In short... I like it.  Too bad I don't build chips!  I just program the
suckers. :-)

-- 
// Ron Guilmette (rfg@ncd.com)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

alvitar@xavax.com (Phillip Harbison) (07/09/90)

In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us>
aglew@oberon.crhc.uiuc.edu (Andy Glew) writes:
> How often are quantities that *might* be allocated to registers *not*
> allocated to registers, because of the remote *possibility* of
> aliasing via some circuitous pointers?
>
>   Hare brained idea: allocate quantities that *might* be aliased to
> registers anyway.  Provide a register to contain the true memory
> address of the aliased quantity, which causes a trap when the address
> is accessed (or automagically forwards to/from the register).

This sounds like an interesting idea.  What if we just added a tag
to each register in the CPU.  The tag would store the address which
the register represents.  A valid bit would be set if the tag was
in use, and cleared for empty or scratch registers.  A dirty bit
could be set when the register was not consistent with memory.  All
load/store instructions would have to check the tags, so the tags
would have to be fully associative.  Registers with the dirty bit
set could be automatically written back when they were reused.

-- 
Live: Phil Harbison, Xavax, P.O. Box 7413, Huntsville, AL 35807
Uucp: alvitar@xavax.com
Bell: 205-883-4233, 205-880-8951

baum@Apple.COM (Allen J. Baum) (07/10/90)

[]
>In article <1990Jul9.073203.3358@xavax.com> alvitar@xavax.com (Phillip Harbison) writes:
....
>This sounds like an interesting idea.  What if we just added a tag
>to each register in the CPU.  The tag would store the address which
>the register represents.

Then again, why not just use ATT CRISP chips? Their 'register' file is
really a 'top-of-stack' cache. CRISP is a memory-to-memory architecture,
so there really are no registers, but one of its four addressing modes
is stack-pointer-relative, with a 5 bit displacement, which effectively
gives you 32 registers. If you go indirect through these 'registers', and
it hits in the cache, it gets it from the 'registers' automagically.
Procedure calls can advance the stack pointer, which gives you variable
size register windows.

--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

preston@titan.rice.edu (Preston Briggs) (07/10/90)

Andy Glew writes:
>>    Hare brained idea: allocate quantities that *might* be aliased to
>>registers anyway.  Provide a register to contain the true memory
>>address of the aliased quantity, which causes a trap when the address
>>is accessed (or automagically forwards to/from the register).  Not
>>only are aliasing problems avoided, but you've got a set of data
>>address breakpoint registers as well! (ie. this technique could be
>>experimentally evaluated on machines that have data address
>>breakpoints).

And Ron Guilmette replied:
>Actually, this sounds like a marvelous idea to me!
...
>Having a machine that did this stuff would effectively render alias analysis
>(in compilers) "impotent and obsolete".

I disagree.  Alias analysis has many uses during optimization.
For example, during value numbering (with constant folding),
this example might arise:

	*p = 5;
	*q = -5;	-- does this change *p's value or not?
	if (*p > 0) {
	    ...
	}
	else {
	    ...
	}

With adequate alias analysis, we might be able to determine that p and q
don't point at the same location; therefore, we'd be able to eliminate
the test and the entire else clause.  With inadequate alias analysis
(or none at all), we must be conservative and emit code for the test.

It's easy to construct similar examples with common subexpression 
elimination, loop invariant code motion, &c.

The register allocation question is a little more difficult.
But consider that many of the values kept
in registers are the result of common subexpression elimination or 
loop invariant code motion and you can get a hint of the problem.  
Here's a (very contrived) example:

	do {
	    *p = a[*q];
	    p++;
	} while (p < pp);

Now what we'd really like is to hold a[*q] in a register throughout the
loop, giving

	r = a[*q];
	do {
	    *p = r;
	    p++;
	} while (p < pp);

But what if p = q at some point during the loop?  Boom!
Note that we're not keeping *q in a register, so there's
no hint from the proposed trap hardware.

Basically, it seems as though we can't keep the results
of any expression that includes aliased elements in a register.
More to the point, we can't do any (?) optimization for
such expressions.

Phil Harbison writes:
>This sounds like an interesting idea.  What if we just added a tag
>to each register in the CPU.  The tag would store the address which
>the register represents.  A valid bit would be set if the tag was
>in use, and cleared for empty or scratch registers.  A dirty bit
>could be set when the register was not consistent with memory.  All
>load/store instructions would have to check the tags, so the tags
>would have to be fully associative.  Registers with the dirty bit
>set could be automatically written back when they were reused.

The whole scheme strikes me as "cache."  Hank Dietz has argued
(I believe) that the decision to keep a variable in cache or
register should be based on whether or not it is aliased.

For data breakpoints, perhaps there's an easy modification
of cache hardware?

(BTW, a similar post will appear someday on comp.compilers,
but it's been delayed somewhat.)

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

chc@briar.Philips.Com (Chi-Hung Chi) (07/10/90)

>Phil Harbison writes:
>>This sounds like an interesting idea.  What if we just added a tag
>>to each register in the CPU.  The tag would store the address which
>>the register represents.  A valid bit would be set if the tag was
>>in use, and cleared for empty or scratch registers.  A dirty bit
>>could be set when the register was not consistent with memory.  All
>>load/store instructions would have to check the tags, so the tags
>>would have to be fully associative.  Registers with the dirty bit
>>set could be automatically written back when they were reused.

This is what my thesis advisor (Hank Dietz) and I call "CReg" 
structure (Cache Register).  Two years ago, we had a preliminary 
paper about CReg which appeared in Supercomputing '88. Some updated
technical reports about the implementation and allocation of CRegs
and its extension to multiprocessors are under preparation and
should be ready very soon.

>>The whole scheme strikes me as "cache."  Hank Dietz has argued
>>(I believe) that the decision to keep a variable in cache or
>>register should be based on whether or not it is aliased.
>>
>>For data breakpoints, perhaps there's an easy modification
>>of cache hardware?

This is the primary functional difference between registers and cache
from a compiler's viewpoint. See my thesis for details.

- Chi-Hung Chi
  (chc@philabs.philips.com)

aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/11/90)

I doubt that you would want to add a tag to all registers.  It could
be a distribution property - one breakpoint enough to cover 90% of the
not-provably-not-aliasing values.

"Could be" - needs quantitaive analysis. I'm hoping that some compiler
jock may already have some numbers.

--
Andy Glew, aglew@uiuc.edu

mash@mips.COM (John Mashey) (07/15/90)

In article <1990Jul06.194618.4957@esegue.segue.boston.ma.us> rfg@ncd.com (Ron Guilmette) writes:

>>    Hare brained idea: allocate quantities that *might* be aliased to
>>registers anyway.  Provide a register to contain the true memory
>>address of the aliased quantity, which causes a trap when the address
>>is accessed (or automagically forwards to/from the register).  Not
>>only are aliasing problems avoided, but you've got a set of data
>>address breakpoint registers as well! (ie. this technique could be
>>experimentally evaluated on machines that have data address
>>breakpoints).

Some of this sounds interesting, and some may be useful in the future,
for various applications.  However, one must be careful, especially
in a world of LIW, super-scalar, super-pipelined, and super-scalar-super-
pipelined multiple-issue machines (i.e., all RISCs that expect to be
competitive in the next few years), that you don't stick something
in a critical path that blows your cycle time by 50%....

Maybe this is a good time to expound a little on a related widespread fantasy
chat might be called:
	When You Have A Zillion Transistors On A Chip, All Of Your
		Problems Go Away.
Most of the following is over-simplified discussion of EE stuff from a
software guy's viewpoint; maybe some real VLSI types will corect goofs and
expound more on this topic:


It is clear that more space on a die help a lot, and they let
you do things like:
	bigger on-chip caches
		a wonderful thing: regular, dense, and
		transistors, rather than wires
		this includes: I-caches, D-caches, TLBs, branch-target
		buffers, pre-decoded instruction buffers, etc.
	monster-fast FP units and other arithmetic units
		for some kinds of units (like multipliers),
		more space ==> faster, reduces latency of operation,
		always a good thing.
	more copies of functional units, or more pipelining
		increases the repeat rate for an operation, which may help
		some kinds of things.
	wider busses, increasing intra-chip bandwidth.
On the other hand, there are some nasty facts of life (for CMOS, anyway):
	1) FAN-OUT is not free.  Put another way, the more loads on a bus,
	the slower it is.  Bigger transistors help, up to a point, but
	what usually happens is that you must cascade the gates to keep
	the total delay minimized.
	2) FAN-IN is not free either.
	3) WIRES DON'T SHRINK as fast as transistors (because the resistance
	increases as they get narrower).  Hence, as you do shrinks to
	increase the speed, and get more on a chip, this means the wires
	can gobble up more of the space.
Put another way:
	1) The more things listening to you, the slower you are.
	2) The more things you listen to, the slower you are.
	3) Don't think you can run monster busses all over the place for free.
All of this says that people STILL have to think very hard about
delays in the critical paths in a CPU.  The faster you go, the more you're
likely to be doing more things in parallel, but if you're not careful,
these factors can bite you badly, especially in a single-chip
design that can only dissipate so much heat.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (07/15/90)

In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> aglew@oberon.crhc.uiuc.edu (Andy Glew) writes:
>How often are quantities that *might* be allocated to registers *not*
>allocated to registers, because of the remote *possibility* of
>aliasing via some circuitous pointers?
>
>    Hare brained idea: allocate quantities that *might* be aliased to
>registers anyway.  Provide a register to contain the true memory
>address of the aliased quantity, which causes a trap when the address
>is accessed (or automagically forwards to/from the register).

If you look back in the IEEE proceedings from Supercomputing '88, you'll see
a paper that I and Chi-Hung Chi did addressing exactly this issue (the title
is something Like "CRegs: A New Type of Memory for Accessing Arrays and
Pointers").  The CReg mechanism augments registers with address fields which
are used associatively to keep the register contents coherent...  we also
pointed-out that with proper compiler analysis, the associativity could be
limited to a rather small set size without ill effect (e.g., a set size of 4
rather than fully associative -- this keeps the speed of CRegs competitive
with that of ordinary registers).

>Evaluation framework:
...
>Unfortunately, I don't have any idea of what the parameters are like.

Some preliminary results on CRegs appear in the aforementioned paper.
The results were pretty good.  However, some of the most significant
benefits are not the obvious ones.  E.g., you don't need to do
explicit stores as often (ever?) -- you can even use a dirty bit and a
lazy store mechanism.  Likewise, you don't need as many loads, and you
certainly don't need to fetch as many addresses, which reduces the
INSTRUCTION FETCH bandwidth required.

					-hankd@ecn.purdue.edu

PS: Chi and I had actually tried to get this stuff patented back in
    1987, and it looked like we had some very nice claims, but Purdue
    University basically put it on hold until it had been over a year
    since public disclosure...  oh well.

PPS: We've also looked at using CRegs for instructions (vs. data).
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

torbenm@diku.dk (Torben [gidius Mogensen) (07/15/90)

rfg@ncd.com (Ron Guilmette) writes:

>In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write:
>>How often are quantities that *might* be allocated to registers *not*
>>allocated to registers, because of the remote *possibility* of
>>aliasing via some circuitous pointers?
>>
>>    Hare brained idea: allocate quantities that *might* be aliased to
>>registers anyway.  Provide a register to contain the true memory
>>address of the aliased quantity, which causes a trap when the address
>>is accessed (or automagically forwards to/from the register).  Not
>>only are aliasing problems avoided, but you've got a set of data
>>address breakpoint registers as well! (ie. this technique could be
>>experimentally evaluated on machines that have data address
>>breakpoints).

>Actually, this sounds like a marvelous idea to me!

>I guess that the only drawback would be the cost of the additional hardware
>to support the data breakpoints, but (as you noted) some machines already
>have such facilities, so it can't be THAT expensive.

There are two other (simpler) ways to circumvent the aliasing problem
in a processor architecture.

One is to assign fixed addresses to registers, e.g.  let the first N
addresses refer to the N registers.  The processor would recognize
that all the top bits in the address are 0, and use registers instead.
I believe this was done in some PDP 11s.  [actually, the PDP-6/10/20 -John]

One is have no data registers at all, but keep everything in memory.
This is not as costly as it may sound, as cache memory can be accessed
(almost) as fast as register memory. An added benefit is fast process
switch. Register allocation becomes a non-issue and register
saving/loading in process calls is done invisibly on a "call-by-need"
basis, as cache cells are overwritten. The only problem is address
size in instructions, but this can be solved in several ways. One is
to use a stacktop relative addressing similar to what Simon Peyton
Jones suggested recently.

Another way is the method used in the TI 99000 series: use a pointer
to a block of memory that can be addressed using short addresses. This
block was actually called the "register block". This series of
processors was actually quite interesting, and it is a shame that TI
dropped it in favor of the 8086. (Some people may remember the TI 99/4
home computer, which used a 99000 processor. It was notoriously slow,
but this was (I have heard) mainly due to bad programming and system
design).

	Torben Mogensen (torbenm@diku.dk)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

mike@vlsivie.at (Inst.f.Techn.Informatik) (07/16/90)

In article <1990Jul06.194618.4957@esegue.segue.boston.ma.us>, rfg@ncd.com (Ron Guilmette) writes:
> In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write:
> >...
> >    Hare brained idea: allocate quantities that *might* be aliased to
> >registers anyway.  Provide a register to contain the true memory
> >address of the aliased quantity, ...
> 
> Actually, this sounds like a marvelous idea to me!

It most certainly is a marvelous idea! And it works! And it has been 
implemented. Here a short description:
As opposed to the suggestions made in these postings, there is no overhead 
vor loading the addresses associated with extra instructions into breakpoint 
registers. It is done automagically. It also uses a dynamic allocation 
scheme for the register values and can use *LOTS* of registers
(Anywhere from 8 KB to 64KB). 
IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short 
and if integrated on the chip can be as fast as a register access.

				bye,
					mike :-)

Michael K. Gschwind             mike@vlsivie.at
Technical University, Vienna    mike@vlsivie.uucp
Voice: (++43).1.58801 8144      e182202@awituw01.bitnet
Fax:   (++43).1.569697
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

aglew@dwarfs.crhc.uiuc.edu (Andy Glew) (07/16/90)

>IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short 
>and if integrated on the chip can be as fast as a register access.

As I am sure Mr. Gschwind knows, cache implemented on-chip is not as
fast as register access, chiefly because cache is seldom multiported,
as well as for other reasons such as loading, and the number of
addresses that can be specified in an instruction.

If you will, we are talking about the space formed by the
cross-product of the following parameters:

    ADDRESSING = by memory address and/or by register number

    MULTIPORTING = 1r/w .. 2r/1w ...

    SIZE/SPEED = large/slow .. small/fast

The standard configurations are 

    CACHE = addressed by memory address (associatively)
    	    single-ported
    	    large/slow (at least wrt. registers)

    REGISTER = addressed by number
    	       multiported
    	       small/fast

What I was asking about, and what Hank Dietz et al, and others, have
explored, is whether there are other configurations between these two
extremes that might be useful.  

--
Andy Glew, aglew@uiuc.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

vestal@src.honeywell.com (Steve Vestal) (07/20/90)

In article <1990Jul15.205606.19343@esegue.segue.boston.ma.us> mike@vlsivie.at (Inst.f.Techn.Informatik) writes:
>   IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short 
>   and if integrated on the chip can be as fast as a register access.

CRegs do have some appeal over caches for hard real-time applications.  Cache
access times can vary; many developers of hard real-time systems just disable
them.  In contrast, my impression is that CReg access times can be as
predictable as normal register access times.  This helps when determining the
worst-case execution time for a code fragment.

The problem of processor design for reliable hard real-time applications is
interesting (if often ignored).  Commercial chip developers tend to minimize
average execution time instead of minimizing worst-case execution time or
minimizing execution time variability.  Things like caches, the use of
data-dependent algorithms for artihmetic, register windowing in a way that
makes procedure call times variable, etc. causes some problems.

Does anyone have benchmark results for the current crop of microprocessors
where all caching has been disabled?

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com

-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.