[comp.compilers] Register Allocation and Aliasing

aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/05/90)

How often are quantities that *might* be allocated to registers *not*
allocated to registers, because of the remote *possibility* of
aliasing via some circuitous pointers?

    Hare brained idea: allocate quantities that *might* be aliased to
registers anyway.  Provide a register to contain the true memory
address of the aliased quantity, which causes a trap when the address
is accessed (or automagically forwards to/from the register).  Not
only are aliasing problems avoided, but you've got a set of data
address breakpoint registers as well! (ie. this technique could be
experimentally evaluated on machines that have data address
breakpoints).

Evaluation framework:
    Na is the number of such data address breakpoint registers provided.
    V(1)..V(Na) are possibly aliased values placed in registers.

    Tr is the performance (CPI) assuming V(1)..V(Na) are all in registers,
	and no aliasing occurs.

    Tm is the performance (CPI) assuming V(1)..V(Na) are placed in 
	memory whenever there is a possibility of aliasing.

    Fa is the frequency of actual aliasing

    Ca is the cost of handling the aliasing, using the data address breakpoints
	(large in SW, less but still significant in HW) 

    Obviously, Tr = Tr( V(1)..V(Na) ) = Tr(Na), and so on.

Then we compare to Tr(1-Fa) + Tr(Fa)(1+Ca) to Tm for a particular Na.

Unfortunately, I don't have any idea of what the parameters are like.
I have the impression that Fa is low, and I can ballpark Ca. But what
is Tm/Tr? Given the amount of fuss (number of papers) about aliasing
in compilers, Tm/Tr would seem to be large, but I'm not sure.
Especially as Na is varied?


(Well, let's see if I've embarassed myself as much as I did in my last
posting, where I said that 24-10=4. ;-} )
--
Andy Glew, aglew@uiuc.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

rfg@ncd.com (Ron Guilmette) (07/07/90)

In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write:
>How often are quantities that *might* be allocated to registers *not*
>allocated to registers, because of the remote *possibility* of
>aliasing via some circuitous pointers?
>
>    Hare brained idea: allocate quantities that *might* be aliased to
>registers anyway.  Provide a register to contain the true memory
>address of the aliased quantity, which causes a trap when the address
>is accessed (or automagically forwards to/from the register).  Not
>only are aliasing problems avoided, but you've got a set of data
>address breakpoint registers as well! (ie. this technique could be
>experimentally evaluated on machines that have data address
>breakpoints).

Actually, this sounds like a marvelous idea to me!

I guess that the only drawback would be the cost of the additional hardware
to support the data breakpoints, but (as you noted) some machines already
have such facilities, so it can't be THAT expensive.

Offhand, I'd say that this could potentially work really well, except that
I think that you would want (a) a set of data-breakpoint registers which
had a one-to-one relationship with the actual registers (but you probably
only need to cover about half of them for a typical 32 register RISC machine)
and (b) special load and store instructions for potentially aliased data
values.

The special load instruction would not only load the data value into the
specified data register, it would also force the address used for the load
into the (associated) data breakpoint register.  Finally, it would set a
"valid" bit associated with the given data-breakpoint register.

The special store instruction would simply, unset the "valid" bit for the
specified data-breakpoint register (thus preventing further data acceses
to the address indicated in the given breakpoint-register from being treated
as if the data-value in the data-register were still the valid (and most
up-to-date one).

The idea of getting the hardware to automagically satisfy references to
temporarily "register resident" data values directly from the registers
would eliminate the overhead of the (software) handing of breakpoints,
and would make this idea even better.

As I said, I think that it would *not* be necessary to associate one data-
breakpoint register with each actual data register.  Keep in mind that
registers often hold computational results as well as simple copies of
variable values.  (And don't forget the registers like PC, SP, and FP.
These don't need the special treatment either.)

Having a machine that did this stuff would effectively render alias analysis
(in compilers) "impotent and obsolete".

> Given the amount of fuss (number of papers) about aliasing in compilers...

It is certainly a pain in the $#^%$$^$%$ to  do good alias analysis statically
in software (as opposed to dynamically and in hardware) especially for
everybody's current favorite language, C.  Further, I'd have to guess that
static (compile-time) mechanisms for dealing with (i.e. preventing) aliasing
problems could NEVER match the performance characteristics of the hardware-
based dynamic scheme you have proposed.

In short... I like it.  Too bad I don't build chips!  I just program the
suckers. :-)

-- 
// Ron Guilmette (rfg@ncd.com)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

preston@rice.edu (Preston Briggs) (07/15/90)

Andy Glew writes:
>>    Hare brained idea: allocate quantities that *might* be aliased to
>>registers anyway.  Provide a register to contain the true memory
>>address of the aliased quantity, which causes a trap when the address
>>is accessed (or automagically forwards to/from the register).  Not
>>only are aliasing problems avoided, but you've got a set of data
>>address breakpoint registers as well! (ie. this technique could be
>>experimentally evaluated on machines that have data address
>>breakpoints).

And Ron Guilmette replied:
>Actually, this sounds like a marvelous idea to me!
...
>Having a machine that did this stuff would effectively render alias analysis
>(in compilers) "impotent and obsolete".

I disagree.  Alias analysis has many uses during optimization.
For example, during value numbering (with constant folding),
this example might arise:

	*p = 5;
	*q = -5;	-- does this change *p's value or not?
	if (*p > 0) {
	    ...
	}
	else {
	    ...
	}

With awsome alias analysis, we might be able to determine that p and q
don't point at the same location; therefore, we'd be able to eliminate
the test and the entire else clause.  With inadequate alias analysis
(or none at all), we must be conservative and emit code for the test.

It's easy to construct similar examples with common subexpression 
elimination, loop invariant code motion, &c.

The register allocation question is more difficult to dispose of with
a simple example.  But consider that many of the values kept
in registers are the result of common subexpression elimination or 
loop invariant code motion and you can get a hint of the problem.  
Here's a (very contrived) example:

	do {
	    *p = a[*q];
	    p++;
	} while (p < pp);

Now what we'd really like is to hold a[*q] in a register throughout the
loop, giving

	r = a[*q];
	do {
	    *p = r;
	    p++;
	} while (p < pp);

But what if p = q at some point during the loop?  Boom!
Note that we're not keeping *q in a register, so there's
no hint from the proposed trap hardware.

Basically, it seems as though we can't keep the results
of any expression that includes aliased elements in a register.
More to the point, we can't do any (?) optimization for
such expressions.

The whole scheme strikes me as "cache."  Hank Dietz has argued
(I believe) that the decision to keep a variable in cache or
register should be based on whether or not it is aliased.

For data breakpoints, perhaps there's an easy modification
of cache hardware?

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

mash@mips.COM (John Mashey) (07/15/90)

In article <1990Jul06.194618.4957@esegue.segue.boston.ma.us> rfg@ncd.com (Ron Guilmette) writes:

>>    Hare brained idea: allocate quantities that *might* be aliased to
>>registers anyway.  Provide a register to contain the true memory
>>address of the aliased quantity, which causes a trap when the address
>>is accessed (or automagically forwards to/from the register).  Not
>>only are aliasing problems avoided, but you've got a set of data
>>address breakpoint registers as well! (ie. this technique could be
>>experimentally evaluated on machines that have data address
>>breakpoints).

Some of this sounds interesting, and some may be useful in the future,
for various applications.  However, one must be careful, especially
in a world of LIW, super-scalar, super-pipelined, and super-scalar-super-
pipelined multiple-issue machines (i.e., all RISCs that expect to be
competitive in the next few years), that you don't stick something
in a critical path that blows your cycle time by 50%....

Maybe this is a good time to expound a little on a related widespread fantasy
chat might be called:
	When You Have A Zillion Transistors On A Chip, All Of Your
		Problems Go Away.
Most of the following is over-simplified discussion of EE stuff from a
software guy's viewpoint; maybe some real VLSI types will corect goofs and
expound more on this topic:


It is clear that more space on a die help a lot, and they let
you do things like:
	bigger on-chip caches
		a wonderful thing: regular, dense, and
		transistors, rather than wires
		this includes: I-caches, D-caches, TLBs, branch-target
		buffers, pre-decoded instruction buffers, etc.
	monster-fast FP units and other arithmetic units
		for some kinds of units (like multipliers),
		more space ==> faster, reduces latency of operation,
		always a good thing.
	more copies of functional units, or more pipelining
		increases the repeat rate for an operation, which may help
		some kinds of things.
	wider busses, increasing intra-chip bandwidth.
On the other hand, there are some nasty facts of life (for CMOS, anyway):
	1) FAN-OUT is not free.  Put another way, the more loads on a bus,
	the slower it is.  Bigger transistors help, up to a point, but
	what usually happens is that you must cascade the gates to keep
	the total delay minimized.
	2) FAN-IN is not free either.
	3) WIRES DON'T SHRINK as fast as transistors (because the resistance
	increases as they get narrower).  Hence, as you do shrinks to
	increase the speed, and get more on a chip, this means the wires
	can gobble up more of the space.
Put another way:
	1) The more things listening to you, the slower you are.
	2) The more things you listen to, the slower you are.
	3) Don't think you can run monster busses all over the place for free.
All of this says that people STILL have to think very hard about
delays in the critical paths in a CPU.  The faster you go, the more you're
likely to be doing more things in parallel, but if you're not careful,
these factors can bite you badly, especially in a single-chip
design that can only dissipate so much heat.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (07/15/90)

In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> aglew@oberon.crhc.uiuc.edu (Andy Glew) writes:
>How often are quantities that *might* be allocated to registers *not*
>allocated to registers, because of the remote *possibility* of
>aliasing via some circuitous pointers?
>
>    Hare brained idea: allocate quantities that *might* be aliased to
>registers anyway.  Provide a register to contain the true memory
>address of the aliased quantity, which causes a trap when the address
>is accessed (or automagically forwards to/from the register).

If you look back in the IEEE proceedings from Supercomputing '88, you'll see
a paper that I and Chi-Hung Chi did addressing exactly this issue (the title
is something Like "CRegs: A New Type of Memory for Accessing Arrays and
Pointers").  The CReg mechanism augments registers with address fields which
are used associatively to keep the register contents coherent...  we also
pointed-out that with proper compiler analysis, the associativity could be
limited to a rather small set size without ill effect (e.g., a set size of 4
rather than fully associative -- this keeps the speed of CRegs competitive
with that of ordinary registers).

>Evaluation framework:
...
>Unfortunately, I don't have any idea of what the parameters are like.

Some preliminary results on CRegs appear in the aforementioned paper.
The results were pretty good.  However, some of the most significant
benefits are not the obvious ones.  E.g., you don't need to do
explicit stores as often (ever?) -- you can even use a dirty bit and a
lazy store mechanism.  Likewise, you don't need as many loads, and you
certainly don't need to fetch as many addresses, which reduces the
INSTRUCTION FETCH bandwidth required.

					-hankd@ecn.purdue.edu

PS: Chi and I had actually tried to get this stuff patented back in
    1987, and it looked like we had some very nice claims, but Purdue
    University basically put it on hold until it had been over a year
    since public disclosure...  oh well.

PPS: We've also looked at using CRegs for instructions (vs. data).
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

torbenm@diku.dk (Torben [gidius Mogensen) (07/15/90)

rfg@ncd.com (Ron Guilmette) writes:

>In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write:
>>How often are quantities that *might* be allocated to registers *not*
>>allocated to registers, because of the remote *possibility* of
>>aliasing via some circuitous pointers?
>>
>>    Hare brained idea: allocate quantities that *might* be aliased to
>>registers anyway.  Provide a register to contain the true memory
>>address of the aliased quantity, which causes a trap when the address
>>is accessed (or automagically forwards to/from the register).  Not
>>only are aliasing problems avoided, but you've got a set of data
>>address breakpoint registers as well! (ie. this technique could be
>>experimentally evaluated on machines that have data address
>>breakpoints).

>Actually, this sounds like a marvelous idea to me!

>I guess that the only drawback would be the cost of the additional hardware
>to support the data breakpoints, but (as you noted) some machines already
>have such facilities, so it can't be THAT expensive.

There are two other (simpler) ways to circumvent the aliasing problem
in a processor architecture.

One is to assign fixed addresses to registers, e.g.  let the first N
addresses refer to the N registers.  The processor would recognize
that all the top bits in the address are 0, and use registers instead.
I believe this was done in some PDP 11s.  [actually, the PDP-6/10/20 -John]

One is have no data registers at all, but keep everything in memory.
This is not as costly as it may sound, as cache memory can be accessed
(almost) as fast as register memory. An added benefit is fast process
switch. Register allocation becomes a non-issue and register
saving/loading in process calls is done invisibly on a "call-by-need"
basis, as cache cells are overwritten. The only problem is address
size in instructions, but this can be solved in several ways. One is
to use a stacktop relative addressing similar to what Simon Peyton
Jones suggested recently.

Another way is the method used in the TI 99000 series: use a pointer
to a block of memory that can be addressed using short addresses. This
block was actually called the "register block". This series of
processors was actually quite interesting, and it is a shame that TI
dropped it in favor of the 8086. (Some people may remember the TI 99/4
home computer, which used a 99000 processor. It was notoriously slow,
but this was (I have heard) mainly due to bad programming and system
design).

	Torben Mogensen (torbenm@diku.dk)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

mike@vlsivie.at (Inst.f.Techn.Informatik) (07/16/90)

In article <1990Jul06.194618.4957@esegue.segue.boston.ma.us>, rfg@ncd.com (Ron Guilmette) writes:
> In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write:
> >...
> >    Hare brained idea: allocate quantities that *might* be aliased to
> >registers anyway.  Provide a register to contain the true memory
> >address of the aliased quantity, ...
> 
> Actually, this sounds like a marvelous idea to me!

It most certainly is a marvelous idea! And it works! And it has been 
implemented. Here a short description:
As opposed to the suggestions made in these postings, there is no overhead 
vor loading the addresses associated with extra instructions into breakpoint 
registers. It is done automagically. It also uses a dynamic allocation 
scheme for the register values and can use *LOTS* of registers
(Anywhere from 8 KB to 64KB). 
IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short 
and if integrated on the chip can be as fast as a register access.

				bye,
					mike :-)

Michael K. Gschwind             mike@vlsivie.at
Technical University, Vienna    mike@vlsivie.uucp
Voice: (++43).1.58801 8144      e182202@awituw01.bitnet
Fax:   (++43).1.569697
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

aglew@dwarfs.crhc.uiuc.edu (Andy Glew) (07/16/90)

>IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short 
>and if integrated on the chip can be as fast as a register access.

As I am sure Mr. Gschwind knows, cache implemented on-chip is not as
fast as register access, chiefly because cache is seldom multiported,
as well as for other reasons such as loading, and the number of
addresses that can be specified in an instruction.

If you will, we are talking about the space formed by the
cross-product of the following parameters:

    ADDRESSING = by memory address and/or by register number

    MULTIPORTING = 1r/w .. 2r/1w ...

    SIZE/SPEED = large/slow .. small/fast

The standard configurations are 

    CACHE = addressed by memory address (associatively)
    	    single-ported
    	    large/slow (at least wrt. registers)

    REGISTER = addressed by number
    	       multiported
    	       small/fast

What I was asking about, and what Hank Dietz et al, and others, have
explored, is whether there are other configurations between these two
extremes that might be useful.  

--
Andy Glew, aglew@uiuc.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

phorgan@cup.portal.com (Patrick Horgan) (07/17/90)

On Amdahl machines a data read or write to cache or register takes only
one cycle.  There is no difference.

Patrick Horgan                          phorgan@cup.portal.com
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

heggy@cs.pitt.edu (Ben Heggy) (07/17/90)

In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> aglew@oberon.crhc.uiuc.edu (Andy Glew) writes:
>    Hare brained idea: allocate quantities that *might* be aliased to
>registers anyway.  Provide a register to contain the true memory
>address of the aliased quantity, which causes a trap when the address
>is accessed (or automagically forwards to/from the register). 

Any such technique must also take into account the possibility that quantities
allocated to registers may also be aliased (i.e. register addresses must also
be monitored and forwarded.)

A technique that solves this problem has been developed and a paper describing
the technique and its use has been submitted for publication. A copy of the
paper is available by sending me e-mail at heggy@cs.pitt.edu.

The paper is:
	B. Heggy and M.L. Soffa, "Architectural Support for Register 
		Allocation in the Presence of Aliasing", Submitted for
		publication.

Ben
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/18/90)

>On Amdahl machines a data read or write to cache or register takes only
>one cycle.  There is no difference.

How many cache locations can you read/write per cycle?
And how many registers can you read/write per cycle?

--
Andy Glew, aglew@uiuc.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

rfg@uunet.UU.NET (Ron Guilmette) (07/20/90)

In article <1990Jul14.222431.13761@esegue.segue.boston.ma.us> Preston Briggs <preston@rice.edu> writes:
>Andy Glew writes:
>>>    Hare brained idea: allocate quantities that *might* be aliased to
>>>registers anyway.  Provide a register to contain the true memory
>>>address of the aliased quantity, which causes a trap when the address
>>>is accessed (or automagically forwards to/from the register).  Not
>>>only are aliasing problems avoided, but you've got a set of data
>>>address breakpoint registers as well! ...
>And Ron Guilmette replied:
>>Actually, this sounds like a marvelous idea to me!
>...
>>Having a machine that did this stuff would effectively render alias analysis
>>(in compilers) "impotent and obsolete".
>I disagree.  Alias analysis has many uses during optimization...

Who am I to disagree with sombody from Rice (especially Preston)? :-)

Preston's examples do in fact totally demolish my argument that clever
hardware could render alias analysis unnecessary.

Obviously, a good optimizing compiler will always need to do hefty analysis
of the code being compiled in order to understand which transformations can
and cannot be done (safely).

Nontheless, I think that there might still be a place for the hardware scheme
we were discussing.  Although a compiler for such a machine would still need
to be pretty smart if it wanted to produce really good code, the hardware
scheme proposed for dealing with potential aliasing could still allow an
optimizing compiler to keep some values in registers over a larger range
than it could otherwise.

-- 

// Ron Guilmette (rfg@ncd.com)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

rfg@uunet.UU.NET (Ron Guilmette) (07/20/90)

In article <1990Jul17.124057.1688@esegue.segue.boston.ma.us> Patrick Horgan <phorgan@cup.portal.com> writes:
>On Amdahl machines a data read or write to cache or register takes only
>one cycle.  There is no difference.

Gee!  I guess that Amdahl's *never* get cache misses!

Now why didn't anybody mention that to me when I worked there? ;-)

-- 

// Ron Guilmette (rfg@ncd.com)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.

vestal@src.honeywell.com (Steve Vestal) (07/20/90)

In article <1990Jul15.205606.19343@esegue.segue.boston.ma.us> mike@vlsivie.at (Inst.f.Techn.Informatik) writes:
>   IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short 
>   and if integrated on the chip can be as fast as a register access.

CRegs do have some appeal over caches for hard real-time applications.  Cache
access times can vary; many developers of hard real-time systems just disable
them.  In contrast, my impression is that CReg access times can be as
predictable as normal register access times.  This helps when determining the
worst-case execution time for a code fragment.

The problem of processor design for reliable hard real-time applications is
interesting (if often ignored).  Commercial chip developers tend to minimize
average execution time instead of minimizing worst-case execution time or
minimizing execution time variability.  Things like caches, the use of
data-dependent algorithms for artihmetic, register windowing in a way that
makes procedure call times variable, etc. causes some problems.

Does anyone have benchmark results for the current crop of microprocessors
where all caching has been disabled?

Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 
Phone: (612) 782-7049                    Internet: vestal@src.honeywell.com

-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{spdcc | ima | lotus| world}!esegue.  Meta-mail to compilers-request@esegue.