aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/05/90)
How often are quantities that *might* be allocated to registers *not* allocated to registers, because of the remote *possibility* of aliasing via some circuitous pointers? Hare brained idea: allocate quantities that *might* be aliased to registers anyway. Provide a register to contain the true memory address of the aliased quantity, which causes a trap when the address is accessed (or automagically forwards to/from the register). Not only are aliasing problems avoided, but you've got a set of data address breakpoint registers as well! (ie. this technique could be experimentally evaluated on machines that have data address breakpoints). Evaluation framework: Na is the number of such data address breakpoint registers provided. V(1)..V(Na) are possibly aliased values placed in registers. Tr is the performance (CPI) assuming V(1)..V(Na) are all in registers, and no aliasing occurs. Tm is the performance (CPI) assuming V(1)..V(Na) are placed in memory whenever there is a possibility of aliasing. Fa is the frequency of actual aliasing Ca is the cost of handling the aliasing, using the data address breakpoints (large in SW, less but still significant in HW) Obviously, Tr = Tr( V(1)..V(Na) ) = Tr(Na), and so on. Then we compare to Tr(1-Fa) + Tr(Fa)(1+Ca) to Tm for a particular Na. Unfortunately, I don't have any idea of what the parameters are like. I have the impression that Fa is low, and I can ballpark Ca. But what is Tm/Tr? Given the amount of fuss (number of papers) about aliasing in compilers, Tm/Tr would seem to be large, but I'm not sure. Especially as Na is varied? (Well, let's see if I've embarassed myself as much as I did in my last posting, where I said that 24-10=4. ;-} ) -- Andy Glew, aglew@uiuc.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
rfg@ncd.com (Ron Guilmette) (07/07/90)
In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write: >How often are quantities that *might* be allocated to registers *not* >allocated to registers, because of the remote *possibility* of >aliasing via some circuitous pointers? > > Hare brained idea: allocate quantities that *might* be aliased to >registers anyway. Provide a register to contain the true memory >address of the aliased quantity, which causes a trap when the address >is accessed (or automagically forwards to/from the register). Not >only are aliasing problems avoided, but you've got a set of data >address breakpoint registers as well! (ie. this technique could be >experimentally evaluated on machines that have data address >breakpoints). Actually, this sounds like a marvelous idea to me! I guess that the only drawback would be the cost of the additional hardware to support the data breakpoints, but (as you noted) some machines already have such facilities, so it can't be THAT expensive. Offhand, I'd say that this could potentially work really well, except that I think that you would want (a) a set of data-breakpoint registers which had a one-to-one relationship with the actual registers (but you probably only need to cover about half of them for a typical 32 register RISC machine) and (b) special load and store instructions for potentially aliased data values. The special load instruction would not only load the data value into the specified data register, it would also force the address used for the load into the (associated) data breakpoint register. Finally, it would set a "valid" bit associated with the given data-breakpoint register. The special store instruction would simply, unset the "valid" bit for the specified data-breakpoint register (thus preventing further data acceses to the address indicated in the given breakpoint-register from being treated as if the data-value in the data-register were still the valid (and most up-to-date one). The idea of getting the hardware to automagically satisfy references to temporarily "register resident" data values directly from the registers would eliminate the overhead of the (software) handing of breakpoints, and would make this idea even better. As I said, I think that it would *not* be necessary to associate one data- breakpoint register with each actual data register. Keep in mind that registers often hold computational results as well as simple copies of variable values. (And don't forget the registers like PC, SP, and FP. These don't need the special treatment either.) Having a machine that did this stuff would effectively render alias analysis (in compilers) "impotent and obsolete". > Given the amount of fuss (number of papers) about aliasing in compilers... It is certainly a pain in the $#^%$$^$%$ to do good alias analysis statically in software (as opposed to dynamically and in hardware) especially for everybody's current favorite language, C. Further, I'd have to guess that static (compile-time) mechanisms for dealing with (i.e. preventing) aliasing problems could NEVER match the performance characteristics of the hardware- based dynamic scheme you have proposed. In short... I like it. Too bad I don't build chips! I just program the suckers. :-) -- // Ron Guilmette (rfg@ncd.com) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
preston@rice.edu (Preston Briggs) (07/15/90)
Andy Glew writes: >> Hare brained idea: allocate quantities that *might* be aliased to >>registers anyway. Provide a register to contain the true memory >>address of the aliased quantity, which causes a trap when the address >>is accessed (or automagically forwards to/from the register). Not >>only are aliasing problems avoided, but you've got a set of data >>address breakpoint registers as well! (ie. this technique could be >>experimentally evaluated on machines that have data address >>breakpoints). And Ron Guilmette replied: >Actually, this sounds like a marvelous idea to me! ... >Having a machine that did this stuff would effectively render alias analysis >(in compilers) "impotent and obsolete". I disagree. Alias analysis has many uses during optimization. For example, during value numbering (with constant folding), this example might arise: *p = 5; *q = -5; -- does this change *p's value or not? if (*p > 0) { ... } else { ... } With awsome alias analysis, we might be able to determine that p and q don't point at the same location; therefore, we'd be able to eliminate the test and the entire else clause. With inadequate alias analysis (or none at all), we must be conservative and emit code for the test. It's easy to construct similar examples with common subexpression elimination, loop invariant code motion, &c. The register allocation question is more difficult to dispose of with a simple example. But consider that many of the values kept in registers are the result of common subexpression elimination or loop invariant code motion and you can get a hint of the problem. Here's a (very contrived) example: do { *p = a[*q]; p++; } while (p < pp); Now what we'd really like is to hold a[*q] in a register throughout the loop, giving r = a[*q]; do { *p = r; p++; } while (p < pp); But what if p = q at some point during the loop? Boom! Note that we're not keeping *q in a register, so there's no hint from the proposed trap hardware. Basically, it seems as though we can't keep the results of any expression that includes aliased elements in a register. More to the point, we can't do any (?) optimization for such expressions. The whole scheme strikes me as "cache." Hank Dietz has argued (I believe) that the decision to keep a variable in cache or register should be based on whether or not it is aliased. For data breakpoints, perhaps there's an easy modification of cache hardware? -- Preston Briggs looking for the great leap forward preston@titan.rice.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
mash@mips.COM (John Mashey) (07/15/90)
In article <1990Jul06.194618.4957@esegue.segue.boston.ma.us> rfg@ncd.com (Ron Guilmette) writes: >> Hare brained idea: allocate quantities that *might* be aliased to >>registers anyway. Provide a register to contain the true memory >>address of the aliased quantity, which causes a trap when the address >>is accessed (or automagically forwards to/from the register). Not >>only are aliasing problems avoided, but you've got a set of data >>address breakpoint registers as well! (ie. this technique could be >>experimentally evaluated on machines that have data address >>breakpoints). Some of this sounds interesting, and some may be useful in the future, for various applications. However, one must be careful, especially in a world of LIW, super-scalar, super-pipelined, and super-scalar-super- pipelined multiple-issue machines (i.e., all RISCs that expect to be competitive in the next few years), that you don't stick something in a critical path that blows your cycle time by 50%.... Maybe this is a good time to expound a little on a related widespread fantasy chat might be called: When You Have A Zillion Transistors On A Chip, All Of Your Problems Go Away. Most of the following is over-simplified discussion of EE stuff from a software guy's viewpoint; maybe some real VLSI types will corect goofs and expound more on this topic: It is clear that more space on a die help a lot, and they let you do things like: bigger on-chip caches a wonderful thing: regular, dense, and transistors, rather than wires this includes: I-caches, D-caches, TLBs, branch-target buffers, pre-decoded instruction buffers, etc. monster-fast FP units and other arithmetic units for some kinds of units (like multipliers), more space ==> faster, reduces latency of operation, always a good thing. more copies of functional units, or more pipelining increases the repeat rate for an operation, which may help some kinds of things. wider busses, increasing intra-chip bandwidth. On the other hand, there are some nasty facts of life (for CMOS, anyway): 1) FAN-OUT is not free. Put another way, the more loads on a bus, the slower it is. Bigger transistors help, up to a point, but what usually happens is that you must cascade the gates to keep the total delay minimized. 2) FAN-IN is not free either. 3) WIRES DON'T SHRINK as fast as transistors (because the resistance increases as they get narrower). Hence, as you do shrinks to increase the speed, and get more on a chip, this means the wires can gobble up more of the space. Put another way: 1) The more things listening to you, the slower you are. 2) The more things you listen to, the slower you are. 3) Don't think you can run monster busses all over the place for free. All of this says that people STILL have to think very hard about delays in the critical paths in a CPU. The faster you go, the more you're likely to be doing more things in parallel, but if you're not careful, these factors can bite you badly, especially in a single-chip design that can only dissipate so much heat. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (07/15/90)
In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> aglew@oberon.crhc.uiuc.edu (Andy Glew) writes: >How often are quantities that *might* be allocated to registers *not* >allocated to registers, because of the remote *possibility* of >aliasing via some circuitous pointers? > > Hare brained idea: allocate quantities that *might* be aliased to >registers anyway. Provide a register to contain the true memory >address of the aliased quantity, which causes a trap when the address >is accessed (or automagically forwards to/from the register). If you look back in the IEEE proceedings from Supercomputing '88, you'll see a paper that I and Chi-Hung Chi did addressing exactly this issue (the title is something Like "CRegs: A New Type of Memory for Accessing Arrays and Pointers"). The CReg mechanism augments registers with address fields which are used associatively to keep the register contents coherent... we also pointed-out that with proper compiler analysis, the associativity could be limited to a rather small set size without ill effect (e.g., a set size of 4 rather than fully associative -- this keeps the speed of CRegs competitive with that of ordinary registers). >Evaluation framework: ... >Unfortunately, I don't have any idea of what the parameters are like. Some preliminary results on CRegs appear in the aforementioned paper. The results were pretty good. However, some of the most significant benefits are not the obvious ones. E.g., you don't need to do explicit stores as often (ever?) -- you can even use a dirty bit and a lazy store mechanism. Likewise, you don't need as many loads, and you certainly don't need to fetch as many addresses, which reduces the INSTRUCTION FETCH bandwidth required. -hankd@ecn.purdue.edu PS: Chi and I had actually tried to get this stuff patented back in 1987, and it looked like we had some very nice claims, but Purdue University basically put it on hold until it had been over a year since public disclosure... oh well. PPS: We've also looked at using CRegs for instructions (vs. data). -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
torbenm@diku.dk (Torben [gidius Mogensen) (07/15/90)
rfg@ncd.com (Ron Guilmette) writes: >In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write: >>How often are quantities that *might* be allocated to registers *not* >>allocated to registers, because of the remote *possibility* of >>aliasing via some circuitous pointers? >> >> Hare brained idea: allocate quantities that *might* be aliased to >>registers anyway. Provide a register to contain the true memory >>address of the aliased quantity, which causes a trap when the address >>is accessed (or automagically forwards to/from the register). Not >>only are aliasing problems avoided, but you've got a set of data >>address breakpoint registers as well! (ie. this technique could be >>experimentally evaluated on machines that have data address >>breakpoints). >Actually, this sounds like a marvelous idea to me! >I guess that the only drawback would be the cost of the additional hardware >to support the data breakpoints, but (as you noted) some machines already >have such facilities, so it can't be THAT expensive. There are two other (simpler) ways to circumvent the aliasing problem in a processor architecture. One is to assign fixed addresses to registers, e.g. let the first N addresses refer to the N registers. The processor would recognize that all the top bits in the address are 0, and use registers instead. I believe this was done in some PDP 11s. [actually, the PDP-6/10/20 -John] One is have no data registers at all, but keep everything in memory. This is not as costly as it may sound, as cache memory can be accessed (almost) as fast as register memory. An added benefit is fast process switch. Register allocation becomes a non-issue and register saving/loading in process calls is done invisibly on a "call-by-need" basis, as cache cells are overwritten. The only problem is address size in instructions, but this can be solved in several ways. One is to use a stacktop relative addressing similar to what Simon Peyton Jones suggested recently. Another way is the method used in the TI 99000 series: use a pointer to a block of memory that can be addressed using short addresses. This block was actually called the "register block". This series of processors was actually quite interesting, and it is a shame that TI dropped it in favor of the 8086. (Some people may remember the TI 99/4 home computer, which used a 99000 processor. It was notoriously slow, but this was (I have heard) mainly due to bad programming and system design). Torben Mogensen (torbenm@diku.dk) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
mike@vlsivie.at (Inst.f.Techn.Informatik) (07/16/90)
In article <1990Jul06.194618.4957@esegue.segue.boston.ma.us>, rfg@ncd.com (Ron Guilmette) writes: > In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> you write: > >... > > Hare brained idea: allocate quantities that *might* be aliased to > >registers anyway. Provide a register to contain the true memory > >address of the aliased quantity, ... > > Actually, this sounds like a marvelous idea to me! It most certainly is a marvelous idea! And it works! And it has been implemented. Here a short description: As opposed to the suggestions made in these postings, there is no overhead vor loading the addresses associated with extra instructions into breakpoint registers. It is done automagically. It also uses a dynamic allocation scheme for the register values and can use *LOTS* of registers (Anywhere from 8 KB to 64KB). IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short and if integrated on the chip can be as fast as a register access. bye, mike :-) Michael K. Gschwind mike@vlsivie.at Technical University, Vienna mike@vlsivie.uucp Voice: (++43).1.58801 8144 e182202@awituw01.bitnet Fax: (++43).1.569697 -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
aglew@dwarfs.crhc.uiuc.edu (Andy Glew) (07/16/90)
>IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short >and if integrated on the chip can be as fast as a register access. As I am sure Mr. Gschwind knows, cache implemented on-chip is not as fast as register access, chiefly because cache is seldom multiported, as well as for other reasons such as loading, and the number of addresses that can be specified in an instruction. If you will, we are talking about the space formed by the cross-product of the following parameters: ADDRESSING = by memory address and/or by register number MULTIPORTING = 1r/w .. 2r/1w ... SIZE/SPEED = large/slow .. small/fast The standard configurations are CACHE = addressed by memory address (associatively) single-ported large/slow (at least wrt. registers) REGISTER = addressed by number multiported small/fast What I was asking about, and what Hank Dietz et al, and others, have explored, is whether there are other configurations between these two extremes that might be useful. -- Andy Glew, aglew@uiuc.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
phorgan@cup.portal.com (Patrick Horgan) (07/17/90)
On Amdahl machines a data read or write to cache or register takes only one cycle. There is no difference. Patrick Horgan phorgan@cup.portal.com -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
heggy@cs.pitt.edu (Ben Heggy) (07/17/90)
In article <1990Jul05.155937.13214@esegue.segue.boston.ma.us> aglew@oberon.crhc.uiuc.edu (Andy Glew) writes: > Hare brained idea: allocate quantities that *might* be aliased to >registers anyway. Provide a register to contain the true memory >address of the aliased quantity, which causes a trap when the address >is accessed (or automagically forwards to/from the register). Any such technique must also take into account the possibility that quantities allocated to registers may also be aliased (i.e. register addresses must also be monitored and forwarded.) A technique that solves this problem has been developed and a paper describing the technique and its use has been submitted for publication. A copy of the paper is available by sending me e-mail at heggy@cs.pitt.edu. The paper is: B. Heggy and M.L. Soffa, "Architectural Support for Register Allocation in the Presence of Aliasing", Submitted for publication. Ben -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
aglew@oberon.crhc.uiuc.edu (Andy Glew) (07/18/90)
>On Amdahl machines a data read or write to cache or register takes only >one cycle. There is no difference. How many cache locations can you read/write per cycle? And how many registers can you read/write per cycle? -- Andy Glew, aglew@uiuc.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
rfg@uunet.UU.NET (Ron Guilmette) (07/20/90)
In article <1990Jul14.222431.13761@esegue.segue.boston.ma.us> Preston Briggs <preston@rice.edu> writes: >Andy Glew writes: >>> Hare brained idea: allocate quantities that *might* be aliased to >>>registers anyway. Provide a register to contain the true memory >>>address of the aliased quantity, which causes a trap when the address >>>is accessed (or automagically forwards to/from the register). Not >>>only are aliasing problems avoided, but you've got a set of data >>>address breakpoint registers as well! ... >And Ron Guilmette replied: >>Actually, this sounds like a marvelous idea to me! >... >>Having a machine that did this stuff would effectively render alias analysis >>(in compilers) "impotent and obsolete". >I disagree. Alias analysis has many uses during optimization... Who am I to disagree with sombody from Rice (especially Preston)? :-) Preston's examples do in fact totally demolish my argument that clever hardware could render alias analysis unnecessary. Obviously, a good optimizing compiler will always need to do hefty analysis of the code being compiled in order to understand which transformations can and cannot be done (safely). Nontheless, I think that there might still be a place for the hardware scheme we were discussing. Although a compiler for such a machine would still need to be pretty smart if it wanted to produce really good code, the hardware scheme proposed for dealing with potential aliasing could still allow an optimizing compiler to keep some values in registers over a larger range than it could otherwise. -- // Ron Guilmette (rfg@ncd.com) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
rfg@uunet.UU.NET (Ron Guilmette) (07/20/90)
In article <1990Jul17.124057.1688@esegue.segue.boston.ma.us> Patrick Horgan <phorgan@cup.portal.com> writes: >On Amdahl machines a data read or write to cache or register takes only >one cycle. There is no difference. Gee! I guess that Amdahl's *never* get cache misses! Now why didn't anybody mention that to me when I worked there? ;-) -- // Ron Guilmette (rfg@ncd.com) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.
vestal@src.honeywell.com (Steve Vestal) (07/20/90)
In article <1990Jul15.205606.19343@esegue.segue.boston.ma.us> mike@vlsivie.at (Inst.f.Techn.Informatik) writes: > IT EVEN HAS A NAME: it's called ``cache memory''. Access times are short > and if integrated on the chip can be as fast as a register access. CRegs do have some appeal over caches for hard real-time applications. Cache access times can vary; many developers of hard real-time systems just disable them. In contrast, my impression is that CReg access times can be as predictable as normal register access times. This helps when determining the worst-case execution time for a code fragment. The problem of processor design for reliable hard real-time applications is interesting (if often ignored). Commercial chip developers tend to minimize average execution time instead of minimizing worst-case execution time or minimizing execution time variability. Things like caches, the use of data-dependent algorithms for artihmetic, register windowing in a way that makes procedure call times variable, etc. causes some problems. Does anyone have benchmark results for the current crop of microprocessors where all caching has been disabled? Steve Vestal Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418 Phone: (612) 782-7049 Internet: vestal@src.honeywell.com -- Send compilers articles to compilers@esegue.segue.boston.ma.us {spdcc | ima | lotus| world}!esegue. Meta-mail to compilers-request@esegue.