spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) (11/12/90)
Has anyone every thought about or done a registerless architecture? registers, after all, are just a sort of cache, another level in the memory hierarchy. but a fixed size, hard-wired one. Consider a machine with a 4 level memory 0) the fpu and alu 0Kb 1) on-chip cache 10Kb 2) normal cache 100Kb 3) main ram 10 000Kb 4) magnetic disk 100 000Kb It is very easy expand the size/speed of caches, but not to add registers. I think this is a big advantage. The way a cache works generalizes the behavior things like register windows. One problem is that instructions would have to be very large (3 addresses). using a stack based approach would help. The 3 addresses are then relative to the stack pointer, and can be small enough to fit into the instruction. That's 8 or 9 bits for 32 bit machines, or twice that for 64 bit machines. again, it scales easily. context switch is fast and easy, there's nothing but CCR, PC, and FP. any thoughts on this? stupid idea, or the wave of the future? :) Consume Scott Draves Be Silent spot@cs.cmu.edu Die
cgy@cs.brown.edu (Curtis Yarvin) (11/12/90)
In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: > >Has anyone every thought about or done a registerless architecture? >registers, after all, are just a sort of cache, another level in the >memory hierarchy. but a fixed size, hard-wired one. >One problem is that instructions would have to be very large (3 addresses). >using a stack based approach would help. The 3 addresses are then >relative to the stack pointer, and can be small enough to fit into the >instruction. That's 8 or 9 bits for 32 bit machines, or twice that >for 64 bit machines. again, it scales easily. This is one of the only two reasons to use registers. The other is that registers can still be made a bit faster; no association or anything necessary (this goes unless you are one of those direct-mapped cache people). This capability isn't much used in practice, though - generally both register and cache hits take one clock cycle. >context switch is fast and easy, there's nothing but CCR, PC, and FP. Ah, but no... you have to flush your cache anyway, you don't really gain anything here. >Scott Draves Be Silent >spot@cs.cmu.edu Die -Curtis "I tried living in the real world Instead of a shell But I was bored before I even began." - The Smiths
tom@ssd.csd.harris.com (Tom Horsley) (11/13/90)
>>>>> Regarding registerless architecture; spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) adds: spot> Has anyone every thought about or done a registerless architecture? spot> registers, after all, are just a sort of cache, another level in the spot> memory hierarchy. but a fixed size, hard-wired one. Consider spot> a machine with a 4 level memory Once a long long time ago in a universe far far away I worked on a compiler for a new machine that was going to be registerless because, as the engineers said, "cache is just as fast as registers anyway". By the time we got to the point where they were ready to cancel the project the engineers had taken to pleading with the compiler writers to come up with some way to allocate variables in locations such that frequently used variables would be in spots that didn't get cache collisions with other frequently used variables... There is a common technique for doing something like this in compilers. It is called "register allocation". Unfortunately, it is orders of magnitude more difficult to do when there are no registers... spot> any thoughts on this? stupid idea, or the wave of the future? :) Stupid idea (that's your phrase, not mine :-). -- ====================================================================== domain: tahorsley@csd.harris.com USMail: Tom Horsley uucp: ...!uunet!hcx1!tahorsley 511 Kingbird Circle Delray Beach, FL 33444 +==== Censorship is the only form of Obscenity ======================+ | (Wait, I forgot government tobacco subsidies...) | +====================================================================+
jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) (11/13/90)
From article <1990Nov12.145410.29035@cs.cmu.edu>, by spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves): > > Has anyone every thought about or done a registerless architecture? My Ultimate RISK (Computer Architecture News, 1988) is a memory-to-memory architecture with no registers in the instruction execution unit other than the PC. It has no arithmetic unit in the IEU either, which is why I call it an IEU instead of a CPU. The registers and arithmetic unit(s) are out on the memory bus. It was proposed as a purely pedagogical exercise, but it can be pipelined to death, and with appropriate ALU(s) out on the bus, it can be quite powerful. I gather a few people have built or are building machines based on my design, but I haven't heard much from them. Doug Jones jones@herky.cs.uiowa.edu
my@dtg.nsc.com (Michael Yip) (11/13/90)
Someoen mentioned about a registerless architecture but using large on chip cache instead of the registers. The reason was registers limit the machine architecture and instruction sets and expanding the cache is easier than adding more registers. The transputers (eg T400, T800) are basically "registerless" machines. Transputer is basically a "stack based RISC machine" which does not use any registers other than the 3 temporary stack registers. Instructions operate on the stack instead of registers. The Transputers have on chip RAM (not cache) for storage, therefore the context of a process including the content of the stack can be stored on the on chip RAM. I think that newer transputers also use caches, but I am not sure anymore since I only designed with the transputer a long time ago when it first came out. So does the Transputer architecture fit into the registerless computer architecture? By the way, I think that the AT&T Crisp (????) is also a stack base machine. But I don't know any detail about it. About instructions and the number of registers ... doesn't the register windowing technique also solve the problem since the instruction set does not really depend on the number of total registers available on the chip (but the number of registers available at one time.) Just my $0.02! ;) -- Mike my@dtg.nsc.com
mash@mips.COM (John Mashey) (11/13/90)
In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: >Has anyone every thought about or done a registerless architecture? >registers, after all, are just a sort of cache, another level in the >memory hierarchy. but a fixed size, hard-wired one. Consider .... >It is very easy expand the size/speed of caches, but not to add registers. >I think this is a big advantage. The way a cache works generalizes >the behavior things like register windows. .... >using a stack based approach would help. The 3 addresses are then >relative to the stack pointer, and can be small enough to fit into the >instruction. That's 8 or 9 bits for 32 bit machines, or twice that >for 64 bit machines. again, it scales easily. Bell Labs' CRISP chips were this way. This architecture was a fairly elegant evolution of the register windows path, i.e., it had a true "stack cache", with on-chip registers laid fairly invisibly over the top of the stack. I.e., register numbers were really offsets from the stack pointer, and if they were within range, you got the register, else had to fetch the data. Of interest to compiler writers was the fact that if you generated an address via other routes, and the address was in the stack cache, you got it also, eliminating the need to deal with address aliasing (i.e., x ... y = &x; func(y)). So, anyway, they've been built, and serious software work done with them, although CRISPs never did get to the commercial market, which is a little sad. (I may disagree with some of the design choices, but it did have some elegant ideas.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jeremy@cs.adelaide.edu.au (Jeremy Webber) (11/13/90)
In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
Has anyone every thought about or done a registerless architecture?
Have a look at the INMOS Transputer. It has 3 general purpose registers, which
aren't addressed directly, but via stack operations. It also has a small
amount of on-chip 1-cycle RAM, mapped into the processor's address space.
Its negatives are no memory management support, and the on-chip RAM isn't a
cache, it is hardwired into the low memory addresses.
Still, they have a lot of virtues, particularly if you're rolling your own
hardware.
-jeremy
--
--
Jeremy Webber ACSnet: jeremy@chook.ua.oz
Digital Arts Film and Television, Internet: jeremy@chook.ua.oz.au
3 Milner St, Hindmarsh, SA 5007, Voicenet: +61 8 346 4534
Australia Papernet: +61 8 346 4537 (FAX)
cik@l.cc.purdue.edu (Herman Rubin) (11/13/90)
In article <1990Nov12.145410.29035@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: > Has anyone every thought about or done a registerless architecture? > registers, after all, are just a sort of cache, another level in the > memory hierarchy. but a fixed size, hard-wired one. Consider > a machine with a 4 level memory > 0) the fpu and alu 0Kb > 1) on-chip cache 10Kb > 2) normal cache 100Kb > 3) main ram 10 000Kb > 4) magnetic disk 100 000Kb > It is very easy expand the size/speed of caches, but not to add registers. > I think this is a big advantage. The way a cache works generalizes > the behavior things like register windows. > One problem is that instructions would have to be very large (3 addresses). > using a stack based approach would help. The 3 addresses are then > relative to the stack pointer, and can be small enough to fit into the > instruction. That's 8 or 9 bits for 32 bit machines, or twice that > for 64 bit machines. again, it scales easily. > context switch is fast and easy, there's nothing but CCR, PC, and FP. > any thoughts on this? stupid idea, or the wave of the future? :) Even with registers, it is sometimes necessary to change code, but it can be made infrequent. Without registers, ugh! Only a 9-bit field relative to a pointer? One of the stupid (in my opinion) things about the 86-class machines is the 16 bit field relative to a pointer, and more than one such field could be active. Indirect addressing and addressing relative to registers is extremely important; to replace registers with cache intelligently would require allowing arbitrary depth of indirection, which is not a bad idea. But there would be at least a cache access for each one. Also, the idea of allowing instructions of arbitrary address length seems to be out of fashion. It would allow indexing of registers, which should be allowed anyhow. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
foo@titan.rice.edu (Mark Hall) (11/13/90)
)In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes:
)
) Has anyone every thought about or done a registerless architecture?
)
Just for a sense of history: the TI 9900 (and 99000 I believe) were
also registerless. They never made it very big in the marketplace.
(this is almost folklore to me, so correct me if I am wrong. It has
been a long time since I looked at a chip spec. Any chip spec.)
- mark
bean@putter.wpd.sgi.com (David (Bean) Anderson) (11/13/90)
In article <1990Nov12.145410.29035@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: |> |> Has anyone every thought about or done a registerless architecture? |> registers, after all, are just a sort of cache, another level in the |> memory hierarchy. but a fixed size, hard-wired one. Consider |> a machine with a 4 level memory |> |> 0) the fpu and alu 0Kb |> 1) on-chip cache 10Kb |> 2) normal cache 100Kb |> 3) main ram 10 000Kb |> 4) magnetic disk 100 000Kb |> |> It is very easy expand the size/speed of caches, but not to add registers. |> I think this is a big advantage. The way a cache works generalizes |> the behavior things like register windows. |> |> One problem is that instructions would have to be very large (3 addresses). |> using a stack based approach would help. The 3 addresses are then |> relative to the stack pointer, and can be small enough to fit into the |> instruction. That's 8 or 9 bits for 32 bit machines, or twice that |> for 64 bit machines. again, it scales easily. |> |> context switch is fast and easy, there's nothing but CCR, PC, and FP. |> |> any thoughts on this? stupid idea, or the wave of the future? :) | 1. Register files are typically multi-ported -- one can usually get two reads and one write to the file in one clock (indeed, usually in a small fraction of the clock) -- whereas a cache typically is single ported and while it can deliver one data item per clock, it is usually the "next" clock. Caches will always be slower than registers because (if for no other reason) the path length and gate count to cache will be higher than to a register file. 2. Why are registers considered a *problem*? Modern compilers usually do a good job of effectively using the registers as opposed to *stupid* cache hardware. Indeed, some interesting work in "blocking algorithms" (faking the cache into behaving like a large register file) have resulted in some impress performance figures. 3. The HP3000 is a stack machine with no GPRs. The hardware (on some models) would keep the top four stack items in a register file in order to increase performance. 4. Register window architectures are an interesting compromise. They use a large register file that the compiler can use as it sees fit, however, one can address registers either by name or relative to the window base. Who decides what data items should go in high speed memory is the critical issue: hardware implemented heuristics (cache) or compiler handled directives (registers)? There are places for both. Bean
ts@cup.portal.com (Tim W Smith) (11/13/90)
< any thoughts on this? stupid idea, or the wave of the future? :) Why do you assume it can't be a "stupid idea" and "the wave of the future"? :-) Tim Smith
nather@ut-emx.uucp (Ed Nather) (11/14/90)
In article <3168@ns-mx.uiowa.edu>, jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes: > From article <1990Nov12.145410.29035@cs.cmu.edu>, > by spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves): > > > > Has anyone every thought about or done a registerless architecture? > > My Ultimate RISK (Computer Architecture News, 1988) is a memory-to-memory > architecture with no registers in the instruction execution unit other Many years ago there was this microprocessor, see, that was 16 bits (!!) when all the others were only 8 bits, and it was going to be a real world-beater and wipe out Intel, Motorola, etc. The thing HAD NO REGISTERS either, went memory-to-memory because that's where everything ends up anyway, so what good are registers? It was made by that powerhouse of computing called Texas Instruments who, as you know, wiped out all the competition and changed its name to IBM and ... Actually, I've forgotten (or suppressed) the chip number, but it was a real dog, much too slow compared with its competition, and died the Death of Dumb Chips long, long ago. Aren't there any CS courses that teach the History of Computer Architectures? -- Ed Nather Astronomy Dept, U of Texas @ Austin
ig@caliban.uucp (Iain Bason) (11/14/90)
curtis>In article <56084@brunix.UUCP> cgy@cs.brown.edu (Curtis Yarvin) writes: scott>In article <1990Nov12.145410.29035@cs.cmu.edu> spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: tom>In article <TOM.90Nov12122800@hcx2.ssd.csd.harris.com> tom@ssd.csd.harris.com (Tom Horsley) writes: herman>In article <2731@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: mark>In article <1990Nov13.011231.4899@rice.edu> foo@titan.rice.edu (Mark Hall) writes: scott>Has anyone every thought about or done a registerless architecture? scott>registers, after all, are just a sort of cache, another level in the scott>memory hierarchy. but a fixed size, hard-wired one. curtis> curtis>...registers can still be made a bit faster; no association or anything necessary curtis>(this goes unless you are one of those direct-mapped cache people). This curtis>capability isn't much used in practice, though - generally both register curtis>and cache hits take one clock cycle. curtis> I agree here. One other point is that if you're doing a stack cache, and all your instructions use indexing off the stack pointer, you have to add the index to the stack pointer. That is going to take > 0 time. scott>context switch is fast and easy, there's nothing but CCR, PC, and FP. curtis> curtis>Ah, but no... you have to flush your cache anyway, you don't really curtis>gain anything here. This is not entirely true. It doesn't take much hardware to add a (small) process-id tag to cache lines. Then cache flushing can take place in the background, while the CPU does useful work. In some cases (e.g., a simple interrupt handler) only a few cache lines will be flushed before the CPU returns to the interrupted process. tom>Once a long long time ago in a universe far far away I worked on a compiler tom>for a new machine that was going to be registerless because, as the engineers tom>said, "cache is just as fast as registers anyway". tom> tom>By the time we got to the point where they were ready to cancel the project tom>the engineers had taken to pleading with the compiler writers to come up tom>with some way to allocate variables in locations such that frequently used tom>variables would be in spots that didn't get cache collisions with other tom>frequently used variables... tom> tom>There is a common technique for doing something like this in compilers. It tom>is called "register allocation". Unfortunately, it is orders of magnitude tom>more difficult to do when there are no registers... Most (maybe all? anyone know?) C compilers will allocate local variables on the stack. Hardware can certainly be designed to cache a stack (all you have to do is avoid collisions from contiguous memory; I would think this would be the normal way to do a cache). The compiler could create new locals and just pretend they are registers (although I'm sure there would be smarter ways to optimize for the architecture). I expect many languages other than C can also be made to allocate local variables on the stack. Lisp might be tough, and Smalltalk, but then they usually are on any architecture. A compiler for a machine like this would obviously be different. For instance, I imagine "register" coloring would be difficult to do when the number of "registers" is variable. You have to take into account the fact that other routines may have data in the cache, and only allocate space if you think it will save this routine more time than it will cost other routines. tom>spot> any thoughts on this? stupid idea, or the wave of the future? :) tom> tom>Stupid idea (that's your phrase, not mine :-). This is far from clear herman>Only a 9-bit field relative to a pointer? One of the stupid (in my opinion) herman>things about the 86-class machines is the 16 bit field relative to a pointer, herman>and more than one such field could be active. I don't think Scott is proposing to limit *all* indexes to 9 bits. Look at it this way: most CPUs limit you to 5-bit indexes into their register files, but they let you use larger indexes into memory. herman>Indirect addressing and addressing relative to registers is extremely herman>important; to replace registers with cache intelligently would require herman>allowing arbitrary depth of indirection, which is not a bad idea. Gaaak! Please banish the thought from your mind. I believe one company (Data General?) had a hell of a time trying to do virtual memory with such a "feature". Apparently it was almost never used, anyway. mark> Just for a sense of history: the TI 9900 (and 99000 I believe) were mark> also registerless. They never made it very big in the marketplace. mark> mark> (this is almost folklore to me, so correct me if I am wrong. It has mark> been a long time since I looked at a chip spec. Any chip spec.) I believe you are correct, although I'd never even heard of the 99000. "They never made it very big" is being charitable. Speaking of which, does anyone remember the Fairchild F8? -- Iain Bason ..uunet!caliban!ig -- Iain Bason ..uunet!caliban!ig
baum@Apple.COM (Allen J. Baum) (11/14/90)
[] I can't let all this stuff go by without my three cents... >In article <39637@ut-emx.uucp> nather@ut-emx.uucp (Ed Nather) writes: >In article <3168@ns-mx.uiowa.edu>, jones@pyrite.cs.uiowa.edu writes: >> > Has anyone every thought about or done a registerless architecture? As many posters have pointed out, most stack architectures can be considered registerless, in the sense that they can be built without physical registers, and, if registers were implemented, could not address them directly. >Many years ago there was this microprocessor, see,...The thing HAD NO >REGISTERS either, went memory-to-memory because that's where everything >ends up anyway, so what good are registers? It was made by... Texas >Instruments..... it was a >real dog, much too slow ..... Actually, there was at least one implementation of the TI9900 that was fast... because they put registers in, but I don't remember if it block-loaded them when the register pointer was switched,( like the Intel 960 does) or if they were a register-cache. The CRISP is in some sense the ultimate expression of the registerless machine. It is a stack architecture, but at the same time it is a 2 1/2 address machine. It can be built with no physical registers, or can have many. The physical registers are used as a cache. If an interrupt occurrs, the SP is changed, and accesses off the stack pointer suddenly miss. It is not required to mess with the 'stack-cache at that point, although it makes mucho sense from a performance point of view. Note that getting to interrupt code is very fast- there aren't many things that must be saved, but can if you want, for performance reasons. The thing that makes CRISP a bit different is that the 'cache' is not automatically loaded on a miss; special function call instructions do that. Note that there have been register architectures which had no physical registers, notably the early DEC PDP-10s (and maybe PDP-6s?), where the registers overlayed the first 16 memory locations, and there was an option that installed real registers. I think I remember reading that no PDP-10 was sold without the option. Hmmmm. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
philip@pescadero.Stanford.EDU (Philip Machanick) (11/14/90)
In article <1990Nov13.035859.4777@relay.wpd.sgi.com>, bean@putter.wpd.sgi.com (David (Bean) Anderson) writes: |> In article <1990Nov12.145410.29035@cs.cmu.edu>, spot@WOOZLE.GRAPHICS.CS.CMU.EDU (Scott Draves) writes: |> |> |> |> Has anyone every thought about or done a registerless architecture? [detail deleted] |> |> One problem is that instructions would have to be very large (3 addresses). |> |> using a stack based approach would help. The 3 addresses are then |> |> relative to the stack pointer, and can be small enough to fit into the |> |> instruction. That's 8 or 9 bits for 32 bit machines, or twice that |> |> for 64 bit machines. again, it scales easily. [stuff deleted] |> |> any thoughts on this? stupid idea, or the wave of the future? :) [more deleted] |> 2. Why are registers considered a *problem*? Modern compilers usually |> do a good job of effectively using the registers as opposed to *stupid* |> cache hardware. Indeed, some interesting work in "blocking algorithms" |> (faking the cache into behaving like a large register file) have resulted |> in some impress performance figures. |> |> 3. The HP3000 is a stack machine with no GPRs. The hardware (on |> some models) would keep the top four stack items in a register file |> in order to increase performance. [more deleted] In fact, I believe the Burroughs B5500 series introduced this registers at top of stack scheme. This was a very pure stack-based architecture, with most instructions relative to top of stack. Because they had no addresses, they were packed 4 to a 48-bit word. Very efficient? See Hennessy and Patterson "Computer Architecture: A Quantitative Approach", Morgan Kuffman, 1990 for why RISC performs better. In other words, this is not a stupid idea, just one that's been tried and not delivered - a wave of the past, if you will. -- Philip Machanick philip@pescadero.stanford.edu
jmaynard@.hsch.utexas.edu (Jay Maynard) (11/14/90)
In article <39637@ut-emx.uucp> nather@ut-emx.uucp (Ed Nather) writes: >Many years ago there was this microprocessor, see, that was 16 bits (!!) >when all the others were only 8 bits, and it was going to be a real >world-beater and wipe out Intel, Motorola, etc. The thing HAD NO >REGISTERS either, went memory-to-memory because that's where everything >ends up anyway, so what good are registers? It was made by that powerhouse >of computing called Texas Instruments who, as you know, wiped out all the >competition and changed its name to IBM and ... >Actually, I've forgotten (or suppressed) the chip number, but it was a >real dog, much too slow compared with its competition, and died the >Death of Dumb Chips long, long ago. The chip you're thinking of is the TMS9900. I have one in a small card cage of a development system, with some flaky bubble memory and a cassette port, and one of the screwiest dialects of BASIC I've ever met. It's a real {curi,monstr}osity. The 9900's big failing, though, was its 64K addressing limit; that was a severe competitive disadvantage compared to the newer 16-bit chips being introduced. I think there were later versions that didn't have that problem, but my memory for such details is getting fuzzy. Oh, yes...there was one common application for the 9900: the TI 99/4[a] home computer. -- Jay Maynard, EMT-P, K5ZC, PP-ASEL | Never ascribe to malice that which can jmaynard@thesis1.hsch.utexas.edu | adequately be explained by stupidity. "With design like this, who needs bugs?" - Boyd Roberts
faiman@m.cs.uiuc.edu (11/14/90)
For Ed Nather at UT-Austin .... It was the TI 9900 (RIP). See, for example, "16-bit Microprocessor Architecture," by Terry Dollhoff, Reston, 1979, which contains a 10-chapter case study that uses this device. Quote (without comment) from the dust jacket: "Complete analysis of the 9900 microprocessor with stand-alone programs, performance ratings of six competing 16-bit machines, and more!" Mike (used to teach a microprocessor course) Faiman, Urbana
torbenm@gere.diku.dk (Torben [gidius Mogensen) (11/14/90)
All this discussion about registers versus no registers but cache has centered about the costs of implementing local variables: either by keeping them in registers and then loading/saving them across calls, or in memory and make the cache take care of loading/saving to main memory. There are several issues involved in this: 1) What are the relative costs of accessing registers and cache memory. 2) What are the relative cost of loading/saving a (large) set of registers in a burst, versus letting the cache do so at its own pace. 3) How do you effectively map a fixed number of registers to a variable number of local variables. (Register allocation). As for 1), there is a general acceptance that registers are faster than cache, but only slightly so. The main problem is that, with present cache architectures, you can access only one cache element per cycle, whereas it is possible to access several registers. This can be solved by implementing a multiported cache. Also, it might be possible to access on-chip cache as fast as registers (to within a few percent) if the architecture is designed for this. As for 2), there has been arguments pointing both ways. Sequential mode access to memory is faster than random access, so this speaks in favor of loading/saving in bursts. It must be noted that many modern cache architectures do the same: If there is a cache miss, a block of cells is saved/loaded in one go. The main difference is that with registers, loading/saving is done in a consistent (compile-time) fashion, whereas with cache it is done by need (run-time). The compile-time register saving will often lead to unnecessary memory traffic, as registers might be saved, only to be loaded again immediately afterwards, because the procedure returns immediately (typically the leaf calls in a recursive algorithm). The same may happen, to a lesser degree, if the cache use burst mode access to main memory. On the other hand, local variables need not be saved when leaving a procedure, and the register saving strategy will know that. The local variables will still be kept in the cache, so they will be saved unnecessarily when the cache locations are re-used. Note that this does not happen all the time: if a return is followed by another call, the same memory locations are re-used, so no cache miss occur. It is possible to make a stack-cache that will know that values above the stack top are garbage, and set the do-not-save bit. This begins to look very much like variable sized register windows, the difference being that you can address arbitrarily far down the stack in a transparent fashion. As for 3), register based architectures require register allocation to map local variables to a possibly smaller number of registers. While good algorithms exist, they are all compile-time based and must thus take worst-case behaviour into account. This will invariably lead to saving registers when this is (in a particular run-time situation) not necessary. While the problem in most cases is small, it can in some cases have a noticeable effect. This is especially true in languages like LISP or Prolog, where basic blocks and procedures are small. The above discussion has centered around loading and saving of local variables, but there are other points to consider. In languages that use a lot of heap access (LISP, Prolog,...), a multiported cache is a huge benefit, whereas a large set of registers almost no help at all. Even if the cache uses burst mode access, it can benefit a heap. In fact the article "Cache Performance of Combinator Graph Reduction" by Koopman, Lee and Siewiorek in this years IEEE International Conference on Computer Languages argues that a large cache block size is beneficial for such languages. All in all, IMHO a well-designed registerless architecture with a multiported cache can perform just as well as a register architecture on C-like languages and far better on languages like LISP and Prolog. Torben Mogensen (torbenm@diku.dk)
moss@cs.umass.edu (Eliot Moss) (11/14/90)
In article <1990Nov14.064225.14406@caliban.uucp> ig@caliban.uucp (Iain Bason) writes:
I expect many languages other than C can also be made to allocate local
variables on the stack. Lisp might be tough, and Smalltalk, but then
they usually are on any architecture.
LISP is actually pretty easy, I believe; Smalltalk is harder because its
"stack frames" can be treated as objects (you can send messages to them!),
they can be retained, etc. I suppose that LISPs supporting continuations also
present problems, but that the stack can be used most of the time (see the
paper in the most recent SIGPLAN conference on the subject).
A compiler for a machine like this would obviously be different. For
instance, I imagine "register" coloring would be difficult to do when
the number of "registers" is variable. You have to take into account
the fact that other routines may have data in the cache, and only
allocate space if you think it will save this routine more time than
it will cost other routines.
On the contrary, register allocation via graph coloring would work reasonably
well. Ordinary register allocation is trying to compensate for the difference
in speed between registers and memory. If you eliminate registers and rely
solely on some kind of cache, then register allocation is easy, and equivalent
to having as many registers as you like. However, you *should* try to use as
few memory locations as possible, and graph coloring is nicely suited to doing
that. The nodes of the graph are the original collection of "variables"
(including various computed expressions, etc.); when the graph is k-colorable,
then these variables fit into k locations. The graph coloring algorithms used
in compilers attempt to minimize the number of colors (i.e., locations) used.
Hope this is reasonably clear ....
--
J. Eliot B. Moss, Assistant Professor
Department of Computer and Information Science
Lederle Graduate Research Center
University of Massachusetts
Amherst, MA 01003
(413) 545-4206; Moss@cs.umass.edu
peter@ficc.ferranti.com (Peter da Silva) (11/15/90)
In article <56084@brunix.UUCP> cgy@cs.brown.edu (Curtis Yarvin) writes: > >context switch is fast and easy, there's nothing but CCR, PC, and FP. > Ah, but no... you have to flush your cache anyway, you don't really > gain anything here. Unless you're one of those direct-mapped cache people... :-> -- Peter da Silva. `-_-' +1 713 274 5180. 'U` peter@ferranti.com
berg@cip-s04.informatik.rwth-aachen.de (AKA Solitair) (11/16/90)
Scott Draves writes: > Has anyone every thought about or done a registerless architecture? Ever had a look at the FORTH-engine (think the part number was 4016), see my other post under the subject: optimal processor. -- Sincerely, berg%cip-s01.informatik.rwth-aachen.de@unido.bitnet Stephen R. van den Berg. "I code it in 5 min, optimize it in 90 min, because it's so well optimized: it runs in only 5 min. Actually, most of the time I optimize programs."
urjlew@uncecs.edu (Rostyk Lewyckyj) (11/17/90)
The original purpose of registers or B boxes as they were called on the machine at Manchester where they were invented was for address modification for addressing arrays, rather than doing permanent code modification on the fly. Their use as temporary volatile fast storage came later. So don't forrget the original register uses.
bimandre@saturnus.cs.kuleuven.ac.be (Andre Marien) (11/20/90)
artcle : <1990Nov14.113748.3677@diku.dk> says : > This will invariably lead to > saving registers when this is (in a particular run-time situation) not > necessary. While the problem in most cases is small, it can in some > cases have a noticeable effect. This is especially true in languages > like LISP or Prolog, where basic blocks and procedures are small. > In languages that > use a lot of heap access (LISP, Prolog,...), a multiported cache is a > huge benefit, whereas a large set of registers almost no help at all. > All in all, IMHO a well-designed registerless architecture with a > multiported cache can perform just as well as a register architecture > on C-like languages and far better on languages like LISP and Prolog. While it is true that architectures seem to forget languages like Prolog, the above quotes are not quite true. Let me first say that the fact that Prolog is forgotten can be justified by the small commercial interest compared to other languages. I hope this will change, of course (see signature) In Prolog, basic blocks are small, but then Prolog is compiled very differently from C. There is nothing but recursion in Prolog, which is not translated to 'procedure calls' by any decend system I know of. Some attempts have been made to map Prolog stacks to C/Pascal procedure stacks, none really succesful. Register allocation for Prolog is also different from C/Pascal/... . There is a more complex abstract machine, with a lot of often used registers, lets say 8. The calling conventions can use another 4 registers just for fast argument passing. The kernel algorithm in Prolog is unification, which adds another 2 registers. So some 16 registers can be used to great benefit. Anyone who both did a 386 and a SPARC/MIPS port will know the difference the number of available registers makes. Prolog does have burst heap/stack accesses : creating choicepoints : some 10 words creating environments : some 6 words creating structured data : some 6 words but I don't see a reason here for multiported access. Decend support for tag manipulation would be far more useful. This was one of the big disappointments on the SPARC. Andre' Marien bimandre@cs.kuleuven.ac.be (ProLog by BIM, ex BIM_Prolog) If opinions are found, they are not guaranteed to belong to anyone.
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (11/21/90)
As to all the comments about needing only cache, I've said it before and I'll say it again.... Registers help because: [1] They are fast [2] Register refs don't interfere with memory data path [3] You never miss (i.e., have static timing for schedules) [3] Register names are shorter than addresses A conventional cache gets you only benefit [1]; however, ambiguously aliased references (array elements and pointer targets) are effectively managed by a cache whereas they require frequent flushing from registers. If you want all the benefits, you need both.... Well, almost. Actually, all you need is a mutant thing like CRegs. See the paper: H. Dietz and C-H Chi, "CRegs: A New Kind of Memory for Referencing Arrays and POinters," from Supercomputing '88. If you don't like that, look at the paper in Supercomputing '90 by B. Heggy and M. Soffa -- it describes a somewhat more complex register value forwarding mechanism which works like fully-associative CRegs. -hankd
mshute@cs.man.ac.uk (Malcolm Shute) (11/26/90)
In article <3168@ns-mx.uiowa.edu> jones@pyrite.cs.uiowa.edu (Douglas W. Jones,201H MLH,3193350740,3193382879) writes: >My Ultimate RISK (Computer Architecture News, 1988) is a memory-to-memory >architecture with no registers in the instruction execution unit other >than the PC. It has no arithmetic unit in the IEU either, which is why >I call it an IEU instead of a CPU. The registers and arithmetic unit(s) >are out on the memory bus. It was proposed as a purely pedagogical >exercise, [...] Mine went the other way (Microelectronics Journal Vol 15, No 3&5)... It had an ACC, but no PC. Instead there was an instruction in location zero of memory which, when executed, had its address field incremented, written back, and used as the address of the next instruction to be fetched. You might have gathered, that it wasn't tuned for high speed use! Instead, the aim was to see if I could design a 16-bit processor using only 600 transistors. There were only 4 instructions in the instruction set, and getting it to do the equivalent of a PDP11 MOV memory, memory operation took about 7 instructions, in much the same contorted sort of a way as Single Instruction Computers. It was a fun exercise. Probably not much use though. -- Malcolm SHUTE. (The AM Mollusc: v_@_ ) Disclaimer: all
daveh@cbmvax.commodore.com (Dave Haynie) (01/08/91)
In article <1990Nov21.004355.212@noose.ecn.purdue.edu> hankd@dynamo.ecn.purdue.edu (Hank Dietz) writes: >As to all the comments about needing only cache, I've said it before >and I'll say it again.... Registers help because: >[1] They are fast >[2] Register refs don't interfere with memory data path >[3] You never miss (i.e., have static timing for schedules) >[4] Register names are shorter than addresses >A conventional cache gets you only benefit [1]; however, ambiguously >aliased references (array elements and pointer targets) are >effectively managed by a cache whereas they require frequent flushing >from registers. If you want all the benefits, you need both.... Well, it seems to me that if you built a registerless machine right, you could pick up a few more points. A good cache is fast these days. So lets have three, one for data, one for instruction, one to replace actual registers. So we got [1]. As for [2], registers to intefere with a memory path -- when they are swapped to main memory during a context swap. So if we have a good sized register cache, in many cases we not only miss interference during task execution, but from within a task as well. Like a Harvard machine, only with three internal data paths rather than two. I guess you have to decide how the register cache actually works during a program execution -- it one could treat each virtual register as on a normal fixed register machine, but it would probably make as much sense to make it act like a register window machine. In today's silicon, you could have a 4-8K register cache with multiple set associtivity. Number [3] is something of an issue -- with a task swap on a conventional machine, you "miss" only on task boundaries. Here, you miss the first time you access a register, but never again, at least until your task is swapped out and back in, in which case you may miss, but even that's not guaranteed. Number [4] is solved by making all working register references relative to a real register, which points to the base of register space. The time to add in the offset from the base pointer can be hidden in the CPU pipeline if there's a dedicated adder for this purpose. Still, with all that said, I'm not sure this puppy buys you much over the conventional approach, and it does make the design of the CPU more complex. It would definitely cut down on the context swap time, and does have the interesting property of making the number of logical registers used in a task definable by the OS, or even the application if you split things into user and supervisor/kernel space. > -hankd -- Dave Haynie Commodore-Amiga (Amiga 3000) "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: hazy BIX: hazy "Don't worry, 'bout a thing. 'Cause every little thing, gonna be alright" -Bob Marley
baum@Apple.COM (Allen J. Baum) (01/09/91)
[] >In article <17212@cbmvax.commodore.com> daveh@cbmvax.commodore.com (Dave Haynie) writes: --arguments for a 'register cache' e.g. it one could treat each virtual >register as on a normal fixed register machine, but it would probably make as >much sense to make it act like a register window machine. >make all working register references relative to a >real register, which points to the base of register space. The time to add >in the offset from the base pointer can be hidden in the CPU pipeline if >there's a dedicated adder for this purpose. You've just described the ATT CRISP pretty muc. Look it up... -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum