scott@labtam.labtam.oz (Scott Colwell) (01/03/91)
rcg@lpi.liant.com (Rick Gorton) writes: >> What features should be put into the CPU to improve performance and >>reduce chip count? >> >SOME REGISTERS!!!!! As has been pointed out, adding more registers to the ix86 would be almost impossible (well impractical anyway) due to the current instruction coding. This is not really the only way of addressing the problem, current compilers use stack frame variables after the scarce registers have been allocated. On the 486, reg to reg operations take one clock, cached memory to reg take two and reg to cached mem take three. By improving this performance and making some changes to the cache allocation scheme, cache can and does compensate for the lack of general purpose registers. It becomes interesting when you consider if the extra byte of offset from the start of the stack frame that is required in the instruction sequence is significant and if the other proposed scheme of another escape byte to allow access to other new GP registers is better. As microprocessor peformance continues to stretch memory bandwidth and latency, keeping the code density reasonably high will always be a worthy aim. (As long as the data bandwidth isn't stupidly high due to lack of registers :-) Some ideas on how to make the on chip cache work better for stack frames; Allocate on writes for stack accesses. The 486 allocates on reads only which means that an automatic variable is guaranteed to be a cache miss on the first read. The cpu knows a stack address from data and code since it uses a different segment reg. If this status is propagated as the code/data status is, the cache could alter its behaviour for stack. As the on chip cache becomes larger, it may be worth the effort to have a special cache for the stack. It would be acceptable for it to not snoop _if_ it was guaranteed to only cache things on the stack. Using virtual tags might be an acceptable trade-off if the miss penalty was not too high after a flush. Filled from the on chip data cache ? (someone will let me know if I have presumed too much here :-) -- Scott Colwell Senior Design Engineer Labtam Australia net: scott@labtam.oz.au Melbourne, Australia phone: +61-3-587-1444
kenw@skyler.calarc.ARC.AB.CA (Ken Wallewein) (01/04/91)
A short while ago I was going to post a message saying "Hey, I know! Virtual registers!" :-). Seriously, though, I remember reading the docs for processor in the Texas Instruments 99/4A. As I recall, it had three registers: a couple for operands, and one pointer register for a "virtual" register set in RAM. Changing that one register's contents gave one a whole or partially different set. The docs claimed (of course) that this was the way of the future, because processor <-> memory bandwidths were increasing (hah :-). It seems to me that, with a good caching scheme, this could work out to being just about as efficient as 'real' registers. On the other hand, I sometimes wonder whether registers aren't actually an optimizing kludge, and that an instruction set which made no explicit reference to registers at all might be more efficient. Speaking of caches, I seem to recall that single-storage-area caches are supposed to be more efficient than caches which have separate areas for different things. But this would have difficulty dealing with addresses which have low hit rates but high impact. One would almost have to have "cache priority". Has this ever been done? /kenw kenw@noah.arc.ab.ca
baum@Apple.COM (Allen J. Baum) (01/04/91)
[] >In article <5827@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes: >On the 486, reg to reg operations take one clock, cached memory to reg take >two and reg to cached mem take three. By improving this performance and making >some changes to the cache allocation scheme, cache can and does compensate >for the lack of general purpose registers. I'm afraid that I can't let this statement of fact go unchallenged. I'm assuming it is fact- I have no reason to believe otherwise. But, the plain facts don't tell all the details, and those details are significant. For example, it is easy to make a memory (or cache) access a single single if you make your cycle slow enough. I will make an assertion here- the critical path for most processors with cache is the cache access path. Secondly, you can make horribly complicated instructions take a single cycle (even for fast cycles) if you make your pipeline deep enough. Unfortunately, deep pipes have a problem with stalling because of branches and dependencies when the pipeline gets deep. So, the book says that some instruction will take only one cycle. What it doesn't say is that for an average instruction mix the pipeline stalls so much that even with 100% cache hits you'll get 2.5 cycles per instruction. Be very careful when you think that cache accesses are fast & free- there isn't a free lunch, and the the cost is tricky to measure. Second order effects (like dependency stalls) are not necessarily down in the noise. One further point- my sense of compiler technology is that 80x86 compilers are very good at what they do. However, because of the lack of registers, there are all sorts of techniques they don't even attempt to do. This cost of opportunity is not easily measured either. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
firth@sei.cmu.edu (Robert Firth) (01/04/91)
In article <KENW.91Jan3123808@skyler.calarc.ARC.AB.CA> kenw@skyler.calarc.ARC.AB.CA (Ken Wallewein) writes: > Seriously, though, I remember reading the docs for processor in the >Texas Instruments 99/4A. As I recall, it had three registers: a >couple for operands, and one pointer register for a "virtual" register >set in RAM. Changing that one register's contents gave one a whole or >partially different set. The docs claimed (of course) that this was the >way of the future, because processor <-> memory bandwidths were >increasing (hah :-). From my (rather large) library of computer manuals: "2.1.4 Workspace Concept The 990 register fileis located in memory. This imposes very little speed penalty because the 990 computers use fast semiconductor memory. The classifications "hardware register" and "memory" are arbitrary when both are semiconductor devices." [Texas Instruments: 990 Computer Family Systems Handbook, 1976] Ah, but how else could you build a machine with a 250ns cycle time that did a full context switch in less than 10us? False premise, maybe, but great engineering.
qzhe1@cs.aukuni.ac.nz (Qun Zheng ) (01/05/91)
Is there a modern processor without registers? If there is, how do they catch up the speed, playing which sort tricks on caches? Chuck ------- Dept CompSci Auckland University NEW ZEALAND
henry@zoo.toronto.edu (Henry Spencer) (01/06/91)
In article <KENW.91Jan3123808@skyler.calarc.ARC.AB.CA> kenw@skyler.calarc.ARC.AB.CA (Ken Wallewein) writes: >... I sometimes wonder whether registers aren't actually an >optimizing kludge, and that an instruction set which made no explicit >reference to registers at all might be more efficient. "More efficient" in what sense? The fact is, if you have compilers that can do intelligent and useful things like common-subexpression optimization, there is a very high payoff for having a modest number of random-access variables that are implemented in very fast storage and can be addressed with very few instruction bits. These need not be "registers" in the traditional sense, as witness CRISP's use of special addressing modes and a special cache for the very top of the current stack frame, but they have to look a lot like them. The major alternatives to explicit register references in instructions are memory-to-memory machines and stack machines ("stack" here in the sense of expression-evaluation stack, not call stack). Memory-to-memory machines lose big on the number of bulky memory addresses needed when expressions start needing temporaries. Stack machines lose big on the difficulty of keeping useful values around for re-use, since the stack really wants to throw a value away after using it. Both lose big on the cost of memory references unless the cache is *very* good. Given good compilers, register machines perform better at equivalent complexity: that easily-accessed fast temporary storage is very useful. > Speaking of caches, I seem to recall that single-storage-area caches >are supposed to be more efficient than caches which have separate >areas for different things... Again, "more efficient" in what sense? Current experience strongly points to the reverse, at least for the obvious case of instructions vs. data. Splitting a cache loses the ability to use any free space to fulfill a new request, but gains the ability to specialize the hardware and parallelize data paths. -- "The average pointer, statistically, |Henry Spencer at U of Toronto Zoology points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu utzoo!henry
baum@Apple.COM (Allen J. Baum) (01/06/91)
[] >In article <1991Jan6.014925.10935@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: --a nice summary of why registers win-- The reason that caches are used is because memory management is -hard to do -a pain in the ass -best done with run time info When these are not an issue, then it can be done extremely efficiently by the programmer. Lo and behold, for very small amounts of memory (like <32 variables), the programmer still has problems, but some other program (the compiler) can manage it. Hence, registers are extremely. In general, memory addresses are much bigger than register references, but in particular they don't have to be, e.g. CRISP which uses a small offset from some pointer. As has been pointed out in the past, registers are the top of the memory hierarchy. The characteristics that make it different than those other levels: multiple ports, sub-cycle access, and extremely small size. (generally speaking, of course) >> Speaking of caches, I seem to recall that single-storage-area caches >>are supposed to be more efficient than caches which have separate >>areas for different things... > >Again, "more efficient" in what sense? Current experience strongly points >to the reverse, at least for the obvious case of instructions vs. data. >Splitting a cache loses the ability to use any free space to fulfill a new >request, but gains the ability to specialize the hardware and parallelize >data paths. In the sense that 1 unified cache will have a better hit ratio than seperate I/D caches of half the size each, yes, it is more efficient. However, hit ratio is not everything. The performance gained by having split caches be accessed in parallel more than makes up for the fairly small loss due to extra misses. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
baum@Apple.COM (Allen J. Baum) (01/06/91)
[] >In article <qzhe1.663050569@aucs7.cs.aukuni.ac.nz> qzhe1@cs.aukuni.ac.nz (Qun Zheng ) writes: >Is there a modern processor without registers? If there is, how do they catch >up the speed, playing which sort tricks on caches? The only one I'm aware of is the ATT CRISP, which is a mem-mem architecture. The only address modes are: immed, SP-relative, indirect SP-relative, and absolute (? could be wrong here...PC-rel also?). By caching the top of stack (i.e., the first 32 locs relative to the SP) in an extremely small, dual ported cache, these references look more like register references, even needed only five bits in the instruction to reference them. Unlike the usual cache tag mechanism, there is a bounds check, since the top of stack must be contiguous. In addition, they don't allocate on a miss- special instructions allocate and deallocate (on procedure entry/exit). In that sense, it looks like register windows with variable size windows. So, its sorta like registers, except that it goes to mem if you access something else. Needless to say, generally the absolute and indirect modes will miss. ALSO it will go to the registers if you access them through a pointer! That means you don't have to worry about making sure a variable is in memory if you might have a pointer to it. For example, an interrupt merely changes the SP. All references then miss, and it starts running slow, unless the SP gets restored long enough to save the "registers"- but you don't have to save anything if you don't want to. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
scott@labtam.labtam.oz (Scott Colwell) (01/07/91)
baum@Apple.COM (Allen J. Baum) writes: >>In article <5827@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes: >>On the 486, reg to reg operations take one clock, cached memory to reg take >>two and reg to cached mem take three. By improving this performance and making >>some changes to the cache allocation scheme, cache can and does compensate >>for the lack of general purpose registers. >I'm afraid that I can't let this statement of fact go unchallenged. >I'm assuming it is fact- I have no reason to believe otherwise. Thankyou ;-) >For example, it is easy to make a memory (or cache) access a single single >if you make your cycle slow enough. .. >Secondly, you can make horribly complicated instructions take a single >cycle (even for fast cycles) if you make your pipeline deep enough. >Unfortunately, deep pipes have a problem with stalling because of branches and >dependencies when the pipeline gets deep. .. I accept that all these are valid comments but again using the example of the 486, Intel have come up with an implimentation that goes quickly. It appears to have integer performance comparable to many of the RISC processors (although not up with the faster ones.) If we accept that the 486 at 33MHz has around 15 vax mips (Intel's quoted integer spec is 17.6 Nov/Dec 1990 issue of Solutions,) then we get somewhere around 2.2 'vax instructions' per clock. I hate to appear to be supporting a less than perfect architecture but for Intel, the name of the game is not to make a fast microprocessor but to make the fastest x86 compatible processor. In this context, I think that Intel have made good design tradeoffs. Imagine what might be produced if the silicon technology and the effort that are put into the x86 and 860 groups was put into just one RISC family rather than the two RISC and one CISC family that Intel have today. (i.e. x86, 860, 960) -- Scott Colwell Senior Design Engineer Labtam Australia net: scott@labtam.oz.au Melbourne, Australia phone: +61-3-587-1444
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (01/07/91)
In article <5833@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes: | Imagine what might be produced if | the silicon technology and the effort that are put into the x86 and 860 | groups was put into just one RISC family rather than the two RISC and one | CISC family that Intel have today. (i.e. x86, 860, 960) I really don't see that any of the lines you mention are suffering from lack of resources (maybe the 486), and no one CPU would be likely to fit all the markets and be priced for mass marketing. And you don't have to read this group long to realize that some people will hate everything Intel does, because Intel does it. Not that this is unique, some people feel the same about IBM or Apple, CISC, etc. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
baum@Apple.COM (Allen J. Baum) (01/08/91)
[] >In article <5833@labtam.labtam.oz> scott@labtam.labtam.oz (Scott Colwell) writes: --my comments on using cache as a substitute for more regs-- > >I hate to appear to be supporting a less than perfect architecture but >for Intel, the name of the game is not to make a fast microprocessor >but to make the fastest x86 compatible processor. In this context, I think >that Intel have made good design tradeoffs. Well, we may be on the same side. Given the 80x86 ISA, then what Intel did might be the best solution. What I was arguing was that cache is not a substitute for more regs., although marketing benchmarks and numbers might make it appear so. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/09/91)
On 6 Jan 91 01:49:25 GMT, henry@zoo.toronto.edu (Henry Spencer) said: henry> The major alternatives to explicit register references in henry> instructions are memory-to-memory machines and stack machines henry> ("stack" here in the sense of expression-evaluation stack, not henry> call stack). Memory-to-memory machines lose big on the number of henry> bulky memory addresses needed when expressions start needing henry> temporaries. Stack machines lose big on the difficulty of henry> keeping useful values around for re-use, since the stack really henry> wants to throw a value away after using it. I'd like to repeat my usual fundamental truth on this: the latter statement need not be true, if one has multiple stacks. A little analysis will show why. Statistics exist that show that almost any expression and a majority of code sequences can, on a single threaded CPU (one with one ALU), be compiled without spills with FOUR spare registers[1]. To add abundant inter-expression caching we need another four. This amounts to saying that usually not more than four things try to happen independently in an expression. Statistics exist that show that the average expression is exceedingly simple[2], and frequently resused values are quite few as well. Now, a single stack architecture loses out badly when two conditions are true: 1) The lifetimes of the spills do not nest gracefully. 2) The CPU is not single threaded and has multiple ALUs. A single stack loses badly because in both[3] cases operations can only happen at the top of stack, and the separate logical (case 1) or physical (case 2) computation streams interfere and trash around it. The effect of case 1) is when you see a lot of seemingly unnecessary POP, PUSH and DUP operations used just to rearrange the elements around the top of stack, to bring each computation thread 'in focus' on it. This happens especially often with Fortran style codes, some of which do have complicated expressions and address/value stream interference. The obvious solution is to have multiple tops of stack. The CRISP architecture does this by allowing a short form of address to the first 32 slots on the stack; this seems to work quite well. My favourite solution would be to have multiple stacks. How many? FOUR is the answer, because it is exceedingly rare that an expression contains as many independent threads of computation. A superscalar machine may attach independent ALUs to each stack, or even specialize them[4]. In the old problem of providing accumulators and on-chip cache, the general purpose register file solution is to hide the accumulators and make only the on-chip cache visible (thus register renaming), the stack solution is to have a single accumulator and a stack cache behind it, per stack. Note that the stack solution has the benefit that it has better information encoding for the most common case, as demonstrated by the impressive code densities of stack based architectures[5]. ---------------- [1] Examples: a+b+c+d this, on a single threaded CPU, only needs one register to be compiled without spills. a/b + c*d this instead needs two registers for compiling without spilling, because two things are going on at the same time (logically). [2] Most expressions, staticall, are of the form 'a = b OP c' or simpler; dynamically things cna be slightly more complicated. Note that the operands may involve address computations. [3] Case 1) seems to be relatively rare but important; case 2) is more common than you think, because without being superscalar, many simple implementations have different ALUs for computing values and addresses. [4] Indeed in practice superscalar implementations find it exceedingly hard to find degrees of microparallelism greater than two or three, unless one is talking codes that are best suited to vector, SIMD or MIMD, architectures. Special purpose architectures for special purpose codes! [5] The reason is that most expressions are indeed stack based, i.e. the lifetimes of spills nest nicely, while register machines assume the worst case, i.e. all spills have completely unrelated lifetimes and have to be managed explicitly. A multiple stack machine is a nice compromise. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
anton@bkj386.uucp (Anton Aylward) (01/09/91)
CPU's without registers? If you mean computational registers, as apart from internal (non programmable) ones, how about the Texas SBP 9900. /anton -- ____ __________ / / / | | Anton J Aylward / / / | Analysis | / | | ,-' | 3355 Don Mills Rd,
preston@ariel.rice.edu (Preston Briggs) (01/09/91)
pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >I'd like to repeat my usual fundamental truth on this: ... >[1] Examples: > >a+b+c+d this, on a single threaded CPU, only needs one register > to be compiled without spills. > >a/b + c*d this instead needs two registers for compiling without > spilling, because two things are going on at the same > time (logically). > >[2] Most expressions, staticall, are of the form 'a = b OP c' or >simpler; dynamically things cna be slightly more complicated. Note that >the operands may involve address computations. These examples gloss over the number of registers required for addressing and the amount of reuse available due to addressing. We all know about array addressing expressions. Consider also pointer chasing in C or Pascal and the addressing of non-local (and non-global) variables in Algol and descendants. A large portion of the work done by various forms of strength reduction, common subexpression elimination (local and global), and loop invariant code motion involve all these implied calculations and memory accesses. In the case of Fortran, very little of the user's explicit code is ever touched by the optimizer (mostly because it _is_ very simple, as mentioned above and in Knuth's "An emperical study of fortran..." paper). >[5] The reason is that most expressions are indeed stack based, i.e. the >lifetimes of spills nest nicely, while register machines assume the >worst case, i.e. all spills have completely unrelated lifetimes and have >to be managed explicitly. A multiple stack machine is a nice compromise. Are there any papers describing how to generate code for such a machine? It seems as though the many stacks must still be managed explicitely. Additionally, we must solve the problem of "which items on which stack" and "when do we pop what values from the stack". Preston Briggs