brownell@harvard.UUCP (Dave Brownell) (03/14/84)
That comment about context switching speed on stack machines, versus the speed of access to registers (especially on the TI 9900 microprocessor) brought up something I've been wondering for some time. Has anybody produced "stack" machines that keep some large portion of the top frame in the processor, instead of memory? So to speak -- in "registers"? The number of frame elements in the processor could be fixed, or they could be in a small, dedicated, high speed cache. Highspeed register flushing would be a must; maybe it could be done in parallel with setting up call frames. It seems to me that this approach could get many of the benefits of both stack machines (simpler compiler technology) and register machines (fast access for frequently used variables). I know that some compiler manipulations that I don't see done very often (does UNIX mislead me?) simulate this, and that a normal cache is still slower than registers. Am I missing something? Is there a way less radical than RISC to use simple compiler and hardware technology and still get high performance? Dave Brownell ...{decvax,linus,sdcsvax}!genrad!wjh12!harvard!brownell
mark@umcp-cs.UUCP (03/15/84)
Cache is slower than registers because register instructions are shorter than cache instructions (which must reference a real memory address). Now let me see, if the instructions are also cached, this is less important, but at every branch it matters again for a little while. -- Spoken: Mark Weiser ARPA: mark@maryland CSNet: mark@umcp-cs UUCP: {seismo,allegra,brl-bmd}!umcp-cs!mark
tjt@kobold.UUCP (03/16/84)
Mark Weiser (ucmp-cs!mark) claims that "cache is slower than registers because register instructions are shorter than cache instructions." This is true for register-machine architectures (pdp-11, vax, 68000, ...), but would not be true of a pure stack machine. The only operators with explicit addresses would be PUSH and POP -- all others manipulate the values on the top of the stack. -- Tom Teixeira, Massachusetts Computer Corporation. Westford MA ...!{ihnp4,harpo,decvax}!masscomp!tjt (617) 692-6200 x275
tjt@kobold.UUCP (03/16/84)
While we're at it, I thought there were some real machines out there where cached memory addresses are effectively faster than registers. The hardware/microcode partially decodes the instructions and starts fetching the memory reference well in advance of when it's needed, but doesn't perform any prefetch for registers. I also thought that the explanation for why PUSH instructions on the VAX-11/780 are slower than the corresponding MOV instructions was similar: there is lots of special microcode for MOV's that starts decoding the operand addresses as early as possible, but PUSH instructions are not processed early. -- Tom Teixeira, Massachusetts Computer Corporation. Westford MA ...!{ihnp4,harpo,decvax}!masscomp!tjt (617) 692-6200 x275
guy@rlgvax.UUCP (Guy Harris) (03/17/84)
> Has anybody produced "stack" machines that keep some large portion of > the top frame in the processor, instead of memory? So to speak -- in > "registers"? The number of frame elements in the processor could be fixed, > or they could be in a small, dedicated, high speed cache. Highspeed > register flushing would be a must; maybe it could be done in parallel > with setting up call frames. > It seems to me that this approach could get many of the benefits of both > stack machines (simpler compiler technology) and register machines (fast > access for frequently used variables). I know that some compiler manipulations > that I don't see done very often (does UNIX mislead me?) simulate this, and > that a normal cache is still slower than registers. When you say "stack machines", it usually refers to machines whose instruction set implements Reverse Polish Notation; there is a "push" or "load" instruction which pushes something onto an evaluation stack, and the arithmetic instructions take the top N elements of the evaluation stack, pop them off, perform the arithmetic operation, and push the result back on to the stack. Most such machines implement the top few elements of the stack in fast "registers" (although those "registers" are not usually accessible directly). The elements of the stack kept in registers take the place of registers used for temporary results in general register machines. One could also give such a machine general registers; data can be pushed from them or popped into them just like it could be pushed from or popped into memory locations. These registers would work exactly like registers used for variables in general register machines. Presumably, these general registers would still have very short names and could be accessed more quickly than memory locations. (The Xerox Mesa machine (which is a stack machine) has special versions of the "push" instruction that push things at a very small offset from the beginning of the current activation record; this means that some variables can have the same short names. For instance, there are 2-byte "push" instructions with an 8-bit offset and 3-byte "push" instructions with a 16-bit offset; however, these special "push" instructions fit into only one byte, with the offset encoded into the op code.) Guy Harris {seismo,ihnp4,allegra}!rlgvax!guy
johnl@haddock.UUCP (03/24/84)
#R:harvard:-17700:haddock:9500009:000:634 haddock!johnl Mar 23 15:28:00 1984 The HP3000 series III, a fairly pure stack architecture, tries to keep the top four entries in the stack in registers. It physically has four fast registers, and dynamically relabels them as the stack expands and shrinks. Most instructions either strictly push and pop data, or else allow you to reference data a few words into the stack. The manual I read (quite a while ago) suggested that this was a big performance booster, but I've never seen a real evaluation of it. The 3000 was hobbled by some unfortunate 64K addressing limits, so stack speed may not have been what real users worried about much. John Levine, ima!johnl
rentsch@unc.UUCP (Tim Rentsch) (03/27/84)
The Burroughs B7700 did cache the top of stack (128 words, as I recall), and did filling (alternatively emptying) when the cache was 3/4 empty (full). The hardware reference manual from Burroughs has the details. Tim
darryl@ism780.UUCP (03/27/84)
#R:harvard:-17700:ism780:11500001:000:407 ism780!darryl Mar 25 20:33:00 1984 I believe that Tandem's old T16 machines had the top 8 elements of the stack as registers. The registers were also directly addressable; in particular, the second 4 were index registers. I'm sure that the architecture made their compiler (TAL) easier to implement, but as for speed, well, just about anything I've worked on crawled faster than their machine did. Darryl Richman ima!ism780!darryl
emjej@uokvax.UUCP (03/29/84)
#R:harvard:-17700:uokvax:9900009:000:980 uokvax!emjej Mar 20 09:24:00 1984 /***** uokvax:net.arch / harvard!brownell / 1:35 am Mar 15, 1984 */ Has anybody produced "stack" machines that keep some large portion of the top frame in the processor, instead of memory? So to speak -- in "registers"? The number of frame elements in the processor could be fixed, or they could be in a small, dedicated, high speed cache. Highspeed register flushing would be a must; maybe it could be done in parallel with setting up call frames. /* ---------- */ This is something that Elliot Organick mentions in his book on the B5500 as a possible improvement (I think the top n words, for some small n, are in faster memory; he proposed increasing n). Can someone from Burroughs tell us whether it is done in some of the B6700/6800 machines? It's also, if I understand correctly, what Steve Johnson has been working with for a while; there was a joint SIGARCH/SIG(PLAN?) conference last year, I believe, containing a paper of his with some results. James Jones
sam@nsc.UUCP (Sam Cramer) (03/29/84)
I believe that Wirth's "Lilith" machine (especially designed to run M-code, which his Modula-2 compiler produces) caches entries at the top of the stack.