[net.arch] Hybrid stack/register machines

brownell@harvard.UUCP (Dave Brownell) (03/14/84)

That comment about context switching speed on stack machines, versus
the speed of access to registers (especially on the TI 9900 microprocessor)
brought up something I've been wondering for some time.

Has anybody produced "stack" machines that keep some large portion of
the top frame in the processor, instead of memory?  So to speak -- in
"registers"?  The number of frame elements in the processor could be fixed,
or they could be in a small, dedicated, high speed cache.  Highspeed
register flushing would be a must; maybe it could be done in parallel
with setting up call frames.

It seems to me that this approach could get many of the benefits of both
stack machines (simpler compiler technology) and register machines (fast
access for frequently used variables).  I know that some compiler manipulations
that I don't see done very often (does UNIX mislead me?) simulate this, and
that a normal cache is still slower than registers.  Am I missing something?
Is there a way less radical than RISC to use simple compiler and hardware
technology and still get high performance?



	Dave Brownell
	...{decvax,linus,sdcsvax}!genrad!wjh12!harvard!brownell

mark@umcp-cs.UUCP (03/15/84)

Cache is slower than registers because register instructions are shorter
than cache instructions (which must reference a real memory address).
Now let me see, if the instructions are also cached, this is less important,
but at every branch it matters again for a little while.
-- 
Spoken: Mark Weiser 	ARPA:	mark@maryland
CSNet:	mark@umcp-cs 	UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!mark

tjt@kobold.UUCP (03/16/84)

Mark Weiser (ucmp-cs!mark) claims that "cache is slower than registers
because register instructions are shorter than cache instructions."
This is true for register-machine architectures (pdp-11, vax, 68000,
...), but would not be true of a pure stack machine.  The only
operators with explicit addresses would be PUSH and POP -- all others
manipulate the values on the top of the stack.
-- 
	Tom Teixeira,  Massachusetts Computer Corporation.  Westford MA
	...!{ihnp4,harpo,decvax}!masscomp!tjt   (617) 692-6200 x275

tjt@kobold.UUCP (03/16/84)

While we're at it, I thought there were some real machines out there
where cached memory addresses are effectively faster than registers.
The hardware/microcode partially decodes the instructions and starts
fetching the memory reference well in advance of when it's needed, but
doesn't perform any prefetch for registers.

I also thought that the explanation for why PUSH instructions on the
VAX-11/780 are slower than the corresponding MOV instructions was
similar: there is lots of special microcode for MOV's that starts
decoding the operand addresses as early as possible, but PUSH
instructions are not processed early.
-- 
	Tom Teixeira,  Massachusetts Computer Corporation.  Westford MA
	...!{ihnp4,harpo,decvax}!masscomp!tjt   (617) 692-6200 x275

guy@rlgvax.UUCP (Guy Harris) (03/17/84)

> Has anybody produced "stack" machines that keep some large portion of
> the top frame in the processor, instead of memory?  So to speak -- in
> "registers"?  The number of frame elements in the processor could be fixed,
> or they could be in a small, dedicated, high speed cache.  Highspeed
> register flushing would be a must; maybe it could be done in parallel
> with setting up call frames.

> It seems to me that this approach could get many of the benefits of both
> stack machines (simpler compiler technology) and register machines (fast
> access for frequently used variables).  I know that some compiler manipulations
> that I don't see done very often (does UNIX mislead me?) simulate this, and
> that a normal cache is still slower than registers.

When you say "stack machines", it usually refers to machines whose instruction
set implements Reverse Polish Notation; there is a "push" or "load" instruction
which pushes something onto an evaluation stack, and the arithmetic instructions
take the top N elements of the evaluation stack, pop them off, perform the
arithmetic operation, and push the result back on to the stack.  Most such
machines implement the top few elements of the stack in fast "registers"
(although those "registers" are not usually accessible directly).  The elements
of the stack kept in registers take the place of registers used for temporary
results in general register machines.

One could also give such a machine general registers; data can be pushed from
them or popped into them just like it could be pushed from or popped into
memory locations.  These registers would work exactly like registers used
for variables in general register machines.  Presumably, these general
registers would still have very short names and could be accessed more quickly
than memory locations.  (The Xerox Mesa machine (which is a stack machine)
has special versions of the "push" instruction that push things at a very
small offset from the beginning of the current activation record; this means
that some variables can have the same short names.  For instance, there are
2-byte "push" instructions with an 8-bit offset and 3-byte "push" instructions
with a 16-bit offset; however, these special "push" instructions fit into
only one byte, with the offset encoded into the op code.)

	Guy Harris
	{seismo,ihnp4,allegra}!rlgvax!guy

johnl@haddock.UUCP (03/24/84)

#R:harvard:-17700:haddock:9500009:000:634
haddock!johnl    Mar 23 15:28:00 1984

The HP3000 series III, a fairly pure stack architecture, tries to keep
the top four entries in the stack in registers.  It physically has four
fast registers, and dynamically relabels them as the stack expands and
shrinks.  Most instructions either strictly push and pop data, or else
allow you to reference data a few words into the stack.

The manual I read (quite a while ago) suggested that this was a big
performance booster, but I've never seen a real evaluation of it.  The
3000 was hobbled by some unfortunate 64K addressing limits, so stack
speed may not have been what real users worried about much.

John Levine, ima!johnl

rentsch@unc.UUCP (Tim Rentsch) (03/27/84)

The Burroughs B7700 did cache the top of stack (128 words, as I recall), and
did filling (alternatively emptying) when the cache was 3/4 empty (full).
The hardware reference manual from Burroughs has the details.

Tim

darryl@ism780.UUCP (03/27/84)

#R:harvard:-17700:ism780:11500001:000:407
ism780!darryl    Mar 25 20:33:00 1984

I believe that Tandem's old T16 machines had the top 8 elements of the
stack as registers.  The registers were also directly addressable;
in particular, the second 4 were index registers.  I'm sure that the
architecture made their compiler (TAL) easier to implement, but as for
speed, well, just about anything I've worked on crawled faster than
their machine did.

    Darryl Richman
    ima!ism780!darryl

emjej@uokvax.UUCP (03/29/84)

#R:harvard:-17700:uokvax:9900009:000:980
uokvax!emjej    Mar 20 09:24:00 1984

/***** uokvax:net.arch / harvard!brownell /  1:35 am  Mar 15, 1984 */
Has anybody produced "stack" machines that keep some large portion of
the top frame in the processor, instead of memory?  So to speak -- in
"registers"?  The number of frame elements in the processor could be fixed,
or they could be in a small, dedicated, high speed cache.  Highspeed
register flushing would be a must; maybe it could be done in parallel
with setting up call frames.
/* ---------- */

This is something that Elliot Organick mentions in his book on the
B5500 as a possible improvement (I think the top n words, for some
small n, are in faster memory; he proposed increasing n). Can someone
from Burroughs tell us whether it is done in some of the B6700/6800
machines?

It's also, if I understand correctly, what Steve Johnson has been
working with for a while; there was a joint SIGARCH/SIG(PLAN?)
conference last year, I believe, containing a paper of his with some
results.

					James Jones

sam@nsc.UUCP (Sam Cramer) (03/29/84)

I believe that Wirth's "Lilith" machine (especially designed to run
M-code, which his Modula-2 compiler produces) caches entries at the
top of the stack.