[net.arch] Stack architectures - why not?

cg@myriasa.UUCP (Chris Gray) (09/05/85)

First, keep in mind that I'm a compiler writer by trade, so I like the
idea of a pure stack architecture because it's so easy to generate code
for (I did it for a machine I designed and emulated).

I've been told by a couple of people who are normally well informed that
a pure stack architecture just isn't practical. They have NOT been able
to convince me of this. Anybody out there want to try?

You say that a stack machine spends far too much time accessing the stack
in memory. Easy - have the top, say, 100 values of the stack in on-board
registers.

You say the large register set requires too much flushing on a task
switch (or redundant stores if you do it in the background). Simple -
don't flush it so much; have about 4 of them, with control over the
flushing policies that are to be used. One can be reserved for the
supervisor, and is only flushed if there is no bus activity at all. Also,
when the on-board top-of-stack is close to being full (90%?), the far top
end of it can be written out (NOT flushed) during any idle bus cycles.
The remaining top-of-stacks are used for user tasks (it may be better
to have more of them, but smaller - depends on how often you task switch
and on how many tasks you plan on having). They still need not be write-
through, but would want the same "write on nearly full" policy. The stack
sets for non-active user tasks can be written out whenever convenient,
probably in LRU order (the goal would be to have an clean one (all modified
values written out) when one is needed for another task).

All those registers won't fit on available chip real-estate. Well, given
that all of your normal registers and most of your addressing modes go
away, there should be lots of space. (This very similar to the RISC
argument that gives you lots of room for registers when you simplify
the instruction set.)

You can't take the address of one of the stack elements (at least not
without a lot of support hardware). Fine. I'm quite willing to have the
compiler allocate some variables in non-stack storage. I'm even willing
to have to explicitly tell the compiler which ones (sort of the inverse of
the way 'register' is treated in C). This may mean having two stacks - one
for "small" variables and one for those whose address is needed. Again,
I say fine. The same holds for variables shared between tasks ('volatile'
in the new C).

There are lots more considerations (e.g. interaction with VM), but this
should give you lots to shoot at. What obvious things am I missing?

		Chris Gray    {...,ihnp4}!alberta!myrias!cg

smb@ulysses.UUCP (Steven Bellovin) (09/11/85)

> First, keep in mind that I'm a compiler writer by trade, so I like the
> idea of a pure stack architecture because it's so easy to generate code
> for (I did it for a machine I designed and emulated).
> 
> I've been told by a couple of people who are normally well informed that
> a pure stack architecture just isn't practical. They have NOT been able
> to convince me of this. Anybody out there want to try?

Well, I'll give it a shot, though it's been a few years since I
seriously studied this.  Note that some of the traditional objections
-- listed in the original article -- no longer apply, because of the
decreasing cost of hardware.

The primary advantage of a stack machine from an architectural point of
view is that it provides good address abbreviation.  (Registers serve a
similar function, even when caches and the like make their speed
advantage minimal.) Given that the memory to CPU bandwidth is a
critical limiting factor on performance, such abbreviation is useful
indeed.  So far, so good.  Unfortunately, with an optimizing compiler
stack machines lose their advantage.  Common subexpressions and the
like require a sufficiently large number of DUPLICATE-TOP, SWAP-TOP-2,
and COPY-ELEMENT-TO-TOP operations that the advantage in terms of total
memory bandwidth is lost.  We thus have a situation where simple
compilers are easier to write, and produce better code;
production-quality compilers are not easier and produce no better
code.

There is also a conflict between the subroutine stack and the expression
stack.  Unless you add some truly oddball instructions, you'll need two
separate stacks, which can complicate your operating system storage management.
(Admittedly, this is less of a problem today with 32-bit name spaces
becoming routine; as we start using them up, we'll have to think about the
issue again....)

Finally, although stack machines are useful for arithmetic processing,
they're far less natural for non-numeric work -- string processing,
record-oriented commercial goo, LISP, etc.  Look at, say, the UNIX kernel --
very little of it would benefit from a stack architecture, and much of it
would suffer.

My conclusion:  the right answer, at least for now, is a machine with a good
subroutine stack.  Other issues, notably the complexity of the instruction
set, are open.

		--Steve Bellovin

P.S.  Many of these points, especially those concerning memory bandwidth, are
from Fred Brooks' Computer Architecture course.  I believe he attributed
that particular argument to Gene Amdahl.  Brooks wanted to make the 360 a
stack machine (yes, big bad IBM seriously considered that), but Amdahl
showed that it wouldn't fly.  Some of his arguments turned on the cost of
the necessary hardware at the time; these, of course, are much less
interesting today.

mash@mips.UUCP (John Mashey) (09/13/85)

teve Bellovin writes, in reply to Chris Gray:
> > I've been told by a couple of people who are normally well informed that
> > a pure stack architecture just isn't practical. They have NOT been able
> > to convince me of this. Anybody out there want to try?
>  
>  (bunch of comments, which seem pretty good ones)
> My conclusion:  the right answer, at least for now, is a machine with a good
> subroutine stack.  Other issues, notably the complexity of the instruction
> set, are open.
> 

In support of Steve's position are the following additional ones.
As always, it is VERY hard to analyze architectural features in isolation,
i.e., all generalizations are false; nevertheless:

1) COSTS IN FUNDAMENTAL DATA ACCESS TIMES.  For a given level of technology,
it always seems faster to do 
	add	reg1,reg2,reg3
where this means a) select values of reg1 and (if dual-ported reg file,
at same time), reg2, b) add them c) gate result back into reg3.
Rather than [assuming A is TOS, B is TOS+1]
	add
where this means (as in B5500, for example):
a) make sure both A & B are valid; if not, make 1-2 memory fetches,
and put them into A&B
b) add them, putting result back in B
c) mark A invalid.

Less registers = more memory traffic = faster access to registers at
lowest hardware level.
More registers = less memory traffice = slower access to the registers,
because either a) the registers not only have to act like registers,
but must also act like a giant shift register to keep the TOS at a
definite place, which (at chip level, anyway) gobbles realestate or
b) one needs an index register (related to the stack pointer) which
points to the TOS location within the register array.  This turns out
to be painful for the basic machine cycle, because it requires some
extra decoding time to find the TOS and the TOS+1 - unless there's
great trickery somewhere, I suspect there's an extra adder step required
somewhere, which is real ungood in the basic machine cycle.
You may note that most machines that have multiple register sets allocate
them in sets of powers of 2, so that reigster selection can occur by
concatenating the register number requested with high-order bits that
indicate which register set is used. Allowing variable-size
register windows is possible, but much harder.

2) PIPELINING PROBLEMS [this piece I'm less sure of]
At a given level of technology, one way to make things go faster is 
pipelining, or overlapping instruction execution.  Faster machines
tend to use more pipeline stages (not just IFETCH & EXECUTE, for example).
AMong other things, this requires complex "bypassing", whereby the
results of one operation may dynamically feed into the next, because
the next has already started well before the first finishes. In
general, this is easiest to do for very simple architectures, i.e.,
like CDC or CRAY machines, which are load/store architectures with little
or no complex side-effects and exciting address modes.  The more complex
the architecture, the more complex becomes the detection and handling of
pipeline hazards; the more complex, the slower.  Recall the number of
oddities that have popped over the years on machines with heavy use
of side-effects (like auto-increment addressing), especially in the
presence of memory protection errors; stack machines are like those,
but with auto-increment/decrement on almost every instruction!

ALthough some of the original technology arguments have disappeared,
it is worth noting that:
a) Although many people have been able to cost-reduce architectures over
the years, it seems that the Burroughs architectures have been difficult
to move unchanged to lower price levels, and at upper performance
levels, they've tended to go to multi-processors.  There may, of course,
be other reasons for the latter, and Burroughs has been in MP for a long time,
but it is often the case that you do that when the technology is hard to
make go much faster in uni-processors. Current dyadic IBM CPUs are similar
example.  [Not a criticism, just an observation;
I always admired the B5500 and its friends for the vision shown therein.]
b) One may conjecture why HP is replacing the (stack machine) HP-3000
with Spectrum (to all accounts, RISC architecture of load/store variety)
for more performance.

BOTTOM LINE: stack machines are elegant ijn some ways, but very hard
to make either really cheap or really fast.  MAYBE current VLSI technology
can overcome some of this, but it's not at all clear.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

alan@drivax.UUCP (Alan Fargusson) (09/15/85)

Ever hear of Hewlett-Packard? They have been making machines
with pure stack archetectures for years. I used an HP3000 when
I was in school and became very knowlegable about the machine.
The only register that the user had real access to was an index
register. Data addressing was relative to the base of data (DB),
or the stack pointer (S), or the frame pointer (Q). The machine
cost about the same as a PDP-11/70, and performed about the same
as far as I can tell. It had one advantage over the PDP-11 in
that it was a segmented machine that could have more than one
code segment, so you didn't need overlays. It had shared run time
libraries that HP called Segmented Libraries. In fact most of the
operating system was just a shared run time library. Code segments
had two attributes, preveliged/non-preveliged and callable/non-callable.
The protection of the system was implemented using these features.

I guess I will stop rambeling now. I really liked this machine.
Now if HP had just ported UNIX to it.

I think that the HP9000 systems have a similar archetecture.
-- 

Alan Fargusson.

{ ihnp4, amdahl, mot }!drivax!alan

richardt@orstcs.UUCP (richardt) (09/16/85)

Stack architectures are beautiful -- to a point.  My major gripes with them
are the following:

1) A very heavy reliance on RPN.  Although routines can be written to get 
around this, and some people actually like RPN, this is still a dubious 
feature at best.  Algebraic and other forms of higher math don't work well
in RPN, and (although I don't know about you) RPN makes logic notation far
more confusing than it needs to be.  Since I use logic functions in a lot of
situations, this is a hassle.  Even the If...Then structure is made more
complex than necessary because of forcing it into RPN.

2) Stack architectures and variables don't mix well.  Although it's not 
impossible, dealing with stack frames to find variables and work with them,
and using stack references for absolute variables is yet another hassle.

3) Although this is somewhat Forth-specific, relying on the Jump rather than
the Goto Subroutine.  Because the stack is being used for parameters, you
can either (1) deal with unpleasant stack frames; (2) never use subroutines,
with the inherent increase in execution time while addresses are being looked
up and the overhead for faking a subroutine call; (3) use subroutines only
for the very low level routines, and then you get stuck with all the 
disadvantages of (1) and (2).

4) It is very difficult to use multiple dynamic data spaces on a stack machine.
Again, it's not impossible, but stack machines are designed against it.

Note:  These opinions are based on writing a Forth for the 68000, where you can
have all the stacks/heaps that you want; writing a Forth for the 6502, whcih
has an unpleasant and unreliable indirect jump, thereby forcing self-modifiying
code (which has it's place, but still); writing a Forth for the 6809, which
has a two stacks (eliminating (3) above); and 6 years of programming 
experience on most of the microprocessors running around and several larger
machines.  I'm sorry, I'd far prefer a register architecture which works to
either a memory architecture or a stack machine.

				orstcs!richardt
"What shall we do with a stack machine,
	Stack Machine, Stack Machine,
Er-ly in the morning?"   <-- I consider Stack machines to be on about the level
					of a drunken sailor.

peter@graffiti.UUCP (Peter da Silva) (09/22/85)

> 1) A very heavy reliance on RPN.  Although routines can be written to get 
> around this, and some people actually like RPN, this is still a dubious 
> feature at best.  Algebraic and other forms of higher math don't work well
> in RPN, and (although I don't know about you) RPN makes logic notation far
> more confusing than it needs to be.  Since I use logic functions in a lot of
> situations, this is a hassle.  Even the If...Then structure is made more
> complex than necessary because of forcing it into RPN.

You seem to be confusing the language you used and machine architecture.
That's a job for the compiler writer. Have you ever written one? I have done
a couple and I'd like to say that for sheer ease of code generation from
algebraic expressions nothing beats a stack machine. You can compile to
FORTH in nothing flat. A great way of getting the language design working
before dealing with grubby registers for the real thing. Unfortunately the
project was killed before we could do the real thing, but that was in spite of
being able to show managers the thing working in FORTH.

alan@drivax.UUCP (Alan Fargusson) (09/23/85)

> Stack architectures are beautiful -- to a point.  My major gripes with them
> are the following:
> 
> 1) A very heavy reliance on RPN.  Although routines can be written to get 
> around this, and some people actually like RPN, this is still a dubious 
> feature at best.  Algebraic and other forms of higher math don't work well
> in RPN, and (although I don't know about you) RPN makes logic notation far
> more confusing than it needs to be.  Since I use logic functions in a lot of
> situations, this is a hassle.  Even the If...Then structure is made more
> complex than necessary because of forcing it into RPN.
> 
> 2) Stack architectures and variables don't mix well.  Although it's not 
> impossible, dealing with stack frames to find variables and work with them,
> and using stack references for absolute variables is yet another hassle.
> 
> 3) Although this is somewhat Forth-specific, relying on the Jump rather than
> the Goto Subroutine.  Because the stack is being used for parameters, you
> can either (1) deal with unpleasant stack frames; (2) never use subroutines,
> with the inherent increase in execution time while addresses are being looked
> up and the overhead for faking a subroutine call; (3) use subroutines only
> for the very low level routines, and then you get stuck with all the 
> disadvantages of (1) and (2).

I worked with an HP3000 for several years and didn't find any of these things
to be a problem. Of course I mostly used high level languages, but I did write
some assembler also. Most compilers convert the algebraic notation to something
else, like RPN or FPN, anyway.

> 4) It is very difficult to use multiple dynamic data spaces on a stack machine.
> Again, it's not impossible, but stack machines are designed against it.
> 
> Note:  These opinions are based on writing a Forth for the 68000, where you can
> have all the stacks/heaps that you want; writing a Forth for the 6502, whcih
> has an unpleasant and unreliable indirect jump, thereby forcing self-modifiying
> code (which has it's place, but still); writing a Forth for the 6809, which
> has a two stacks (eliminating (3) above); and 6 years of programming 
> experience on most of the microprocessors running around and several larger
> machines.  I'm sorry, I'd far prefer a register architecture which works to
> either a memory architecture or a stack machine.
> 
> 				orstcs!richardt
> "What shall we do with a stack machine,
> 	Stack Machine, Stack Machine,
> Er-ly in the morning?"   <-- I consider Stack machines to be on about the level
> 					of a drunken sailor.

I don't quite follow your argument. You seem to be saying that since it was
hard to implement forth on a 6502 it must be hard to implement it on a stack
machine. I don't agree with that.
-- 

Alan Fargusson.

{ ihnp4, amdahl, mot }!drivax!alan

blaine@nmtvax.UUCP (10/08/85)

In article <1260> cg.myriasa.UUCP (Chris Grey) writes:
> I've been told by a couple of people who are normally well informed that
> a pure stack architecture just isn't practical. They have NOT been able
> to convince me of this. Anybody out there want to try?

I will not respond directly to this question,  but rather provide some 
information from someone who has worked with both types of systems. 
Burroughs has had a stack oriented product line since arround
1964.  These machines fall into the Burroughs designation B5000-B7900 and
A-Series.   The performance of these machines spans the range from a VAX-750
to the largest 308x IBMs.  Contrary to some statements, it has not been very 
difficult to develop high end machines.  The low end machines are not a problem
now (due to small winchesters which can hold the MCP).  The Burroughs machines
have a segmented memory with tagged data.

Smaller machines execute the RPN dirrectly,  while larger machines can 
run into trouble with two problems:
   1.  Having to execute too many instructions/second (it takes more
       operators to do something in pure RPN).
   2.  Having too high a volume of memory reads and writes from the top
       of stack.

These problems are not difficult to get around.  First, operator concatenation
is used on high end machines to fold RPN onto a register machine dynamically!
This is rather like a small peephole optimizer in hardware.  The RPN notation
is very usefull in that it does not tie you down to a particular number of 
registers (0 to N are fine).  In theory a tightly coupled configuration of
a small RPN executing machine(such as for handling datacom) and a larger 
concatenating machine could share the same object image, one maping onto
an x,y top of stack register pair while the other machine goes to GPRs.

The concatenated opcodes are as easy to optimize as any GPR machine.

Top of stack memory accesses can be reduced by a special cache.

The essential thing to remember is that a stack instruction set does not
need to imply a stack processor design.

P.S.  Tags are not a performance problem.  They are a great help for
debugging.  Thunk heaven(:->)

Blaine Gaither
-- 
Blaine Gaither                    ucbvax!unmvax!nmtvax!blaine
Computer Science Department       blaine@nmt

peterb@pbear.UUCP (10/10/85)

/* Written  2:55 pm  Oct  8, 1985 by nmtvax!blaine in pbear:net.arch */
/* ---------- "Re: Stack architectures - why not?" ---------- */
In article <1260> cg.myriasa.UUCP (Chris Grey) writes:
>> I've been told by a couple of people who are normally well informed that
>> a pure stack architecture just isn't practical. They have NOT been able
>> to convince me of this. Anybody out there want to try?
>
>
>The essential thing to remember is that a stack instruction set does not
>need to imply a stack processor design.

	True, I wrote a compiler for a simple stack machine based around a
modula-2 syntax, for the 8088 chip that has no "general" registers, and I
found it rather easy to implement code that runs just about as fast as it's
comperable C-code. (Both running under PC/IX) All I did was to keep the top
two stack operands in registers (TOS in AX, next TOS in CX where possible).
Then map operands into registers backwards using an optimizer to cut down on
redundant MOV instructions.

	With this philosophy, much of the drudgery of the 8088 architecture
is removed from the user standpoint, and even the compiler need not know if
it, just the code selection/generation/optimization phases.

	My simple optimizer was a peephole optimizer for register
utilization that ran until it could not optimize further, (or by an option,
stop at a percentage of optimization).

	My next step is to modify the compiler to produce parse trees
that make logical expressions easier to use by implementing the operations
as tree operations rather than stack operations as used by much of the past.
Also a register mask set will be propagated throughout the tree making the
allocation of appropriate registers quite straight forward. The pass for
register allocation/utilization can follow most if not all of the code
optimizations based upon code reordering, and subexpresssion collections.

	I've been ramblin, but I would like to see if there is interest in a
newsgroup for compiler subjects(probably called net.compiler) that would
cover the more "black magic" topics of register allocation, LR parser error
recovery, Stack machine to non-stack machine mapping, etc...

	Mail me some ideas, and I'll rehash to the net.

Peter Barada
{ihnp4!inmet|{harvard|cca}!ima}!pbear!peterb