cg@myriasa.UUCP (Chris Gray) (09/05/85)
First, keep in mind that I'm a compiler writer by trade, so I like the idea of a pure stack architecture because it's so easy to generate code for (I did it for a machine I designed and emulated). I've been told by a couple of people who are normally well informed that a pure stack architecture just isn't practical. They have NOT been able to convince me of this. Anybody out there want to try? You say that a stack machine spends far too much time accessing the stack in memory. Easy - have the top, say, 100 values of the stack in on-board registers. You say the large register set requires too much flushing on a task switch (or redundant stores if you do it in the background). Simple - don't flush it so much; have about 4 of them, with control over the flushing policies that are to be used. One can be reserved for the supervisor, and is only flushed if there is no bus activity at all. Also, when the on-board top-of-stack is close to being full (90%?), the far top end of it can be written out (NOT flushed) during any idle bus cycles. The remaining top-of-stacks are used for user tasks (it may be better to have more of them, but smaller - depends on how often you task switch and on how many tasks you plan on having). They still need not be write- through, but would want the same "write on nearly full" policy. The stack sets for non-active user tasks can be written out whenever convenient, probably in LRU order (the goal would be to have an clean one (all modified values written out) when one is needed for another task). All those registers won't fit on available chip real-estate. Well, given that all of your normal registers and most of your addressing modes go away, there should be lots of space. (This very similar to the RISC argument that gives you lots of room for registers when you simplify the instruction set.) You can't take the address of one of the stack elements (at least not without a lot of support hardware). Fine. I'm quite willing to have the compiler allocate some variables in non-stack storage. I'm even willing to have to explicitly tell the compiler which ones (sort of the inverse of the way 'register' is treated in C). This may mean having two stacks - one for "small" variables and one for those whose address is needed. Again, I say fine. The same holds for variables shared between tasks ('volatile' in the new C). There are lots more considerations (e.g. interaction with VM), but this should give you lots to shoot at. What obvious things am I missing? Chris Gray {...,ihnp4}!alberta!myrias!cg
smb@ulysses.UUCP (Steven Bellovin) (09/11/85)
> First, keep in mind that I'm a compiler writer by trade, so I like the > idea of a pure stack architecture because it's so easy to generate code > for (I did it for a machine I designed and emulated). > > I've been told by a couple of people who are normally well informed that > a pure stack architecture just isn't practical. They have NOT been able > to convince me of this. Anybody out there want to try? Well, I'll give it a shot, though it's been a few years since I seriously studied this. Note that some of the traditional objections -- listed in the original article -- no longer apply, because of the decreasing cost of hardware. The primary advantage of a stack machine from an architectural point of view is that it provides good address abbreviation. (Registers serve a similar function, even when caches and the like make their speed advantage minimal.) Given that the memory to CPU bandwidth is a critical limiting factor on performance, such abbreviation is useful indeed. So far, so good. Unfortunately, with an optimizing compiler stack machines lose their advantage. Common subexpressions and the like require a sufficiently large number of DUPLICATE-TOP, SWAP-TOP-2, and COPY-ELEMENT-TO-TOP operations that the advantage in terms of total memory bandwidth is lost. We thus have a situation where simple compilers are easier to write, and produce better code; production-quality compilers are not easier and produce no better code. There is also a conflict between the subroutine stack and the expression stack. Unless you add some truly oddball instructions, you'll need two separate stacks, which can complicate your operating system storage management. (Admittedly, this is less of a problem today with 32-bit name spaces becoming routine; as we start using them up, we'll have to think about the issue again....) Finally, although stack machines are useful for arithmetic processing, they're far less natural for non-numeric work -- string processing, record-oriented commercial goo, LISP, etc. Look at, say, the UNIX kernel -- very little of it would benefit from a stack architecture, and much of it would suffer. My conclusion: the right answer, at least for now, is a machine with a good subroutine stack. Other issues, notably the complexity of the instruction set, are open. --Steve Bellovin P.S. Many of these points, especially those concerning memory bandwidth, are from Fred Brooks' Computer Architecture course. I believe he attributed that particular argument to Gene Amdahl. Brooks wanted to make the 360 a stack machine (yes, big bad IBM seriously considered that), but Amdahl showed that it wouldn't fly. Some of his arguments turned on the cost of the necessary hardware at the time; these, of course, are much less interesting today.
mash@mips.UUCP (John Mashey) (09/13/85)
teve Bellovin writes, in reply to Chris Gray: > > I've been told by a couple of people who are normally well informed that > > a pure stack architecture just isn't practical. They have NOT been able > > to convince me of this. Anybody out there want to try? > > (bunch of comments, which seem pretty good ones) > My conclusion: the right answer, at least for now, is a machine with a good > subroutine stack. Other issues, notably the complexity of the instruction > set, are open. > In support of Steve's position are the following additional ones. As always, it is VERY hard to analyze architectural features in isolation, i.e., all generalizations are false; nevertheless: 1) COSTS IN FUNDAMENTAL DATA ACCESS TIMES. For a given level of technology, it always seems faster to do add reg1,reg2,reg3 where this means a) select values of reg1 and (if dual-ported reg file, at same time), reg2, b) add them c) gate result back into reg3. Rather than [assuming A is TOS, B is TOS+1] add where this means (as in B5500, for example): a) make sure both A & B are valid; if not, make 1-2 memory fetches, and put them into A&B b) add them, putting result back in B c) mark A invalid. Less registers = more memory traffic = faster access to registers at lowest hardware level. More registers = less memory traffice = slower access to the registers, because either a) the registers not only have to act like registers, but must also act like a giant shift register to keep the TOS at a definite place, which (at chip level, anyway) gobbles realestate or b) one needs an index register (related to the stack pointer) which points to the TOS location within the register array. This turns out to be painful for the basic machine cycle, because it requires some extra decoding time to find the TOS and the TOS+1 - unless there's great trickery somewhere, I suspect there's an extra adder step required somewhere, which is real ungood in the basic machine cycle. You may note that most machines that have multiple register sets allocate them in sets of powers of 2, so that reigster selection can occur by concatenating the register number requested with high-order bits that indicate which register set is used. Allowing variable-size register windows is possible, but much harder. 2) PIPELINING PROBLEMS [this piece I'm less sure of] At a given level of technology, one way to make things go faster is pipelining, or overlapping instruction execution. Faster machines tend to use more pipeline stages (not just IFETCH & EXECUTE, for example). AMong other things, this requires complex "bypassing", whereby the results of one operation may dynamically feed into the next, because the next has already started well before the first finishes. In general, this is easiest to do for very simple architectures, i.e., like CDC or CRAY machines, which are load/store architectures with little or no complex side-effects and exciting address modes. The more complex the architecture, the more complex becomes the detection and handling of pipeline hazards; the more complex, the slower. Recall the number of oddities that have popped over the years on machines with heavy use of side-effects (like auto-increment addressing), especially in the presence of memory protection errors; stack machines are like those, but with auto-increment/decrement on almost every instruction! ALthough some of the original technology arguments have disappeared, it is worth noting that: a) Although many people have been able to cost-reduce architectures over the years, it seems that the Burroughs architectures have been difficult to move unchanged to lower price levels, and at upper performance levels, they've tended to go to multi-processors. There may, of course, be other reasons for the latter, and Burroughs has been in MP for a long time, but it is often the case that you do that when the technology is hard to make go much faster in uni-processors. Current dyadic IBM CPUs are similar example. [Not a criticism, just an observation; I always admired the B5500 and its friends for the vision shown therein.] b) One may conjecture why HP is replacing the (stack machine) HP-3000 with Spectrum (to all accounts, RISC architecture of load/store variety) for more performance. BOTTOM LINE: stack machines are elegant ijn some ways, but very hard to make either really cheap or really fast. MAYBE current VLSI technology can overcome some of this, but it's not at all clear. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
alan@drivax.UUCP (Alan Fargusson) (09/15/85)
Ever hear of Hewlett-Packard? They have been making machines with pure stack archetectures for years. I used an HP3000 when I was in school and became very knowlegable about the machine. The only register that the user had real access to was an index register. Data addressing was relative to the base of data (DB), or the stack pointer (S), or the frame pointer (Q). The machine cost about the same as a PDP-11/70, and performed about the same as far as I can tell. It had one advantage over the PDP-11 in that it was a segmented machine that could have more than one code segment, so you didn't need overlays. It had shared run time libraries that HP called Segmented Libraries. In fact most of the operating system was just a shared run time library. Code segments had two attributes, preveliged/non-preveliged and callable/non-callable. The protection of the system was implemented using these features. I guess I will stop rambeling now. I really liked this machine. Now if HP had just ported UNIX to it. I think that the HP9000 systems have a similar archetecture. -- Alan Fargusson. { ihnp4, amdahl, mot }!drivax!alan
richardt@orstcs.UUCP (richardt) (09/16/85)
Stack architectures are beautiful -- to a point. My major gripes with them are the following: 1) A very heavy reliance on RPN. Although routines can be written to get around this, and some people actually like RPN, this is still a dubious feature at best. Algebraic and other forms of higher math don't work well in RPN, and (although I don't know about you) RPN makes logic notation far more confusing than it needs to be. Since I use logic functions in a lot of situations, this is a hassle. Even the If...Then structure is made more complex than necessary because of forcing it into RPN. 2) Stack architectures and variables don't mix well. Although it's not impossible, dealing with stack frames to find variables and work with them, and using stack references for absolute variables is yet another hassle. 3) Although this is somewhat Forth-specific, relying on the Jump rather than the Goto Subroutine. Because the stack is being used for parameters, you can either (1) deal with unpleasant stack frames; (2) never use subroutines, with the inherent increase in execution time while addresses are being looked up and the overhead for faking a subroutine call; (3) use subroutines only for the very low level routines, and then you get stuck with all the disadvantages of (1) and (2). 4) It is very difficult to use multiple dynamic data spaces on a stack machine. Again, it's not impossible, but stack machines are designed against it. Note: These opinions are based on writing a Forth for the 68000, where you can have all the stacks/heaps that you want; writing a Forth for the 6502, whcih has an unpleasant and unreliable indirect jump, thereby forcing self-modifiying code (which has it's place, but still); writing a Forth for the 6809, which has a two stacks (eliminating (3) above); and 6 years of programming experience on most of the microprocessors running around and several larger machines. I'm sorry, I'd far prefer a register architecture which works to either a memory architecture or a stack machine. orstcs!richardt "What shall we do with a stack machine, Stack Machine, Stack Machine, Er-ly in the morning?" <-- I consider Stack machines to be on about the level of a drunken sailor.
peter@graffiti.UUCP (Peter da Silva) (09/22/85)
> 1) A very heavy reliance on RPN. Although routines can be written to get > around this, and some people actually like RPN, this is still a dubious > feature at best. Algebraic and other forms of higher math don't work well > in RPN, and (although I don't know about you) RPN makes logic notation far > more confusing than it needs to be. Since I use logic functions in a lot of > situations, this is a hassle. Even the If...Then structure is made more > complex than necessary because of forcing it into RPN. You seem to be confusing the language you used and machine architecture. That's a job for the compiler writer. Have you ever written one? I have done a couple and I'd like to say that for sheer ease of code generation from algebraic expressions nothing beats a stack machine. You can compile to FORTH in nothing flat. A great way of getting the language design working before dealing with grubby registers for the real thing. Unfortunately the project was killed before we could do the real thing, but that was in spite of being able to show managers the thing working in FORTH.
alan@drivax.UUCP (Alan Fargusson) (09/23/85)
> Stack architectures are beautiful -- to a point. My major gripes with them > are the following: > > 1) A very heavy reliance on RPN. Although routines can be written to get > around this, and some people actually like RPN, this is still a dubious > feature at best. Algebraic and other forms of higher math don't work well > in RPN, and (although I don't know about you) RPN makes logic notation far > more confusing than it needs to be. Since I use logic functions in a lot of > situations, this is a hassle. Even the If...Then structure is made more > complex than necessary because of forcing it into RPN. > > 2) Stack architectures and variables don't mix well. Although it's not > impossible, dealing with stack frames to find variables and work with them, > and using stack references for absolute variables is yet another hassle. > > 3) Although this is somewhat Forth-specific, relying on the Jump rather than > the Goto Subroutine. Because the stack is being used for parameters, you > can either (1) deal with unpleasant stack frames; (2) never use subroutines, > with the inherent increase in execution time while addresses are being looked > up and the overhead for faking a subroutine call; (3) use subroutines only > for the very low level routines, and then you get stuck with all the > disadvantages of (1) and (2). I worked with an HP3000 for several years and didn't find any of these things to be a problem. Of course I mostly used high level languages, but I did write some assembler also. Most compilers convert the algebraic notation to something else, like RPN or FPN, anyway. > 4) It is very difficult to use multiple dynamic data spaces on a stack machine. > Again, it's not impossible, but stack machines are designed against it. > > Note: These opinions are based on writing a Forth for the 68000, where you can > have all the stacks/heaps that you want; writing a Forth for the 6502, whcih > has an unpleasant and unreliable indirect jump, thereby forcing self-modifiying > code (which has it's place, but still); writing a Forth for the 6809, which > has a two stacks (eliminating (3) above); and 6 years of programming > experience on most of the microprocessors running around and several larger > machines. I'm sorry, I'd far prefer a register architecture which works to > either a memory architecture or a stack machine. > > orstcs!richardt > "What shall we do with a stack machine, > Stack Machine, Stack Machine, > Er-ly in the morning?" <-- I consider Stack machines to be on about the level > of a drunken sailor. I don't quite follow your argument. You seem to be saying that since it was hard to implement forth on a 6502 it must be hard to implement it on a stack machine. I don't agree with that. -- Alan Fargusson. { ihnp4, amdahl, mot }!drivax!alan
blaine@nmtvax.UUCP (10/08/85)
In article <1260> cg.myriasa.UUCP (Chris Grey) writes: > I've been told by a couple of people who are normally well informed that > a pure stack architecture just isn't practical. They have NOT been able > to convince me of this. Anybody out there want to try? I will not respond directly to this question, but rather provide some information from someone who has worked with both types of systems. Burroughs has had a stack oriented product line since arround 1964. These machines fall into the Burroughs designation B5000-B7900 and A-Series. The performance of these machines spans the range from a VAX-750 to the largest 308x IBMs. Contrary to some statements, it has not been very difficult to develop high end machines. The low end machines are not a problem now (due to small winchesters which can hold the MCP). The Burroughs machines have a segmented memory with tagged data. Smaller machines execute the RPN dirrectly, while larger machines can run into trouble with two problems: 1. Having to execute too many instructions/second (it takes more operators to do something in pure RPN). 2. Having too high a volume of memory reads and writes from the top of stack. These problems are not difficult to get around. First, operator concatenation is used on high end machines to fold RPN onto a register machine dynamically! This is rather like a small peephole optimizer in hardware. The RPN notation is very usefull in that it does not tie you down to a particular number of registers (0 to N are fine). In theory a tightly coupled configuration of a small RPN executing machine(such as for handling datacom) and a larger concatenating machine could share the same object image, one maping onto an x,y top of stack register pair while the other machine goes to GPRs. The concatenated opcodes are as easy to optimize as any GPR machine. Top of stack memory accesses can be reduced by a special cache. The essential thing to remember is that a stack instruction set does not need to imply a stack processor design. P.S. Tags are not a performance problem. They are a great help for debugging. Thunk heaven(:->) Blaine Gaither -- Blaine Gaither ucbvax!unmvax!nmtvax!blaine Computer Science Department blaine@nmt
peterb@pbear.UUCP (10/10/85)
/* Written 2:55 pm Oct 8, 1985 by nmtvax!blaine in pbear:net.arch */ /* ---------- "Re: Stack architectures - why not?" ---------- */ In article <1260> cg.myriasa.UUCP (Chris Grey) writes: >> I've been told by a couple of people who are normally well informed that >> a pure stack architecture just isn't practical. They have NOT been able >> to convince me of this. Anybody out there want to try? > > >The essential thing to remember is that a stack instruction set does not >need to imply a stack processor design. True, I wrote a compiler for a simple stack machine based around a modula-2 syntax, for the 8088 chip that has no "general" registers, and I found it rather easy to implement code that runs just about as fast as it's comperable C-code. (Both running under PC/IX) All I did was to keep the top two stack operands in registers (TOS in AX, next TOS in CX where possible). Then map operands into registers backwards using an optimizer to cut down on redundant MOV instructions. With this philosophy, much of the drudgery of the 8088 architecture is removed from the user standpoint, and even the compiler need not know if it, just the code selection/generation/optimization phases. My simple optimizer was a peephole optimizer for register utilization that ran until it could not optimize further, (or by an option, stop at a percentage of optimization). My next step is to modify the compiler to produce parse trees that make logical expressions easier to use by implementing the operations as tree operations rather than stack operations as used by much of the past. Also a register mask set will be propagated throughout the tree making the allocation of appropriate registers quite straight forward. The pass for register allocation/utilization can follow most if not all of the code optimizations based upon code reordering, and subexpresssion collections. I've been ramblin, but I would like to see if there is interest in a newsgroup for compiler subjects(probably called net.compiler) that would cover the more "black magic" topics of register allocation, LR parser error recovery, Stack machine to non-stack machine mapping, etc... Mail me some ideas, and I'll rehash to the net. Peter Barada {ihnp4!inmet|{harvard|cca}!ima}!pbear!peterb