simonpj@cs.glasgow.ac.uk (Jones) (06/08/90)
Early implementations of lazy functional languages were often based on a various forms of graph reduction, often using combinators. Reduction seemed such a completely different way of executing a program that lots of proposals emerged for special non-von-Neumann computer architectures, with graph reduction somehow implemented in the hardware. The Burroughs NORMA machine was one example. As time went on, we all realised that all such machines were doing was implementing an interpreter in hardware, and a much better job could be done by applying better compiler technology, though still based on graph reduction. The Chalmers LML compiler was an early example of this sort of compiler. Good old von Neumann architectures are good targets for this sort of compilation; radical new architectures (with the important exception of dataflow, which has been with us a long time, but which is in a resurgent phase at the moment) are out of fashion. I thought it might be interesting to think about the question "if one were designing a broadly conventional processor to exectute lazy functional languages, would there be any features one would put in to support them?". I concluded that there are one or two, and that they might be useful in other contexts too. This note is an attempt to set these ideas down, to see if they spark off reactions from anyone else who has been thinking on similar lines. It's a little longer than most postings, so I've Latex'd it so you can snip it out and print it if you want. Cheers Simon \documentstyle{article} \begin{document} \title{A note on processor architecture \\ for lazy graph reduction} \author{ Simon L Peyton Jones \\ Dept of Computing Science, Glasgow University } \maketitle \section{Introduction} This note puts forward a few ideas for ways in which a von Neumann processor could be optimised for lazy graph reduction, an important implementation technique for non-strict functional languages. Most of the ideas would be useful in other contexts. \section{Lazy graph reduction} In compiled lazy graph reduction, every value is represented by a {\em closure}. A closure is a heap-allocated object which may or may not be evaluated. If it is unevaluated, it will contain a code pointer and free variables which together give a ``recipe'' for its value. The closure can be evaluated by putting a pointer to it into a standard register {\tt Node}, and jumping to its code pointer. The code can now access the free variables indirectly via {\tt Node}. In lazy graph reduction, it is very common to have to test to see whether a closure has been evaluated. If so, execution can proceed normally; if not, the current context (such as registers) will have to be saved, and the closure evaluated. The usual callee-saves technique for this context-saving, whereby the callee saves any registers he disturbs, makes small leaf procedures very cheap, but it does not work here. The closure may tail-call a function, which tail-calls another function and so on \footnote{A tail call is a call which does not push a return address, because no further work remains to be done after the return. The called function will return direct to the caller's caller. This optimisation is crucial to the efficient implementation of iteration in functional languages.}. Under a callee-saves convention, the registers would be restored before each tail call, and then saved again by the called function. So the caller has to save. Worse still, there may be more to test than simple evaluated/not-evaluated. For example, in a parallel machine there may be local/non-local information encoded in the closure too. The simplest way of finessing all this is to {\em replace the test with an unconditional jump to the closure's code pointer}. If the closure is already evaluated, its code will return immediately. If it is non-local, its code will fetch it. If it is unevaluated, it will evaluate it. The trouble is that this means the registers must be saved regardless of whether the closure is evaluated and local (which is typically the case) or not. In practice this means that it isn't possible to make much use of registers, which is a royal pain. \section{Fast register saves} So lazy graph reduction amplifies the usual dilemma of register saving. So here's the idea: \begin{itemize} \item The ``registers'' are just the top few locations on the stack. \item The top of stack is cached rather carefully in a multiport cache. \end{itemize} What's the point of registers? Answers: (a) short addresses (b) fast access. Relative addressing from top of stack gives short addresses, whilst the cache provides fast access. Notice that I am not talking about a ``stack machine'' in the usual sense of a zero-address machine. I'm suggesting all the usual 2-address (or even 3-address; take your pick) instructions, with a load-store architecture, but addressing the stack via the cache. The cache can take advantage of \begin{itemize} \item not fetching in data which is in newly- allocated stack locations (since they contain junk), \item and not writing back data which is now beyond the stack top. \end{itemize} Now, calling a procedure which returns immediately doesn't disturb the cache at all. For reasons I'll get to, calling should load a return address into a register, not push it on the stack, so the stack pointer isn't even moved. Even if the procedure does do something, and needs a couple of registers, it can make the space with a single adjust-stack-ptr instruction, then do its stuff and return, only having poisoned a few of the stack cache locations. Notice what context-switching costs between separate parallel processes have turned into. The context-switch itself happens very fast, because no register-saving occurs. But as the new process starts up, it will start to use its stack, and that will cause the old process's cached data to get flushed out onto its external concrete stack. The context switching cost is still there, but it is smeared out more, giving more chance for the processor to do something useful meanwhile. The memory cycles due to context switch can occur in idle slots of the new processes execution (if there are any!). Another nice feature of this scheme is that there is a uniform escape from the register-spill problem. The not-enough-registers problem is no longer a physical limitation of registers cells; rather it is to do with how many bits in the instruction word are dedicated to holding register addresses \footnote{Notice that the number of registers addressable by the shortest instruction format need not be the same as than the number of slots in the cache! It can be either more or less.}. If you need more bits than this, a prefix instruction can make up the shortfall. This would be a great boon to compiler-writers who spend an inordinate amount of effort (and compilation time) worrying about register spills! Of course there should be a set of global registers too, not mapped into the stack, for heap pointers and the like. This is absolutely essential. \subsection{This is not register windows} At first this idea looks a bit like register windows, but it isn't at all. Register windows are profligate on memory bandwidth when the stack wraps round. Whether registers are live or not they have to be uselessly saved and then restored later. No benefits accrue for procedures which just use a few registers (which is often the case in functional programs). No convenient mechanism is available for the case when there aren't enough registers... there isn't even a stack to spill to! \subsection{Implementation} The main question in all of this is whether a suitable multi-port cache with fast enough access can be built. The adjust-stk-ptr instruction is potentially tricky, because it changes the names of all the things in the cache. There must be quick ways of doing this, even as crude as giving them all unary names which are all rotated at once. \section{Instruction fetching} Lazy graph reduction in the style I have described above has an amazingly short basic-block size. There are frequent jumps to destinations given indirectly, but {\em very many of them are unconditional}. This suggests an opportunity: the instruction fetch unit can ``execute'' these unconditional jumps without ever bothering the execution unit with them. The remaining conditional jumps are the unavoidable ones written in the original program, rather than artefacts of lazy graph reduction. For example, it is typical to jump to {\tt [Node]}, meaning to jump to the address held in the memory location whose address is held in register {\tt Node} (this is what is done to evaluate a closure, for example). Provided that {\tt Node} is loaded sufficiently far upstream (normally not a problem), the execution unit could make the fetch to get the new PC and start fetching instructions from there. Then there wouldn't be any pipeline bubbles at the jump. Similarly, the instruction thus jumped to is often a return instruction (the jump having stored a return address in a suitable return-address register). The instruction fetcher can just resume fetching at the designated return address. \section{Limit checking} Another very common operation is to compare two registers and trap if one exceeds the other. This is used to detect heap and stack overflow. It is not acceptable to trap this at the exact moment the memory reference which causes the overflow occurs; because by then the machine is in a state which makes garbage collection impossible. Rather, an explicit (but fast) overflow check instruction is inserted at a point where everything is in a tidy state. Alternatively, and perhaps better, at the point when everything is in a tidy state, an instruction can add an offset to the stack pointer which makes space for all the stuff which is to be allocated. (The stack locations will then be filled in using indexed accesses rather than push instructions.) This add-to-stack-ptr instruction is the one which should trap if it makes the stack pointer bigger than some limit. Thus the limit checking is implicit, and should be just about free. \section{Acknowledgements} Dick Kieburtz has been developing similar ideas at the Oregon Graduate Centre; thanks to him for some discussions we have had on the subject. \end{document}
carlton@mitre.org (David Carlton) (06/16/90)
What about having hardware/OS support for multiple stacks per thread? It seems to me that it would be easier and faster (or at least not slower) to let somebody else take care of making sure that your stack doesn't run out of space unless you actually run out of memory. It would mean that you wouldn't have to write code which has to deal with the fact that the stack might move at a moment's notice (though, if the garbage collector were written properly, I suppose that there would be no way for the program to tell that the stack had moved), might reduce fragmentation, and so forth. Also, what work has been done on putting garbage collection in hardware? I was just reading ad (old) MIT AI lab report by Guy L. Steele, Jr. proposing that you get rid of (some of the) math operations in the microprocessor and replace them by garbage-collection operations. And I could well believe that in many circumstances it would be more desirable to have garbage collection done in hardware rather than, e.g., floating point. Of course, this too would cause problems - the cpu would have to be able to identify pointers, for example, and so it would probably only be feasible on architectures with tagged data. But for them it might be a good idea. What do people think? david carlton carlton@linus.mitre.org
aglew@basagran.csg.uiuc.edu (Andy Glew) (06/16/90)
>It seems to me that it would be easier and faster (or at least not >slower) to let somebody else take care of making sure that your stack >doesn't run out of space unless you actually run out of memory. It >would mean that you wouldn't have to write code which has to deal with >the fact that the stack might move at a moment's notice (though, if >the garbage collector were written properly, I suppose that there >would be no way for the program to tell that the stack had moved), >might reduce fragmentation, and so forth. > >... > >david carlton >carlton@linus.mitre.org Single threaded programs don't have to worry about stack overflow. They put the heap at one end of virtual memory, and the stack at the other, let them grow together, and take a page fault trap when they grow past the currently allocated bounds. And you usually run out of physical memory and swap space before you run out of virtual address space. Why can't we use the same technique for multi-thread stacks? Apart from stupidities like OSes that only let you have three sets of contiguously mapped pages, you can. The growing toward each other trick wouldn't help much, but you could conceivably set each threads' stack in a larger than the stack will ever require virtual address space, with only one page mapped in at the beginning, and then let them grow. Page fault on the stack growing past the limit of already allocated space, and page fault (interpreted differently) if a stacks' top address ever gets close to the base of the next stack in memmory. Why isn't this done? Well, let's see - say I have 128 threads. I'm on a 32 bit machine, but say that only 1G of addresses are accessible to the user. And say maybe that 512M of address space is needed for non-stack related applications, like text, bss, the heap, and mapped files (can't have too many mapped files :-( ). So, we have 512M for 128 stacks. 19-7=12 => 4K per stack. Not very much, is it? (In fact, it's smaller than the page size for high performance machines). And there you have it. Existing virtual memory mechanisms are sufficient hardware support for multiple stacks, except (1) there isn't enough address space, and (2) large page sizes imply large overhead. Eg. on a system with 16K pages, do you really want to allocate a page to each threads' stack? If most of the memory won't be used (but has to be there for worst case). -- Andy Glew, aglew@uiuc.edu
jonah@db.toronto.edu (Jeff Lee) (06/16/90)
aglew@basagran.csg.uiuc.edu (Andy Glew) writes: >Why isn't this done? Well, let's see - say I have 128 threads. I'm >on a 32 bit machine, but say that only 1G of addresses are accessible >to the user. And say maybe that 512M of address space is needed for >non-stack related applications, like text, bss, the heap, and mapped >files (can't have too many mapped files :-( ). > So, we have 512M for 128 stacks. 19-7=12 => 4K per stack. Not >very much, is it? (In fact, it's smaller than the page size for high >performance machines). Sorry Andy, 1) 512M==2**29, hence you could have 128*4MB stacks, not 4KB stacks -- or 128K*4KB stacks. 2) 4KB is a fairly reasonable page size. There was a time when 4KB was considered a *large* page size; of course opinions have changed. However, 4KB is the smallest page size that completely maps 4GB (32-bit addresses) in two levels of page map tables with 32-bit PMT entries. I believe that Sun adopted 8KB pages so they could cover 16MB (the physical address range of the 68000) in a single level PMT. Then of course they were stuck with it. (Now, rumour has it that the Sparcstation hits the wall when given more than 16MB of memory. ;-) j.
david@eng.sun.com (2. Is it hard on you when you fail?) (06/16/90)
In article <90Jun15.211837edt.2773@ois.db.toronto.edu> jonah@db.toronto.edu (Jeff Lee) writes: >2) 4KB is a fairly reasonable page size. There was a time when 4KB was >considered a *large* page size; of course opinions have changed. >However, 4KB is the smallest page size that completely maps 4GB (32-bit >addresses) in two levels of page map tables with 32-bit PMT entries. I >believe that Sun adopted 8KB pages so they could cover 16MB (the >physical address range of the 68000) in a single level PMT. Then of >course they were stuck with it. The Sun 4/20, 4/60, and 4/65 (aka SLC, SS1, SS1+) have a 4KB page size. The SPARC Reference MMU spec also calls for 4KB pages. -- David DiGiacomo, Sun Microsystems, Mt. View, CA david@eng.sun.com