[comp.arch] Processor architecture to support functional languages

simonpj@cs.glasgow.ac.uk (Jones) (06/08/90)

Early implementations of lazy functional languages were often based on
a various forms of graph reduction, often using combinators.  Reduction
seemed such a completely different way of executing a program that lots
of proposals emerged for special non-von-Neumann computer architectures,
with graph reduction somehow implemented in the hardware.  The Burroughs
NORMA machine was one example.

As time went on, we all realised that all such machines were doing was
implementing an interpreter in hardware, and a much better job could be
done by applying better compiler technology, though still based on graph
reduction.  The Chalmers LML compiler was an early example of this sort of 
compiler.  Good old von Neumann architectures are good targets for this
sort of compilation; radical new architectures (with the important exception
of dataflow, which has been with us a long time, but which is in a 
resurgent phase at the moment) are out of fashion.

I thought it might be interesting to think about the question "if one
were designing a broadly conventional processor to exectute lazy
functional languages, would there be any features one would put in to
support them?".  I concluded that there are one or two, and that they
might be useful in other contexts too.  This note is an attempt to set
these ideas down, to see if they spark off reactions from anyone else
who has been thinking on similar lines.

It's a little longer than most postings, so I've Latex'd it so you can
snip it out and print it if you want.

Cheers

Simon

\documentstyle{article}
\begin{document}

\title{A note on processor architecture \\
for lazy graph reduction}

\author{ Simon L Peyton Jones \\
Dept of Computing Science, Glasgow University }

\maketitle

\section{Introduction}

This note puts forward a few ideas for ways in which
a von Neumann processor could be optimised for 
lazy graph reduction, an important implementation technique
for non-strict functional languages.

Most of the ideas would be useful in other contexts.

\section{Lazy graph reduction}

In compiled lazy graph reduction, every value is represented by a {\em
closure}.  A closure is a heap-allocated object which may or may not
be evaluated.  If it is unevaluated, it will contain a code pointer
and free variables which together give a ``recipe'' for its value.
The closure can be evaluated by putting a pointer to it into a
standard register {\tt Node}, and jumping to its code pointer.  The
code can now access the free variables indirectly via {\tt Node}.

In lazy graph reduction, it is very common to have to test to see
whether a closure has been evaluated.  If so, execution can proceed
normally; if not, the current context (such as registers) will have to
be saved, and the closure evaluated.

The usual callee-saves technique for this context-saving, whereby the
callee saves any registers he disturbs, makes small leaf procedures
very cheap, but it does not work here.  The closure may tail-call a
function, which tail-calls another function and so on
\footnote{A tail call is a call which does not push a return address,
because no further work remains to be done after the return.  The
called function will return direct to the caller's caller.  This
optimisation is crucial to the efficient implementation of iteration
in functional languages.}.  Under a callee-saves convention, the
registers would be restored before each tail call, and then saved
again by the called function.  So the caller has to save.

Worse still, there may be more to test than simple
evaluated/not-evaluated.  For example, in a parallel machine there may
be local/non-local information encoded in the closure too.  The
simplest way of finessing all this is to {\em replace the test with an
unconditional jump to the closure's code pointer}.  If the closure is
already evaluated, its code will return immediately.  If it is
non-local, its code will fetch it.  If it is unevaluated, it will
evaluate it.

The trouble is that this means the registers must be saved regardless
of whether the closure is evaluated and local (which is typically the
case) or not.  In practice this means that it isn't possible to make
much use of registers, which is a royal pain.

\section{Fast register saves}

So lazy graph reduction amplifies the usual dilemma of register saving.
So here's the idea: 
\begin{itemize}
\item 
The ``registers'' are just the top few locations on the stack.
\item
The top of stack is cached rather carefully in a multiport cache.
\end{itemize}

What's the point of registers?  Answers: (a) short addresses (b) fast
access.  Relative addressing from top of stack gives short addresses,
whilst the cache provides fast access.

Notice that I am not talking about a ``stack machine'' in the usual
sense of a zero-address machine.  I'm suggesting all the usual
2-address (or even 3-address; take your pick) instructions, with a
load-store architecture, but addressing the stack via the cache.

The cache can take advantage of 
\begin{itemize}
\item
not fetching in data which is in newly-
allocated stack locations (since they contain junk), 
\item
and not writing
back data which is now beyond the stack top.
\end{itemize}

Now, calling a procedure which returns immediately doesn't disturb the
cache at all.  For reasons I'll get to, calling should load a return
address into a register, not push it on the stack, so the stack
pointer isn't even moved.  Even if the procedure does do something,
and needs a couple of registers, it can make the space with a single
adjust-stack-ptr instruction, then do its stuff and return, only
having poisoned a few of the stack cache locations.

Notice what context-switching costs between separate parallel processes
have turned into.  The context-switch itself happens very fast, because
no register-saving occurs.  But as the new process starts up, it will
start to use its stack, and that will cause the old process's cached
data to get flushed out onto its external concrete stack.  The 
context switching cost is still there, but it is smeared out more, giving
more chance for the processor to do something useful meanwhile.  The
memory cycles due to context switch can occur in idle slots of the 
new processes execution (if there are any!).

Another nice feature of this scheme is that there is a uniform escape
from the register-spill problem.  The not-enough-registers problem is
no longer a physical limitation of registers cells; rather it is to do
with how many bits in the instruction word are dedicated to holding
register addresses
\footnote{Notice that the number of registers addressable by the
shortest instruction format need not be the same as than the number of
slots in the cache!  It can be either more or less.}.  If you need
more bits than this, a prefix instruction can make up the shortfall.
This would be a great boon to compiler-writers who spend an inordinate
amount of effort (and compilation time) worrying about register
spills!

Of course there should be a set of global registers too, not mapped into
the stack, for heap pointers and the like.
This is absolutely essential.

\subsection{This is not register windows}

At first this idea looks a bit like register windows, but it isn't at
all.  Register windows are profligate on memory bandwidth when the
stack wraps round.  Whether registers are live or not they have to be
uselessly saved and then restored later.  No benefits accrue for
procedures which just use a few registers (which is often the case in
functional programs).

No convenient mechanism is available for the case when there aren't
enough registers... there isn't even a stack to spill to!

\subsection{Implementation}

The main question in all of this is whether a suitable multi-port
cache with fast enough access can be built.  The adjust-stk-ptr
instruction is potentially tricky, because it changes the names of all
the things in the cache.  There must be quick ways of doing this, even
as crude as giving them all unary names which are all rotated at once.

\section{Instruction fetching}

Lazy graph reduction in the style I have described above has an
amazingly short basic-block size.  There are frequent jumps to
destinations given indirectly, but {\em very many of them are unconditional}.
This suggests an opportunity: the instruction fetch unit can
``execute'' these unconditional jumps without ever bothering the
execution unit with them.  The remaining conditional jumps are the 
unavoidable ones written in the original program, rather than artefacts of
lazy graph reduction.

For example, it is typical to jump to {\tt [Node]}, meaning to jump to the 
address held in the memory location whose address is held in register
{\tt Node} (this is what is done to evaluate a closure, for example).
Provided that {\tt Node} is loaded sufficiently far upstream (normally not
a problem), the execution unit could make the fetch to get the 
new PC and start fetching instructions from there.
Then there wouldn't be any pipeline bubbles at the jump.

Similarly, the instruction thus jumped to is often a return instruction
(the jump having stored a return address in a suitable return-address
register).    The instruction fetcher can just resume fetching
at the designated return address.

\section{Limit checking}

Another very common operation is to compare two registers and trap if
one exceeds the other.  This is used to detect heap and stack
overflow.  It is not acceptable to trap this at the exact moment the
memory reference which causes the overflow occurs; because by then the
machine is in a state which makes garbage collection impossible.
Rather, an explicit (but fast) overflow check instruction is inserted
at a point where everything is in a tidy state.

Alternatively, and perhaps better, at the point when everything is
in a tidy state, an instruction can add an offset to the stack pointer
which makes space for all the stuff which is to be allocated.  (The stack
locations will then be filled in using indexed accesses rather than push
instructions.)  This add-to-stack-ptr instruction is the one which should
trap if it makes the stack pointer bigger than some limit.  Thus the
limit checking is implicit, and should be just about free.

\section{Acknowledgements}

Dick Kieburtz has been developing similar ideas at the Oregon Graduate
Centre; thanks to him for some discussions we have had on the subject.

\end{document}

carlton@mitre.org (David Carlton) (06/16/90)

What about having hardware/OS support for multiple stacks per thread?
It seems to me that it would be easier and faster (or at least not
slower) to let somebody else take care of making sure that your stack
doesn't run out of space unless you actually run out of memory.  It
would mean that you wouldn't have to write code which has to deal with
the fact that the stack might move at a moment's notice (though, if
the garbage collector were written properly, I suppose that there
would be no way for the program to tell that the stack had moved),
might reduce fragmentation, and so forth.

  Also, what work has been done on putting garbage collection in
hardware?  I was just reading ad (old) MIT AI lab report by Guy L.
Steele, Jr. proposing that you get rid of (some of the) math
operations in the microprocessor and replace them by garbage-collection
operations.  And I could well believe that in many circumstances it
would be more desirable to have garbage collection done in hardware
rather than, e.g., floating point.  Of course, this too would cause
problems - the cpu would have to be able to identify pointers, for
example, and so it would probably only be feasible on architectures
with tagged data.  But for them it might be a good idea.  What do
people think?

david carlton
carlton@linus.mitre.org

aglew@basagran.csg.uiuc.edu (Andy Glew) (06/16/90)

>It seems to me that it would be easier and faster (or at least not
>slower) to let somebody else take care of making sure that your stack
>doesn't run out of space unless you actually run out of memory.  It
>would mean that you wouldn't have to write code which has to deal with
>the fact that the stack might move at a moment's notice (though, if
>the garbage collector were written properly, I suppose that there
>would be no way for the program to tell that the stack had moved),
>might reduce fragmentation, and so forth.
>
>...
>
>david carlton
>carlton@linus.mitre.org

Single threaded programs don't have to worry about stack overflow.
They put the heap at one end of virtual memory, and the stack at the
other, let them grow together, and take a page fault trap when they
grow past the currently allocated bounds.  And you usually run out of
physical memory and swap space before you run out of virtual address
space.

Why can't we use the same technique for multi-thread stacks?  Apart
from stupidities like OSes that only let you have three sets of
contiguously mapped pages, you can.  The growing toward each other
trick wouldn't help much, but you could conceivably set each threads'
stack in a larger than the stack will ever require virtual address
space, with only one page mapped in at the beginning, and then let
them grow. Page fault on the stack growing past the limit of already
allocated space, and page fault (interpreted differently) if a stacks'
top address ever gets close to the base of the next stack in memmory.

Why isn't this done?  Well, let's see - say I have 128 threads.  I'm
on a 32 bit machine, but say that only 1G of addresses are accessible
to the user. And say maybe that 512M of address space is needed for
non-stack related applications, like text, bss, the heap, and mapped
files (can't have too many mapped files :-( ).
    So, we have 512M for 128 stacks. 19-7=12 => 4K per stack.  Not
very much, is it? (In fact, it's smaller than the page size for high
performance machines).

And there you have it.  Existing virtual memory mechanisms are sufficient
hardware support for multiple stacks, except (1) there isn't enough
address space, and (2) large page sizes imply large overhead.  Eg. on a system
with 16K pages, do you really want to allocate a page to each threads' stack?
If most of the memory won't be used (but has to be there for worst case).
--
Andy Glew, aglew@uiuc.edu

jonah@db.toronto.edu (Jeff Lee) (06/16/90)

aglew@basagran.csg.uiuc.edu (Andy Glew) writes:
>Why isn't this done?  Well, let's see - say I have 128 threads.  I'm
>on a 32 bit machine, but say that only 1G of addresses are accessible
>to the user. And say maybe that 512M of address space is needed for
>non-stack related applications, like text, bss, the heap, and mapped
>files (can't have too many mapped files :-( ).
>    So, we have 512M for 128 stacks. 19-7=12 => 4K per stack.  Not
>very much, is it? (In fact, it's smaller than the page size for high
>performance machines).

Sorry Andy,

1) 512M==2**29, hence you could have 128*4MB stacks, not 4KB stacks --
or 128K*4KB stacks.

2) 4KB is a fairly reasonable page size.  There was a time when 4KB was
considered a *large* page size; of course opinions have changed.
However, 4KB is the smallest page size that completely maps 4GB (32-bit
addresses) in two levels of page map tables with 32-bit PMT entries.  I
believe that Sun adopted 8KB pages so they could cover 16MB (the
physical address range of the 68000) in a single level PMT.  Then of
course they were stuck with it.  (Now, rumour has it that the
Sparcstation hits the wall when given more than 16MB of memory.  ;-)

j.

david@eng.sun.com (2. Is it hard on you when you fail?) (06/16/90)

In article <90Jun15.211837edt.2773@ois.db.toronto.edu>
jonah@db.toronto.edu (Jeff Lee) writes:
>2) 4KB is a fairly reasonable page size.  There was a time when 4KB was
>considered a *large* page size; of course opinions have changed.
>However, 4KB is the smallest page size that completely maps 4GB (32-bit
>addresses) in two levels of page map tables with 32-bit PMT entries.  I
>believe that Sun adopted 8KB pages so they could cover 16MB (the
>physical address range of the 68000) in a single level PMT.  Then of
>course they were stuck with it.

The Sun 4/20, 4/60, and 4/65 (aka SLC, SS1, SS1+) have a 4KB page size.
The SPARC Reference MMU spec also calls for 4KB pages.

-- 
David DiGiacomo, Sun Microsystems, Mt. View, CA  david@eng.sun.com