[comp.lang.forth] Forth on RISC processors

celarier@reed.bitnet (Stuart Celarier,<None>,2369490,2369386) (10/27/90)

	I would like to know if anyone has any experience, references or
comments regarding implementing Forth on a RISC processor.

Some of the relevant issues that I need to address (to my boss) are:

	1.  The computational model of Forth is a stack machine; however
RISC processors are not stack machines, instead they use a hefty number
of registers.  There is no stack support, such as push and pop
instructions, no stack pointer, the registers do not form any sort of
stack themselves (such as, e.g., the registers on the Intel 80x87 do).
	On such a processor, it would be an unacceptable waste of
resources to implement a stack in memory from scratch -- er, not to
mention slow.  Similarly, implementing a stack on a subset of the 
registers would involve an unacceptable amount of register to register
data manipulation, since the registers participating in an operation
are hardcoded in that operation.

	2.  If the stack machine model is abandoned for Forth on a RISC
machine, replaced with _something_ else, then the Forth word interface
-- you know, those wonderful stack transition diagrams that show you what
a definition does to the data -- must change as well.  Would this then
be recognizable as Forth to anyone else?  Would it be Forth?

	3.  Assuming that Forth can be implemented on a RISC processor, and
that it doesn't use a stack, then can Forth code be ported to this system?
See rehetorical questions above.

	4.  Forth gets its speed and to some extent its modularity from
working at a very low level, with very small pieces.  This means a rather
high frequency of branches (jumps, calls, returns).  On a RISC processor,
there is lots of hardware devoted to pipelining the execution of
instructions, as well as detecting branches far enough in advance of
their execution so that the instructions are already cached -- giving the
"zero wait branch".  Will coding little tiny pieces of code mean that
typically the processor will be performing much less than optimal, due to
constantly depleating the pipeline, branching more frequently than the
instruction processor can keep up with, not being able to simultaneously
execute fixed and floating point instructions, etc.?
	I guess this can be summarized with:  are RISC processors designed
for larger granularity code than Forth is?  (Note the dangling participle.)

	5.  Any other comments, such as pitfalls, success stories, technical
journals, abandoned projects, contacts to talk with, etc. are GREATLY
appreciated.

Stuart Celarier			Voice:  503/236-9490
celarier@reed.edu		  Fax:  503/236-9491

koopman@a.gp.cs.cmu.edu (Philip Koopman) (10/28/90)

In article <15603@reed.UUCP>, celarier@reed.bitnet (Stuart Celarier,<None>,2369490,2369386) writes:
>... comments regarding implementing Forth on a RISC processor.
>         1.  The computational model of Forth is a stack machine; however
> RISC processors are not stack machines, ...
 
As with anything on a RISC, your performance will be determined by
the quality of your compiler.  You will need at the very least 
a peephole optimizer that can do things like change "7 +"
from "push(7) +" to "add r1,7".  In general the idea is to 
change a sequence of in-line Forth primitives into a series 
of register-based operations, thereby eliminating much of the
stack manipulation overhead.  Such compilers do exist, but they
are currently the exception rather than the rule (perhaps someone
out there can name one -- JForth??).
 
> ... since the registers participating in an operation
> are hardcoded in that operation.
 
This is one of the reasons it is so attractive do subroutine
inlining on RISCs -- for better register allocation.  Such inlining
is not needed on stack hardware.  So, you will find a high-quality
RISC Forth compilers doing in-lining of short operations.  You will
probably also find they can't use as many registers for Forth as
for C.
 
> ..        I guess this can be summarized with:  are RISC processors designed
> for larger granularity code than Forth is?  (Note the dangling participle.)
 
Forth/stack-based processors tend to be more "agile" than RISCs.
Specifically, they don't have branch delay slots and they do
support subroutine calls as a quick primitive operation.
A high-quality RISC Forth compiler will try to arrange to fill
branch delay slots for you.
 
>         5.  Any other comments, such as pitfalls, success stories, technical
> journals, abandoned projects, contacts to talk with, etc. are GREATLY
> appreciated.
 
Mitch Bradley at Sun does SPARC-based Forth, but I believe he
works at a higher level (Mitch -- how does the optimizing C
compiler do at translating stack->registers?)
Tim Olson at AMD posted here that he had done some low-level work
with Forth on an AMD 29000.  (AMD 29000 has a register stack).
 
On the other hand, you could just use a stack-based chip (such
as the Harris RTX 2000, etc) if you have an embedded application.
 
  Phil Koopman                koopman@greyhound.ece.cmu.edu   Arpanet
  2525A Wexford Run Rd.
  Wexford, PA  15090
Senior scientist at Harris Semiconductor, and adjunct professor at CMU.
I don't speak for them, and they don't speak for me.

DAVID@PENNDRLS.BITNET (10/29/90)

You might want to check out a book previously mentioned by someone
else (I forget who) on this conference:

  Interpretation and Instruction Path Coprocessing
  Debaere & van Campenhout, MIT Press 1990.
  ISBN 0-262-04107-3

The first half of this book gives an in depth analysis of the
interpretation process, with special attention to speed/size tradeoffs
and optimization techniques.  RISC processors are discussed, with some
specific notes on the principles an optimizing compiler would need to
follow to produce fast code.  Most of the issues mentioned by Mr.
Koopman in his posting are covered by this book.

-- R. David Murray    (DAVID@PENNDRLS.BITNET, DAVID@PENNDRLS.UPENN.EDU)

wmb@MITCH.ENG.SUN.COM (10/31/90)

I have been using Forth on SPARC as long as SPARC has existed; my Forth
implementation was the first program to run on real SPARC hardware (not
counting a few "do nothing" diagnostic loops).  More than half of all the
SPARC machines that Sun has shipped have a live Forth interpreter in the
boot PROM, and all new Sun machines will have Forth in the boot PROM.

In the following message, I will compare the implementation of traditional
threaded Forth interpreters on RISC microprocessors and on conventional
microprocessors.  I am explicitly not addressing the issue of "Forth machines",
nor am I addressing the compilation of Forth into native machine code.

> RISC processors are not stack machines, instead they use a hefty number
> of registers.  There is no stack support, such as push and pop
> instructions, no stack pointer, the registers do not form any sort of
> stack themselves (such as, e.g., the registers on the Intel 80x87 do).
>       On such a processor, it would be an unacceptable waste of
> resources to implement a stack in memory from scratch -- er, not to
> mention slow.

The lack of "stack support" instructions turns out not to be a big issue.
It takes 2 SPARC instructions to do a "pop", but those 2 instructions
execute in 1 cycle each, for a total of 2 cycles.  You can do a pop on
a 68020 with 1 instruction, but that 1 instruction probably takes 4-6
cycles.

The 2-instruction SPARC "pop" sequence occupies 8 bytes of memory, compared
to 2 bytes on the 680x0.  Code words are not all "pushes" and "pops", and
the net expansion of code word size is a bit less than 2:1.

In a traditional threaded code Forth implementation, only the code words
would experience this code expansion. Colon definitions, which make up the
bulk of most systems, do not get much bigger (they get a little bigger
on average, due to alignment restrictions, but not enough to worry about).

My "loaded" 680x0 and SPARC Forth implementations compare thusly:

        68020:  95K
        SPARC: 100K  2.5x faster than 68020

(The speed comparison will of course depend on the implementation.  The
2.5x number was based on the first-ever SPARC machine (Sun 4/260), compared
to Sun's then top-of-the-line 68020 machine.  The 2 machines had identical
MMU's, identical second-level memory systems, and very similar caches.)

This is very much an "apples to apples" comparison.  This is basically
the exact same Forth system running on 2 different machines.  Nearly
all the high-level code is 100% identical.  The kernel implementation
tradeoffs were the same for both machines: full 32-bit addressing, maximize
the speed within the constraint of using threaded code.  (For me, the
debugging convenience of threaded code outweights the speed penalty.)


>       4.  Forth gets its speed and to some extent its modularity from
> working at a very low level, with very small pieces.  This means a rather
> high frequency of branches (jumps, calls, returns).  On a RISC processor,
> there is lots of hardware devoted to pipelining the execution of
> instructions, as well as detecting branches far enough in advance of
> their execution so that the instructions are already cached -- giving the
> "zero wait branch".  Will coding little tiny pieces of code mean that
> typically the processor will be performing much less than optimal, due to
> constantly depleating the pipeline, branching more frequently than the
> instruction processor can keep up with, not being able to simultaneously
> execute fixed and floating point instructions, etc.?

Many of the "branch delay slots" in my SPARC Forth kernel are filled, so
the pipeline doesn't spend too much of its time stalled on account of
branches.  On most SPARC chip implentations, memory loads and stores
take 2 and sometimes 3 cycles, but the extra cycles can overlap with the
following instructions if there are no data dependencies.  In a lot of
cases, I managed to arrange the code to eliminate load dependencies, thus
minimizing pipeline stalls on account of memory accesses.


>       I guess this can be summarized with:  are RISC processors designed
> for larger granularity code than Forth is?

Yes, but branching isn't the real issue.  The real issue is that the
register transfer architecture tends to favor compiler register usage
optimizations that Forth has a difficult time using.

The important thing is that RISC architecture is NOT worse than conventional
architectures for implemented threaded Forth, nor is it much better.
Overall, Forth on RISC tends to run faster, not because the architecture
is better suited to Forth, but because the RISC instruction cycles are faster.

SPARC's hardware register window support, designed to support procedural
languages like C, neither helps Forth nor harms it.


>       5.  Any other comments, such as pitfalls, success stories, technical
> journals, abandoned projects, contacts to talk with, etc. are GREATLY
> appreciated.

Here is an amusing effect:  Under certain (baroque but real) constraints,
Forth runs FASTER than C on a SPARC.  The constraints: the code is located
in a relatively slow 8-bit-wide PROM (9 wait states/byte x 4 bytes/instruction).
RAM is at a premium, so it is not desireable to copy the PROM into RAM and
execute from there.  I copy the Forth kernel code into 2 pages of RAM and
turn on the cache for those pages.  The threaded code is stored as shifted
16-bit offsets from a base register, so you only have to fetch 2 PROM bytes
to follow a thread.  Under these conditions, it takes fewer cycles to follow
a Forth thread (i.e. execute NEXT) than to execute a single machine instruction
from PROM.

Note: the preceding lengthy tome on my SPARC Forth implementation does not
apply to the Forth in the Sun SPARC boot PROMs; I tuned that implementation
for small size instead of speed.

Mitch Bradley, wmb@Eng.Sun.COM