[comp.lang.forth] FORTH, RISC, and other new architectures

ken@key.COM (Ken Kofman) (08/02/89)

I have been a FORTH fan ever since I learned the language about
a year and a half ago.  I have also been following some of the
new cpu architectures, such as mipsco, AMD29000, transputer, etc,
and have been wondering about the extreme performance hit that
FORTH must take on these cpus compared to languages like C.

It would be extremely difficult for FORTH to take full advantage of
the large register files the RISC chips provide, unless a FORTH
compiler is provided, or the programmer resorts to assembler :(.
Even on the transputer, which, at first glance, seems particularly
well suited to FORTH, there must be a good deal of time wasted 
maintaining the stack- -the transputer's stack is only three deep.

Intrinsic to the current RISC philosophy is the idea that memory
accesses are BAD, and should be avoided where possible.  RISC chips
therefore provide a large register set, and usually a memory cache.
A FORTH program will be making many memory accesses.  The cache will
sometimes help, but will often prove ineffectual.  One would want the
top of the stacks to be cached, and usually they will be.  However, 
these caches are usually direct mapped, sometimes are 2 way.  There is
a reasonable chance that other data will blow the stack tops away at any
time during program execution.

I think I'm beginning to ramble.  Basically, it seems that the current
crop of cpu architectures, with multiple functional units with varying
data latencies, and large register sets do not work well with FORTH.
register sets do not work well with FORTH.  Am I missing something?
Has anyone out there used FORTH on new and/or interesting architectures?

Very curious,

ken



-- 
-------------------------------------------------------------------
Disclaimer:  Don't blame me, Amdahl MADE me say it.
Ken Kofman
ken@samaria.key.com				...!pacbell!key!ken

tim@cayman.amd.com (Tim Olson) (08/02/89)

In article <975@key.COM> ken@samaria.key.COM (Ken Kofman) writes:
| It would be extremely difficult for FORTH to take full advantage of
| the large register files the RISC chips provide, unless a FORTH
| compiler is provided, or the programmer resorts to assembler :(.
| Even on the transputer, which, at first glance, seems particularly
| well suited to FORTH, there must be a good deal of time wasted 
| maintaining the stack- -the transputer's stack is only three deep.

Yes, large standard register files are not very effective in speeding up
FORTH.  However, the Am29000 register file is different.  There are 128
local registers, which are offset from the register stack pointer
(global register 1).  Thus, the local register file can act like a stack
cache, holding the top (or all!) of the operand stack.

FORTH primatives (compiled in-line) then look like:

	+			DUP			SWAP
add lr1, lr1, lr0	add lr127, lr0, 0	add tmp, lr0, 0
add gr1, gr1, 4		sub gr1, gr1, 4		add lr0, lr1, 0
						add lr1, tmp, 0

with lr0 pointing to the top of stack, lr1 the next of stack, etc.  A
"smart" FORTH compiler would be able to remove the extraneous
stack-adjustment instructions and change the local variable references
accordingly.

The FORTH notion of separate operand and control stacks can also be
implemented by splitting the local register file into two 64-register
chunks, and keeping two shadow stack pointers, swapping the correct
pointer into the real stack pointer when required.

This raises a couple of questions:

First, are 64 words of control and operand stack enough?  How deeply
nested to FORTH words typically get?

Second, does a typical FORTH program (coded with in-line primatives and
call-threaded) spend most of its time executing FORTH primatives or
executing the thread of control for user-written words? 

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

jax@well.UUCP (Jack J. Woehr) (08/03/89)

In article <26573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes:
	... original posting deleted, but yes the transputer is not
a very effective Forth engine ...
>
>Yes, large standard register files are not very effective in speeding up
>FORTH.  However, the Am29000 register file is different.  There are 128
>local registers, which are offset from the register stack pointer
>(global register 1).  Thus, the local register file can act like a stack
>cache, holding the top (or all!) of the operand stack.
>
>FORTH primatives (compiled in-line) then look like:
>
>	+			DUP			SWAP
>add lr1, lr1, lr0	add lr127, lr0, 0	add tmp, lr0, 0
>add gr1, gr1, 4		sub gr1, gr1, 4		add lr0, lr1, 0
>						add lr1, tmp, 0
>
>with lr0 pointing to the top of stack, lr1 the next of stack, etc.  A
>"smart" FORTH compiler would be able to remove the extraneous
>stack-adjustment instructions and change the local variable references
>accordingly.
>
	VERY Interesting!!! Just got hold of the 29000 User's Manual
today, thanx to AMD's eager and friendly marketing division. Is this
to be taken to mean that you have a Forth up and running on some 29000-based
boards or machines? More info, please!

>The FORTH notion of separate operand and control stacks can also be
>implemented by splitting the local register file into two 64-register
>chunks, and keeping two shadow stack pointers, swapping the correct
>pointer into the real stack pointer when required.
>
>This raises a couple of questions:
>
>First, are 64 words of control and operand stack enough?  How deeply
>nested to FORTH words typically get?

	Well, one opinion on that is expressed in the Forth 83-Standard,
q.v. (downloadable from RCFB, phonenum below). But answer is qualified
yes.

>Second, does a typical FORTH program (coded with in-line primatives and
>call-threaded) spend most of its time executing FORTH primatives or
>executing the thread of control for user-written words? 
>

	Depends on the programmer, the system, etc., but the answer would
tend towards the latter.

	Not all Forths are threaded the same. The new Forth Stack/RISC
chips like the SC32 are VERY efficient in this respect.

	At the opposite extreme, an indirect-threaded Forth can spend
1/3 of its life executing the threading routine NEXT.

	There is all sorts of middle ground in this. Tell us more about
what you are doing with the 29000!

{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}
{}                                                                        {}
{} jax@well     ." Sysop, Realtime Control and Forth Board"      FIG      {}
{} jax@chariot  ." (303) 278-0364 3/12/2400 8-n-1 24 hrs."     Chapter    {}
{} JAX on GEnie       ." Tell them JAX sent you!"             Coordinator {}
{}                                                                        {}
{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}

tim@cayman.amd.com (Tim Olson) (08/04/89)

In article <12989@well.UUCP> jax@well.UUCP (Jack J. Woehr) writes:
| 	VERY Interesting!!! Just got hold of the 29000 User's Manual
| today, thanx to AMD's eager and friendly marketing division. Is this
| to be taken to mean that you have a Forth up and running on some 29000-based
| boards or machines? More info, please!

No, just in the "thinking about the problem" stage, right now.

| >Second, does a typical FORTH program (coded with in-line primatives and
| >call-threaded) spend most of its time executing FORTH primatives or
| >executing the thread of control for user-written words? 
| >
| 
| 	Depends on the programmer, the system, etc., but the answer would
| tend towards the latter.

That is what I suspected.  It would then seem to be adventageous to
optimize for fast function call and return (threading) at the expense of
primative speed.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

wmb@SUN.COM (Mitch Bradley) (08/04/89)

Speed isn't everything.  It's fun to play the "my system is faster
than your system" game, but to a large extent, for the stuff I do
most often, even a garden variety 68000 is plenty fast.  (On the other
hand, a garden variety 8088 is not fast enough for anything I
can think of :-)

On a 10 or 20 MIPs CPU chip, I will gladly spend a lot of those
blazing CPU cycles in order to make my life as a programmer easier.
In particular, I like threaded code a lot, because it is easy to
decompile and easy to debug.  In so many applications, it's plenty
fast, and when it's not, writing a very few code words gets you
within epsilon of the best speed you could hope to achieve anyway.

Before you get out your flame throwers, I realize that there are
some applications that demand the ultimate in speed.  There are also
a lot of applications that do not.

Now, back to something closer to the original question of how Forth
can or cannot use large register files on RISC chips.  SPARC chips
generally have 7 or 8 windows of 16 32-bit registers each.  These
registers are quite effective as a "cache" for C stack frames, but
they are pretty much useless for Forth.  My SPARC Forth implementation
just sits in one of the registers and uses the registers in that
window as general registers.  There is an interpreter pointer, a data
stack pointer, a return stack pointer, a user area pointer,
a code base pointer, and a top of stack register.  The rest of the
registers are just scratch registers.  Pretty basic, obvious, boring
design.  It could be faster, but it still spend most of its
cycles waiting on the disk, or the Ethernet, or the UART, or me.

One interesting side effect of this non-use of the window register
mechanism is that Forth becomes an excellent debugger for C programs
like the Unix kernel.  Since Forth never triggers overlow/underflow
of this stack frame "cache", it is innocuous with respect to the
code that it is trying to debug.

If I had it to do over, I wouldn't have cached the top of stack in
a register.  The 5 or 10% speedup that results in not worth the
trouble it causes in trying to explain to people how to write code words.

By the way, the SPARC Forth kernel was the very first high level language
program ever to run on real SPARC hardware.  When the first Sun 4/260
prototypes came in, a bug in the memory logic prevented C programs
from being able to work.  Forth ran just fine, and was used to debug
the machine to the point of being able to run C.  (The SPARC Forth kernel
had been previously debugged using an instruction set simulator).

Mitch

koopman@a.gp.cs.cmu.edu (Philip Koopman) (08/05/89)

In article <26573@amdcad.AMD.COM>, tim@cayman.amd.com (Tim Olson) writes:
> First, are 64 words of control and operand stack enough?  How deeply
> nested to FORTH words typically get?

My data showed that 32 elements are typically enough.  Many Forth
programs don't really need more than 16 elements.  Having hardware
spill detection logic is very handy, but it's enough just to generate
an interrupt when a spill is about to happen (hardware spill handling
logic is nice to have, but not essential).

> Second, does a typical FORTH program (coded with in-line primatives and
> call-threaded) spend most of its time executing FORTH primatives or
> executing the thread of control for user-written words? 

My measurements for dynamic instruction mix are:
  CALL: 12.2%
  RETURN: 11.7%   (There are obscure reasons why this isn't 12.2%)
in terms of frequency of execution compared to other Forth primitives.
The definition of a primitive for this study was "anything in assembly
language in a highly optimized Forth compiler, including 32-bit math
functions".  So, I think these represent a lower bound for less optimized 
Forth systems.

These numbers provide an obvious source of motiviation for designing
Forth chips that often get subroutine returns for free by combining
them with other operations, and can even get subroutine calls for
free, or at least perform calls in a single clock cycle.

  Phil Koopman                koopman@greyhound.ece.cmu.edu   Arpanet
  2525A Wexford Run Rd.
  Wexford, PA  15090
Senior Scientist at Harris Semiconductor.
I don't speak for them, and they don't speak for me.