[net.lang.forth] FORTH returns on a TI 99/4A

gjphw@iham1.UUCP (10/11/84)

   An article appeared in a recent issue of BYTE presenting a case for changing
 the way FORTH handles its threading.  Ordinarily, FORTH's resident words are
 written in assembler and the user defined words are threaded using these
 resident words and other user defined words.  At the end of any particular
 word, a return pointer to the preceding word is stored.  The article said that
 this technique was used to keep FORTH small.

   A faster way to handle returns from any particular word would be to use the
 return procedure built into the microprocessor.  Rather than going through the
 bother of storing the pointers, the author argued that treating every word as
 a subroutine would provide faster execution.  He then gave some comparisons
 between the usual pointer returns and the subroutine/macro routines that he
 advocated.

   I have the opportunity to obtain the assembly source code for the FORTH that
 has been implemented by Texas Instruments on their TI 99/4A.  If there is
 indeed a significant performance improvement to be realized, a nice project
 for me might involve modifying the returns according to this guy's proposal.
 My question concerns the value of the article's argument.

   Is there a significant performance improvement to be had by treating FORTH
 words as subroutines?  Has anyone in netland considered this possibility?  Are
 there presently some implementations of FORTH that already treat its words as
 subroutines (and uses the subroutine return native to the micro)?  Replies
 will be read if either posted to this newsgroup or sent to me directly.
 Thank you.

-- 

                                    Patrick Wyant
                                    AT&T Bell Laboratories (Naperville, IL)
                                    *!iham1!gjphw

wmb@sun.uucp (Mitch Bradley) (10/12/84)

> ...
>A faster way to handle returns from any particular word would be to use the
>return procedure built into the microprocessor.  Rather than going through
>the bother of storing the pointers, the author argued that treating every
>word as a subroutine would provide faster execution.  He then gave some
>comparisons between the usual pointer returns and the subroutine/macro
>routines that he advocated.
> ...
>Is there a significant performance improvement to be had by treating FORTH
>words as subroutines?  Has anyone in netland considered this possibility?
>Are there presently some implementations of FORTH that already treat its
>words as subroutines (and uses the subroutine return native to the micro)?

This opens up a whole can of worms.  There is no simple answer, so here's
the complicated one:

There are at least 5 different ways of "threading" that I know of.  Every one
of these schemes has been used before.
1) Indirect threading (ITC) - compile a pointer to a pointer to the code to
   execute.  This is the common FIG-Forth way.  It os compact an makes
   it easy to implement DOES> words.
2) Direct Threaded Code (DTC) - compile a pointer to the code to execute.
   This is frequently faster than (ITC), but DOES> words are not quite as
   easy as with ITC.
3) Token Threaded Code (TTC) - Compile an index into a jump table of pointers
   to the code to execute.  This has the potential for being very compact,
   and is also very easy to dynamically relocate.  Usually a little slower
   than either ITC or DTC, but not by much.
4) JSR Threaded Code (JTC) - compile a JSR instruction to the code to execute.
   May be somewhat faster than 1,2, or 3, but see discussion below.
5) Native Code (NC) - attempt to compile machine code sequences for Forth
   words.  A good compromise is to compile the sequences for short CODE words
   in-line, and to use JSR threading for references to longer words.  This
   is the way the Princeton VAX Forth works.  This is the fastest scheme of
   all, but depending on the processor, it may take up a lot of space.
   In the case of the VAX, there is hardly any space penalty at all!
   The 68000 has only a moderate space penalty; for most 8-bit micros
   (with the possible exception of the 6809), the space penalty is severe.

The only way to decide which scheme is faster for a particular processor
it to write down the code to implement the inner interpreter for each
scheme, then count cycles.  You MUST include the DOCOLON and EXIT routines
as part of the inner interpreter in order to make a fair comparison.
Ditto for an JMP instructions used to get to NEXT, if applicable.
It is usually okay to neglect to time the run-time code for VARIABLE,
CONSTANT, and DOES>, unless that code is horribly long.

Dynamic word frequencies I have measured for an ITC forth:
(Program was the outer interpreter compiling Forth code)

Relative frequencies of execution:

CODE words: 90%
COLON definitions: 8%
Other (VARIABLE, CONSTANT, DOES>): 2%

What this means is that saving one cycle on the execution time of
CODE words is worth about 10 cycles saved on COLON definitions.

It turns out that for the 68000, DTC is faster than JTC.  The reason
is that DTC NEXT for the 68000 does  MOVE (IP)+,A0   JMP (A0)
exactly once per CODE word, whereas JTC does  JSR <whatever> ... RTS
which involves pushing and popping a long word on the stack.

The bottom line is that you have to do a lot of work to get the
right answer for an particular processor.

Another thing to consider is the ease of implementing desireable
utilities like a decompiler.  Some threading schemes are easier to
decompile than others.  I personally find a decompiler to be a very
valuable tool.  Other people get along quite well by having convenient
access to the source code.  Since I frequently use Forth for debugging
peripheral hardware, in which case I don't have any mass storage on line,
I really use the decompiler a lot.

Good Luck,
Mitch Bradley

carr@utah-cs.UUCP (Harold Carr) (10/16/84)

Is there an implementation of FORTH for 8080/z80s or VAX that
uses the Token Threaded Code mentioned by Mitch Bradley?  From what
I read, this sounds equivalent to Lisp's function cell - is that
correct?  In Lisp, the ability to dynamically redefine functions
gives alot of power, especially for debugging.

Harold Carr    Arpa: CARR@UTAH-20   uucp: harpo!utah-cs!carr