gjphw@iham1.UUCP (10/11/84)
An article appeared in a recent issue of BYTE presenting a case for changing the way FORTH handles its threading. Ordinarily, FORTH's resident words are written in assembler and the user defined words are threaded using these resident words and other user defined words. At the end of any particular word, a return pointer to the preceding word is stored. The article said that this technique was used to keep FORTH small. A faster way to handle returns from any particular word would be to use the return procedure built into the microprocessor. Rather than going through the bother of storing the pointers, the author argued that treating every word as a subroutine would provide faster execution. He then gave some comparisons between the usual pointer returns and the subroutine/macro routines that he advocated. I have the opportunity to obtain the assembly source code for the FORTH that has been implemented by Texas Instruments on their TI 99/4A. If there is indeed a significant performance improvement to be realized, a nice project for me might involve modifying the returns according to this guy's proposal. My question concerns the value of the article's argument. Is there a significant performance improvement to be had by treating FORTH words as subroutines? Has anyone in netland considered this possibility? Are there presently some implementations of FORTH that already treat its words as subroutines (and uses the subroutine return native to the micro)? Replies will be read if either posted to this newsgroup or sent to me directly. Thank you. -- Patrick Wyant AT&T Bell Laboratories (Naperville, IL) *!iham1!gjphw
wmb@sun.uucp (Mitch Bradley) (10/12/84)
> ... >A faster way to handle returns from any particular word would be to use the >return procedure built into the microprocessor. Rather than going through >the bother of storing the pointers, the author argued that treating every >word as a subroutine would provide faster execution. He then gave some >comparisons between the usual pointer returns and the subroutine/macro >routines that he advocated. > ... >Is there a significant performance improvement to be had by treating FORTH >words as subroutines? Has anyone in netland considered this possibility? >Are there presently some implementations of FORTH that already treat its >words as subroutines (and uses the subroutine return native to the micro)? This opens up a whole can of worms. There is no simple answer, so here's the complicated one: There are at least 5 different ways of "threading" that I know of. Every one of these schemes has been used before. 1) Indirect threading (ITC) - compile a pointer to a pointer to the code to execute. This is the common FIG-Forth way. It os compact an makes it easy to implement DOES> words. 2) Direct Threaded Code (DTC) - compile a pointer to the code to execute. This is frequently faster than (ITC), but DOES> words are not quite as easy as with ITC. 3) Token Threaded Code (TTC) - Compile an index into a jump table of pointers to the code to execute. This has the potential for being very compact, and is also very easy to dynamically relocate. Usually a little slower than either ITC or DTC, but not by much. 4) JSR Threaded Code (JTC) - compile a JSR instruction to the code to execute. May be somewhat faster than 1,2, or 3, but see discussion below. 5) Native Code (NC) - attempt to compile machine code sequences for Forth words. A good compromise is to compile the sequences for short CODE words in-line, and to use JSR threading for references to longer words. This is the way the Princeton VAX Forth works. This is the fastest scheme of all, but depending on the processor, it may take up a lot of space. In the case of the VAX, there is hardly any space penalty at all! The 68000 has only a moderate space penalty; for most 8-bit micros (with the possible exception of the 6809), the space penalty is severe. The only way to decide which scheme is faster for a particular processor it to write down the code to implement the inner interpreter for each scheme, then count cycles. You MUST include the DOCOLON and EXIT routines as part of the inner interpreter in order to make a fair comparison. Ditto for an JMP instructions used to get to NEXT, if applicable. It is usually okay to neglect to time the run-time code for VARIABLE, CONSTANT, and DOES>, unless that code is horribly long. Dynamic word frequencies I have measured for an ITC forth: (Program was the outer interpreter compiling Forth code) Relative frequencies of execution: CODE words: 90% COLON definitions: 8% Other (VARIABLE, CONSTANT, DOES>): 2% What this means is that saving one cycle on the execution time of CODE words is worth about 10 cycles saved on COLON definitions. It turns out that for the 68000, DTC is faster than JTC. The reason is that DTC NEXT for the 68000 does MOVE (IP)+,A0 JMP (A0) exactly once per CODE word, whereas JTC does JSR <whatever> ... RTS which involves pushing and popping a long word on the stack. The bottom line is that you have to do a lot of work to get the right answer for an particular processor. Another thing to consider is the ease of implementing desireable utilities like a decompiler. Some threading schemes are easier to decompile than others. I personally find a decompiler to be a very valuable tool. Other people get along quite well by having convenient access to the source code. Since I frequently use Forth for debugging peripheral hardware, in which case I don't have any mass storage on line, I really use the decompiler a lot. Good Luck, Mitch Bradley
carr@utah-cs.UUCP (Harold Carr) (10/16/84)
Is there an implementation of FORTH for 8080/z80s or VAX that uses the Token Threaded Code mentioned by Mitch Bradley? From what I read, this sounds equivalent to Lisp's function cell - is that correct? In Lisp, the ability to dynamically redefine functions gives alot of power, especially for debugging. Harold Carr Arpa: CARR@UTAH-20 uucp: harpo!utah-cs!carr