bpendlet@esunix.UUCP (Bob Pendleton) (11/07/86)
<> A few weeks ago there was some discussion of the execution speed of FORTH on a 68000. Out going news was hung up at our site so I wasn't able to comment at the time the discussion was going on. First a little background. FORTH is usually "compiled" to what is called threaded code. Threaded code is a sequence of pointers to the objects referenced in the original source code. A FORTH compiler spends most of its time looking up words in the simple table an writing their addresses into memory. A very small number of words are compiled by executing them and letting them do anything they want. This may seem strange to people who are used to complex compilers, but it works, its fast, and its very flexible. Threaded code is interpreted by a small loop that gets the next pointer from memory and then calls the code at that address. The code that is called my actually do something, an add or a fetch, or it may just call the interpreter with the address of another sequence of pointers. The only real problem with threaded code is that it is interpreted. And because it is interpreted, it is slow. I know that statement will get me some heat from FORTH fans the world over, but interpreted threaded code is slow compared to machine code. You're just stuck with the fact that the interpreter must execute 2 or 3 instructions for each pointer it follows. The obvious solution to this is to compile to "call" or "JSR" threaded code. Don't just put the address of object in memory, put a hardcoded subroutine call to the object in memory. This approach gets rid of the interpretive overhead and can make things run a lot faster. This works well on multiple stack machines like the 68000, but doesn't work as well as you might expect for single stack machines like the z80. You can also get into trouble on systems that require code to be relocatable. One of the reasons for using threaded code is that it is supposed to be more compact than equivalent machine code. But, on a machine like the 68000 where a pointer is 4 bytes and a JSR with a long word address is 6 bytes this argument breaks down. Assuming register A6 is the data stack pointer a JSR threaded FORTH system the word + ( add top to items on the stack ) is ( most likely to be ) compiled as JSR add$word Where add$word is something like add$word: MOVE.L (A6)+,D0 ADD.L (A6)+,D0 MOVE.L D0,-(A6) RTS Except for the RTS instruction the code used to implement + is the same length as the JSR instruction that invokes it. The RTS is only needed because of the JSR. So JSR threaded code gives no savings in space over compiling + in line, and the time required to execute the JSR, RTS pair ( assuming I'm reading the tables right ) at 36 clocks, almost doubles the 40 clocks needed to execute the two MOVEs and an ADD. And this is for 32 bit operands. Most primitive operations in FORTH can be expanded in line with little cost in space when compared to JSR threading. There are several other optimizations that can be used once you start compiling to in line code. A compiler that worked this way would generate many redundant push/pop pairs. A single flag can be used to eliminate these pairs of instructions. The result is that in the worst case the generated code would be the same size or perhaps slightly larger than JSR threaded code, but would usually be 1/3 to 1/2 the size and run 2 to 4 times faster. I wrote a FORTH like system for the z80 that compiled code this way. It generated code that was about the same size as CALL threaded code on the z80 ( remember 16 bit addresses, not 32 bit addresses like the 68000 ). The code ran about 3 times faster than the original CALL threaded version of the same system. So if somebody says that a JSR threaded FORTH on the Amiga runs 10 iterations of the sieve in 10 seconds and somebody else says that they have seen a FORTH on the Amiga that will do 10 iterations in 2 to 3 seconds, I would be inclined to believe them. Bob Pendleton. P.S. Sorry for being so long winded.