celarier@reed.bitnet (Stuart Celarier,<None>,2369490,2369386) (10/27/90)
I would like to know if anyone has any experience, references or comments regarding implementing Forth on a RISC processor. Some of the relevant issues that I need to address (to my boss) are: 1. The computational model of Forth is a stack machine; however RISC processors are not stack machines, instead they use a hefty number of registers. There is no stack support, such as push and pop instructions, no stack pointer, the registers do not form any sort of stack themselves (such as, e.g., the registers on the Intel 80x87 do). On such a processor, it would be an unacceptable waste of resources to implement a stack in memory from scratch -- er, not to mention slow. Similarly, implementing a stack on a subset of the registers would involve an unacceptable amount of register to register data manipulation, since the registers participating in an operation are hardcoded in that operation. 2. If the stack machine model is abandoned for Forth on a RISC machine, replaced with _something_ else, then the Forth word interface -- you know, those wonderful stack transition diagrams that show you what a definition does to the data -- must change as well. Would this then be recognizable as Forth to anyone else? Would it be Forth? 3. Assuming that Forth can be implemented on a RISC processor, and that it doesn't use a stack, then can Forth code be ported to this system? See rehetorical questions above. 4. Forth gets its speed and to some extent its modularity from working at a very low level, with very small pieces. This means a rather high frequency of branches (jumps, calls, returns). On a RISC processor, there is lots of hardware devoted to pipelining the execution of instructions, as well as detecting branches far enough in advance of their execution so that the instructions are already cached -- giving the "zero wait branch". Will coding little tiny pieces of code mean that typically the processor will be performing much less than optimal, due to constantly depleating the pipeline, branching more frequently than the instruction processor can keep up with, not being able to simultaneously execute fixed and floating point instructions, etc.? I guess this can be summarized with: are RISC processors designed for larger granularity code than Forth is? (Note the dangling participle.) 5. Any other comments, such as pitfalls, success stories, technical journals, abandoned projects, contacts to talk with, etc. are GREATLY appreciated. Stuart Celarier Voice: 503/236-9490 celarier@reed.edu Fax: 503/236-9491
koopman@a.gp.cs.cmu.edu (Philip Koopman) (10/28/90)
In article <15603@reed.UUCP>, celarier@reed.bitnet (Stuart Celarier,<None>,2369490,2369386) writes: >... comments regarding implementing Forth on a RISC processor. > 1. The computational model of Forth is a stack machine; however > RISC processors are not stack machines, ... As with anything on a RISC, your performance will be determined by the quality of your compiler. You will need at the very least a peephole optimizer that can do things like change "7 +" from "push(7) +" to "add r1,7". In general the idea is to change a sequence of in-line Forth primitives into a series of register-based operations, thereby eliminating much of the stack manipulation overhead. Such compilers do exist, but they are currently the exception rather than the rule (perhaps someone out there can name one -- JForth??). > ... since the registers participating in an operation > are hardcoded in that operation. This is one of the reasons it is so attractive do subroutine inlining on RISCs -- for better register allocation. Such inlining is not needed on stack hardware. So, you will find a high-quality RISC Forth compilers doing in-lining of short operations. You will probably also find they can't use as many registers for Forth as for C. > .. I guess this can be summarized with: are RISC processors designed > for larger granularity code than Forth is? (Note the dangling participle.) Forth/stack-based processors tend to be more "agile" than RISCs. Specifically, they don't have branch delay slots and they do support subroutine calls as a quick primitive operation. A high-quality RISC Forth compiler will try to arrange to fill branch delay slots for you. > 5. Any other comments, such as pitfalls, success stories, technical > journals, abandoned projects, contacts to talk with, etc. are GREATLY > appreciated. Mitch Bradley at Sun does SPARC-based Forth, but I believe he works at a higher level (Mitch -- how does the optimizing C compiler do at translating stack->registers?) Tim Olson at AMD posted here that he had done some low-level work with Forth on an AMD 29000. (AMD 29000 has a register stack). On the other hand, you could just use a stack-based chip (such as the Harris RTX 2000, etc) if you have an embedded application. Phil Koopman koopman@greyhound.ece.cmu.edu Arpanet 2525A Wexford Run Rd. Wexford, PA 15090 Senior scientist at Harris Semiconductor, and adjunct professor at CMU. I don't speak for them, and they don't speak for me.
DAVID@PENNDRLS.BITNET (10/29/90)
You might want to check out a book previously mentioned by someone else (I forget who) on this conference: Interpretation and Instruction Path Coprocessing Debaere & van Campenhout, MIT Press 1990. ISBN 0-262-04107-3 The first half of this book gives an in depth analysis of the interpretation process, with special attention to speed/size tradeoffs and optimization techniques. RISC processors are discussed, with some specific notes on the principles an optimizing compiler would need to follow to produce fast code. Most of the issues mentioned by Mr. Koopman in his posting are covered by this book. -- R. David Murray (DAVID@PENNDRLS.BITNET, DAVID@PENNDRLS.UPENN.EDU)
wmb@MITCH.ENG.SUN.COM (10/31/90)
I have been using Forth on SPARC as long as SPARC has existed; my Forth implementation was the first program to run on real SPARC hardware (not counting a few "do nothing" diagnostic loops). More than half of all the SPARC machines that Sun has shipped have a live Forth interpreter in the boot PROM, and all new Sun machines will have Forth in the boot PROM. In the following message, I will compare the implementation of traditional threaded Forth interpreters on RISC microprocessors and on conventional microprocessors. I am explicitly not addressing the issue of "Forth machines", nor am I addressing the compilation of Forth into native machine code. > RISC processors are not stack machines, instead they use a hefty number > of registers. There is no stack support, such as push and pop > instructions, no stack pointer, the registers do not form any sort of > stack themselves (such as, e.g., the registers on the Intel 80x87 do). > On such a processor, it would be an unacceptable waste of > resources to implement a stack in memory from scratch -- er, not to > mention slow. The lack of "stack support" instructions turns out not to be a big issue. It takes 2 SPARC instructions to do a "pop", but those 2 instructions execute in 1 cycle each, for a total of 2 cycles. You can do a pop on a 68020 with 1 instruction, but that 1 instruction probably takes 4-6 cycles. The 2-instruction SPARC "pop" sequence occupies 8 bytes of memory, compared to 2 bytes on the 680x0. Code words are not all "pushes" and "pops", and the net expansion of code word size is a bit less than 2:1. In a traditional threaded code Forth implementation, only the code words would experience this code expansion. Colon definitions, which make up the bulk of most systems, do not get much bigger (they get a little bigger on average, due to alignment restrictions, but not enough to worry about). My "loaded" 680x0 and SPARC Forth implementations compare thusly: 68020: 95K SPARC: 100K 2.5x faster than 68020 (The speed comparison will of course depend on the implementation. The 2.5x number was based on the first-ever SPARC machine (Sun 4/260), compared to Sun's then top-of-the-line 68020 machine. The 2 machines had identical MMU's, identical second-level memory systems, and very similar caches.) This is very much an "apples to apples" comparison. This is basically the exact same Forth system running on 2 different machines. Nearly all the high-level code is 100% identical. The kernel implementation tradeoffs were the same for both machines: full 32-bit addressing, maximize the speed within the constraint of using threaded code. (For me, the debugging convenience of threaded code outweights the speed penalty.) > 4. Forth gets its speed and to some extent its modularity from > working at a very low level, with very small pieces. This means a rather > high frequency of branches (jumps, calls, returns). On a RISC processor, > there is lots of hardware devoted to pipelining the execution of > instructions, as well as detecting branches far enough in advance of > their execution so that the instructions are already cached -- giving the > "zero wait branch". Will coding little tiny pieces of code mean that > typically the processor will be performing much less than optimal, due to > constantly depleating the pipeline, branching more frequently than the > instruction processor can keep up with, not being able to simultaneously > execute fixed and floating point instructions, etc.? Many of the "branch delay slots" in my SPARC Forth kernel are filled, so the pipeline doesn't spend too much of its time stalled on account of branches. On most SPARC chip implentations, memory loads and stores take 2 and sometimes 3 cycles, but the extra cycles can overlap with the following instructions if there are no data dependencies. In a lot of cases, I managed to arrange the code to eliminate load dependencies, thus minimizing pipeline stalls on account of memory accesses. > I guess this can be summarized with: are RISC processors designed > for larger granularity code than Forth is? Yes, but branching isn't the real issue. The real issue is that the register transfer architecture tends to favor compiler register usage optimizations that Forth has a difficult time using. The important thing is that RISC architecture is NOT worse than conventional architectures for implemented threaded Forth, nor is it much better. Overall, Forth on RISC tends to run faster, not because the architecture is better suited to Forth, but because the RISC instruction cycles are faster. SPARC's hardware register window support, designed to support procedural languages like C, neither helps Forth nor harms it. > 5. Any other comments, such as pitfalls, success stories, technical > journals, abandoned projects, contacts to talk with, etc. are GREATLY > appreciated. Here is an amusing effect: Under certain (baroque but real) constraints, Forth runs FASTER than C on a SPARC. The constraints: the code is located in a relatively slow 8-bit-wide PROM (9 wait states/byte x 4 bytes/instruction). RAM is at a premium, so it is not desireable to copy the PROM into RAM and execute from there. I copy the Forth kernel code into 2 pages of RAM and turn on the cache for those pages. The threaded code is stored as shifted 16-bit offsets from a base register, so you only have to fetch 2 PROM bytes to follow a thread. Under these conditions, it takes fewer cycles to follow a Forth thread (i.e. execute NEXT) than to execute a single machine instruction from PROM. Note: the preceding lengthy tome on my SPARC Forth implementation does not apply to the Forth in the Sun SPARC boot PROMs; I tuned that implementation for small size instead of speed. Mitch Bradley, wmb@Eng.Sun.COM