ken@key.COM (Ken Kofman) (08/02/89)
I have been a FORTH fan ever since I learned the language about a year and a half ago. I have also been following some of the new cpu architectures, such as mipsco, AMD29000, transputer, etc, and have been wondering about the extreme performance hit that FORTH must take on these cpus compared to languages like C. It would be extremely difficult for FORTH to take full advantage of the large register files the RISC chips provide, unless a FORTH compiler is provided, or the programmer resorts to assembler :(. Even on the transputer, which, at first glance, seems particularly well suited to FORTH, there must be a good deal of time wasted maintaining the stack- -the transputer's stack is only three deep. Intrinsic to the current RISC philosophy is the idea that memory accesses are BAD, and should be avoided where possible. RISC chips therefore provide a large register set, and usually a memory cache. A FORTH program will be making many memory accesses. The cache will sometimes help, but will often prove ineffectual. One would want the top of the stacks to be cached, and usually they will be. However, these caches are usually direct mapped, sometimes are 2 way. There is a reasonable chance that other data will blow the stack tops away at any time during program execution. I think I'm beginning to ramble. Basically, it seems that the current crop of cpu architectures, with multiple functional units with varying data latencies, and large register sets do not work well with FORTH. register sets do not work well with FORTH. Am I missing something? Has anyone out there used FORTH on new and/or interesting architectures? Very curious, ken -- ------------------------------------------------------------------- Disclaimer: Don't blame me, Amdahl MADE me say it. Ken Kofman ken@samaria.key.com ...!pacbell!key!ken
tim@cayman.amd.com (Tim Olson) (08/02/89)
In article <975@key.COM> ken@samaria.key.COM (Ken Kofman) writes: | It would be extremely difficult for FORTH to take full advantage of | the large register files the RISC chips provide, unless a FORTH | compiler is provided, or the programmer resorts to assembler :(. | Even on the transputer, which, at first glance, seems particularly | well suited to FORTH, there must be a good deal of time wasted | maintaining the stack- -the transputer's stack is only three deep. Yes, large standard register files are not very effective in speeding up FORTH. However, the Am29000 register file is different. There are 128 local registers, which are offset from the register stack pointer (global register 1). Thus, the local register file can act like a stack cache, holding the top (or all!) of the operand stack. FORTH primatives (compiled in-line) then look like: + DUP SWAP add lr1, lr1, lr0 add lr127, lr0, 0 add tmp, lr0, 0 add gr1, gr1, 4 sub gr1, gr1, 4 add lr0, lr1, 0 add lr1, tmp, 0 with lr0 pointing to the top of stack, lr1 the next of stack, etc. A "smart" FORTH compiler would be able to remove the extraneous stack-adjustment instructions and change the local variable references accordingly. The FORTH notion of separate operand and control stacks can also be implemented by splitting the local register file into two 64-register chunks, and keeping two shadow stack pointers, swapping the correct pointer into the real stack pointer when required. This raises a couple of questions: First, are 64 words of control and operand stack enough? How deeply nested to FORTH words typically get? Second, does a typical FORTH program (coded with in-line primatives and call-threaded) spend most of its time executing FORTH primatives or executing the thread of control for user-written words? -- Tim Olson Advanced Micro Devices (tim@amd.com)
jax@well.UUCP (Jack J. Woehr) (08/03/89)
In article <26573@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: ... original posting deleted, but yes the transputer is not a very effective Forth engine ... > >Yes, large standard register files are not very effective in speeding up >FORTH. However, the Am29000 register file is different. There are 128 >local registers, which are offset from the register stack pointer >(global register 1). Thus, the local register file can act like a stack >cache, holding the top (or all!) of the operand stack. > >FORTH primatives (compiled in-line) then look like: > > + DUP SWAP >add lr1, lr1, lr0 add lr127, lr0, 0 add tmp, lr0, 0 >add gr1, gr1, 4 sub gr1, gr1, 4 add lr0, lr1, 0 > add lr1, tmp, 0 > >with lr0 pointing to the top of stack, lr1 the next of stack, etc. A >"smart" FORTH compiler would be able to remove the extraneous >stack-adjustment instructions and change the local variable references >accordingly. > VERY Interesting!!! Just got hold of the 29000 User's Manual today, thanx to AMD's eager and friendly marketing division. Is this to be taken to mean that you have a Forth up and running on some 29000-based boards or machines? More info, please! >The FORTH notion of separate operand and control stacks can also be >implemented by splitting the local register file into two 64-register >chunks, and keeping two shadow stack pointers, swapping the correct >pointer into the real stack pointer when required. > >This raises a couple of questions: > >First, are 64 words of control and operand stack enough? How deeply >nested to FORTH words typically get? Well, one opinion on that is expressed in the Forth 83-Standard, q.v. (downloadable from RCFB, phonenum below). But answer is qualified yes. >Second, does a typical FORTH program (coded with in-line primatives and >call-threaded) spend most of its time executing FORTH primatives or >executing the thread of control for user-written words? > Depends on the programmer, the system, etc., but the answer would tend towards the latter. Not all Forths are threaded the same. The new Forth Stack/RISC chips like the SC32 are VERY efficient in this respect. At the opposite extreme, an indirect-threaded Forth can spend 1/3 of its life executing the threading routine NEXT. There is all sorts of middle ground in this. Tell us more about what you are doing with the 29000! {}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{} {} {} {} jax@well ." Sysop, Realtime Control and Forth Board" FIG {} {} jax@chariot ." (303) 278-0364 3/12/2400 8-n-1 24 hrs." Chapter {} {} JAX on GEnie ." Tell them JAX sent you!" Coordinator {} {} {} {}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}{}
tim@cayman.amd.com (Tim Olson) (08/04/89)
In article <12989@well.UUCP> jax@well.UUCP (Jack J. Woehr) writes: | VERY Interesting!!! Just got hold of the 29000 User's Manual | today, thanx to AMD's eager and friendly marketing division. Is this | to be taken to mean that you have a Forth up and running on some 29000-based | boards or machines? More info, please! No, just in the "thinking about the problem" stage, right now. | >Second, does a typical FORTH program (coded with in-line primatives and | >call-threaded) spend most of its time executing FORTH primatives or | >executing the thread of control for user-written words? | > | | Depends on the programmer, the system, etc., but the answer would | tend towards the latter. That is what I suspected. It would then seem to be adventageous to optimize for fast function call and return (threading) at the expense of primative speed. -- Tim Olson Advanced Micro Devices (tim@amd.com)
wmb@SUN.COM (Mitch Bradley) (08/04/89)
Speed isn't everything. It's fun to play the "my system is faster than your system" game, but to a large extent, for the stuff I do most often, even a garden variety 68000 is plenty fast. (On the other hand, a garden variety 8088 is not fast enough for anything I can think of :-) On a 10 or 20 MIPs CPU chip, I will gladly spend a lot of those blazing CPU cycles in order to make my life as a programmer easier. In particular, I like threaded code a lot, because it is easy to decompile and easy to debug. In so many applications, it's plenty fast, and when it's not, writing a very few code words gets you within epsilon of the best speed you could hope to achieve anyway. Before you get out your flame throwers, I realize that there are some applications that demand the ultimate in speed. There are also a lot of applications that do not. Now, back to something closer to the original question of how Forth can or cannot use large register files on RISC chips. SPARC chips generally have 7 or 8 windows of 16 32-bit registers each. These registers are quite effective as a "cache" for C stack frames, but they are pretty much useless for Forth. My SPARC Forth implementation just sits in one of the registers and uses the registers in that window as general registers. There is an interpreter pointer, a data stack pointer, a return stack pointer, a user area pointer, a code base pointer, and a top of stack register. The rest of the registers are just scratch registers. Pretty basic, obvious, boring design. It could be faster, but it still spend most of its cycles waiting on the disk, or the Ethernet, or the UART, or me. One interesting side effect of this non-use of the window register mechanism is that Forth becomes an excellent debugger for C programs like the Unix kernel. Since Forth never triggers overlow/underflow of this stack frame "cache", it is innocuous with respect to the code that it is trying to debug. If I had it to do over, I wouldn't have cached the top of stack in a register. The 5 or 10% speedup that results in not worth the trouble it causes in trying to explain to people how to write code words. By the way, the SPARC Forth kernel was the very first high level language program ever to run on real SPARC hardware. When the first Sun 4/260 prototypes came in, a bug in the memory logic prevented C programs from being able to work. Forth ran just fine, and was used to debug the machine to the point of being able to run C. (The SPARC Forth kernel had been previously debugged using an instruction set simulator). Mitch
koopman@a.gp.cs.cmu.edu (Philip Koopman) (08/05/89)
In article <26573@amdcad.AMD.COM>, tim@cayman.amd.com (Tim Olson) writes: > First, are 64 words of control and operand stack enough? How deeply > nested to FORTH words typically get? My data showed that 32 elements are typically enough. Many Forth programs don't really need more than 16 elements. Having hardware spill detection logic is very handy, but it's enough just to generate an interrupt when a spill is about to happen (hardware spill handling logic is nice to have, but not essential). > Second, does a typical FORTH program (coded with in-line primatives and > call-threaded) spend most of its time executing FORTH primatives or > executing the thread of control for user-written words? My measurements for dynamic instruction mix are: CALL: 12.2% RETURN: 11.7% (There are obscure reasons why this isn't 12.2%) in terms of frequency of execution compared to other Forth primitives. The definition of a primitive for this study was "anything in assembly language in a highly optimized Forth compiler, including 32-bit math functions". So, I think these represent a lower bound for less optimized Forth systems. These numbers provide an obvious source of motiviation for designing Forth chips that often get subroutine returns for free by combining them with other operations, and can even get subroutine calls for free, or at least perform calls in a single clock cycle. Phil Koopman koopman@greyhound.ece.cmu.edu Arpanet 2525A Wexford Run Rd. Wexford, PA 15090 Senior Scientist at Harris Semiconductor. I don't speak for them, and they don't speak for me.