hgw@rht32.pcs.com (h.-g. willers) (02/05/91)
Can anyone in the Forth-Comunity domment on the following issue: Given an indirect threaded FORTH for a RISC-procesor (R3000 or i860). What is the best implementation (concerning speed) for Top-of-Stack, i.e. TOS not in a CPU register TOS in a CPU register TOS and NOS in a CPU register TOS and NOS and NOS+1 .... ...... Having too many stack items in CPU registers generates much shuffling of data for some stack operations. Which implementation should be chosen? H.-G. -- H.-G. Willers PCS-Mail: hgw internal phone ( -271 ) DOMAIN: hgw@rht32.pcs.de (EUR) or hgw@rht32.pcs.com (US) BANG: ..unido!pcsbst!hgw (EUR) or ..pyramid!pcsbst!hgw (US)
koopman@a.gp.cs.cmu.edu (Philip Koopman) (02/06/91)
In article <1134@pcsbst.pcs.com>, hgw@rht32.pcs.com (h.-g. willers) writes: > Given an indirect threaded FORTH for a RISC-procesor (R3000 or i860). > What is the best implementation (concerning speed) for Top-of-Stack, > ... > Having too many stack items in CPU registers generates much shuffling > of data for some stack operations. Which implementation should be > chosen? I compared actual implementations on an 80286, and found that TOS in register was 10% to 15% faster than TOS not in a register. I expect this will be broadly true for most other register-based CPUs (i.e., not 1%, and not 30%, but probably something in between). Having more than 1 stack element in registers led to too much shuffling to be worthwhile. Phil Koopman koopman@greyhound.ece.cmu.edu Arpanet 2525A Wexford Run Rd. Wexford, PA 15090 *** this space for rent ***
jwoehr@isis.cs.du.edu (Jack J. Woehr) (02/09/91)
In article <1134@pcsbst.pcs.com> hgw@rht32.pcs.com (h.-g. willers) writes: >Can anyone in the Forth-Comunity domment on the following issue: > >Given an indirect threaded FORTH for a RISC-procesor (R3000 or i860). >What is the best implementation (concerning speed) for Top-of-Stack, >i.e. > TOS not in a CPU register > TOS in a CPU register > TOS and NOS in a CPU register > TOS and NOS and NOS+1 .... Depends on the chip architecture. On the FRISC-32 (marketed commercially by Silicon Composers as the SC32), the top four stack items are registers, so la-dee-dah! It's all in the silicon. On the other hand, TOS in a register is about all that most conventional small CISC chips will manage efficiently. I cache TOS in Vesta's Forth-83i96 for the SBC196 (Intel 80196-based single board). In that case, NOS in a register, even though the 80196 has PLENTY regs, would be an license to thrash. The advantage on the SBC196 is that address modes work on the 80C196 typically dictate that one operand be register direct mode and the other be any of the oblique modes ... so were TOS *not* cached, "+" would be: POP R0 ADD R0,[SP] ST R0,[SP] RETURN whereas cached it's ADD TOS,[SP]+ RETURN for over a 50 % advantage in this particular case. You end up giving some of the speed back since every time you push a literal (or whatever) to the stack it's two operations PUSH TOS LD TOS,#FOOBAR but my guesstimate is that in Vesta Forth-83i96 we are saving over ten percent execution overhead by cacheing TOS. I say "guesstimate" since the advantage was so obvious prima facia that we never coded the system any other way. Your question is intriguing ... I have the i860 manuals on my shelf but have never played with this 64-bit graphics engine ... what's up your sleeve? Would love to see your work after you get she up and running! Keep us all posted, and let us know what you conclude after you have counted all the cycles like good engineers :-) -- # ..!apple!dunike!nyx!koscej!jax # "Therefore, the L-RD G-d # # ..!hplabs!hp-lsd!oldcolo!jax # sent him FORTH ..." # # {apple,hplabs,pacbell,ucb}!well!jax # - Genesis 3:23 # # JAX on GEnie SYSOP RCFB 303-278-0364 # Member ANS Forth X3J14 TC #
koopman@a.gp.cs.cmu.edu (Philip Koopman) (02/09/91)
*** This is posted as a favor to Igor Agamirzian *** Organization: Leningrad Institute for Informatics AS USSR From: Igor Agamirzian <igor@iias.spb.su> We have and experience with different implementation of the top of the stack in the AstroFORTH system for the IBM PC. In the standard system we use a hardware stack of i8086/i80286 without top on the register. Using the target compiler of the AstroFORTH, we implemented two different types of stack top: with TOS on a register and with TOS and NOS on the registers, and checked the speed on the standard banchmark tests (BYTE, 1984, v.9, No 12 and FORTH Dimensions III/1). Our result was: taking standard implementation for 100% of execution speed, we got 108% with the one register implementation, and 95% with two registers. I think, that the result shows, that handling of two register top takes more time, than the economy on readiness of the arguments for binary operations gives. Of course, theese results may differ on different types of processors, though in any case there must be a threshold of effectiveness for the number of registers for the top of the stack representation. -- -- Igor Agamirzian Office: +7(812)350-2523 Home: +7(812)314-6055 Fax: +7(812)217-5105 Address: 37 Rackova Str. # 4, Leningrad 191011 U.S.S.R.