[comp.lang.forth] TOS in a CPU register

hgw@rht32.pcs.com (h.-g. willers) (02/05/91)

Can anyone in the Forth-Comunity domment on the following issue:

Given an indirect threaded FORTH for a RISC-procesor (R3000 or i860).
What is the best implementation (concerning speed) for Top-of-Stack,
i.e.
	TOS not in a CPU register
	TOS in     a CPU register
	TOS and NOS in a CPU register
	TOS and NOS and NOS+1 ....
	......

Having too many stack items in CPU registers generates much shuffling
of data for some stack operations. Which implementation should be
chosen?

    H.-G.

--
H.-G. Willers       PCS-Mail: hgw       internal phone ( -271 )
DOMAIN:  hgw@rht32.pcs.de   (EUR) or  hgw@rht32.pcs.com    (US)
BANG:    ..unido!pcsbst!hgw (EUR) or  ..pyramid!pcsbst!hgw (US)

koopman@a.gp.cs.cmu.edu (Philip Koopman) (02/06/91)

In article <1134@pcsbst.pcs.com>, hgw@rht32.pcs.com (h.-g. willers) writes:
> Given an indirect threaded FORTH for a RISC-procesor (R3000 or i860).
> What is the best implementation (concerning speed) for Top-of-Stack,
> ...
> Having too many stack items in CPU registers generates much shuffling
> of data for some stack operations. Which implementation should be
> chosen?

I compared actual implementations on an 80286, and found that TOS in
register was 10% to 15% faster than TOS not in a register.
I expect this will be broadly true for most other register-based CPUs
(i.e., not 1%, and not 30%, but probably something in between).
Having more than 1 stack element in registers led to too much shuffling
to be worthwhile.

  Phil Koopman                koopman@greyhound.ece.cmu.edu   Arpanet
  2525A Wexford Run Rd.
  Wexford, PA  15090
*** this space for rent ***

jwoehr@isis.cs.du.edu (Jack J. Woehr) (02/09/91)

In article <1134@pcsbst.pcs.com> hgw@rht32.pcs.com (h.-g. willers) writes:
>Can anyone in the Forth-Comunity domment on the following issue:
>
>Given an indirect threaded FORTH for a RISC-procesor (R3000 or i860).
>What is the best implementation (concerning speed) for Top-of-Stack,
>i.e.
>	TOS not in a CPU register
>	TOS in     a CPU register
>	TOS and NOS in a CPU register
>	TOS and NOS and NOS+1 ....

	Depends on the chip architecture. On the FRISC-32 (marketed
commercially by Silicon Composers as the SC32), the top four stack
items are registers, so la-dee-dah! It's all in the silicon.

	On the other hand, TOS in a register is about all that most
conventional small CISC chips will manage efficiently. I cache TOS
in Vesta's Forth-83i96 for the SBC196 (Intel 80196-based single board).
In that case, NOS in a register, even though the 80196 has PLENTY regs,
would be an license to thrash.

	The advantage on the SBC196 is that address modes work on the
80C196 typically dictate that one operand be register direct mode and the
other be any of the oblique modes ... so were TOS *not* cached, "+" would
be:

	POP R0
	ADD R0,[SP]
	ST  R0,[SP]
	RETURN

	whereas cached it's

	ADD TOS,[SP]+
	RETURN

	for over a 50 % advantage in this particular case.

	You end up giving some of the speed back since every time you
push a literal (or whatever) to the stack it's two operations

	PUSH TOS
	LD   TOS,#FOOBAR

	but my guesstimate is that in Vesta Forth-83i96 we are saving over
ten percent execution overhead by cacheing TOS.

	I say "guesstimate" since the advantage was so obvious prima
facia that we never coded the system any other way.

	Your question is intriguing ... I have the i860 manuals on my shelf
but have never played with this 64-bit graphics engine ... what's up
your sleeve? Would love to see your work after you get she up and running!

	Keep us all posted, and let us know what you conclude after you
have counted all the cycles like good engineers :-)
-- 
# ..!apple!dunike!nyx!koscej!jax       # "Therefore, the L-RD G-d  #
# ..!hplabs!hp-lsd!oldcolo!jax         #   sent him FORTH ..."     #
# {apple,hplabs,pacbell,ucb}!well!jax  #  - Genesis 3:23           #
# JAX on GEnie SYSOP RCFB 303-278-0364 # Member ANS Forth X3J14 TC #

koopman@a.gp.cs.cmu.edu (Philip Koopman) (02/09/91)

*** This is posted as a favor to Igor Agamirzian ***

Organization: Leningrad Institute for Informatics AS USSR
From: Igor Agamirzian <igor@iias.spb.su>

We have and experience with different implementation of the top of the stack in
the AstroFORTH system for the IBM PC. In the standard system we use a hardware
stack of i8086/i80286 without top on the register. Using the target compiler of
the AstroFORTH, we implemented two different types of stack top: with TOS on a
register and with TOS and NOS on the registers, and checked the speed on the
standard banchmark tests (BYTE, 1984, v.9, No 12 and FORTH Dimensions III/1).
Our result was: taking standard implementation for 100% of execution speed, we
got 108% with the one register implementation, and 95% with two registers. I
think, that the result shows, that handling of two register top takes more time,
than the economy on readiness of the arguments for binary operations gives. Of
course, theese results may differ on different types of processors, though in
any case there must be a threshold of effectiveness for the number of registers
for the top of the stack representation.
-- 
-- Igor Agamirzian

Office: +7(812)350-2523    Home: +7(812)314-6055    Fax: +7(812)217-5105
Address: 37 Rackova Str. # 4, Leningrad 191011 U.S.S.R.