[comp.compilers] Threaded code

ian@uunet.UU.NET (Ian Lance Taylor) (04/04/91)

The book ``Threaded Interpretive Languages'' has a quite complete
implementation of a threaded version of Forth in Z80 assembler.  It's
a very complete description of why threaded implementations exist and
how to implement them easily.  It's by R. G. Loeliger and was
published by Byte Books (ISBN 0-07-038360-X).  It was published in
1981, though, and I'm sure it's long out of print.

Ian Taylor              airs!ian@uunet.uu.net               uunet!airs!ian
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

eliot@cs.qmw.ac.uk (Eliot Miranda) (04/05/91)

I recently posted articles to comp.compilers & alt.sources on how
to write threaded code machines in C.  I received the following questions
from Simon Peyton Jones at Glasgow.  They are specific to GCC.
Since they have non-obvious answers & since the answers suggest
augmentation of the GCC compiler I'm posting my response to Simon.

>From: Simon L Peyton Jones <simonpj@cs.gla.ac.uk>
>
>I saw your article about threaded code.  Like you and others, we are
>using C as an assembler, only for a pure functional language, Haskell.
>I have some brief questions.
>
>1. By telling GCC not to use a frame pointer, one can eliminate
>the prolog entirely, can't one?  So why edit it out?

No, registers local to the procedure will still be saved & stack space
allocated for automatic variables. This IS a problem since the
threaded-code jump at the end of the procedure will miss the register
restores before the epilogue.  Consequently the stack will grow unless
these register saves & stack-space allocations are removed.  Also
GCC can not always eliminate the frame pointer.

>I guess the answer is going to be local variables, allocated once for
>all by the StartTCMachine routine.  Still, it seems quite a pain.  I guess
>one could sacrifice some (perhaps slight) speed by using a bunch of
>globals instead.
For certain routines, not using register variables will be expensive
(e.g. a simple integer arithmetic primitive).

>2. You edit out the epilogue for tidiness only, I take it.  It doesn't
>cause any problems if you leave it in, does it?
No, but given that the prologue has to be removed & removing the epilogue
is as easy (& given finite memory :-) one might as well remove it.
>
>3. Why do you use no-defer-pop (with the associated bug)?
OK. This is again to avoid stack growth.  On conventional stack architectures
gcc will try & combine the argument popping code of a sequence of procedure calls.
e.g.
extern long a, b, c;
void doit() {
	foo(a); bar(b); baz(c);
}

with -O -no-defer-pop one might expect gcc to generate

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	addqw #4,%sp
	unlk %a6
	rts

but because gcc knows that the unlk instruction will roll back the stack
in fact gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	unlk %a6
	rts

With -O -fdefer-pop gcc optimizes out the pops completely & generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	movel c,%sp@-
	jbsr baz
	unlk %a6
	rts

with -O -fomit-frame-pointer -fdefer-pop gcc generates:

	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	movel c,%sp@-
	jbsr baz
	addw #12,%sp
	rts

& with -O -fomit-frame-pointer -fno-defer-pop gcc generates:

	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
	movel c,%sp@-
	jbsr baz
	addqw #4,%sp
	rts

All the above cases are as one would wish.  The elimination of all
defered pops in the unlk instruction is especially clever.

However, in the presence of the threaded-code jump the waters darken!
Consider what gcc generates for:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)

	extern long a, b;
	void doit() {
		foo(a); bar(b); JUMPNEXT;
	}
with -O -fdefer-pop gcc generates

doit:
	link %a6,#0
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

This is clearly no good because the arguments to both foo & bar
will never be popped. Every time doit() is executed the stack will grow
by 8 bytes.  Soon your program will dump core with a very large stack
segment!

with -O -fno-defer-pop gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

Again useless because bar's pop has been folded into the unlk
which won't be executed.

with -O -fdefer-pop -fomit-frame-pointer gcc generates

	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
	addqw #8,%sp
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	rts

This is great. However, not all functions are susceptible to
the omit-frame-pointer optimization (i.e. functions
with local variables).  E.g. the code generated for:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0)

	extern long a, b;
	void doit() {
		char	large[1024];
		foo(a,large); bar(b); JUMPNEXT;
	}

with -O -fomit-frame-pointer -fdefer-pop is:

	link %a6,#-1024
	pea %a6@(-1024)
	movel a,%sp@-
	jbsr foo
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

so in general one can't rely on -fomit-frame-pointer.
For the above example both
	-O -fomit-frame-pointer -fno-defer-pop
and
	-O -fno-defer-pop
generate:

doit:
	link %a6,#-1024
	pea %a6@(-1024)
	movel a,%sp@-
	jbsr foo
	addqw #8,%sp
	movel b,%sp@-
	jbsr bar
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	unlk %a6
	rts

This is also useless because bar's argument pop has been folded away.  The
problem is that gcc will always fold away the last call's argument pop if
the function has a frame pointer, and -fomit-frame-pointer can't allways
get rid of the frame pointer.  In fact, in the presence of variable sized
automatic variables or calls on alloca it would be very hard (impossible
for recursive functions?) to do.

The neatest solution I've come up with is to use -fno-defer-pop and a
dummy function call between the threaded-code jump and the return:

	register void (**tcip)() asm("%a5");

	#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0)

	extern long a, b;
	void doit() {
		foo(a); bar(b); JUMPNEXT;
	}
with -O -fno-defer-pop gcc generates:

	link %a6,#0
	movel a,%sp@-
	jbsr foo
	addqw #4,%sp
	movel b,%sp@-
	jbsr bar
	addqw #4,%sp
#APP
	movl %a5@+,%a0; jmp %a0@
#NO_APP
	jbsr SBD
	unlk %a6
	rts

Now bar's argument pop is not folded because its no longer the last
call in the routine, SBD is.
So the call to SBD
	a) prevents gcc's 'last call argument pop fold into unlk' optimization
	   which prevents uncontrolled stack growth.
	b) doesn't get executed because of the jump
	c) is trivial to remove from the assembler with a sed-script


One an try to use -fcaller-saves, but this surrounds calls with unnecessary
register saves & restores that for the code to be optimal have to be edited out.

>4. Why does JUMPNEXT have a loop?  Surely the jump leaves the loop right
>away.  Presumably you are tricking the compiler somehow.

This is old C lore.  The problem is
	'How do you write a macro that is a sequence of statements
	 that can be used wherever a single statement can?'

take the following definition of JUMPNEXT:
#define JUMPNEXT asm("movl %a5@+,%a0; jmp %a0@");return;

Now invoke it here:
	if (its_time_to_jump)
		JUMPNEXT;
	do_something_else();

This expands to:
	if (its_time_to_jump)
		asm("movl %a5@+,%a0; jmp %a0@");
	return;
	do_something_else();

Not at all whats intended!

There are two tricks I know of (the first I saw in Berkely Smalltalk,
the second in Richard Stallman's gcc manual. I expect they're both
quite old).
The first is to surround your statements with
if (TRUE){statements}else
i.e.
#define JUMPNEXT if(1){asm("movl %a5@+,%a0; jmp %a0@");return;}else
So now we get:
	if (its_time_to_jump)
		if (1){
			asm("movl %a5@+,%a0; jmp %a0@");
			return;
		else;
	do_something_else();

which works because C binds elses innermost first. However, some
compilers will whine about dangling elses.  The second scheme is
more elegant (-:

Surround your statements with
do{statements}while(FALSE);
which will execute statements precisely once (its NOT a loop).
i.e.
#define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0)
expands to

	if (its_time_to_jump)
		do {
			asm("movl %a5@+,%a0; jmp %a0@");
			return;
		while(0);
	do_something_else();

which does what's wanted and doesn't incur compiler whines.


>Thanks
>
>Simon L Peyton Jones, Glasgow University

More and more people are taking the 'use C as an assembler' route, and
more and more people are using GCC to do it (because its code quality is
good, it had global register variables, and it has an excellent asm
facility).  The threaded-code in C idea is also becomming more popular.
But as the code above demonstrates, one does have to side-step
optimizations and develop system-specific assembler editing scripts.

I'd like to ask Richard Stallman & the GCC development team for
	-fno-prolog -fno-epilog
flags that instruct gcc to generate
	a) no register saves or restores
	b) no automatic variable allocation
	c) no procedure linkage/frame creation

Then the optimal 'Threaded-Code Machine in GCC C' can be compiled without
any assembler editing scripts at all.

Also nice would be a way of telling GCC that an asm statement
changed the flow of control so GCC could
	a) warn about not-reached code
	b) eliminate unnecessary code (do more code folding)
-- 
Eliot Miranda			email:	eliot@cs.qmw.ac.uk
Dept of Computer Science	Tel:	071 975 5229 (+44 71 975 5229)
Queen Mary Westfield College	ARPA:	eliot%cs.qmw.ac.uk@nsf.ac.uk	
Mile End Road			UUCP:	eliot@qmw-cs.uucp
LONDON E1 4NS
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

brennan@bcsaic.boeing.com (Michael D Brennan) (05/03/91)

Another method for obtaining threaded byte code for an interpreter
is to edit the assembler output of a big switch
rather than editing the prologue and epilogue off functions calls.

You don't need gcc, global vars in registers, works with smart and
dumb compilers, and all optimization can be turned on.

For example:

This C routine executes (unthreaded) byte code for an interpreter 
that can add, subtract and print.

#define  HALT     0
#define  PUSH     1
#define  ADD      2
#define  SUB      3
#define  PRINT    4

static int stack[32] ;

void  execute(code_ptr)
  register int *code_ptr ;
{
  register int *stack_ptr = stack - 1 ;


  while ( 1 )
  {
    switch( *code_ptr++ )
    {
      case HALT   :  return ;
      case PUSH   :
	    * ++ stack_ptr = *code_ptr++ ;
	    break ;

      case  ADD   :
	    stack_ptr-- ;
	    *stack_ptr += stack_ptr[1] ;
	    break ;

      case  SUB   :
	    stack_ptr-- ;
	    *stack_ptr -= stack_ptr[1] ;
	    break ;

      case   PRINT  :
             printf("%d\n", *stack_ptr--);
	     break ;
     }
  }
}

-------------------------------------------------------

to interpret 2 + (3 - 4)

the front end "compiles" in  int code[]

PUSH, 2, PUSH, 3, PUSH, 4, SUB, ADD, PRINT, HALT

and calls  execute(code).

------------------------------------------------------

The difference between this and the threaded code discussed over
the last few weeks is the switch gets compiled as

      jmp	TABLE[ *code_ptr++ ]

where TABLE is the jump table generated by the compiler which holds
the addresses of the case labels.

With threading, the transitions between functions become

      jmp	*code_ptr++


but this is easy to get by editing the assembler output to
export the case label and recode the switch.

--------------------------------------------------

For example on a SPARC:

code_ptr is %o0
stack_ptr is %i5


	.....
			      ! case PUSH
L77004:
	ld	[%i0],%o1
	inc	4,%i5
	inc	4,%i0
	b	L77008
	st	%o1,[%i5]

	.....

			     !  the switch, doesn't change structure
			     !  as you add new op codes

L77008:
	mov	%i0,%i4
	ld	[%i4],%i4
	inc	4,%i0
	cmp	%i4,4
	bgu	L77008
	sll	%i4,2,%i4
	sethi	%hi(L2000000),%o1
	or	%o1,%lo(L2000000),%o1	! [internal]
	ld	[%i4+%o1],%o0
	jmp	%o0
	nop
L2000000:  		  !  the jump TABLE
	.word	L77003    !  HALT  etc
	.word	L77004
	.word	L77005
	.word	L77006
	.word	L77007


-------------------------------------------
modify by adding global labels and edit the switch
	


	.....
			      ! case PUSH
_push :
L77004:
	ld	[%i0],%o1
	inc	4,%i5
	inc	4,%i0
	b	L77008
	st	%o1,[%i5]

	.....

			  !  the edited switch
L77008:
	mov	%i0,%i4
	ld	[%i4],%i4
	inc	4,%i0
	jmp	%i4
	nop
          		  !  remove TABLE

-------------------------------------------

For another example on an Intel 8088 

stack_ptr is si
code_ptr  is di

   ;	  while ( 1 )
   ;	  {
   ;	    switch( *code_ptr++ )
   ;	
@1@50:
	mov	bx,di
	inc	di
	inc	di
	mov	bx,word ptr [bx]
	cmp	bx,3
	ja	short @1@50
	shl	bx,1
	jmp	word ptr cs:@1@C738[bx]


@1@122:
   ;	
   ;	      case PUSH   :
   ;		    * ++ stack_ptr = *code_ptr++ ;
   ;	
	inc	si
	inc	si
	mov	ax,word ptr [di]
	mov	word ptr [si],ax
	inc	di
	inc	di
   ;	
   ;		    break ;
   ;	
	jmp	short @1@50   
   ;	

	....

@1@C738	label	word       ;  jump TABLE
	dw	@1@194     ;  HALT
	dw	@1@122     ;  PUSH   etc
	dw	@1@146

	....

------------------------------------------------

edited the jump can be computed inline

   ;	  while ( 1 )
   ;	  {
   ;	    switch( *code_ptr++ )
   ;	
@1@50:      ; switch code is replaced by code only executed once
   
	inc	di
	inc	di
	jmp     [di-2]

	.....

_push :
@1@122:
   ;	
   ;	      case PUSH   :
   ;		    * ++ stack_ptr = *code_ptr++ ;
   ;	
	inc	si
	inc	si
	mov	ax,word ptr [di]
	mov	word ptr [si],ax
	inc	di
	inc	di
   ;	
   ;		    break ;
   ;	     
	inc	di         ;  jmp  to *code_ptr++  inline
	inc	di
	jmp	[di-2]
   ;	
	....

----------------------------------------------

the "front end" has defines

typedef  void (*TCODE)() ;

extern  void  halt(), push(), add(), sub(), print() ;

TCODE  code[CODESIZE] ;

in the array code[], the front end compiles


push, 2, push, 3, push, 4, sub, add, print, halt

and calls execute(code).


--
Mike Brennan
brennan@bcsaic.boeing.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.