eliot@cs.qmw.ac.uk (Eliot Miranda) (04/05/91)
I recently posted articles to comp.compilers & alt.sources on how to write threaded code machines in C. I received the following questions from Simon Peyton Jones at Glasgow. They are specific to GCC. Since they have non-obvious answers & since the answers suggest augmentation of the GCC compiler I'm posting my response to Simon. >From: Simon L Peyton Jones <simonpj@cs.gla.ac.uk> > >I saw your article about threaded code. Like you and others, we are >using C as an assembler, only for a pure functional language, Haskell. >I have some brief questions. > >1. By telling GCC not to use a frame pointer, one can eliminate >the prolog entirely, can't one? So why edit it out? No, registers local to the procedure will still be saved & stack space allocated for automatic variables. This IS a problem since the threaded-code jump at the end of the procedure will miss the register restores before the epilogue. Consequently the stack will grow unless these register saves & stack-space allocations are removed. Also GCC can not always eliminate the frame pointer. >I guess the answer is going to be local variables, allocated once for >all by the StartTCMachine routine. Still, it seems quite a pain. I guess >one could sacrifice some (perhaps slight) speed by using a bunch of >globals instead. For certain routines, not using register variables will be expensive (e.g. a simple integer arithmetic primitive). >2. You edit out the epilogue for tidiness only, I take it. It doesn't >cause any problems if you leave it in, does it? No, but given that the prologue has to be removed & removing the epilogue is as easy (& given finite memory :-) one might as well remove it. > >3. Why do you use no-defer-pop (with the associated bug)? OK. This is again to avoid stack growth. On conventional stack architectures gcc will try & combine the argument popping code of a sequence of procedure calls. e.g. extern long a, b, c; void doit() { foo(a); bar(b); baz(c); } with -O -no-defer-pop one might expect gcc to generate link %a6,#0 movel a,%sp@- jbsr foo addqw #4,%sp movel b,%sp@- jbsr bar addqw #4,%sp movel c,%sp@- jbsr baz addqw #4,%sp unlk %a6 rts but because gcc knows that the unlk instruction will roll back the stack in fact gcc generates: link %a6,#0 movel a,%sp@- jbsr foo addqw #4,%sp movel b,%sp@- jbsr bar addqw #4,%sp movel c,%sp@- jbsr baz unlk %a6 rts With -O -fdefer-pop gcc optimizes out the pops completely & generates: link %a6,#0 movel a,%sp@- jbsr foo movel b,%sp@- jbsr bar movel c,%sp@- jbsr baz unlk %a6 rts with -O -fomit-frame-pointer -fdefer-pop gcc generates: movel a,%sp@- jbsr foo movel b,%sp@- jbsr bar movel c,%sp@- jbsr baz addw #12,%sp rts & with -O -fomit-frame-pointer -fno-defer-pop gcc generates: movel a,%sp@- jbsr foo addqw #4,%sp movel b,%sp@- jbsr bar addqw #4,%sp movel c,%sp@- jbsr baz addqw #4,%sp rts All the above cases are as one would wish. The elimination of all defered pops in the unlk instruction is especially clever. However, in the presence of the threaded-code jump the waters darken! Consider what gcc generates for: register void (**tcip)() asm("%a5"); #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0) extern long a, b; void doit() { foo(a); bar(b); JUMPNEXT; } with -O -fdefer-pop gcc generates doit: link %a6,#0 movel a,%sp@- jbsr foo movel b,%sp@- jbsr bar #APP movl %a5@+,%a0; jmp %a0@ #NO_APP unlk %a6 rts This is clearly no good because the arguments to both foo & bar will never be popped. Every time doit() is executed the stack will grow by 8 bytes. Soon your program will dump core with a very large stack segment! with -O -fno-defer-pop gcc generates: link %a6,#0 movel a,%sp@- jbsr foo addqw #4,%sp movel b,%sp@- jbsr bar #APP movl %a5@+,%a0; jmp %a0@ #NO_APP unlk %a6 rts Again useless because bar's pop has been folded into the unlk which won't be executed. with -O -fdefer-pop -fomit-frame-pointer gcc generates movel a,%sp@- jbsr foo movel b,%sp@- jbsr bar addqw #8,%sp #APP movl %a5@+,%a0; jmp %a0@ #NO_APP rts This is great. However, not all functions are susceptible to the omit-frame-pointer optimization (i.e. functions with local variables). E.g. the code generated for: register void (**tcip)() asm("%a5"); #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");return;}while(0) extern long a, b; void doit() { char large[1024]; foo(a,large); bar(b); JUMPNEXT; } with -O -fomit-frame-pointer -fdefer-pop is: link %a6,#-1024 pea %a6@(-1024) movel a,%sp@- jbsr foo movel b,%sp@- jbsr bar #APP movl %a5@+,%a0; jmp %a0@ #NO_APP unlk %a6 rts so in general one can't rely on -fomit-frame-pointer. For the above example both -O -fomit-frame-pointer -fno-defer-pop and -O -fno-defer-pop generate: doit: link %a6,#-1024 pea %a6@(-1024) movel a,%sp@- jbsr foo addqw #8,%sp movel b,%sp@- jbsr bar #APP movl %a5@+,%a0; jmp %a0@ #NO_APP unlk %a6 rts This is also useless because bar's argument pop has been folded away. The problem is that gcc will always fold away the last call's argument pop if the function has a frame pointer, and -fomit-frame-pointer can't allways get rid of the frame pointer. In fact, in the presence of variable sized automatic variables or calls on alloca it would be very hard (impossible for recursive functions?) to do. The neatest solution I've come up with is to use -fno-defer-pop and a dummy function call between the threaded-code jump and the return: register void (**tcip)() asm("%a5"); #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0) extern long a, b; void doit() { foo(a); bar(b); JUMPNEXT; } with -O -fno-defer-pop gcc generates: link %a6,#0 movel a,%sp@- jbsr foo addqw #4,%sp movel b,%sp@- jbsr bar addqw #4,%sp #APP movl %a5@+,%a0; jmp %a0@ #NO_APP jbsr SBD unlk %a6 rts Now bar's argument pop is not folded because its no longer the last call in the routine, SBD is. So the call to SBD a) prevents gcc's 'last call argument pop fold into unlk' optimization which prevents uncontrolled stack growth. b) doesn't get executed because of the jump c) is trivial to remove from the assembler with a sed-script One an try to use -fcaller-saves, but this surrounds calls with unnecessary register saves & restores that for the code to be optimal have to be edited out. >4. Why does JUMPNEXT have a loop? Surely the jump leaves the loop right >away. Presumably you are tricking the compiler somehow. This is old C lore. The problem is 'How do you write a macro that is a sequence of statements that can be used wherever a single statement can?' take the following definition of JUMPNEXT: #define JUMPNEXT asm("movl %a5@+,%a0; jmp %a0@");return; Now invoke it here: if (its_time_to_jump) JUMPNEXT; do_something_else(); This expands to: if (its_time_to_jump) asm("movl %a5@+,%a0; jmp %a0@"); return; do_something_else(); Not at all whats intended! There are two tricks I know of (the first I saw in Berkely Smalltalk, the second in Richard Stallman's gcc manual. I expect they're both quite old). The first is to surround your statements with if (TRUE){statements}else i.e. #define JUMPNEXT if(1){asm("movl %a5@+,%a0; jmp %a0@");return;}else So now we get: if (its_time_to_jump) if (1){ asm("movl %a5@+,%a0; jmp %a0@"); return; else; do_something_else(); which works because C binds elses innermost first. However, some compilers will whine about dangling elses. The second scheme is more elegant (-: Surround your statements with do{statements}while(FALSE); which will execute statements precisely once (its NOT a loop). i.e. #define JUMPNEXT do{asm("movl %a5@+,%a0; jmp %a0@");SBD();return;}while(0) expands to if (its_time_to_jump) do { asm("movl %a5@+,%a0; jmp %a0@"); return; while(0); do_something_else(); which does what's wanted and doesn't incur compiler whines. >Thanks > >Simon L Peyton Jones, Glasgow University More and more people are taking the 'use C as an assembler' route, and more and more people are using GCC to do it (because its code quality is good, it had global register variables, and it has an excellent asm facility). The threaded-code in C idea is also becomming more popular. But as the code above demonstrates, one does have to side-step optimizations and develop system-specific assembler editing scripts. I'd like to ask Richard Stallman & the GCC development team for -fno-prolog -fno-epilog flags that instruct gcc to generate a) no register saves or restores b) no automatic variable allocation c) no procedure linkage/frame creation Then the optimal 'Threaded-Code Machine in GCC C' can be compiled without any assembler editing scripts at all. Also nice would be a way of telling GCC that an asm statement changed the flow of control so GCC could a) warn about not-reached code b) eliminate unnecessary code (do more code folding) -- Eliot Miranda email: eliot@cs.qmw.ac.uk Dept of Computer Science Tel: 071 975 5229 (+44 71 975 5229) Queen Mary Westfield College ARPA: eliot%cs.qmw.ac.uk@nsf.ac.uk Mile End Road UUCP: eliot@qmw-cs.uucp LONDON E1 4NS -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.