eliot@.cs.qmw.ac.uk (Eliot Miranda) (03/28/91)
Various people have asked me for details on how I do threaded code in my dynamic translation Smalltalk-80 VM. So here's the gory details as well as the first published benchmarks for the system. How to do "Machine-Independent" Fast Direct Threaded Code: First off, use C (although other flexible machine-oriented imperative languages would probably work too). Global registers: If you can use GCC >v1.33 you can use global register variables to hold the threaded code machine's registers. If you have various forms of stupid C compiler then you can get global register variables by declaring your globals as register variables in every function, and later editing the assembler generated by the C compiler to remove global register saves & restores (details in [Miranda]). Threaded Code: Threaded code instructions (TCIs) are written as C procedures. They are compiled to assembler by the C compiler. Subsequently a simple editor script is run on the assembler to remove the prologues and epilogues from the threaded code procedures, and possibly to insert direct threaded code jumps. How to Identify Threaded Code Instructions: The method I prefer is to segregate TCIs from other procedures & functions in the machine by placing related groups of TCIs in separate source files. I give my threaded code source files the .tc extension and they have a rule in the makefile that will run the editor script on the assembler. An alternative is to identify each threaded code procedure with a special prefix which is spotted by the editor script. This is probably more error prone & doesn't fit well with leaf-routine optimization (see below). How to Write Threaded Code Instructions: Each TCI is writen an a void function of no arguments. It is wise to start and end each TCI with two special macros to replace '{' and '}'. So, using GCC on the SPARC, given some declarations: typedef void (*TCODE)(); /* a TCODE is a pointer to a function */ typedef ???? ObjectPointer; . . . register TCODE *tcip asm("%g7"); /*threaded code instruction pointer*/ register ObjectPointer *stackPointer asm("%g5"); e.g. popStack would be written: void popStack() TBEGIN stackPointer--; TEND With GCC TBEGIN is #define TBEGIN { With stupid C compilers it can be defined to be the list of global register variables. Further, if you want to build a debuger for your threaded code machine you could compile the system with #define TBEGIN { int frig = checkForBreakPoint(); and ignore lots of warnings about variable frig being unused :-). TEND has to do a direct threaded code jump. In my system I want an indirect post-increment jump on tcip; i.e. jump to *tcip++. On the SPARC with tcip in %g7 the jump is ld [%g7],%o0 ! get *tcip jmpl %o0,%g0 ! jump to it add %g7,4,%g7 ! increment tcip in the jump's delay slot On the 68k with tcip in a5 the jump is movl a5@+,a0 jmp a0@ With GCC this is implemented by the JUMPNEXT macro. On the SPARC: #define JUMPNEXT do{ \ asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");\ return;}while(0) Note the return, it tells the compiler that control does not pass this point. On the 68k: /* SBD = Silent But Deadly = Stack Bug Dummy. gcc has a bug with no-defer-pop. it always depers the pop of the last function call in a routine. SBD is a dummy call to ensure no other previous call gets its pop deferred. */ extern void SBD P((void)); #define JUMPNEXT do{ \ asm("movl a5@+,a0; jmp a0@");\ SBD();return;}while(0) SBD is then removed by the editor script. So TEND is defined to be #define TEND JUMPNEXT; } On the SPARC popStack is expanded to void popStack() { stackPointer--; do{asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");return;}while(0); } Its compiled to: _popStack: !#PROLOGUE# 0 save %sp,-80,%sp !#PROLOGUE# 1 add %g5,-4,%g5 ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7 ret restore The editor script then reduces this to:` _popStack: ! [gotcher] add %g5,-4,%g5 ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7 On the 68k you end up with: .globl _popStack _popStack: subqw #4,a3 movl a5@+,a0; jmp a0@ Global Register Variables and Stupid C Compilers: Some C compilers are stupid enough to provide straight-forward global registers. A C compiler on a nat-semi based system I used just allocated registers in the order they were declared. The assembler syntax was very simple to edit. Global register variables could thus be obtained easily. Some C compilers are stupid but think they're clever. Sun's SUN3 compiler is a case in point. It also allocates registers in the order declared. However, it tries to be clever by removing 'dead assignments' (assignments to subsequently unused register variables). These compilers are easy to fool. Simply arrange that the register variables are always used before leaving a function. I do this by always writing RETURN or RETURNV(e) instead of return and return e. On systems with such stupid C compilers RETURN(e) is defined thus: #define RETURNV(e) do{DummyUseRegs(GR1,GR2,GR3); return e;}while(1) The call on DummyUseRegs fools the compiler into thinking the registers are live & hence saves assignments to them. The editor scripts can then remove calls on DumyUseRegs. Of course on systems with marginally clever C compilers (SUN4 HP-UX etc) you're stuffed. However, in clever C compilers like GCC and Acorn's C compiler you can declare global registers & everything is clean & wholesome :-). Conditional TCODE Jumps: Say we wanted a conditional tcode jump. This might be writen: void skipIfTrue() TBEGIN if (*stackPointer-- == TrueObject) { tcip += 1; JUMPNEXT; } TEND How this All Works: With the above scheme, each threaded code procedure runs in the same C stack frame, and jumps directly to the next procedure, eliminating an unnecessary <epilogue, return>, <call, prolog> pair. Once we establish a stack frame and call the first function away we go. Assuming that you've produced your first threaded code method (a sequence of pointers to threaded code procedures), and that tcip points at the start, then StartTCMachine might be defined as follows: volatile void StartTCMachine() { char *enoughSpaceForAllTCIStackFrames; enoughSpaceForAllTCIStackFrames = alloca(1024); while(1) { (*tcip++)(); } } The alloca allocates space on the stack. After the first (*tcip++)() control goes off into threaded code land and never returns. Leaf Routine Optimization: The threaded code machine will make calls on support routines e.g. graphics, garbage collector etc. Any group of routines that dont access the global registers and don't directly or indirectly call routines that need to access the global registers can be optimized. These routines should be compiled without declaring the global registers. These routines will then use as many registers as the compiler will give them and will save & restore any registers they use, preserving the values of the global register variables. Details of my Smalltalk Threaded Code Machine: I use a pair of words for each TCI, a pointer to the procedure followed by an optional operand. This avoids going out of line to access arguments. e.g. pushLiteral is: void pushLit() TBEGIN *++stackPointer = (OOP)*tcip++; TEND where OOP is an ordinary object pointer. So on entry to push lit we have: <pointer to pushLit> tcip-> <object pointer> <pointer to next TCI> <next TCI's operand> and popStack must therefore be written void popStack() TBEGIN stackPointer--; tcip++; TEND I dynamically compile Smalltalk-80 bytecodes to threaded code. I use 128k bytes of memory to hold all threaded code. This 'tspace' is periodically scavenged to reclaim space. The architecture is similar to [DeutschSchiffman]. Using an eighth of the space used by the Deutch Schifman machine I get around 75% of the performance on the non-graphics benchmarks. Here are the Smalltalk macro benchmarks for BrouHaHa Smalltalk-80 v2.3.2t running on a monochrome SUN3/60 (20MHz 68020): BitBLT 76.7308 TextScanning 222.857 ClassOrganizer 80.6667 PrintDefinition 59.0278 PrintHierachy 142.857 AllCallsOn 112.5 AllImplementors 130.0 Inspect 116.667 Compiler 86.4341 Decompiler 101.639 KeyboardLookAhead 212.5 KeyboardSingle 302.778 TextDisplay 148.837 TextFormatting 273.81 TextEditing 180.342 Performance Rating 134.198 and on the same machine under the same conditions are the timings for ParcPlace Smalltalk-80 V2.3: BitBLT 99.75 TextScanning 390.0 ClassOrganizer 155.128 PrintDefinition 137.097 PrintHierachy 192.308 AllCallsOn 120.0 AllImplementors 108.333 Inspect 146.774 Compiler 118.617 Decompiler 129.167 KeyboardLookAhead 303.571 KeyboardSingle 473.913 TextDisplay 172.973 TextFormatting 442.308 TextEditing 285.135 Performance Rating 189.504 134.198/189.504 = 0.708154 WARNING!! These systems ARE different, these benchmarks are included only to give a feeling for ball-park performance. Example differences: BrouHaHa ParcPlace closures blue-book BlockContexts immediates: characters, smallints, fixedpoints immediate smallintegers 5844 compiled methods 5136 compiled methods (5026 ordinary methods) (4798 ordinary methods) (818 quick methods) (338 quick methods) More Portable File Organization: To keep the code as clean looking as possible all machine-dependencies are isolated in separate files. e.g. tcode.h gives machine independent definitions for TCODE. It includes machine dependencies from another file: /* for debugging purposes; single step breakpoint at start of each tcode */ #define DEBUG_FETCH_BREAK int frig = fetchBrk(); #ifdef FAST # include "fasttcode.h" #else TCODE *tcip; /* the tcode ip points at TCODEs */ # define JUMPNEXT return # ifdef BIDebug # define TBEGIN { DEBUG_FETCH_BREAK # else # define TBEGIN { # endif # define TEND } #endif GCC/SPARC/fasttcode.h: /* tcodeip in g7 */ register TCODE *tcip asm("%g7"); #define JUMPNEXT do{asm("ld [%g7],%o0; jmpl %o0,%g0; add %g7,4,%g7");return;}while(0) #ifdef BIDebug # define TBEGIN { DEBUG_FETCH_BREAK #else # define TBEGIN { #endif #define TEND JUMPNEXT; } I also don't want to include the necessary definitions for the global registers in every file. So for those non-leaf routines that must avoid using the global registers there's a fastglobal.h file that gives dummy definitions for these registers. e.g. GCC/SPARC/fastglobal.h: /* machine specific FAST defines. Gnu C 1.33 systems can use nice compiler provided global registers. */ #define BEGIN { #define END } #define RETURN(e) return e #define RETURNV return register char *GlobRegDummy1 asm("a5"); register char *GlobRegDummy2 asm("a4"); register char *GlobRegDummy3 asm("a3"); register char *GlobRegDummy4 asm("d6"); #ifdef MASKREGISTER register char *GlobRegDummy5 asm("d7"); #endif I use symbolic links to set up the machine dependent include files. This has the advantage that if you add a new machine you don't have to remake all the others. The Tedious Bit: The only tedious bit is writing the sed-scripts. For the SPARC this took 1 day. Here are the sed scripts I use for SUN 3, MAC2AUX (using GAS) and SUN4, all using GCC (v1.33 upwards). There's a problem on the SPARC in that the ABI does not seem to define the status of the global registers. Some math and library routines stomp on the global registers (beware getwd!), so I've included GCC/SUN4/sed.globreg.bugfix as an example of how to spot the offending math routines: SUN3/GCC/lib/sed.tcode.opt: # script to strip prolog & epilog from threaded code under gcc. # WARNING the script may strip a push of a register argument if a call is the # first statement of a function!! # /^_.*:$/{n N N s/ link a6,#[^\n]*\n// / fmovem #[^\n]*,sp@-/{ N s/ fmovem #[^\n]*,sp@-\n// } s/ moveml .*,sp@-\n// s/ movel [ad][0-7],sp@-\n// } / moveml a6@(-[1-9][0-9]*),#/{N s/ moveml a6@(-[1-9][0-9]*),#[^\n]*\n unlk a6// } / movel a6@(-[1-9][0-9]*),[ad][0-7]/{N s/ movel a6@(-[1-9][0-9]*),[ad][0-7]\n unlk a6// } / movel sp@+,/d / moveml sp@+,#/d / unlk a6/d / rts/d / jbsr _SBD/d MAC2AUX/GCC/lib.gas/sed.tcode.opt: /COMMENT/{ i\ script to strip prolog & epilog from threaded code under gcc. WARNING \ the script may strip a push of a register argument if a call is the\ first statement of a function!! } /^gcc_compiled:/d /^.[^%].*:$/{ n / link %a6/{ N N s/ link %a6,#[x0-9-]*\n// / fmovem #[^\n]*,%sp@-/{ N s/ fmovem #[^\n]*,%sp@-\n// } s/ moveml #[x0-9a-f]*,%sp@-\n// s/ movel %[ad][0-7],%sp@-\n// n } } / moveml -[1-9][0-9]*%a6@,#/{ N s/ moveml -[1-9][0-9]*%a6@,#[x0-9a-f-]*\n unlk %a6// } / movel -[1-9][0-9]*%a6@,%[ad][0-7]/{ N s/ movel -[1-9][0-9]*%a6@,%[ad][0-7]\n unlk %a6// } / movel %sp@+,%/d / moveml %sp@+,#/d / movel %d0,%a0/{ N s/ movel %d0,%a0\n unlk %a6// / movem*l %a6/{ N s/ movel %d0,%a0\n movem*l %a6.*\n unlk %a6// / fmovem %a6/{ N s/ movel %d0,%a0\n movem*l %a6.*\n fmovem %a6.*\n unlk %a6// } } } / unlk %a6/d / rts/d / jbsr SBD/d SUN4/GCC/lib/sed.tcode.opt: # script to strip prolog & epilog from threaded code under gcc. # /^_.*:$/{n N N s/ !#PROLOGUE# 0\n save %sp,[-0-9]*,%sp\n !#PROLOGUE# 1/ ! [gotcher]/ } / ret/d / restore/d SUN4/GCC/lib/sed.globreg.bugfix: # Some of the libc builtin routines (.rem .urem .div & .udiv so far known) # stamp on %g3 which is the maskReg (it contains 0x7FFFFF). # This script reassigns the value of maskReg after each of these routines # has been called. /call[ ]\.div,[0-9]/{n n i\ sethi %hi(0x7FFFFF),%g3 ![globregfix]\ or %lo(0x7FFFFF),%g3,%g3 } /call[ ]\.udiv,[0-9]/{n n i\ sethi %hi(0x7FFFFF),%g3 ![globregfix]\ or %lo(0x7FFFFF),%g3,%g3 } /call[ ]\.rem,[0-9]/{n n i\ sethi %hi(0x7FFFFF),%g3 ![globregfix]\ or %lo(0x7FFFFF),%g3,%g3 } /call[ ]\.urem,[0-9]/{n n i\ sethi %hi(0x7FFFFF),%g3 ![globregfix]\ or %lo(0x7FFFFF),%g3,%g3 } You can now see why I put "Machine-Independent" in quotes. Here's the count of machine dependent code for the SPARC: 25 99 786 fastcache.h 68 262 1882 fastglobal.h 31 112 906 fasttcode.h 28 80 595 ../tcsrc/SUN4/GCC/lib/sed.globreg.bugfix 5 34 222 ../tcsrc/SUN4/GCC/lib/sed.peep.opt 9 30 173 ../tcsrc/SUN4/GCC/lib/sed.tcode.opt 166 617 4564 total Of these 166 lines 51 lines are banner headers. 100 odd lines are machine dependent. A whole VM is around the 20k lines mark. So machine dependencies are down in the 0.5% range. Use this stuff as part of what ever you like. If you try & assert ownership I'll fight & batter you over the head with the GPL ('bout time we had some serious steel in that thar GPL). Share And Enjoy! P.S. The BrouHaHa machine is available to educational institutions with a valid ParcPlace Smalltalk-80 licence, subject to a strict non-disclosure agreement. email me if you want it. I am slow to answer requests! -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
crowl@cs.rochester.edu (Lawrence Crowl) (04/01/91)
In article <3035@redstar.cs.qmw.ac.uk> Eliot Miranda <eliot@.cs.qmw.ac.uk> writes: >The threaded code machine will make calls on support routines, e.g. graphics, >garbage collector etc. Any group of routines that don't access the global >registers and don't directly or indirectly call routines that need to access >the global registers can be optimized. These routines should be compiled >without declaring the global registers. These routines will then use as many >registers as the compiler will give them and will save & restore any >registers they use, preserving the values of the global register variables. This scheme assumes that procedure calls use a "callee saves" register convention, and does not work if you allocate the global register variables out of the "caller saves" set of registers. The problem is that the caller does not save the register (because it is global) and the callee writes over the register (because the caller saved it). In this situation, the programmer must insert explicit saves and restores of the global register variables. The obvious solution to this problem is to allocate all global register variables out of the "callee saves" set of registers. However, the Alliant has _no_ callee saves registers. Library routines on the Alliant will trash every register you have. In addition, implicit library calls to routines like bcopy will also trash your registers. (I learned this the hard way.) The lesson is that calling library routines with global register variables in caller saves registers requires special handling. It is not automagic. -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!rutgers!rochester!crowl Rochester, New York, 14627-0226 -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
Tom.Lane@G.GP.CS.CMU.EDU (04/01/91)
Lawrence Crowl points out one important problem with Eliot Miranda's scheme for global register use in C. There's an even more fundamental problem, though: there is *no guarantee whatever* that the compiler will assign the same registers to the global variables in every routine. Compilers that do intelligent allocation of variables to registers may refuse to honor the "register" declarations at all if the global variables are not heavily used in a given routine, and in any case the globals need not be assigned to the same registers every time. Miranda's scheme thus relies heavily on the assumption of a dumb register allocator (or else a compiler that supports global register variable declarations). This scheme may be "fast" direct threaded code, but it's hardly "portable". -- tom lane Internet: tgl@cs.cmu.edu BITNET: tgl%cs.cmu.edu@cmuccvma [GCC lets you specify what register to use for global register variables, but that is of course both machine and compiler specific. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
metzger@watson.ibm.com (Perry E. Metzger) (04/02/91)
In article <3035@redstar.cs.qmw.ac.uk> Eliot Miranda <eliot@.cs.qmw.ac.uk> writes: >Various people have asked me for details on how I do threaded code >in my dynamic translation Smalltalk-80 VM. So here's the gory details >as well as the first published benchmarks for the system. Thanks for a really neat article. This revives a long standing curiosity I have had about threaded code interpreters. I've never really seen a good reference paper on the subject; most of what I have learned has been through "back door" sources. Are there any good papers on threaded code systems? Do any books have good information on them? Thanks much, Perry Metzger -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@cs.washington.edu (David Keppel) (04/03/91)
metzger@watson.ibm.com (Perry E. Metzger) writes: >[I'd like a reference on threaded code interpreters.] 3 citations follow: %A James R. Bell %T Threaded Code %J Communications of the ACM (CACM) %V 16 %N 2 %D June 1973 %P 370-372 %X Describes the basic idea of threaded code. Compares to hard code (subroutine calls) and interpreters. %A Richard H. Eckhouse Jr. %A L. Robert Morris %T Minicomputer Systems Organization, Programming, and Applications (PDP-11). 2nd Ed. %I Prentice-Hall, Inc. %P 290-298 %X Describes threaded code and ``knotted code''. I (pardo) think that this is a very clear introduction to threaded code. %A Peter M. Kogge %T An Architectural Trail to Threaded Code Systems %J IEEE Computer %P 22-33 %D March 1982 %W rrh (original) %W Pardo (copy) %X Describes how to build a threaded code interpeter/compiler from scratch. * Direct threaded/indirect threaded. * Less than 2:1 performance hit of threading compared to full compilation. * Note about bad compilers contributing to relative OK-ness of threaded code. * Ease of rewriting stuff. * Ease of tuning. My favorite of the three is Eckhouse & Morris; however I don't know where to get it. The pages that I have are photocopies graciously sent to me by a friend. As the title implies, this book is several years old and undoubtedly out-of-print. ;-D on ( Following this thread of the discussion... ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
vestal@SRC.Honeywell.COM (Steve Vestal) (04/04/91)
In article <1991Apr2.192125.7464@beaver.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes:
[references about threaded code, much stuff deleted]
David> %X Describes how to build a threaded code interpeter/compiler from
David> scratch.
David> * Less than 2:1 performance hit of threading compared to full
David> compilation.
I have a question about this. Numbers like this are often cited for
threaded-type code, but in Bell's paper this was for the PDP-11 (whose
addressing modes made it a natural for threaded code). Paul Klint's
"Interpretation Techniques" paper (Software P&E, v11, 1981) cites a
significant difference for interpreter fetch/decode times on different
architectures. He cited numbers around 2:1 for the PDP-11, but something
more like 9:1 for a Cyber. I did a Q&D evaluation of this for a RISC, and
the ratio I guestemated was closer to that Klint gave for the Cyber than
for the PDP-11 (not unexpectedly).
How architecturally dependent is the performance of these techniques
(relative to compiling to native code)?
Steve Vestal
Mail: Honeywell S&RC MN65-2100, 3660 Technology Drive, Minneapolis MN 55418
Phone: (612) 782-7049 Internet: vestal@src.honeywell.com
--
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
firth@sei.cmu.edu (Robert Firth) (04/04/91)
In article <1991Apr3.182334.16164@src.honeywell.com> vestal@SRC.Honeywell.COM (Steve Vestal) writes: >How architecturally dependent is the performance of these techniques >(relative to compiling to native code)? [cost of threaded code on PDP-11, RISC &c] We might have a misunderstanding here, because what I think of as threaded code doesn't have a decoding and interpretation step. But I'll talk of what I know. A program in threaded code is just an array of addresses, possibly interspersed with operands. So the fragment c := a + b becomes something like address of 'load' address of 'a' address of 'load' address of 'b' address of '+' address of 'store' address of 'c' This implies a very simple virtual stack machine - you can get more clever by implementing a virtual register machine. The basic execution thread then does this. We point a global register at the table of addresses, and each primitive has the form treg := treg + address'size ... jump (treg) As you can see, this is great on the PDP-11, since that reduces to one instruction MOV (treg)+,PC ; NOTE TO MAINTAINER: FASTER THAN JMP - DON'T TOUCH! On a typical RISC machine, it's two cycles, since you can put almost anything in the delay slot(s) after the jump. Now, the load instruction, for instance, would say load: treg := treg + address'size load (treg) into tempreg treg := treg + address'size push (tempreg) onto simulated stack jump (treg) On the PDP-11, that's MOV @(treg)+, -(SP) MOV (treg)+, PC On a RISC, it's much more like L R0, 4(treg) ; get operand address L R0, 0(R0) ; dereference to get operand SUBI SP, #4 ; decrement simulated SP S R0, 0(SP) ; push operand on stack ADDI treg, #8 ; step over two addresses (mine & operands) JR (treg) ; over to you, Moriarty! [to fill delay slots, shuffle the above to 132564] Well, if you have one load delay slot and one branch delay slot, you can fill all three of them, so that's 6 cycles. Given that a typical load is only 1.1 cycles in direct code (90% of the delays filled), this is certainly a lot more than a 2:1 overhead! When you add the cost of a simulated stack (lots of needless loads and stores), I can well believe 10:1 time expansion for simple code. In fact, it was more than that on the PDP-11, if you compared threaded code with direct code from a decent compiler. The big win in the Fortran compiler came from (a) very compact threaded code, and (b) the floating point operations were implemented in software, so the overhead of threaded code was swamped by the cost of floating addition, subtraction &c. Here's the full code of the above example, so you can see for yourself Direct: MOV a, R0 ADD b, R0 MOV R0, c Threaded: MOV @(treg)+, -(SP) MOV (treg)+, PC * MOV @(treg)+, -(SP) * MOV (treg)+, PC * ADD (SP)+,(SP) MOV (treg)+, PC MOV (SP)+, @(treg)+ MOV (treg)+, PC Note that, if you implement a one-address add, you save two instructions, since the *** bit reduces to ADD @(treg)+, (SP) But even then, it's coming out at over 4:1. What architectural features make threaded code more efficient? The fundamental one is main memory that is fast (or not too slow) relative to registers, since you're doing a lot more fetching. Another is a set of address modes with double indirection, since you're accessing most operands one level of indirection further back. And good old autoincrement helps a little, too. Alas, none of that says 'risc', and much of it says '1960s'. Incidentally, if I were to do this again today, I'd definitely simulate a general-register machine and use a subset of the real machine's registers. If you take 8 of them, you then have 8 loads and stores, one for each register, but if you make an absolute rule that nobody even thinks about touching one of those 8 that doesn't belong to him, then all the good tricks about register allocation, slaving &c will still work. If you then implement the operations as one-address general-register, you have again 8 versions (add into R0, add into R1, ...) and lo! you're programming a very familiar old friend. "But that was in another country, and besides, the machine is dead..." Robert Firth -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@cs.washington.edu (David Keppel) (04/05/91)
>>>[Threaded code vs. compilation] >pardo@cs.washington.edu (David Keppel) writes: >>[Less than 2:1 performance hit of threading vs. full compilation.] Note also that the reference that claimed 2:1 (Peter M. Kogge, IEEE Computer pp 22-33 March 1982) also attributed part of that ratio to the poor state of compiler optimization. vestal@SRC.Honeywell.COM (Steve Vestal) writes: >[Klint says 2:1 for PDP-11 v. 9:1 for Cyber. > How architecturally dependent are these techniques?] Suppose that the statically-compiled code fragments that are threaded together are called `primitives'. When the execution time of a primitive is large, then the overhead for the interpreter can be large and still have a small effect on performance. The performance of the interpreted code is dominated by the time in a primitive vs. the overhead of moving between primitives. When the execution time of the primitives is small, then the overhead for moving between primitives can be a large fraction of the total execution time. Overhead comes from at least two sources: * Control flow: the address of the the next primitive is loaded from data memory and the processor executes an indirect jump. * Register allocation: a primitive is essentially a function with a fast calling convention (no stack adjustment). Thus, all the traditional problems with interprocedural register allocation. Examples of `large primitives' are ``draw circle'' and ``interpolate spline''. Examplees of small primitives are ``push'', ``add'', etc. * Architectural dependency of control flow Examples: Direct jumps in full compilation: op1 op2 br next // direct jump Indirect jumps for threading for a CISC: op1 op2 br *(r0)+ // load, increment, jump Indirect jumps for threading for a RISC: ld *r0, r1 // scheduled load op1 op2 br *r1 // jump add r1, #4, r1 // delay slot increment Direct jumps in full compilation can frequently use one instruction (a ``near branch'') both to find the address of the next code fragment and perform the control transfer. On a CISC, branches are typically two or three bytes. On a RISC, branches are typically four bytes. The threaded indirect (load, increment, jump) is typically three bytes on a CISC and twelve bytes (three instructions) on a RISC. Direct jumps in full compilation take typically one instruction time. Indirect jumps take at least the following operations: load, increment, jump indirect. On a CISC, the three operations can typically be `folded' in to one instruction. There may be a load penalty of one instruction time but the increment is overlapped, so the total time is three machine units (one `unit' is about one register->register operation). On a RISC, the total penalty is three machine units. Direct jumps take one (I-cache) cycle to fetch both the branch instruction and the address of the branch target. Indirect jumps take a D-cache cycle to fetch the address of the branch target and an I-cache cycle to fetch the branch instruction. Direct jumps can take advantage of instruction prefetching since the address of the next instruction is known at the time that the instruction prefetch reads the direct jump. Threaded indirects require an indirect branch off of a register. Current RISC and CISC machines are about equivalent in that they do little prefetching. Some machines being designed do more prefetching so the threading overhead for them will be greater. * Architectural dependency of register allocation In a machine with a small number of registers, many of the registers are in-use in each primitive and the best possible register allocation will contain many loads and stores. In a machine with a large number of registers, the full-compilation implementation can make much better use of registers than the threaded primitives implementation (again, for small primitives). The savings from full compilation are exactly analagous to the improvements in register allocation from doing inlining of small procedures. * Other points to ponder In some cases the threaded code implementation is substantially smaller than the full-compilation implementation. For large functions or a machine with small caches, the loss of performance from threading might be overwhelmed by the gain in cache performance. On RISC machines, procedure call/return is about twice the cost of other control flow, except for the overhead of register management. Thus, call-dense RISC code from full compilation may behave about the same as threaded code. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
bpendlet@bambam.es.com (Bob Pendleton) (04/09/91)
In article <23613@as0c.sei.cmu.edu> you write: > A program in threaded code is just an array of addresses, possibly > interspersed with operands. So the fragment > > c := a + b > > becomes something like > > address of 'load' > address of 'a' > address of 'load' > address of 'b' > address of '+' > address of 'store' > address of 'c' > > This implies a very simple virtual stack machine - you can get more clever > by implementing a virtual register machine. About 10 years ago I was working on a lisp compler that compiled to threaded code. I was trying to get small code and still have some performance. (Since I wanted to run the code on a Z80 or 8080 small was important. My how things change :-) I found that the 3 most common operations in threaded code were load, store, and execute. So I put those operations with the operands. This made the operands look rather like classes with load, store, and execute as virtual functions. If you let the evaluate operation subsume the load and execute operations the threaded code for c := a + b; becomes address of 'a.evaluate()' address of 'b.evaluate()' address of '+' address of 'c.store()' and g := F(x, y); becomes address of 'x.evaluate()' address of 'y.evaluate()' address of 'F.evaluate()' address of 'g.store()' Which is much smaller than the original version of threaded code. Later, while working on a homebrew version of FORTH I gave up on threaded code completely. I found, like most who have expolored it, that symbolic execution of RPN code is a nice way to generated machine code. Machine code that runs much faster than threaded code, and that the machine code, even on an 8080, was only about 25% bigger than threaded code. -- Bob Pendleton bpendlet@dsd.es.com or decwrl!esunix!bpendlet or utah-cs!esunix!bpendlet [The DEC PDP-11 Fortran compiler did something similar, writing load routines for commonly used variables. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.