ssinghani@viewlogic.com (Sunder Singhani) (04/18/91)
Recently, there have been a lot of postings on the comparison of threaded code vs. full compilation performance. We have a system which uses a threaded code implementation for interpreting this particular higher level language. The performance is below our requirements; hence I am looking at different types of IRs (SPC, RTL, tuples etc.) and VMs to get good interpreted speed. So, I have the following question - Is threaded code the best form of interpreted code representation for performance ? If not, what are the other kinds and how do they compare with full compilation ? Any references to general comparisons between performance of interpreted code and compiled code systems are welcome. --------------- Sunder Singhani ssinghani@viewlogic.com -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@june.cs.washington.edu (David Keppel) (04/23/91)
ssinghani@viewlogic.com (Sunder Singhani) writes: >[Our threaded code isn't fast enough. What's faster?] As far as I know, threaded code gives the fastest primitives/second dispatch rate on a variety of architectures. The general techniques for making things faster (that I know of!) are to change things so that the dispatch rate can go down without changing the work that gets done (or use hardware, but we'll ignore that for the moment.) * Use a different v-machine instruction set The overhead of interpreting is almost nothing in generic PostScript imaging code because all the time is spent in non-interpretded primitives. If you can characterize your key operations (perhaps info in [Davidson & Fraser ??, Fraser, Myers & Wendt 84] can help you analyze for common operations instead of the more normal time in routines) then you can re-code your virtual instruction set to have as primintives oeprations that are performed frequently. * Dynamic compilation to native machine code This is what is done in ParcPlace System's Smalltalk-80 implementation, [Deutsch & Schiffman 84] also Insignia Solution's 8086 interpreter. Dynamic compilation suffers from the need to do compilation at runtime: a compiler that produces better code will take longer to run and the compile time contributes to the overall program runtime. Also, program text isn't shared, even if multiple instances are running simultaneously. * Native-coding key routines If you believe that your program spends 80% of its time in a few key routines, then compiling just these routines -- statically, adding them to the primitive set, statically adding them as library routines, or dynamically -- can improve performance substantially [Pittman 87]. 5 Citations follow: %A Robert Bedichek %T Some Efficient Architecture Simulation Techniques %J Winter '90 USENIX Conference %D 26 October, 1989 %W Robert Bedichek. %W Pardo has a copy. %X Describes a simulator that uses threaded-code techniques to emulate a Motorola 88000. Each 88k instruction is executed in about 20 host (68020) instructions. Discusses techniques used to get the simulation down from several thousand host instructions in many other simulators. %A Jack W. Davidson %A Chris W. Fraser %T Eliminating Redundant Object Code %J POPL9 %P 128-132 %A Peter Deutsch %A Alan M. Schiffman %T Efficient Implementation of the Smalltalk-80 System %J 11th Annual Symposium on Principles of Programming Languages (POPL 11) %D January 1984 %P 297-302 %X Dynamic translatin of p-code to n-code (native code). Resons for not using straight p-code or straight n-code: * p-code is smaller than n-code (<= 5X). * The debugger can debug p-code, improving portability. * Native code is faster (<= 2X). Reasons include special fetch/decode/dispatch hardware; p-machine and n-machine may be very different, e.g., stack machine vs. register-oriented. * Threaded code does reduce the cost of p-code fetch/decode. Does not help with operand decoding. Does not allow optimizations to span more than one instruction. [pardo: that's not technically true, but each optimized instruction must exist in addition to the unoptimized version. That leads to exponential blowup of the p-code. Example: delayed branch and non-delayed branch versions of Robert Bedichek's 88k simulator.] The system characteristics: * The time to translate to n-code via macro expansion is about the same as the execute time to interpret the p-code. * (pg 300:) Self-modifying code (SMC) is deprecated but used in a ``well-confined'' way. Could indirect at more cost. Use SMC on the machines where it works, indirection where SMC. doesn't. * Performance is compared to a ``straightforward'' interpreter. What's that? %A Christopher W. Fraser %A Eugene W. Myers %A Alan L. Wendt %T Analyzing and Compressing Assembly Code %J CCC84 %P 117-121 %A Thomas Pittman %T Two-Level Hybrid Interpreter/Native Code Execution for Combined Space-Time Program Efficiency %D 1987 %J ACM SIGPLAN %P 150-152 %X Talks about native code execution vs. various kinds of interpreting and encoding of key routines in assembly. Hope this helps! ;-D on ( This is all open to interpretation ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.