[comp.compilers] Fast Interpreted Code

ssinghani@viewlogic.com (Sunder Singhani) (04/18/91)

Recently, there have been a lot of postings on the comparison of
threaded code vs. full compilation performance. We have a system
which uses a threaded code implementation for interpreting this
particular higher level language. The performance is below our
requirements; hence I am looking at different types of IRs (SPC,
RTL, tuples etc.) and VMs to get good interpreted speed. So, I 
have the following question -

   Is threaded code the best form of interpreted code representation
   for performance ? If not, what are the other kinds and how do
   they compare with full compilation ? Any references to general
   comparisons between performance of interpreted code and compiled
   code systems are welcome.

---------------
Sunder Singhani
ssinghani@viewlogic.com
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@june.cs.washington.edu (David Keppel) (04/23/91)

ssinghani@viewlogic.com (Sunder Singhani) writes:
>[Our threaded code isn't fast enough.  What's faster?]

As far as I know, threaded code gives the fastest primitives/second
dispatch rate on a variety of architectures.  The general techniques for
making things faster (that I know of!) are to change things so that the
dispatch rate can go down without changing the work that gets done (or use
hardware, but we'll ignore that for the moment.)

* Use a different v-machine instruction set

  The overhead of interpreting is almost nothing in generic PostScript
  imaging code because all the time is spent in non-interpretded
  primitives.  If you can characterize your key operations (perhaps
  info in [Davidson & Fraser ??, Fraser, Myers & Wendt 84] can help
  you analyze for common operations instead of the more normal time in
  routines) then you can re-code your virtual instruction set to have
  as primintives oeprations that are performed frequently.

* Dynamic compilation to native machine code

  This is what is done in ParcPlace System's Smalltalk-80
  implementation,  [Deutsch & Schiffman 84] also Insignia Solution's
  8086 interpreter.

  Dynamic compilation suffers from the need to do compilation at
  runtime: a compiler that produces better code will take longer to
  run and the compile time contributes to the overall program runtime.
  Also, program text isn't shared, even if multiple instances are
  running simultaneously.
 
* Native-coding key routines

  If you believe that your program spends 80% of its time in a few key
  routines, then compiling just these routines -- statically, adding
  them to the primitive set, statically adding them as library
  routines, or dynamically -- can improve performance substantially
  [Pittman 87].


5 Citations follow:

%A Robert Bedichek
%T Some Efficient Architecture Simulation Techniques
%J Winter '90 USENIX Conference
%D 26 October, 1989
%W Robert Bedichek.
%W Pardo has a copy.
%X Describes a simulator that uses threaded-code techniques to emulate
a Motorola 88000.  Each 88k instruction is executed in about 20 host
(68020) instructions.  Discusses techniques used to get the simulation
down from several thousand host instructions in many other
simulators.

%A Jack W. Davidson
%A Chris W. Fraser
%T Eliminating Redundant Object Code
%J POPL9
%P 128-132

%A Peter Deutsch
%A Alan M. Schiffman
%T Efficient Implementation of the Smalltalk-80 System
%J 11th Annual Symposium on Principles of Programming Languages
(POPL 11)
%D January 1984
%P 297-302
%X Dynamic translatin of p-code to n-code (native code).
Resons for not using straight p-code or straight n-code:
 * p-code is smaller than n-code (<= 5X).
 * The debugger can debug p-code, improving portability.
 * Native code is faster (<= 2X).  Reasons include
special fetch/decode/dispatch hardware;
p-machine and n-machine may be very different, e.g.,
stack machine vs. register-oriented.
 * Threaded code does reduce the cost of p-code fetch/decode.
Does not help with operand decoding.
Does not allow optimizations to span more than one instruction.
[pardo: that's not technically true, but each optimized
instruction must exist in addition to the unoptimized version.
That leads to exponential blowup of the p-code.  Example: delayed
branch and non-delayed branch versions of Robert Bedichek's 88k
simulator.]
 The system characteristics:
 * The time to translate to n-code via macro expansion is about the
same as the execute time to interpret the p-code.
 * (pg 300:) Self-modifying code (SMC) is deprecated but used in a
``well-confined'' way.  Could indirect at more cost.  Use SMC on the
machines where it works, indirection where SMC.
doesn't.
 * Performance is compared to a ``straightforward'' interpreter.
What's that?

%A Christopher W. Fraser
%A Eugene W. Myers
%A Alan L. Wendt
%T Analyzing and Compressing Assembly Code
%J CCC84
%P 117-121

%A Thomas Pittman
%T Two-Level Hybrid Interpreter/Native Code Execution for Combined
Space-Time Program Efficiency
%D 1987
%J ACM SIGPLAN
%P 150-152
%X Talks about native code execution vs. various kinds of interpreting
and encoding of key routines in assembly.


Hope this helps!

	;-D on  ( This is all open to interpretation )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.