[comp.lang.prolog] Inline expansion versus threaded code

bimandre@kulcs.uucp (Andre Marien) (04/10/89)

only partial answers: can you help us, either with your personal experience or
with pointers to literature ?


1. Inline expansion versus threaded code

Inline expansion produces larger executables than threaded code, so one could
think that while executing a threaded code program, there is more locality and
thus less paging. This could lead - especially for LARGE programs - to
performance degradation for inline expanded code, at least a degradation which
is faster than with threaded code. Our experience is quite the contrary:
threaded code degrades more sudden and faster than inline code. Of course,
threaded code has a fixed overhead in executing the thread - jmp a0@+ - but
perhaps it is more important that threaded code has to fetch the operands for
the instruction, which destroys the locality. Another point is that the fixed
code in a threaded code implementation can be quite small - 64Kb for instance -
so that this code is probably always in real memory; and by tuning the place of
the pieces of code, one can achieve that certain benchmarks run completely in
the instruction cache, explaining why for small programs, threaded code
performs quite well.
We know that certain compilers for pc's use threaded code, because space
still is a problem if you have just a few mega and no virtual memory.
We wonder whether there are production quality compilers exist for larger
computers using threaded code. It may be true that for low-end SUN's
threaded code may decrease the speed gap, but for the larger systems this
would surprise us.

2. Mixing interpreted and compiled code

In a lot of Prolog systems, Prolog code is divided into static code and dynamic
code: static code cannot be changed at runtime, dynamic code can. Very often,
dynamic code is interpreted - by different techniques - and static code is
compiled. There are several ways to implement how compiled code calls
interpreted code, or vice versa. Obviously, most Prolog implementors have
solved this, perhaps to their satisfaction. We are curious to know how others
did this and we would also like to know about literature about this - in the
context of Prolog, or Lisp or any other language.

If you answer to us directly, we'll post a summary to the net later.

Thanks for your help

bimbart@cs.kuleuven.ac.be       (mcvax!prlb2!kulcs!bimbart)
andre@sunbim                    (mcvax!prlb2!sunbim!andre for europe)
                                (sun!sunuk!sunbim!andre for US)

pereira@russell.STANFORD.EDU (Fernando Pereira) (04/12/89)

Our experience with large Prolog and C programs for natural-language
analysis, grammar analysis, simulation and speech understanding is
that soon after your program starts paging (even on an empty 32MB
Sun-4/280 with local SMD disk), you may as well forget about it. The
problem is that all of those applications create or use very large
relations over complex terms leading to essentially random access to
most of program memory for Prolog or data area for C. For such
programs, compact encoding is essential in that it allows us to run an
n times larger problem without paging (n varying from 4 to 10
depending on the size ratio between native compiled code and threaded
code). It doesn't matter much to us that native code might be
between 1.5 and 3 times faster than threaded code on small or
well-behaved benchmarks, since native code would be too bulky to run
at all! (not to mention the cost of the disk space for enormous swap
partitions). 

Incidentally, one of the complaints most often heard about Lisp on
general-purpose machines is the enormous size of compiled code and
what that does to paging. This is particularly bad if the compiled
code is optimized for speed; in our experience, Sun/Lucid Common Lisp
cannot be used effectively on a Sun with less than 12MB or real
memory. In contrast, a Symbolics machine with 1M words (approx 4.5MB)
has adequate performance, and I believe this is in good part due to
the fact that the machine executes directly high-level Lisp
instructions, leading to much less bulky code.

One obvious reason why the conventional wisdom on threaded versus
native code may not apply to Prolog or Lisp is that most work on
language compilation and interpretation has been done for relatively
low-level languages like C, in which the average number of machine
instructions per basic language construct is substantially lower than
for Prolog or Lisp. Actually, this treadeoff is much less clear for
more sophisticated imperative languages like Algol68 or Simula67, in
which native-compiled code uses calls to out-of-line routines to do a
lot of the work (eg. access to nonlocal variables via displays). It
was a very instructive experience to look at compiled code for the
DEC-10 Simula 67 compiler...

-- Fernando Pereira
pereira@ai.sri.com

lee@munnari.oz (Lee Naish) (04/14/89)

pereira@russell.UUCP (Fernando Pereira) writes:
>soon after your program starts paging
>you may as well forget about it

>compact encoding is essential in that it allows us to run an
>n times larger problem without paging (n varying from 4 to 10
>depending on the size ratio between native compiled code and threaded
>code)

When I was visiting SICS over the northern winter I ran the MTS natural
language system, written by Xiuming Huang, under Sicstus Prolog using
the bytecode emulator and the (new) native code system.  I have to agree
that paging really kills the system.  However, the code size factor was
not as great as Fernando suggests in this system.  The native code
version was between 14 and 15 Mb; the emulated system between 6 and 7 Mb
(if I recall correctly).  Sicstus emulated code is not particularly
compact (instructions are halfword or word aligned to speed up loading
of instructions and operands) but I think that is true for many Prolog
systems (eg, Quintus uses 16 bit opcodes, I've heard).

About half the code in MTS is complex dcg rules and about half is the
lexicon (large sets of facts).  I'm not sure how the size and speed
ratios compare with these two rather different types of code.  It may be
important for optimization (eg, it might be best to compile one with
native code and emulate the other).  Locality is another important
issue.  The working set of the lexicon code is related to the number of
different words in the sentence/text being processed.  This is probably
reasonably small, even though the fine grain locality is poor.  For the
grammar rules, fine grain locality is probably better, but the quantity
of code used overall is larger.

In the longer term, parallelism may have a significant role to play.
Having lots of memory attached to a single processor means the memory is
not used efficiently.  Shared memory multiprocessors make better use of
memory and parallelism can also absorb some memory latency due to paging
(while one bit of the computation is waiting for a page, another bit can
be done).  In other words, the processor utilisation is increased also.
You have to be careful to avoid thrashing of course.

	lee

tim@quintus.UUCP (Tim Lindholm) (04/15/89)

lee@munmurra.UUCP (Lee Naish) writes:
>pereira@russell.UUCP (Fernando Pereira) writes:
>>soon after your program starts paging
>>you may as well forget about it

>>compact encoding is essential in that it allows us to run an
>>n times larger problem without paging (n varying from 4 to 10
>>depending on the size ratio between native compiled code and threaded
>>code)

> When I was visiting SICS over the northern winter I ran the MTS natural
> language system, written by Xiuming Huang, under Sicstus Prolog using
> the bytecode emulator and the (new) native code system.  I have to agree
> that paging really kills the system.  However, the code size factor was
> not as great as Fernando suggests in this system.  The native code
> version was between 14 and 15 Mb; the emulated system between 6 and 7 Mb
> (if I recall correctly).  Sicstus emulated code is not particularly
> compact (instructions are halfword or word aligned to speed up loading
> of instructions and operands) but I think that is true for many Prolog
> systems (eg, Quintus uses 16 bit opcodes, I've heard).

But remember that code compactness among emulated systems can also vary
widely!  While both Quintus and SICStus Prologs use a basic 16-bit opcode
when emulating, there are differences in compilation and encoding that make
Quintus' emulated code somewhat more compact than SICStus'.  (For one thing,
emulated SICStus Prolog has shallow backtracking support that is paid for by
some additional head code.)  I remember a difference of something like
50-100%, but as I don't have a SICStus 0.6 system to play with, please don't
take that number too seriously.

Would anyone at SICS (or anyone else with access to Quintus Prolog and
SICStus Prolog 0.6) be willing to compile some code and post the results?  I
would bet that they would bring SICStus' native code expansion over Quintus'
emulated code to within the factor of 4-10 that Fernando estimated.  (For
that matter, does anyone have both QP and the SICStus native code compiler
to compare directly?  Does the native code compiler do things like shallow
backtracking???)  Incidentally, that the expansion SICStus' native code over
Quintus' emulated code is probably near the low end of Fernando's range
should be taken as a source of praise, not blame!

-- Tim

pereira@russell.STANFORD.EDU (Fernando Pereira) (04/16/89)

A few additional details prompted by Lee's comments.

My figures for native vs. threaded code came from comparing Quintus
Prolog with two commercial native code compilers for Prolog two years
ago, and from discussions with Richard O'Keefe about Prolog
compilation. Native code compilation techniques may have improved
since then.

The problem of access to rules is more complicated in parsers that
do not use the rules directly but instead use some kind of parsing
tables derived from the rules. Usually, the tables are much larger
than the original rule set.

The question of whether parallelism will be useful for parsing is also
rather tricky. To ensure termination and avoid an exponential
proliferation of analyses, most parsers these days use dynamic
programming techniques to merge parsing configurations. As a
consequence, parsing alternatives are no longer independent, leading
to subtle synchronization problems in parallel systems. Sigh...

-- Fernando Pereira

lang@sirius.PRC.Unisys.COM (Francois-Michel Lang) (04/18/89)

There has been much discussion recently about
the relative merits of inline expansion vs.
threaded code....

I am a relatively experienced Prolog user,
(I work on implementations of NL systems written
in Prolog) but I have never worked on implementations
of Prolog itself, or read much about that aspect of Prolog.
Consequently, I do not have much of an idea
of what this discussion is all about,
but would  like to find out more.
If some kind soul would be willing to provide
some background, or, better yet, give some references
to papers/books discussing this general area,
I would greatly appreciate the enlightenment.

Many thanks,

--Francois Lang

----------------------------------------------------------------------------
Francois-Michel Lang
Paoli Research Center, Unisys Corporation lang@prc.unisys.com (215) 648-7256
Dept of Comp & Info Science, U of PA      lang@cis.upenn.edu  (215) 898-9511

bimandre@kulcs.uucp (Andre Marien) (04/28/89)

(This reply became rather long ...)

First, a summary of responses in chronological order.

: It seems obvious to me that threaded code will page much more.  Inline code
: will avoid calls to potentially far-away memory locations that require paging
: in memory.  ... Just my intuitions ...
: Bruce Krulwich


: You may want to look into Envos, a recent Xerox spin-off marketing Xerox
: AI technology.  They have ported Interlisp-D, which runs on Xerox
: D-machines (microcoded Bytecode), to the Sun platform by writing a
: Bytecode emulator.  They apparently get quite good performance (emulator
: fits in 64K cache) and extremely dense code (5Meg for a full system vs.
: 30-40Meg for similar Symbolics image).  Native code would be faster, but
: perhaps not significantly, and would cause very significant code
: expansion. ...
: Johannes A. G. M. Koomen


: I'm shure You know it, but however: E.Tick Memory Performance of Prolog
: Architectures, Kluwers.
: ... for very short, what is the essence of this book: the WAM is not bad,
: but its only pitfall is the great memory throughput because of
: shallow backtracking ... the primary problem is __PAGING__!!!
: ...
: The biggest problem is not inline or not but the abstract machine. If it 
: has inherent failures inline expansion will have no great effects. And
: there lies the difference between Prolog and other programming languages.
: For the WAM inline expansion is indeed very attractive for determinate
: parts of programs. But when shallow backtracking is present the savings of 
: the instruction fetches are neglectable. When designing a Prolog-machine
: the primary concern should be the locality of memory references, inline
: expansion should be seen as an improvement for some higly used code. If one
: considers big Prolog applications (eg TWAICE) one sees clearly that inline
: code can be used for the smallest parts only. 
: Ulrich Neumerkel


: Our experience with large Prolog and C programs for natural-language
: analysis, grammar analysis, simulation and speech understanding is
: that soon after your program starts paging (even on an empty 32MB
: Sun-4/280 with local SMD disk), you may as well forget about it. The
: problem is that all of those applications create or use very large
: relations over complex terms leading to essentially random access to
: most of program memory for Prolog or data area for C.
: ...
: It doesn't matter much to us that native code might be
: between 1.5 and 3 times faster than threaded code on small or
: well-behaved benchmarks, since native code would be too bulky to run
: at all!
: ...
: Fernando Pereira
:
: My figures for native vs. threaded code came from comparing Quintus
: Prolog with two commercial native code compilers for Prolog two years
: ago, and from discussions with Richard O'Keefe about Prolog
: compilation. Native code compilation techniques may have improved
: since then.
: The problem of access to rules is more complicated in parsers that
: do not use the rules directly but instead use some kind of parsing
: tables derived from the rules. Usually, the tables are much larger
: than the original rule set.
: ...
: Fernando Pereira


: When I was visiting SICS over the northern winter I ran the MTS natural
: language system, written by Xiuming Huang, under Sicstus Prolog using
: the bytecode emulator and the (new) native code system.  I have to agree
: that paging really kills the system.
: ...
: About half the code in MTS is complex dcg rules and about half is the
: lexicon (large sets of facts).  I'm not sure how the size and speed
: ratios compare with these two rather different types of code.  It may be
: important for optimization (eg, it might be best to compile one with
: native code and emulate the other).
: Locality is another important
: issue.  The working set of the lexicon code is related to the number of
: different words in the sentence/text being processed.  This is probably
: reasonably small, even though the fine grain locality is poor.  For the
: grammar rules, fine grain locality is probably better, but the quantity
: of code used overall is larger.
: Lee Naish


: Point 1: the optimised native code is about twice as fast as the
: threaded code, because it doesn't have the overhead of the jmp *(SIMPC)++
: instructions, and perhaps more importantly because it is possible to
: combine the effects of a series of instructions moving things around into
: a simpler series putting things directly in their final place.
: Point 2: the emulated instructions might have occupied 6 (e.g. SB Prolog)
: or 12 (e.g. Quintus Prolog) bytes.  A VAX version of the optimised native
: code takes on the close order of 60.  That's for just one example, but it
: was to some extent a favourable example.
: Point 3: the emulator for an abstract machine is likely to be very small.
: Emulated code just won't be paging the emulator, and it is likely to get
: much higher locality of code reference than native code, which is good
: news for caches.  The emulated instructions are small, so they are good
: news for data caches.  The emulated instructions flow through the data
: cache, which is bad news for data caches, because the real data have to
: go through the same cache.
: Point 4: emulated systems can be easier to port.
: It's all a trade off.
: Richard A. O'Keefe



: Eckouse's book on PDP-11 programming (mid 70's) contains a section on
: inline-vs.-threaded-vs-`knotted' code.  This is probably the best
: description that I have seen in print.  (`Knotted' code is his name
: for a cute optimization on threading).  He includes a single data
: point <basic, inline, threaded, knotted> for a `real' program.  The
: data point supports his claim that you sometimes get better
: performance from threaded code than from inline code.
: ...
: several observations:
: * For systems that rely heavily on caching, and particularly where
:   caching actions (e.g., prefetch) can hurt performance (e.g, bus
:   contention on a shared-memory multiprocessor), inlining is a loss on
:   most large functions.
: * Threading is harder to apply and probably less general.  It is of
:   obvious utility in interpreters and for iterations over a fixed
:   parameter list.  It is less obvious how to use threading for
:   collections of arbitrary functions.  It is still less obvious what
:   is `best' when the function must have two calling conventions (the
:   `most general' and the optimized calls).
: ...
: %X /u1/rrh/bib/rrh.d/peterd.ref
: %T An Architectural Trail to Threaded-Code Systems
: %A Peter M. Kogge
: %P 22-33
: %J COMP
: %V 15
: %N 3
: %D MAR 1982
: %K forth
: ...
: Pardo
: ...
:
: * Global register allocation.
: ...
: Pardo



: I think the first is the most relevant, the following, you might find usefull.
: 
: @InProceedings(clocksin85,
:         Key="Clocksin",
:         Author="W. F. Clocksin",
:         Title="Implementation Techniques for Prolog Databases",
:         BookTitle="Software: Practice and Experience",
:         Pages="1-8",
:         Year="1985")
:         @comment{Keywords= assert}
: ...
: Periklis Tsahageas

The basic reason why we believe native code generation is better than
threaded code, is speed. Virtual machines, base register adressing, optimization
analysis, caches all start with the principle of locality of code and
data reference. You find this in the rule 90% of time in 10% of the code,
the name 'working set' and others.
So even if the amount of code produced when generating native code
is a lot higher, the working set itself will be moderate.
Having a complete emulator in 64k as Koomen reports, and we assume Quintus
uses, is nice of course, if your performance is not hurt too much. You
can plug in a 12Mbyte board instead of a 4Mbyte board; doing the same with a 
processor 12 Mips instead of 4 Mips may neither be as easy nor as effective.

In EUUG '89 David Rosenthal writes about a related topic, observed performance,
and says that a workstation performance depends heavily on the network/disk
speed, which can be improved by increasing memory, and if all that behaves well,
than you will see the effect of lower cputime. People tend to use the systems
resources to the edge where trashing is going to occur. This gives unexpected
drops in performance if you just go a little further.
It is obvious that programs exist were the small increase in paging due to
larger code may just do that.

It is also true that for some programs, it happens with both native and
threaded code because the cause is the heap reference, or the local stack,
not code reference. A program used as benchmark and written by O'Keefe,
doing fft, would run fine with the query u_time(9) and trash with u_time(10).
No relation with the code size, obviously !
Of course, this is not new. See the remarks of Neumerkel, and the well known
book of Tick.

The plm_compiler, which has a non-benchmark size, runs happily with native code.
No thrashing. That should answer the 'well-behaved benchmarks' remark
of Pereira. It may come as a suprise, but benchmarks tend to be in favor of
threaded code ! Just think of those long repetitions in some benchmark suites,
all using the same emulator instructions over and over again, whereas
in native code it is a brandnew instruction each time ... .
There is a counterpart of course. Any small recursive benchmark predicate
will be in the cache of a 68020 without much trouble, for native code.
With a thread, you may gain more than 10% of speed by tuning the
emulator to append/3 (concat/3). Do walk-list and walk-list-recursive ring
a bell ? The first gives overrated timings for threaded code, the second
shows native code at its best.

A good point is made by Pereira and agreed upon by Naish. Both mention
the problem that prolog programs contain two very different kinds of code.
You have the 'real' prolog predicates, and you have the facts. These are
quit different beasts, and we agree almost completely with Naish that 
it is best to compile one with native code and emulate the other.
This is precisely what is possible with BIM_Prolog. We support this idea at
different levels. First of all, there is a distinction between dynamic
and static code (code which is modifiable or code which is not). There are
cases where this is sufficient. As both are code, we compile both,
static code to native code, dynamic code only to WAM.
For fast access, there is indexing on one user selectable argument for
dynamic code, and up to three for static code.
If dynamic code access time is crucial, hashing can be added,
which gives a drastic page improvement if this would have been
a problem with the normal indexing.
We could construct an artificial testcase were all of a sudden response
dropped below acceptable level, using the emulated code, which is about
the same density as threaded code, and using a lot of facts with multiple
arguments. Hashing removed the problem at once.
Conclusion : running a benchmark on different systems without using any tuning
towards the implementation does not show its capacitys.
There is a second point about which we have been trying to convince people
- with more or less with success - is the principle that data and
program are not the same, despite its theoretical intersting properties.
What we mean is that using assert/retract for global variable
simulation is bad. We know 'global variable' is a bad thing, but we have
both our feet on the ground : sometimes it is better to have them. Just
as goto's (never used longjump in C?). Therefore we have a set of predicate
to handle them efficiently. The wise can use it to their advantage and others
should never write prolog.
The third point is the use of a database on disk. This may be a standard
relational database or a special prolog database. If your database is big,
paging will eventually be as slow as disk accesses. Maybe your database
caching will be even better. We could easily construct cases where a trashing
program was significantly slower than using a special database.

As (prolog) compiler technology advances, less instructions will be needed.
And of course, if the code is to long, you can always use a kind of call
to an out-of-line routine, gaining nothing, losing nothing compared
with the thread. Currently, many WAM instructions can be compiled in just a few
instructions. When using mode declarations, code size can often be reduced
substantially. Of course, you have to put them in your programs to
notice the reduction in code size.
There will be a paper at ICLP 89 showing what the future may bring.
An item mentioned by Pardo, is only applicable when using native code :
global register allocation. WAM register alloaction is still not fully
solved, due to tradeoffs and the interaction with things like
shallow backtracking. Not many attemps for global register allocation
have been made. As can be seen in Lisp, it could be usefull.

O'Keefe says 'It's all a trade off'. With this, we agree. The size of
code and the speedups are easy to trade off. If you develop programs in
prolog which need to compete with C-coded programs, you need top speed.
The arguments about caching are not that clear, we doubt he has a point.
See above. As for the ease of porting, this is hardly something users
care much about. It is an implementors tradeoff. And then, it depends on the
people. Porting BIM_Prolog 2.4 to VAX BSD took 14 days to get an alfa
release produkt, meaning a lot of testing is included in this time.
We find this very acceptable, although the work itself is boring.

As a last remark, the reference given by Keppel to the article of Kogge
is interesting for all who want to read something about threaded code
and how it should be done, if at all.

Andre' Marien, B.I.M.
Bart Demoen,   K.U.Leuven

debray@arizona.edu (Saumya K. Debray) (05/03/89)

There seems to be a consensus that native code implementations of
Prolog may suffer because of non-locality of code references,
and the excessive paging that results.  However, it may be
possible to reduce the amount of paging, using code reorganization
techniques developed for cache management.  The following is the
abstract of a TR titled "Code Reorganization for Instruction Caches"
by A. Dain Samples and Paul Hilfinger (UC Berkeley, Oct 1988):

   "Program performance can be improved on machines with virtual
    memory by reorganizing the program's address space.  We
    address the question whether reorganization could benefit
    programs on machines with instruction caches.  We have
    performed experiments to determine the efficacy of
    restructuring using simple reordering algorithms and profile
    data, concluding that performance improvement can be
    obtained relatively cheaply.  Experiments show improvements
    in miss rates on the order of 30% to 50%, and sometimes as
    high as 50% to 80%, by performing a simple algorithm that
    relocates only 3% to 8% of the basic blocks of a program."

-- 
Saumya Debray		CS Department, University of Arizona, Tucson

     internet:   debray@arizona.edu
     uucp:       arizona!debray