bimandre@kulcs.uucp (Andre Marien) (04/10/89)
only partial answers: can you help us, either with your personal experience or with pointers to literature ? 1. Inline expansion versus threaded code Inline expansion produces larger executables than threaded code, so one could think that while executing a threaded code program, there is more locality and thus less paging. This could lead - especially for LARGE programs - to performance degradation for inline expanded code, at least a degradation which is faster than with threaded code. Our experience is quite the contrary: threaded code degrades more sudden and faster than inline code. Of course, threaded code has a fixed overhead in executing the thread - jmp a0@+ - but perhaps it is more important that threaded code has to fetch the operands for the instruction, which destroys the locality. Another point is that the fixed code in a threaded code implementation can be quite small - 64Kb for instance - so that this code is probably always in real memory; and by tuning the place of the pieces of code, one can achieve that certain benchmarks run completely in the instruction cache, explaining why for small programs, threaded code performs quite well. We know that certain compilers for pc's use threaded code, because space still is a problem if you have just a few mega and no virtual memory. We wonder whether there are production quality compilers exist for larger computers using threaded code. It may be true that for low-end SUN's threaded code may decrease the speed gap, but for the larger systems this would surprise us. 2. Mixing interpreted and compiled code In a lot of Prolog systems, Prolog code is divided into static code and dynamic code: static code cannot be changed at runtime, dynamic code can. Very often, dynamic code is interpreted - by different techniques - and static code is compiled. There are several ways to implement how compiled code calls interpreted code, or vice versa. Obviously, most Prolog implementors have solved this, perhaps to their satisfaction. We are curious to know how others did this and we would also like to know about literature about this - in the context of Prolog, or Lisp or any other language. If you answer to us directly, we'll post a summary to the net later. Thanks for your help bimbart@cs.kuleuven.ac.be (mcvax!prlb2!kulcs!bimbart) andre@sunbim (mcvax!prlb2!sunbim!andre for europe) (sun!sunuk!sunbim!andre for US)
pereira@russell.STANFORD.EDU (Fernando Pereira) (04/12/89)
Our experience with large Prolog and C programs for natural-language analysis, grammar analysis, simulation and speech understanding is that soon after your program starts paging (even on an empty 32MB Sun-4/280 with local SMD disk), you may as well forget about it. The problem is that all of those applications create or use very large relations over complex terms leading to essentially random access to most of program memory for Prolog or data area for C. For such programs, compact encoding is essential in that it allows us to run an n times larger problem without paging (n varying from 4 to 10 depending on the size ratio between native compiled code and threaded code). It doesn't matter much to us that native code might be between 1.5 and 3 times faster than threaded code on small or well-behaved benchmarks, since native code would be too bulky to run at all! (not to mention the cost of the disk space for enormous swap partitions). Incidentally, one of the complaints most often heard about Lisp on general-purpose machines is the enormous size of compiled code and what that does to paging. This is particularly bad if the compiled code is optimized for speed; in our experience, Sun/Lucid Common Lisp cannot be used effectively on a Sun with less than 12MB or real memory. In contrast, a Symbolics machine with 1M words (approx 4.5MB) has adequate performance, and I believe this is in good part due to the fact that the machine executes directly high-level Lisp instructions, leading to much less bulky code. One obvious reason why the conventional wisdom on threaded versus native code may not apply to Prolog or Lisp is that most work on language compilation and interpretation has been done for relatively low-level languages like C, in which the average number of machine instructions per basic language construct is substantially lower than for Prolog or Lisp. Actually, this treadeoff is much less clear for more sophisticated imperative languages like Algol68 or Simula67, in which native-compiled code uses calls to out-of-line routines to do a lot of the work (eg. access to nonlocal variables via displays). It was a very instructive experience to look at compiled code for the DEC-10 Simula 67 compiler... -- Fernando Pereira pereira@ai.sri.com
lee@munnari.oz (Lee Naish) (04/14/89)
pereira@russell.UUCP (Fernando Pereira) writes: >soon after your program starts paging >you may as well forget about it >compact encoding is essential in that it allows us to run an >n times larger problem without paging (n varying from 4 to 10 >depending on the size ratio between native compiled code and threaded >code) When I was visiting SICS over the northern winter I ran the MTS natural language system, written by Xiuming Huang, under Sicstus Prolog using the bytecode emulator and the (new) native code system. I have to agree that paging really kills the system. However, the code size factor was not as great as Fernando suggests in this system. The native code version was between 14 and 15 Mb; the emulated system between 6 and 7 Mb (if I recall correctly). Sicstus emulated code is not particularly compact (instructions are halfword or word aligned to speed up loading of instructions and operands) but I think that is true for many Prolog systems (eg, Quintus uses 16 bit opcodes, I've heard). About half the code in MTS is complex dcg rules and about half is the lexicon (large sets of facts). I'm not sure how the size and speed ratios compare with these two rather different types of code. It may be important for optimization (eg, it might be best to compile one with native code and emulate the other). Locality is another important issue. The working set of the lexicon code is related to the number of different words in the sentence/text being processed. This is probably reasonably small, even though the fine grain locality is poor. For the grammar rules, fine grain locality is probably better, but the quantity of code used overall is larger. In the longer term, parallelism may have a significant role to play. Having lots of memory attached to a single processor means the memory is not used efficiently. Shared memory multiprocessors make better use of memory and parallelism can also absorb some memory latency due to paging (while one bit of the computation is waiting for a page, another bit can be done). In other words, the processor utilisation is increased also. You have to be careful to avoid thrashing of course. lee
tim@quintus.UUCP (Tim Lindholm) (04/15/89)
lee@munmurra.UUCP (Lee Naish) writes: >pereira@russell.UUCP (Fernando Pereira) writes: >>soon after your program starts paging >>you may as well forget about it >>compact encoding is essential in that it allows us to run an >>n times larger problem without paging (n varying from 4 to 10 >>depending on the size ratio between native compiled code and threaded >>code) > When I was visiting SICS over the northern winter I ran the MTS natural > language system, written by Xiuming Huang, under Sicstus Prolog using > the bytecode emulator and the (new) native code system. I have to agree > that paging really kills the system. However, the code size factor was > not as great as Fernando suggests in this system. The native code > version was between 14 and 15 Mb; the emulated system between 6 and 7 Mb > (if I recall correctly). Sicstus emulated code is not particularly > compact (instructions are halfword or word aligned to speed up loading > of instructions and operands) but I think that is true for many Prolog > systems (eg, Quintus uses 16 bit opcodes, I've heard). But remember that code compactness among emulated systems can also vary widely! While both Quintus and SICStus Prologs use a basic 16-bit opcode when emulating, there are differences in compilation and encoding that make Quintus' emulated code somewhat more compact than SICStus'. (For one thing, emulated SICStus Prolog has shallow backtracking support that is paid for by some additional head code.) I remember a difference of something like 50-100%, but as I don't have a SICStus 0.6 system to play with, please don't take that number too seriously. Would anyone at SICS (or anyone else with access to Quintus Prolog and SICStus Prolog 0.6) be willing to compile some code and post the results? I would bet that they would bring SICStus' native code expansion over Quintus' emulated code to within the factor of 4-10 that Fernando estimated. (For that matter, does anyone have both QP and the SICStus native code compiler to compare directly? Does the native code compiler do things like shallow backtracking???) Incidentally, that the expansion SICStus' native code over Quintus' emulated code is probably near the low end of Fernando's range should be taken as a source of praise, not blame! -- Tim
pereira@russell.STANFORD.EDU (Fernando Pereira) (04/16/89)
A few additional details prompted by Lee's comments. My figures for native vs. threaded code came from comparing Quintus Prolog with two commercial native code compilers for Prolog two years ago, and from discussions with Richard O'Keefe about Prolog compilation. Native code compilation techniques may have improved since then. The problem of access to rules is more complicated in parsers that do not use the rules directly but instead use some kind of parsing tables derived from the rules. Usually, the tables are much larger than the original rule set. The question of whether parallelism will be useful for parsing is also rather tricky. To ensure termination and avoid an exponential proliferation of analyses, most parsers these days use dynamic programming techniques to merge parsing configurations. As a consequence, parsing alternatives are no longer independent, leading to subtle synchronization problems in parallel systems. Sigh... -- Fernando Pereira
lang@sirius.PRC.Unisys.COM (Francois-Michel Lang) (04/18/89)
There has been much discussion recently about the relative merits of inline expansion vs. threaded code.... I am a relatively experienced Prolog user, (I work on implementations of NL systems written in Prolog) but I have never worked on implementations of Prolog itself, or read much about that aspect of Prolog. Consequently, I do not have much of an idea of what this discussion is all about, but would like to find out more. If some kind soul would be willing to provide some background, or, better yet, give some references to papers/books discussing this general area, I would greatly appreciate the enlightenment. Many thanks, --Francois Lang ---------------------------------------------------------------------------- Francois-Michel Lang Paoli Research Center, Unisys Corporation lang@prc.unisys.com (215) 648-7256 Dept of Comp & Info Science, U of PA lang@cis.upenn.edu (215) 898-9511
bimandre@kulcs.uucp (Andre Marien) (04/28/89)
(This reply became rather long ...) First, a summary of responses in chronological order. : It seems obvious to me that threaded code will page much more. Inline code : will avoid calls to potentially far-away memory locations that require paging : in memory. ... Just my intuitions ... : Bruce Krulwich : You may want to look into Envos, a recent Xerox spin-off marketing Xerox : AI technology. They have ported Interlisp-D, which runs on Xerox : D-machines (microcoded Bytecode), to the Sun platform by writing a : Bytecode emulator. They apparently get quite good performance (emulator : fits in 64K cache) and extremely dense code (5Meg for a full system vs. : 30-40Meg for similar Symbolics image). Native code would be faster, but : perhaps not significantly, and would cause very significant code : expansion. ... : Johannes A. G. M. Koomen : I'm shure You know it, but however: E.Tick Memory Performance of Prolog : Architectures, Kluwers. : ... for very short, what is the essence of this book: the WAM is not bad, : but its only pitfall is the great memory throughput because of : shallow backtracking ... the primary problem is __PAGING__!!! : ... : The biggest problem is not inline or not but the abstract machine. If it : has inherent failures inline expansion will have no great effects. And : there lies the difference between Prolog and other programming languages. : For the WAM inline expansion is indeed very attractive for determinate : parts of programs. But when shallow backtracking is present the savings of : the instruction fetches are neglectable. When designing a Prolog-machine : the primary concern should be the locality of memory references, inline : expansion should be seen as an improvement for some higly used code. If one : considers big Prolog applications (eg TWAICE) one sees clearly that inline : code can be used for the smallest parts only. : Ulrich Neumerkel : Our experience with large Prolog and C programs for natural-language : analysis, grammar analysis, simulation and speech understanding is : that soon after your program starts paging (even on an empty 32MB : Sun-4/280 with local SMD disk), you may as well forget about it. The : problem is that all of those applications create or use very large : relations over complex terms leading to essentially random access to : most of program memory for Prolog or data area for C. : ... : It doesn't matter much to us that native code might be : between 1.5 and 3 times faster than threaded code on small or : well-behaved benchmarks, since native code would be too bulky to run : at all! : ... : Fernando Pereira : : My figures for native vs. threaded code came from comparing Quintus : Prolog with two commercial native code compilers for Prolog two years : ago, and from discussions with Richard O'Keefe about Prolog : compilation. Native code compilation techniques may have improved : since then. : The problem of access to rules is more complicated in parsers that : do not use the rules directly but instead use some kind of parsing : tables derived from the rules. Usually, the tables are much larger : than the original rule set. : ... : Fernando Pereira : When I was visiting SICS over the northern winter I ran the MTS natural : language system, written by Xiuming Huang, under Sicstus Prolog using : the bytecode emulator and the (new) native code system. I have to agree : that paging really kills the system. : ... : About half the code in MTS is complex dcg rules and about half is the : lexicon (large sets of facts). I'm not sure how the size and speed : ratios compare with these two rather different types of code. It may be : important for optimization (eg, it might be best to compile one with : native code and emulate the other). : Locality is another important : issue. The working set of the lexicon code is related to the number of : different words in the sentence/text being processed. This is probably : reasonably small, even though the fine grain locality is poor. For the : grammar rules, fine grain locality is probably better, but the quantity : of code used overall is larger. : Lee Naish : Point 1: the optimised native code is about twice as fast as the : threaded code, because it doesn't have the overhead of the jmp *(SIMPC)++ : instructions, and perhaps more importantly because it is possible to : combine the effects of a series of instructions moving things around into : a simpler series putting things directly in their final place. : Point 2: the emulated instructions might have occupied 6 (e.g. SB Prolog) : or 12 (e.g. Quintus Prolog) bytes. A VAX version of the optimised native : code takes on the close order of 60. That's for just one example, but it : was to some extent a favourable example. : Point 3: the emulator for an abstract machine is likely to be very small. : Emulated code just won't be paging the emulator, and it is likely to get : much higher locality of code reference than native code, which is good : news for caches. The emulated instructions are small, so they are good : news for data caches. The emulated instructions flow through the data : cache, which is bad news for data caches, because the real data have to : go through the same cache. : Point 4: emulated systems can be easier to port. : It's all a trade off. : Richard A. O'Keefe : Eckouse's book on PDP-11 programming (mid 70's) contains a section on : inline-vs.-threaded-vs-`knotted' code. This is probably the best : description that I have seen in print. (`Knotted' code is his name : for a cute optimization on threading). He includes a single data : point <basic, inline, threaded, knotted> for a `real' program. The : data point supports his claim that you sometimes get better : performance from threaded code than from inline code. : ... : several observations: : * For systems that rely heavily on caching, and particularly where : caching actions (e.g., prefetch) can hurt performance (e.g, bus : contention on a shared-memory multiprocessor), inlining is a loss on : most large functions. : * Threading is harder to apply and probably less general. It is of : obvious utility in interpreters and for iterations over a fixed : parameter list. It is less obvious how to use threading for : collections of arbitrary functions. It is still less obvious what : is `best' when the function must have two calling conventions (the : `most general' and the optimized calls). : ... : %X /u1/rrh/bib/rrh.d/peterd.ref : %T An Architectural Trail to Threaded-Code Systems : %A Peter M. Kogge : %P 22-33 : %J COMP : %V 15 : %N 3 : %D MAR 1982 : %K forth : ... : Pardo : ... : : * Global register allocation. : ... : Pardo : I think the first is the most relevant, the following, you might find usefull. : : @InProceedings(clocksin85, : Key="Clocksin", : Author="W. F. Clocksin", : Title="Implementation Techniques for Prolog Databases", : BookTitle="Software: Practice and Experience", : Pages="1-8", : Year="1985") : @comment{Keywords= assert} : ... : Periklis Tsahageas The basic reason why we believe native code generation is better than threaded code, is speed. Virtual machines, base register adressing, optimization analysis, caches all start with the principle of locality of code and data reference. You find this in the rule 90% of time in 10% of the code, the name 'working set' and others. So even if the amount of code produced when generating native code is a lot higher, the working set itself will be moderate. Having a complete emulator in 64k as Koomen reports, and we assume Quintus uses, is nice of course, if your performance is not hurt too much. You can plug in a 12Mbyte board instead of a 4Mbyte board; doing the same with a processor 12 Mips instead of 4 Mips may neither be as easy nor as effective. In EUUG '89 David Rosenthal writes about a related topic, observed performance, and says that a workstation performance depends heavily on the network/disk speed, which can be improved by increasing memory, and if all that behaves well, than you will see the effect of lower cputime. People tend to use the systems resources to the edge where trashing is going to occur. This gives unexpected drops in performance if you just go a little further. It is obvious that programs exist were the small increase in paging due to larger code may just do that. It is also true that for some programs, it happens with both native and threaded code because the cause is the heap reference, or the local stack, not code reference. A program used as benchmark and written by O'Keefe, doing fft, would run fine with the query u_time(9) and trash with u_time(10). No relation with the code size, obviously ! Of course, this is not new. See the remarks of Neumerkel, and the well known book of Tick. The plm_compiler, which has a non-benchmark size, runs happily with native code. No thrashing. That should answer the 'well-behaved benchmarks' remark of Pereira. It may come as a suprise, but benchmarks tend to be in favor of threaded code ! Just think of those long repetitions in some benchmark suites, all using the same emulator instructions over and over again, whereas in native code it is a brandnew instruction each time ... . There is a counterpart of course. Any small recursive benchmark predicate will be in the cache of a 68020 without much trouble, for native code. With a thread, you may gain more than 10% of speed by tuning the emulator to append/3 (concat/3). Do walk-list and walk-list-recursive ring a bell ? The first gives overrated timings for threaded code, the second shows native code at its best. A good point is made by Pereira and agreed upon by Naish. Both mention the problem that prolog programs contain two very different kinds of code. You have the 'real' prolog predicates, and you have the facts. These are quit different beasts, and we agree almost completely with Naish that it is best to compile one with native code and emulate the other. This is precisely what is possible with BIM_Prolog. We support this idea at different levels. First of all, there is a distinction between dynamic and static code (code which is modifiable or code which is not). There are cases where this is sufficient. As both are code, we compile both, static code to native code, dynamic code only to WAM. For fast access, there is indexing on one user selectable argument for dynamic code, and up to three for static code. If dynamic code access time is crucial, hashing can be added, which gives a drastic page improvement if this would have been a problem with the normal indexing. We could construct an artificial testcase were all of a sudden response dropped below acceptable level, using the emulated code, which is about the same density as threaded code, and using a lot of facts with multiple arguments. Hashing removed the problem at once. Conclusion : running a benchmark on different systems without using any tuning towards the implementation does not show its capacitys. There is a second point about which we have been trying to convince people - with more or less with success - is the principle that data and program are not the same, despite its theoretical intersting properties. What we mean is that using assert/retract for global variable simulation is bad. We know 'global variable' is a bad thing, but we have both our feet on the ground : sometimes it is better to have them. Just as goto's (never used longjump in C?). Therefore we have a set of predicate to handle them efficiently. The wise can use it to their advantage and others should never write prolog. The third point is the use of a database on disk. This may be a standard relational database or a special prolog database. If your database is big, paging will eventually be as slow as disk accesses. Maybe your database caching will be even better. We could easily construct cases where a trashing program was significantly slower than using a special database. As (prolog) compiler technology advances, less instructions will be needed. And of course, if the code is to long, you can always use a kind of call to an out-of-line routine, gaining nothing, losing nothing compared with the thread. Currently, many WAM instructions can be compiled in just a few instructions. When using mode declarations, code size can often be reduced substantially. Of course, you have to put them in your programs to notice the reduction in code size. There will be a paper at ICLP 89 showing what the future may bring. An item mentioned by Pardo, is only applicable when using native code : global register allocation. WAM register alloaction is still not fully solved, due to tradeoffs and the interaction with things like shallow backtracking. Not many attemps for global register allocation have been made. As can be seen in Lisp, it could be usefull. O'Keefe says 'It's all a trade off'. With this, we agree. The size of code and the speedups are easy to trade off. If you develop programs in prolog which need to compete with C-coded programs, you need top speed. The arguments about caching are not that clear, we doubt he has a point. See above. As for the ease of porting, this is hardly something users care much about. It is an implementors tradeoff. And then, it depends on the people. Porting BIM_Prolog 2.4 to VAX BSD took 14 days to get an alfa release produkt, meaning a lot of testing is included in this time. We find this very acceptable, although the work itself is boring. As a last remark, the reference given by Keppel to the article of Kogge is interesting for all who want to read something about threaded code and how it should be done, if at all. Andre' Marien, B.I.M. Bart Demoen, K.U.Leuven
debray@arizona.edu (Saumya K. Debray) (05/03/89)
There seems to be a consensus that native code implementations of Prolog may suffer because of non-locality of code references, and the excessive paging that results. However, it may be possible to reduce the amount of paging, using code reorganization techniques developed for cache management. The following is the abstract of a TR titled "Code Reorganization for Instruction Caches" by A. Dain Samples and Paul Hilfinger (UC Berkeley, Oct 1988): "Program performance can be improved on machines with virtual memory by reorganizing the program's address space. We address the question whether reorganization could benefit programs on machines with instruction caches. We have performed experiments to determine the efficacy of restructuring using simple reordering algorithms and profile data, concluding that performance improvement can be obtained relatively cheaply. Experiments show improvements in miss rates on the order of 30% to 50%, and sometimes as high as 50% to 80%, by performing a simple algorithm that relocates only 3% to 8% of the basic blocks of a program." -- Saumya Debray CS Department, University of Arizona, Tucson internet: debray@arizona.edu uucp: arizona!debray