micha@ecrc.de (Micha Meier) (08/01/90)
Some people argue that Prolog is too slow, others claim that it is fast enough, yet others that it is even faster than languages like C, for particular problems. I would like to contribute to this discus- sion from another point of view. I can say, and so can everybody who has seriously tried to implement an efficient Prolog compiler, that Prolog could be made much faster than it is now, because there are numerous redundant operations being performed all the time. Trying to optimize them all away requires generating a lot of code to catch all various possibilities and generate optimal code for each of them. With the use of abstract interpretation it might be possible to ob- tain enough information to generate both good and short code, but the current abstract interpretation techniques are not yet mature enough to be available in commercial systems. Consequently, people are trying to locate what are the bottlenecks of Prolog execution and modify the compiler and the engine to get rid of them. This, however, may only make the whole story more complicated, because often the criteria that decide what are the bottlenecks are rather questionable. There have been many papers published which say something like "on the set of benchmarks B the feature F showed to be very important and therefore we present a WAM modification that han- dles the feature F very efficiently". In almost all cases the men- tioned benchmark suite includes only very short programs, which can in no way be representative for current real-life Prolog programs. *Current* is important here because in some years from now, these programs may become more representative because the way of executing them may change, the current bottlenecks disappear, and the feature F really becomes significant. But until then, small benchmark programs cannot be trusted! I have an interesting example with several suites of benchmark programs which were used to test the ECRC Sepia system. The first one consists of the benchmark programs set up by the Com- puter Architecture group at ECRC several years ago, and it tries to test single Prolog actions, like creating a choice point, an environ- ment, dereferencing, unification, etc. The Pereira benchmarks used for the same purpose are not well suited because they often measure something different than they say. The second benchmark set was also set up by the Computer Architec- ture, it includes small, but well known benchmark programs like naive reverse, map colouring, sieve, quicksort, queens etc. The third benchmark suite consists of large programs. I have selected them using the following criteria: - The program must run long enough to be safely measurable - The program must contain a sufficient number of different procedure calls. 100 is sufficient. - The program has to make at least 100000 procedure calls and these calls must not be concentrated into one or two procedures only. (SEPIA contains a profiler that shows the number of ports passed for each procedure.) These programs were taken from the L.A. benchmarking workshop, some were written at ECRC and yet others come from various sources. With these three benchmarks suites (I have also others, but they are rather biased) one gets a reasonable feeling for the speed of a Pro- log system. Now what is interesting is that I have run these bench- marks with Quintus 2.0 and SEPIA 3.0.5 (academic release) and the results are quite different: time Quintus/SEPIA Single features 0.79 Small programs 0.77 Large programs 1.33 The difference for the small benchmarks is mainly due to the fact that SEPIA uses 32 bits for the tag and 32 for the value, and its ability to handle coroutining and asynchronous interrupts. For the large programs, however, one can see that actually other features may outweigh. I don't want to claim here that SEPIA is faster than Quintus, there are enough programs even in the suite where it is 40 % slower, but I want to claim that small benchmarks are not enough. My point is that Prolog needs an official benchmark suite. We have at least to start discussing which programs should or should not be in the suite, and what are the criteria to select them. Currently, every vendor uses some benchmark suite, of course that one which is good for his Prolog system, and other benchmarks are usually refused. We have to move away from this. I think that by now everyone can see that Prolog systems are going to be much faster and that a vendor will not be able to say "yes, on this benchmark the other Prolog is faster, but only because it optimizes feature F which in real pro- grams would not matter". Sooner or later, all features will have to be optimized. I know it will be hard, but until we get rid of the "naive reverse curse", Prolog will not be taken seriously. Last year in Cleveland I have talked to people from several commer- cial companies and it turned out that everyone basically agrees that a benchmark suite is necessary, but everybody also wants to partici- pate in the selection. We more or less agreed that benchmark pro- grams should be used 'as is'. If a system has a way to generate e.g. mode declarations automatically, it can use it, but declarations can- not be added by hand. Everybody can of course add some declarations and say "our system runs the benchmark suite in X seconds but with added mode declarations in Y seconds!". There are always problems when running some programs with two dif- ferent Prolog systems, because the same functionality can be obtained in different ways, and so e.g. using assert/1 to make a counter can be fast in one system, but hopelessly slow in another, which has its own efficient primitive to handle counters. These problems have to be handled on a case by case basis, the idea is that some programs would have simple predicates that can be defined for a particular Prolog system, e.g. inc_counter(c) would be defined in one system as inc_counter(C) :- erase(C, Value), NewVal is Value + 1, recorda(C, NewValue). in another one as inc_counter(C) :- ctr_inc(C). or inc_counter(C) :- incval(C). This approach should make each vendor reasonably happy, and could be- come obsolete after the ISO standard appears (but probably will not, because Prolog systems will continue to offer extra features to han- dle frequent operations). The benchmark programs should be written in a syntax which is not too distant from the current ISO draft, so that they can be made ISO compatible when needed. I would like to collect more programs from which the benchmark suite could be selected. Using our profiler, I can collect the statistics for every program, both at the level of box ports and at the level of WAM instructions, and I can send this statistics to everyone who is interested. I can also make my benchmark suites available to people who want to participate in the selection of benchmark programs. Since I have already enough small programs (everybody has :-)), what I need is more large programs, but also maybe programs tuned to test other single features. Please, if you have some real-life programs which are sufficiently large and which can be made public, send them to me together with a sample data and reference output. If your pro- gram is good, you can in turn get SEPIA with discount, so it pays off for you. Moreover, if your program is in the benchmark suite, then a lot of implementors all around the world will be trying to speed up *your* program, wouldn't that be great? The best media is a Sun car- tridge, but mailing a compressed file would do as well. Everybody who contributes with a program will be acknowledged in the suite. --Micha Meier Disclaimer as usual. USA micha%ecrc.de@pyramid.com pyramid!ecrc!micha EUROPE micha@ecrc.de unido!ecrc!micha MAIL Micha Meier ECRC Arabellastr. 17 8000 Munich 81 West Germany -- USA micha%ecrc.de@pyramid.com, pyramid!ecrc!micha EUROPE micha@ecrc.de, unido!ecrc!micha
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (08/02/90)
In article <708@ecrc.de>, micha@ecrc.de (Micha Meier) writes:
[ that benchmarking large programs gives different results from
benchmarking small ones, and that we need some large ones
]
Absoutely right, but there IS a collection of larger benchmarks,
that was collected at the Prolog Benchmarking Workshop at Aerospace
Corp a couple of years ago. Contact Karl Kesselman for details.
A tape is available. It isn't, I'm afraid, possible to use ISO-like
syntax yet, because nobody knows well enough what it will be like.
--
Distinguishing between a work written in Hebrew and one written in Aramaic
when we have only a Latin version made from a Greek translation is not easy.
(D.J.Harrington, discussing pseudo-Philo)
pereira@alice.UUCP (Fernando Pereira) (08/03/90)
In article <708@ecrc.de> micha@acrab4.UUCP (Micha Meier) writes: > The Pereira benchmarks used >for the same purpose are not well suited because they often measure >something different than they say. I've just looked at the comments in the source for my benchmarks, and nowhere is there a statement of what the various programs are intended to measure, so I don't see what evidence you have for your statement. Now, clearly I had in mind when preparing the benchmarks that they would exercise different aspects of a Prolog implementation, but I was familiar with enough different implementations to know that I didn't have the time to build a benchmark suite whose components I would be sure would measure the same individual aspects across implementations. Otherwise, I entirely agree with and support your points, and I will be digging into my archives to find programs that might be useful for benchmarking. Fernando Pereira 2D-447, AT&T Bell Laboratories 600 Mountain Ave, Murray Hill, NJ 07974 pereira@research.att.com
micha@.de (Micha Meier) (08/06/90)
In article <3502@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: >Absoutely right, but there IS a collection of larger benchmarks, >that was collected at the Prolog Benchmarking Workshop at Aerospace >Corp a couple of years ago. Yes, I have written in my message >These programs were taken from the L.A. benchmarking workshop, >some were written at ECRC and yet others come from various sources. Not all the programs from this workshop are large enough and even less call a reasonable number of procedures a reasonable number of times. --Micha -- USA micha%ecrc.de@pyramid.com MAIL Micha Meier ECRC, Arabellastr. 17 EUROPE micha@ecrc.de 8000 Munich 81 West Germany
micha@.de (Micha Meier) (08/08/90)
Some issues concerning the benchmark suite: - Several people mailed me that they would like to obtain the benchmark suite. I'm going to make it available as soon as I have enough interesting programs in it, together with test data and output so that the testing can be completely automatic. The first problem that comes to my mind here is that the same output produced by different Prolog systems might not necessarily be identical, some Prologs put spaces after commas and around operators, other don't, probably there are other differences as well. I guess the output will have to be preprocessed for some systems to be able to compare its correctness. If you are interested in obtaining the test suite, let me know. I'm not yet sure how I will distribute it, probably on a tape for the tape+mailing costs, maybe also by ftp. - Whoever else has interesting programs, you can also mail me where to ftp them, it is not necessary to send the program directly. - Although this might not be of general interest, I would also like to collect programs using coroutining, to be able to compare performance of coroutining systems. I have some rather small programs, mostly puzzles, but I'm missing serious application programs using coroutining, are there any such programs around? Small programs would do as well (but no n-queens, primes etc.). - It is clear to me that every commercial company would like to push its own programs into the benchmark suite, and to avoid programs from other companies. I would like to make clear that ECRC, my employer, is a non-profit organization, and I'm not dependent on the performance of our Sepia system (or others) on the benchmark. I am trying to select the programs using some objective criteria, based on profiling them. However, I do accept test programs even when they are supplied by commercial companies, if they fit into the suite frame. --Micha Meier -- USA micha%ecrc.de@pyramid.com MAIL Micha Meier ECRC, Arabellastr. 17 EUROPE micha@ecrc.de 8000 Munich 81 West Germany
micha@ecrc.de (Micha Meier) (08/10/90)
In article <11122@alice.UUCP> pereira@alice.UUCP () writes: >In article <708@ecrc.de> micha@acrab4.UUCP (Micha Meier) writes: >> The Pereira benchmarks used >>for the same purpose are not well suited because they often measure >>something different than they say. > >I've just looked at the comments in the source for my benchmarks, and nowhere >is there a statement of what the various programs are intended to measure, >so I don't see what evidence you have for your statement. I'm sorry I didn't express myself more clearly. What I had in mind is that when you run Pereira benchmarks, you get an output of the form tail_call_atom_atom took (1.45 - 0.533333) / 2000 = 0.000458333 seconds/iteration binary_call_atom_atom took (2.65 - 0.533334) / 2000 = 0.00105833 seconds/iteration cons_list took (2.83333 - 0.533334) / 2000 = 0.00115 seconds/iteration etc. so without looking at the code and the comments there, one would assume that the program measures what the predicate names say, because for most of the benchmarks the names look self-explanatory. On the other hand, the comments in the source only say e.g. '% 19. Make 1000 asserts of unit clauses', but one assumes that this is what is being measured by this benchmark, at least it is how I've understood it, however this one also measures list unification and tail-recursive calls. --Micha Meier -- USA micha%ecrc.de@pyramid.com MAIL Micha Meier ECRC, Arabellastr. 17 EUROPE micha@ecrc.de 8000 Munich 81 West Germany