micha@ecrc.de (Micha Meier) (08/01/90)
Some people argue that Prolog is too slow, others claim that it is
fast enough, yet others that it is even faster than languages like C,
for particular problems. I would like to contribute to this discus-
sion from another point of view. I can say, and so can everybody who
has seriously tried to implement an efficient Prolog compiler, that
Prolog could be made much faster than it is now, because there are
numerous redundant operations being performed all the time. Trying to
optimize them all away requires generating a lot of code to catch all
various possibilities and generate optimal code for each of them.
With the use of abstract interpretation it might be possible to ob-
tain enough information to generate both good and short code, but the
current abstract interpretation techniques are not yet mature enough
to be available in commercial systems.
Consequently, people are trying to locate what are the bottlenecks of
Prolog execution and modify the compiler and the engine to get rid of
them. This, however, may only make the whole story more complicated,
because often the criteria that decide what are the bottlenecks are
rather questionable. There have been many papers published which say
something like "on the set of benchmarks B the feature F showed to be
very important and therefore we present a WAM modification that han-
dles the feature F very efficiently". In almost all cases the men-
tioned benchmark suite includes only very short programs, which can
in no way be representative for current real-life Prolog programs.
*Current* is important here because in some years from now, these
programs may become more representative because the way of executing
them may change, the current bottlenecks disappear, and the feature F
really becomes significant. But until then, small benchmark programs
cannot be trusted! I have an interesting example with several suites
of benchmark programs which were used to test the ECRC Sepia system.
The first one consists of the benchmark programs set up by the Com-
puter Architecture group at ECRC several years ago, and it tries to
test single Prolog actions, like creating a choice point, an environ-
ment, dereferencing, unification, etc. The Pereira benchmarks used
for the same purpose are not well suited because they often measure
something different than they say.
The second benchmark set was also set up by the Computer Architec-
ture, it includes small, but well known benchmark programs like naive
reverse, map colouring, sieve, quicksort, queens etc.
The third benchmark suite consists of large programs.
I have selected them using the following criteria:
- The program must run long enough to be safely measurable
- The program must contain a sufficient number of different
procedure calls. 100 is sufficient.
- The program has to make at least 100000 procedure calls
and these calls must not be concentrated into one or
two procedures only.
(SEPIA contains a profiler that shows the number of ports passed
for each procedure.)
These programs were taken from the L.A. benchmarking workshop,
some were written at ECRC and yet others come from various sources.
With these three benchmarks suites (I have also others, but they are
rather biased) one gets a reasonable feeling for the speed of a Pro-
log system. Now what is interesting is that I have run these bench-
marks with Quintus 2.0 and SEPIA 3.0.5 (academic release) and the
results are quite different:
time Quintus/SEPIA
Single features 0.79
Small programs 0.77
Large programs 1.33
The difference for the small benchmarks is mainly due to the fact
that SEPIA uses 32 bits for the tag and 32 for the value, and its
ability to handle coroutining and asynchronous interrupts. For the
large programs, however, one can see that actually other features may
outweigh. I don't want to claim here that SEPIA is faster than
Quintus, there are enough programs even in the suite where it is 40 %
slower, but I want to claim that small benchmarks are not enough.
My point is that Prolog needs an official benchmark suite. We have at
least to start discussing which programs should or should not be in
the suite, and what are the criteria to select them. Currently, every
vendor uses some benchmark suite, of course that one which is good
for his Prolog system, and other benchmarks are usually refused. We
have to move away from this. I think that by now everyone can see
that Prolog systems are going to be much faster and that a vendor
will not be able to say "yes, on this benchmark the other Prolog is
faster, but only because it optimizes feature F which in real pro-
grams would not matter". Sooner or later, all features will have to
be optimized. I know it will be hard, but until we get rid of the
"naive reverse curse", Prolog will not be taken seriously.
Last year in Cleveland I have talked to people from several commer-
cial companies and it turned out that everyone basically agrees that
a benchmark suite is necessary, but everybody also wants to partici-
pate in the selection. We more or less agreed that benchmark pro-
grams should be used 'as is'. If a system has a way to generate e.g.
mode declarations automatically, it can use it, but declarations can-
not be added by hand. Everybody can of course add some declarations
and say "our system runs the benchmark suite in X seconds but with
added mode declarations in Y seconds!".
There are always problems when running some programs with two dif-
ferent Prolog systems, because the same functionality can be obtained
in different ways, and so e.g. using assert/1 to make a counter can
be fast in one system, but hopelessly slow in another, which has its
own efficient primitive to handle counters. These problems have to
be handled on a case by case basis, the idea is that some programs
would have simple predicates that can be defined for a particular
Prolog system, e.g. inc_counter(c) would be defined in one system as
inc_counter(C) :-
erase(C, Value),
NewVal is Value + 1,
recorda(C, NewValue).
in another one as
inc_counter(C) :-
ctr_inc(C).
or
inc_counter(C) :-
incval(C).
This approach should make each vendor reasonably happy, and could be-
come obsolete after the ISO standard appears (but probably will not,
because Prolog systems will continue to offer extra features to han-
dle frequent operations). The benchmark programs should be written
in a syntax which is not too distant from the current ISO draft, so
that they can be made ISO compatible when needed.
I would like to collect more programs from which the benchmark suite
could be selected. Using our profiler, I can collect the statistics
for every program, both at the level of box ports and at the level of
WAM instructions, and I can send this statistics to everyone who is
interested. I can also make my benchmark suites available to people
who want to participate in the selection of benchmark programs.
Since I have already enough small programs (everybody has :-)), what
I need is more large programs, but also maybe programs tuned to test
other single features. Please, if you have some real-life programs
which are sufficiently large and which can be made public, send them
to me together with a sample data and reference output. If your pro-
gram is good, you can in turn get SEPIA with discount, so it pays off
for you. Moreover, if your program is in the benchmark suite, then a
lot of implementors all around the world will be trying to speed up
*your* program, wouldn't that be great? The best media is a Sun car-
tridge, but mailing a compressed file would do as well. Everybody who
contributes with a program will be acknowledged in the suite.
--Micha Meier
Disclaimer as usual.
USA micha%ecrc.de@pyramid.com
pyramid!ecrc!micha
EUROPE micha@ecrc.de
unido!ecrc!micha
MAIL Micha Meier
ECRC
Arabellastr. 17
8000 Munich 81
West Germany
--
USA micha%ecrc.de@pyramid.com, pyramid!ecrc!micha
EUROPE micha@ecrc.de, unido!ecrc!michaok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (08/02/90)
In article <708@ecrc.de>, micha@ecrc.de (Micha Meier) writes:
[ that benchmarking large programs gives different results from
benchmarking small ones, and that we need some large ones
]
Absoutely right, but there IS a collection of larger benchmarks,
that was collected at the Prolog Benchmarking Workshop at Aerospace
Corp a couple of years ago. Contact Karl Kesselman for details.
A tape is available. It isn't, I'm afraid, possible to use ISO-like
syntax yet, because nobody knows well enough what it will be like.
--
Distinguishing between a work written in Hebrew and one written in Aramaic
when we have only a Latin version made from a Greek translation is not easy.
(D.J.Harrington, discussing pseudo-Philo)pereira@alice.UUCP (Fernando Pereira) (08/03/90)
In article <708@ecrc.de> micha@acrab4.UUCP (Micha Meier) writes: > The Pereira benchmarks used >for the same purpose are not well suited because they often measure >something different than they say. I've just looked at the comments in the source for my benchmarks, and nowhere is there a statement of what the various programs are intended to measure, so I don't see what evidence you have for your statement. Now, clearly I had in mind when preparing the benchmarks that they would exercise different aspects of a Prolog implementation, but I was familiar with enough different implementations to know that I didn't have the time to build a benchmark suite whose components I would be sure would measure the same individual aspects across implementations. Otherwise, I entirely agree with and support your points, and I will be digging into my archives to find programs that might be useful for benchmarking. Fernando Pereira 2D-447, AT&T Bell Laboratories 600 Mountain Ave, Murray Hill, NJ 07974 pereira@research.att.com
micha@.de (Micha Meier) (08/06/90)
In article <3502@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: >Absoutely right, but there IS a collection of larger benchmarks, >that was collected at the Prolog Benchmarking Workshop at Aerospace >Corp a couple of years ago. Yes, I have written in my message >These programs were taken from the L.A. benchmarking workshop, >some were written at ECRC and yet others come from various sources. Not all the programs from this workshop are large enough and even less call a reasonable number of procedures a reasonable number of times. --Micha -- USA micha%ecrc.de@pyramid.com MAIL Micha Meier ECRC, Arabellastr. 17 EUROPE micha@ecrc.de 8000 Munich 81 West Germany
micha@.de (Micha Meier) (08/08/90)
Some issues concerning the benchmark suite: - Several people mailed me that they would like to obtain the benchmark suite. I'm going to make it available as soon as I have enough interesting programs in it, together with test data and output so that the testing can be completely automatic. The first problem that comes to my mind here is that the same output produced by different Prolog systems might not necessarily be identical, some Prologs put spaces after commas and around operators, other don't, probably there are other differences as well. I guess the output will have to be preprocessed for some systems to be able to compare its correctness. If you are interested in obtaining the test suite, let me know. I'm not yet sure how I will distribute it, probably on a tape for the tape+mailing costs, maybe also by ftp. - Whoever else has interesting programs, you can also mail me where to ftp them, it is not necessary to send the program directly. - Although this might not be of general interest, I would also like to collect programs using coroutining, to be able to compare performance of coroutining systems. I have some rather small programs, mostly puzzles, but I'm missing serious application programs using coroutining, are there any such programs around? Small programs would do as well (but no n-queens, primes etc.). - It is clear to me that every commercial company would like to push its own programs into the benchmark suite, and to avoid programs from other companies. I would like to make clear that ECRC, my employer, is a non-profit organization, and I'm not dependent on the performance of our Sepia system (or others) on the benchmark. I am trying to select the programs using some objective criteria, based on profiling them. However, I do accept test programs even when they are supplied by commercial companies, if they fit into the suite frame. --Micha Meier -- USA micha%ecrc.de@pyramid.com MAIL Micha Meier ECRC, Arabellastr. 17 EUROPE micha@ecrc.de 8000 Munich 81 West Germany
micha@ecrc.de (Micha Meier) (08/10/90)
In article <11122@alice.UUCP> pereira@alice.UUCP () writes: >In article <708@ecrc.de> micha@acrab4.UUCP (Micha Meier) writes: >> The Pereira benchmarks used >>for the same purpose are not well suited because they often measure >>something different than they say. > >I've just looked at the comments in the source for my benchmarks, and nowhere >is there a statement of what the various programs are intended to measure, >so I don't see what evidence you have for your statement. I'm sorry I didn't express myself more clearly. What I had in mind is that when you run Pereira benchmarks, you get an output of the form tail_call_atom_atom took (1.45 - 0.533333) / 2000 = 0.000458333 seconds/iteration binary_call_atom_atom took (2.65 - 0.533334) / 2000 = 0.00105833 seconds/iteration cons_list took (2.83333 - 0.533334) / 2000 = 0.00115 seconds/iteration etc. so without looking at the code and the comments there, one would assume that the program measures what the predicate names say, because for most of the benchmarks the names look self-explanatory. On the other hand, the comments in the source only say e.g. '% 19. Make 1000 asserts of unit clauses', but one assumes that this is what is being measured by this benchmark, at least it is how I've understood it, however this one also measures list unification and tail-recursive calls. --Micha Meier -- USA micha%ecrc.de@pyramid.com MAIL Micha Meier ECRC, Arabellastr. 17 EUROPE micha@ecrc.de 8000 Munich 81 West Germany