[comp.lang.prolog] Prolog Efficiency and Benchmarking

micha@ecrc.de (Micha Meier) (08/01/90)

Some people argue that Prolog is too slow, others claim  that  it  is
fast enough, yet others that it is even faster than languages like C,
for particular problems. I would like to contribute to  this  discus-
sion  from another point of view. I can say, and so can everybody who
has seriously tried to implement an efficient Prolog  compiler,  that
Prolog  could  be  made much faster than it is now, because there are
numerous redundant operations being performed all the time. Trying to
optimize them all away requires generating a lot of code to catch all
various possibilities and generate optimal code  for  each  of  them.
With  the  use of abstract interpretation it might be possible to ob-
tain enough information to generate both good and short code, but the
current  abstract interpretation techniques are not yet mature enough
to be available in commercial systems.

Consequently, people are trying to locate what are the bottlenecks of
Prolog execution and modify the compiler and the engine to get rid of
them. This, however, may only make the whole story more  complicated,
because  often  the criteria that decide what are the bottlenecks are
rather questionable. There have been many papers published which  say
something like "on the set of benchmarks B the feature F showed to be
very important and therefore we present a WAM modification that  han-
dles  the  feature F very efficiently".  In almost all cases the men-
tioned benchmark suite includes only very short programs,  which  can
in  no  way  be representative for current real-life Prolog programs.
*Current* is important here because in some  years  from  now,  these
programs  may become more representative because the way of executing
them may change, the current bottlenecks disappear, and the feature F
really  becomes significant. But until then, small benchmark programs
cannot be trusted! I have an interesting example with several  suites
of benchmark programs which were used to test the ECRC Sepia system.

The first one consists of the benchmark programs set up by  the  Com-
puter  Architecture  group at ECRC several years ago, and it tries to
test single Prolog actions, like creating a choice point, an environ-
ment,  dereferencing,  unification, etc.  The Pereira benchmarks used
for the same purpose are not well suited because they  often  measure
something different than they say.

The second benchmark set was also set up by  the  Computer  Architec-
ture, it includes small, but well known benchmark programs like naive
reverse, map colouring, sieve, quicksort, queens etc.

The third benchmark suite consists of large programs.
I have selected them using the following criteria:
        - The program must run long enough to be safely measurable
        - The program must contain a sufficient number of different
          procedure calls. 100 is sufficient.
        - The program has to make at least 100000 procedure calls
          and these calls must not be concentrated into one or
          two procedures only.
(SEPIA contains a profiler that shows the number of ports passed
for each procedure.)
These programs were taken from the L.A. benchmarking workshop,
some were written at ECRC and yet others come from various sources.

With these three benchmarks suites (I have also others, but they  are
rather  biased) one gets a reasonable feeling for the speed of a Pro-
log system. Now what is interesting is that I have run  these  bench-
marks  with  Quintus  2.0  and SEPIA 3.0.5 (academic release) and the
results are quite different:

                                time Quintus/SEPIA
        Single features                 0.79
        Small programs                  0.77
        Large programs                  1.33

The difference for the small benchmarks is mainly  due  to  the  fact
that  SEPIA  uses  32  bits for the tag and 32 for the value, and its
ability to handle coroutining and asynchronous interrupts.   For  the
large programs, however, one can see that actually other features may
outweigh.  I don't want to claim  here  that  SEPIA  is  faster  than
Quintus, there are enough programs even in the suite where it is 40 %
slower, but I want to claim that small benchmarks are not enough.

My point is that Prolog needs an official benchmark suite. We have at
least  to  start discussing which programs should or should not be in
the suite, and what are the criteria to select them. Currently, every
vendor  uses  some  benchmark suite, of course that one which is good
for his Prolog system, and other benchmarks are usually refused.   We
have  to  move  away  from this. I think that by now everyone can see
that Prolog systems are going to be much faster  and  that  a  vendor
will  not  be able to say "yes, on this benchmark the other Prolog is
faster, but only because it optimizes feature F which  in  real  pro-
grams  would not matter".  Sooner or later, all features will have to
be optimized.  I know it will be hard, but until we get  rid  of  the
"naive reverse curse", Prolog will not be taken seriously.

Last year in Cleveland I have talked to people from  several  commer-
cial  companies and it turned out that everyone basically agrees that
a benchmark suite is necessary, but everybody also wants to  partici-
pate  in  the  selection.  We more or less agreed that benchmark pro-
grams should be used 'as is'.  If a system has a way to generate e.g.
mode declarations automatically, it can use it, but declarations can-
not be added by hand.  Everybody can of course add some  declarations
and  say  "our  system runs the benchmark suite in X seconds but with
added mode declarations in Y seconds!".

There are always problems when running some programs  with  two  dif-
ferent Prolog systems, because the same functionality can be obtained
in different ways, and so e.g. using assert/1 to make a  counter  can
be  fast in one system, but hopelessly slow in another, which has its
own efficient primitive to handle counters.  These problems  have  to
be  handled  on  a case by case basis, the idea is that some programs
would have simple predicates that can be  defined  for  a  particular
Prolog system, e.g.  inc_counter(c) would be defined in one system as

inc_counter(C) :-
        erase(C, Value),
        NewVal is Value + 1,
        recorda(C, NewValue).

in another one as

inc_counter(C) :-
        ctr_inc(C).

or

inc_counter(C) :-
        incval(C).
This approach should make each vendor reasonably happy, and could be-
come  obsolete after the ISO standard appears (but probably will not,
because Prolog systems will continue to offer extra features to  han-
dle  frequent  operations).  The benchmark programs should be written
in a syntax which is not too distant from the current ISO  draft,  so
that they can be made ISO compatible when needed.

I would like to collect more programs from which the benchmark  suite
could  be selected.  Using our profiler, I can collect the statistics
for every program, both at the level of box ports and at the level of
WAM  instructions,  and I can send this statistics to everyone who is
interested.  I can also make my benchmark suites available to  people
who want to participate in the selection of benchmark programs.

Since I have already enough small programs (everybody has :-)),  what
I  need is more large programs, but also maybe programs tuned to test
other single features.  Please, if you have some  real-life  programs
which  are sufficiently large and which can be made public, send them
to me together with a sample data and reference output. If your  pro-
gram is good, you can in turn get SEPIA with discount, so it pays off
for you.  Moreover, if your program is in the benchmark suite, then a
lot  of  implementors all around the world will be trying to speed up
*your* program, wouldn't that be great?  The best media is a Sun car-
tridge, but mailing a compressed file would do as well. Everybody who
contributes with a program will be acknowledged in the suite.

--Micha Meier

Disclaimer as usual.

USA     micha%ecrc.de@pyramid.com
	pyramid!ecrc!micha

EUROPE  micha@ecrc.de
	unido!ecrc!micha

MAIL	Micha Meier
	ECRC
	Arabellastr. 17
	8000 Munich 81
	West Germany
--
USA     micha%ecrc.de@pyramid.com, pyramid!ecrc!micha

EUROPE  micha@ecrc.de, unido!ecrc!micha

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (08/02/90)

In article <708@ecrc.de>, micha@ecrc.de (Micha Meier) writes:
[ that benchmarking large programs gives different results from
  benchmarking small ones, and that we need some large ones
]
Absoutely right, but there IS a collection of larger benchmarks,
that was collected at the Prolog Benchmarking Workshop at Aerospace
Corp a couple of years ago.  Contact Karl Kesselman for details.
A tape is available.  It isn't, I'm afraid, possible to use ISO-like
syntax yet, because nobody knows well enough what it will be like.
-- 
Distinguishing between a work written in Hebrew and one written in Aramaic
when we have only a Latin version made from a Greek translation is not easy.
(D.J.Harrington, discussing pseudo-Philo)

pereira@alice.UUCP (Fernando Pereira) (08/03/90)

In article <708@ecrc.de> micha@acrab4.UUCP (Micha Meier) writes:
> The Pereira benchmarks used
>for the same purpose are not well suited because they  often  measure
>something different than they say.

I've just looked at the comments in the source for my benchmarks, and nowhere
is there a statement of what the various programs are intended to measure,
so I don't see what evidence you have for your statement. Now, clearly I
had in mind when preparing the benchmarks that they would exercise different
aspects of a Prolog implementation, but I was familiar with enough different
implementations to know that I didn't have the time to build a benchmark
suite whose components I would be sure would measure the same individual
aspects across implementations.

Otherwise, I entirely agree with and support your points, and I will be
digging into my archives to find programs that might be useful for
benchmarking.

Fernando Pereira
2D-447, AT&T Bell Laboratories
600 Mountain Ave, Murray Hill, NJ 07974
pereira@research.att.com

micha@.de (Micha Meier) (08/06/90)

In article <3502@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
>Absoutely right, but there IS a collection of larger benchmarks,
>that was collected at the Prolog Benchmarking Workshop at Aerospace
>Corp a couple of years ago.

	Yes, I have written in my message

	>These programs were taken from the L.A. benchmarking workshop,
	>some were written at ECRC and yet others come from various sources.

	Not all the programs from this workshop are large enough and even
	less call a reasonable number of procedures a reasonable number
	of times.

--Micha
--
USA     micha%ecrc.de@pyramid.com	MAIL	Micha Meier
						ECRC, Arabellastr. 17
EUROPE  micha@ecrc.de				8000 Munich 81
						West Germany

micha@.de (Micha Meier) (08/08/90)

Some issues concerning the benchmark suite:

	- Several people mailed me that they would like to obtain
	  the benchmark suite. I'm going to make it available as soon
	  as I have enough interesting programs in it, together
	  with test data and output so that the testing
	  can be completely automatic. The first problem that comes
	  to my mind here is that the same output produced by
	  different Prolog systems might not necessarily be identical,
	  some Prologs put spaces after commas and around operators,
	  other don't, probably there are other differences as well.
	  I guess the output will have to be preprocessed for some systems
	  to be able to compare its correctness.

	  If you are interested in obtaining the test suite, let me know.
	  I'm not yet sure how I will distribute it, probably
	  on a tape for the tape+mailing costs, maybe also by ftp.

	- Whoever else has interesting programs, you can also mail
	  me where to ftp them, it is not necessary to send the
	  program directly.

	- Although this might not be of general interest, I would also
	  like to collect programs using coroutining, to be able to
	  compare performance of coroutining systems.
	  I have some rather small programs, mostly puzzles, but I'm
	  missing serious application programs using coroutining,
	  are there any such programs around? Small programs would
	  do as well (but no n-queens, primes etc.).

	- It is clear to me that every commercial company would like
	  to push its own programs into the benchmark suite, and
	  to avoid programs from other companies. I would like
	  to make clear that ECRC, my employer, is a non-profit
	  organization, and I'm not dependent on the performance of our
	  Sepia system (or others) on the benchmark. I am trying
	  to select the programs using some objective criteria,
	  based on profiling them. However, I do accept test programs
	  even when they are supplied by commercial companies,
	  if they fit into the suite frame.

--Micha Meier
--
USA     micha%ecrc.de@pyramid.com	MAIL	Micha Meier
						ECRC, Arabellastr. 17
EUROPE  micha@ecrc.de				8000 Munich 81
						West Germany

micha@ecrc.de (Micha Meier) (08/10/90)

In article <11122@alice.UUCP> pereira@alice.UUCP () writes:
>In article <708@ecrc.de> micha@acrab4.UUCP (Micha Meier) writes:
>> The Pereira benchmarks used
>>for the same purpose are not well suited because they  often  measure
>>something different than they say.
>
>I've just looked at the comments in the source for my benchmarks, and nowhere
>is there a statement of what the various programs are intended to measure,
>so I don't see what evidence you have for your statement.

I'm sorry I didn't express myself more clearly. What I had in mind is that
when you run Pereira benchmarks, you get an output of the form

tail_call_atom_atom took (1.45 - 0.533333) / 2000 = 0.000458333 seconds/iteration
binary_call_atom_atom took (2.65 - 0.533334) / 2000 = 0.00105833 seconds/iteration
cons_list took (2.83333 - 0.533334) / 2000 = 0.00115 seconds/iteration
etc.

so without looking at the code and the comments there, one would assume
that the program measures what the predicate names say, because for most
of the benchmarks the names look self-explanatory.
On the other hand, the comments in the source only
say e.g. '% 19. Make 1000 asserts of unit clauses',
but one assumes that this is what is being measured by this benchmark,
at least it is how I've understood it, however this one also
measures list unification and tail-recursive calls.

--Micha Meier

--
USA     micha%ecrc.de@pyramid.com	MAIL	Micha Meier
						ECRC, Arabellastr. 17
EUROPE  micha@ecrc.de				8000 Munich 81
						West Germany