[comp.lang.prolog] Proof that Prolog can be faster than C

vanroy@pisces.uucp (06/13/90)

Benchmark timings:

Benchmark	Prolog  	C (no opt.)	C (opt. 4)

tak(24,16,8)    1.2		2.1		1.6

All three timings are in user seconds (with 'time') measured on a 25 MHz MIPS
processor.  The Prolog version is compiled with the Aquarius Prolog compiler
under development at Berkeley.  The C versions are compiled with the MIPS C
compiler, with no optimization and full optimization (level 4).

Disclaimer: this benchmark has a particularly easy translation to C.
The result does not necessarily hold for other programs.  But don't
you believe it anymore that Prolog is slow!

	Peter Van Roy

-------------------------------------------------------------------------------

/* C version of tak benchmark */

#include <stdio.h>

int tak(x,y,z)
int x, y, z;
{
  int a1, a2, a3;
  if (x <= y) return z;
  a1 = tak(x-1,y,z);
  a2 = tak(y-1,z,x);
  a3 = tak(z-1,x,y);
  return tak(a1,a2,a3);
}

main()
{
  printf("%d\n", tak(24, 16, 8));
}

-------------------------------------------------------------------------------

/* Prolog version of tak benchmark */

main :- tak(24,16,8,X), write(X), nl.

tak(X,Y,Z,A) :- X =< Y, Z = A.
tak(X,Y,Z,A) :- X > Y,
        X1 is X - 1, tak(X1,Y,Z,A1),
        Y1 is Y - 1, tak(Y1,Z,X,A2),
        Z1 is Z - 1, tak(Z1,X,Y,A3),
        tak(A1,A2,A3,A).

-------------------------------------------------------------------------------

christer@cs.umu.se (Christer Ericson) (06/13/90)

In article <36986@ucbvax.BERKELEY.EDU> vanroy@pisces.uucp () writes:
>
>[Stuff about some particular Prolog implementation being faster that some 
>other particular C implementation deleted]
>
>Disclaimer: this benchmark has a particularly easy translation to C.
>The result does not necessarily hold for other programs.  But don't
>you believe it anymore that Prolog is slow!
>

This is silly! Haven't you learned yet that no language is faster/slower than
another language - it's the implementation that's faster/slower than that
other implementation. The worst thing I know is people saying things like
C is faster than Pascal, because it isn't. Most C compilers generate faster
code than most Pascal compilers, true, but that's only because C heavily
depends upon (read allows) the programmer to optimize the code while the
Pascal compiler has to figure it all out by itself.

>	Peter Van Roy

/Christer

| Christer Ericson            Internet: christer@cs.umu.se [130.239.1.101] |
| Department of Computer Science, University of Umea, S-90187 UMEA, Sweden |
|                            "I bully sheep. I claim God doesn't exist..." |

richard@aiai.ed.ac.uk (Richard Tobin) (06/14/90)

>This is silly! Haven't you learned yet that no language is faster/slower than
>another language - it's the implementation that's faster/slower than that
>other implementation. 

I think Peter knows all about that.  There is a widespread belief that
languages like prolog can never be as fast as languages like C
(especially for numerical applications); the figures quoted are a
counterexample to that.

I would be interested to know if there are any Lisp compilers that
can produce comparable or better results. 

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

hawley@icot32.icot.or.jp (David John Hawley) (06/14/90)

In article <1990Jun13.100341.7148@cs.umu.se> christer@cs.umu.se (Christer Ericson) writes:
>This is silly! Haven't you learned yet that no language is faster/slower than
>another language - it's the implementation that's faster/slower than that

Well, of course the point is that Prolog is a much higher-level
language than C type languages, but we don't want to pay for all that
expressiveness all the time. I.e. the speed of our programs should
relate to the algorithms, not to the languages in which they are
expressed. So Prolog needs GREAT compilers and implementations, and
THEN we can write SHORT, SIMPLE, and FAST programs.

Well, I tried the given benchmark on my SUN/?:
	C version       	  - 8.7u
	SICstus Prolog(compiled)  - mem ovflw after > 1 min of real time

So how did he do it? Inquiring minds (and inquisitive noses) want to know.

---------------------------
David Hawley, ICOT, 4th Lab
csnet: hawley%icot.jp@relay.cs.net uucp:{enea,inria,mit-eddie,ukc}!icot!hawley
ICOT, 1-4-28 Mita, Minato-ku, Tokyo 108 JAPAN. TEL/FAX {81-3-456-}2514/1618

jwmills@iuvax.cs.indiana.edu (Jonathan Mills) (06/14/90)

From Peter Vanroy:

>Benchmark timings:
>
>Benchmark       Prolog          C (no opt.)     C (opt. 4)
>
>tak(24,16,8)    1.2             2.1             1.6
>fib(30)         1.5             2.0             1.6
>
>All three timings are in user seconds (with 'time') measured on a 25
>MHz MIPS processor.  The Prolog version is compiled with the
>Aquarius Prolog compiler under development at Berkeley.  The C
>versions are compiled with the MIPS C compiler, with no optimization
>and full optimization (level 4).
>
>Because older implementations of Prolog have been relatively slow,
>it is sometimes believed that this is inherent in the language.  The
>purpose of my original message was to dispel that notion by
>providing measurements of an optimizing Prolog compiler.



These are encouraging benchmarks, but should be generalized
cautiously.  First, to place the performance of the Aquarius Prolog
compiler in perspective, the times for the call:

	tak(18,12,6)

should be posted.  This would allow the direct comparison of the
times with those reported by by Richard P. Gabriel in "Performance
and Evaluation of Lisp Systems", MIT Press, 1974.  Comparing Prolog
and Lisp implementations is not unreasonable because of the
similarities between implementation techniques, particularly for
simple programs.



From Richard O'Keefe:

>This particular benchmark result tells us precisely two things:
>(1) _one_ Prolog system on that machine has cheaper procedure calls
>    than _one_ C compiler.  Good work!  But not very surprising
>     given that C compiler writers tend to worry more about other
>     things.
>
>(2) that particular Prolog system handles integer arithmetic better
>    than a lot of other Prolog systems.



Gabriel reports that the "significant Lisp-level operations" for
tak(18,12,6) are composed of:

     Calls to TAK                    63,609
     1-'s (decrements)               47,706

Since the Prolog code listed by Vanroy is a direct translation of the
TAK benchmark, these numbers and their significance _probably_ apply. 
The prevalence of function calls and decrements in this benchmark
limits its general significance.

The fib/1 benchmark is similarly likely to be a limited measure of
general performance since it is also composed primarily of function
calls (to FIB) and decrements (N - 1 and N - 2).



From Richard O'Keefe:

>It doesn't tell us how C and Prolog would compare on realistic
>applications, and there the pertinent question tends to be how easy
>it is to handle a good data structure.  I've an example where I
>speeded a program up quite substantially by translating from Fortran
>to Prolog, the speedup really came from using a much better data
>structure, but it was much easier to use that data structure in
>Prolog.



Gabriel has other benchmarks that are better suited to evaluate the
general performance of the Aquarius Compiler.  Two that appear well
suited are the BOYER and BROWSE benchmarks.

BOYER is a tiny theorem prover intended to give plausible estimates
of the performance of the Boyer-Moore theorem prover on various Lisp
systems.  BROWSE is a compendium of operations that an expert system
might use.  Both of these benchmarks could be translated to advantage
in Prolog, since they have a substantial amount of list manipulation
and pattern matching.

Ross Overbeek has a theorem prover whose performance in a variety of
Prologs was compared to C at a benchmarker's workshop held at The
Aerospace Corporation in 1988.  Carl Kesselman sponsored the
workshop, and may be able to provide source for a test.



From Richard Tobin:

>I would be interested to know if there are any Lisp compilers that
>can produce comparable or better results.



Gabriel has no results for a 25MHz MIPS machine (wasn't available),
but does list some other interesting results for TAK:

IMPLEMENTATION            CPU             RAW TIME (sec)
--------------            ---------       --------------
Portable Standard Lisp    Cray-1           0.044
Symbolics                 3600+IFU         0.43
Assembler                 68000            0.7
Franz Lisp                VAX 780          1.09
Common Lisp               VAX 785          1.18
Portable Standard Lisp    Sun/2 (68010)    1.44
Common Lisp               VAX 780          1.83
C                         68000            1.9
Franz Lisp                68000            2.37
INTERLISP                 Dolphin         11.12


It would also be interesting to compare the performance of commercial
Prolog compilers (ALS, BIM, Quintus) on various machines to see how
they handle TAK.  While older Prolog implementations were slow, newer
implementations have moved to native code, and impressive speeds. A
recent claim for ALS Prolog (using that hoary chestnut, NREV, sorry :-)
gave in excess of one megaLIPS on the 88000, and 100+ KLIPS on a Mac IIfx.
Perhaps they would be willing to key in TAK and post the timings.

In closing, designing and interpreting benchmarks offer tempting
targets for implementors and their critics.  Gabriel's book suggests
some strategies to follow (and avoid);  there are many other articles
out there also (Jack Dongarra has published advice for designers of
benchmarks for supercomputers, for example).

Vanroy's results are encouraging, and (hopefully) will lead to even
better comparisons.  Let's support his work:  I'd like to suggest
that anyone with additional "points" on the graph post their results
to the net.  This would help Prolog implementors today just as
Gabriel's work goaded Lisp implementors in the 1980's.

kdq@demott.COM (Kevin D. Quitt) (06/15/90)

In article <47658@iuvax.cs.indiana.edu> jwmills@iuvax.cs.indiana.edu (Jonathan Mills) writes:
>but does list some other interesting results for TAK:
>
>IMPLEMENTATION            CPU             RAW TIME (sec)
>--------------            ---------       --------------
>Portable Standard Lisp    Cray-1           0.044
>Symbolics                 3600+IFU         0.43
>Assembler                 68000            0.7
>Franz Lisp                VAX 780          1.09
>Common Lisp               VAX 785          1.18
>Portable Standard Lisp    Sun/2 (68010)    1.44
>Common Lisp               VAX 780          1.83
>C                         68000            1.9
>Franz Lisp                68000            2.37
>INTERLISP                 Dolphin         11.12
>

    Just for fun, on a Motorola Delta 3608 (25MHz 68030),

cc:	gnucc:

0.6	0.15	cc tak.c
0.3	0.3	cc -O tak.c


    and on a TwinHead 386/25, Microsoft C 6.0

0.28	/Ox /Gs
0.22	/Ox /Gsr

-- 
 _
Kevin D. Quitt         demott!kdq   kdq@demott.com
DeMott Electronics Co. 14707 Keswick St.   Van Nuys, CA 91405-1266
VOICE (818) 988-4975   FAX (818) 997-1190  MODEM (818) 997-4496 PEP last

                96.37% of all statistics are made up.

dan@sics.se (Dan Sahlin) (06/15/90)

David Hawley got the following result when trying tak(24,16,8,X) on SICStus:

> SICstus Prolog(compiled)  - mem ovflw after > 1 min of real time

SICStus Prolog runs out of memory because a lot of choicepoints are
created and a lot of trailing is done.  If a cut is put into the first
clause of "tak" no choicepoints will be created we will not run out of
memory.  Presumably Van Roy's compiler has detected that this cut may be
safely inserted.

Running tak(24,16,8,X) using SICStus Prolog on a SPARCstation 1 having
compiled the program into emulated code then takes about 77 seconds.  We
also compiled the program to native code for SPARC (this version of
SICStus has not yet been released) and it took about 25 seconds.  As the
SPARCstation has about half the number of MIPS as Van Roy's machine, we
still have a factor of ten of execution performance improvements to get
down to his figure of just 1.2 seconds.

>So how did he do it? Inquiring minds (and inquisitive noses) want to know.

Yes, please tell us how it was done!

	Dan Sahlin
email: dan@sics.se

mmh@cs.qmw.ac.uk (Matthew Huntbach) (06/16/90)

What is the point of using Prolog if you don't backtrack?
All you have is some syntactic sugar for a subset of functional
languages (i.e. instead of fun f(pat1) = f(g(h(p))) | f(pat2) = [];
you have f(pat1,q) :- h(p,p1),g(p1,p2),f(p2,q).
         f(pat2,[])

instead of fun f(x) = if a(x) then b(x) else c(x)
you have f(x,y) :- a(x),!,b(x,y).
         f(x,y) :- c(x,y). ).

The comparisons that are being made seem to be for
non-backtracking problems.

If backtracking is important, then comparisons should be made
with examples where Prolog backtracks. If it isn't, all we are
saying is that Prolog has a nicer sugar coating than its
rivals, but this sugar coating doesn't (as you might suppose)
hide any nasty-tasting inefficiency.

Matthew Huntbach

vanroy@pisces.uucp (06/16/90)

Several people have asked me to post the code and/or explain what kind of
optimizations are done.  Aw shucks, don't you know that a magician doesn't
like to give away his secrets?

I will answer the second question first.  The tak benchmark is able to
take advantage of several optimization techniques:

(1) Deterministic predicates.  The compiler is able to recognize determinacy
    in the program.  The Prolog code given for tak uses a tremendous amount of
    memory on systems like Quintus or Sicstus Prolog because they will create
    a choice point for each call.  A cut is needed to make it run for them.
(2) Simplifying variable binding.  The compiler recognizes that the variables
    used to bind the output of tak can be implemented in a simpler way than
    the default method of creating a memory cell pointing to itself, passing
    it into the routine, and binding it there.  Tak does recursive calculations
    on integers (i.e. objects that fit in a register), so the compiler is able
    to output the results directly in registers.  This is a general technique
    which is also used to implement integer arithmetic.
(3) Last call optimization.  The last call to tak is compiled as a jump, i.e.
    recursion is converted to iteration.  This is not done by the C compiler,
    and measurements with the pixie utility on the MIPS show that this explains
    a large part of the speed difference.  This optimization is essential for
    Prolog because the only way to denote iteration in Prolog is by using
    recursion.  C has other constructs to denote iteration (i.e. "for" and
    "while" loops) so it does not need last call optimization as badly as
    Prolog does.
(4) Dataflow analysis.  The compiler incorporates a global dataflow analysis
    module to do type inferencing.  For tak, the results of analysis are crucial
    for optimization (2) above, and they also make it unnecessary to do type
    checking, dereferencing, and trailing.  To do analysis the programmer has
    to declare only the types of the entry points.  For tak this is a single
    directive:
	:- entry((main:-true)).
    No other declarations are used.

For those who are interested, the paper "The Benefits of Global Flow Analysis
for an Optimizing Prolog Compiler", to appear in the 1990 North American
Conference on Logic Programming, describes techniques (2) and (4) in detail
and gives more numbers (for the BAM processor).

The compiler has only recently been ported to the MIPS and the run-time system
is not yet completely functional there.  I plan to extend the comparison to
other programs (such as Overbeek's theorem prover).  The BAM version is
completely functional, but the BAM processor does not have a C compiler.

The Aquarius compiler does not do memoization.  However, there is no reason
a smart (*very* smart) compiler couldn't do it for tak.  The C compiler could
do it as well.  This is an orthogonal issue to the performance comparison.

By the way, the execution time for tak(18,12,6), given by Gabriel in his study
of Lisp performance, is 31 ms for Aquarius Prolog on a 25 MHz MIPS processor.

Sincerely,
    Peter Van Roy

-------------------------------------------------------------------------------

In answer to the first question, here is the MIPS assembly code generated by
the Aquarius compiler for the tak benchmark:
(I've left out the code for the run-time system.)

	.data
	.align	3
$32:
	.word	7
	.word	atm_string_0+1
atm_string_0:
	.word	1463
	.word	1f+1
1:
	.word	1495
	.word	3
	.text
	.align	2
 #	procedure(tak/0)
	.globl	tak_0
	.ent	tak_0
tak_0:
$33:
	.set	noreorder
	sltu	$14, $17, $16
	bne	$14, 0, 1f
	move	$14, $16
	move	$16, $17
 #	label(1)
1:
	addu	$16, 8
	sw	$14, -8($16)
	sw	$31, -4($16)
	.set	reorder
	li	$2, 295
	li	$3, 199
	li	$4, 103
	jal	$34
	move	$2, $5
	jal	$35
	lw	$31, -4($16)
	lw	$16, -8($16)
	j	$36
	.end	tak_0
 #	procedure(tak/4)
	.globl	tak_4
	.ent	tak_4
tak_4:
$34:
	bgt	$2, $3, $37
	move	$5, $4
	j	$31
 #	label('$b_tak/4_2'/4)
$37:
	.set	noreorder
	sltu	$14, $17, $16
	bne	$14, 0, 1f
	move	$14, $16
	move	$16, $17
 #	label(1)
1:
	addu	$16, 32
	sw	$14, -8($16)
	sw	$31, -4($16)
	.set	reorder
	sw	$2, -24($16)
	sw	$3, -20($16)
	sw	$4, -28($16)
	addu	$2, $2, -16
	jal	$34
	sw	$5, -16($16)
	lw	$14, -20($16)
	addu	$2, $14, -16
	lw	$3, -28($16)
	lw	$4, -24($16)
	jal	$34
	sw	$5, -12($16)
	lw	$14, -28($16)
	addu	$2, $14, -16
	lw	$3, -24($16)
	lw	$4, -20($16)
	jal	$34
	lw	$2, -16($16)
	lw	$3, -12($16)
	move	$4, $5
	lw	$31, -4($16)
	lw	$16, -8($16)
	j	$34
	.end	tak_4

-------------------------------------------------------------------------------

dankg@volcano.Berkeley.EDU (Dan KoGai) (06/16/90)

In article <36986@ucbvax.BERKELEY.EDU> vanroy@pisces.uucp () writes:
> [benchmark omitted]
>
>/* C version of tak benchmark */
>
>#include <stdio.h>
>
>int tak(x,y,z)
>int x, y, z;
>{
>  int a1, a2, a3;
>  if (x <= y) return z;
>  a1 = tak(x-1,y,z);
>  a2 = tak(y-1,z,x);
>  a3 = tak(z-1,x,y);
>  return tak(a1,a2,a3);
>}

	But this C code is hardly optimum.  1st of all you don't need
a1, a2, a3 to waste space and time. try the following.

int tak(int x, int y, int z)
{
	return (x <= y) ? z : tak(tak(--x, y, z),tak(--y, z, x),tak(--z, x, y))

}

> [Prolog version omitted]

	And it's also unfair to compare 2 versions without posting the
assembly codes both versions made.  Both C and Prolog allow programmers
to write various codes doing the same thing.

----------------
____  __  __    + Dan The Walking C Obfscunator
    ||__||__|   + E-mail:       dankg@ocf.berkeley.edu
____| ______    + Voice:        +1 415-549-6111
|     |__|__|   + USnail:       1730 Laloma Berkeley, CA 94709 U.S.A
|___  |__|__|   +
    |____|____  + "Unix is not insecure. It's people who are"
  \_|    |      + "Unix doesn't have pain.  It's people who do"

vanroy@pisces (Peter Van Roy) (06/16/90)

In article <1990Jun15.205242.16314@agate.berkeley.edu> dankg@volcano.Berkeley.EDU (Dan KoGai) writes:

>	But this C code is hardly optimum.  1st of all you don't need
>a1, a2, a3 to waste space and time. try the following.
>
>int tak(int x, int y, int z)
>{
>	return (x <= y) ? z : tak(tak(--x, y, z),tak(--y, z, x),tak(--z, x, y))
>
>}
>
>> [Prolog version omitted]
>
>____  __  __    + Dan The Walking C Obfscunator

Did you try to run this program?
It behaves very strangely:
(after adding a semicolon to the return statement)

	mammoth:~/Peter <20> cc -g taktak.c
	mammoth:~/Peter <21> dbx a.out
	dbx version 2.0
	Copyright 1988 MIPS Computer Systems Inc.
	Type 'help' for help.
	reading symbolic information ...
	main:  15  printf("%d\n", tak(24, 16, 8));
	(dbx) run
	^C				<- interrupted after ~10 seconds
	Interrupt [tak:10 +0x28,0x400220]
        	tak(--z, x, y));
	(dbx) print x,y,z
	-44896 -44897 -44898		<- strange values

From page 186 of K&R:
(note that predecrement does a side-effect)

	[Regarding function evaluation] The order of evaluation of arguments
	is undefined by the language; take note that the various compilers
	differ.

By the way, I did try a different version of tak.c without the temporary
variables a1, a2, and a3.  It ran *slower* unoptimized and at exactly the
same speed optimized.

Please think before you post.
Sincerely,
	Peter Van Roy

grover@brahmand.Eng.Sun.COM (Vinod Grover) (06/16/90)

In article <1990Jun15.205242.16314@agate.berkeley.edu> dankg@volcano.Berkeley.EDU (Dan KoGai) writes:
>In article <36986@ucbvax.BERKELEY.EDU> vanroy@pisces.uucp () writes:
>a1, a2, a3 to waste space and time. try the following.
>
>int tak(int x, int y, int z)
>{
>  return (x <= y) ? z : tak(tak(--x, y, z),tak(--y, z, x),tak(--z, x, y))
>}
This program has a missing semicolon, and runs into an infinite loop. 
C does not require the side-effect in --x to happen immediately. You must
force the subtraction. Furthermore, eliminating explicit temp variables does
not mean that the compiler won't put them back in. However, declaring those
explicit temps to be register variables *may* improve the program, but is
not guarenteed to do so. 

Vinod Grover