[comp.lang.scheme] Benchmarking Scheme

jaffer@zurich.ai.mit.edu (Aubrey Jaffer) (05/02/91)

At the end of this message is benchmark code in C and in Scheme.  My
original hope was that computing the ratios of execution time for the
2 programs would allow me to roughly measure speed of an
implementation independent of the hardware.  There are problems with
this approach.

The benchmark programs take an argument and compute that many digits
of pi.  The programs take time proportional to the square of the
number of digits.  This allows one to compensate for widely differing
times of computation.

When I measure the times on my 10Mhz 68000 I get:
(pi 100)       takes 33.9 secs (scm2d)
time pi 200    takes 7.1 secs
33.9/(7.1/4) ==> 19 times slower than C

But on kleph.ai.mit.edu (which I think has a 68020) I get:
(pi 100)       takes 6.6 secs (scm2d)
time pi 800    takes 7.6 secs
6.6/(7.6/64) ==> 56 times slower that C ??!

What is going on here?  J. Finger suggests that kleph has an
instruction cache.  That would mean that the C inner loop stays in the
cache while the scm interpreter doesn't.

So what are the prospects for machine independent performance
measures?  An algorithm with a more complicated inner loop would keep
the C program out of cache.  But such algorithms tend either to be
artifically lengthened or lack the nice scaling properties of the pi
programs.  Part of the appeal to me of the pi program is that it does
an actual difficult computation with the best code I can devise.
Previously I used an intentionally less that optimal fibonacci.

An argument can also be made that the cached figures are more
realistic for everyday programming problems.  This has an interesting
consequence in that the penalty for using interpreters on newer
machines is more severe than on old ones.

So if anyone can suggest solitions to these problems I would be very
interested.  By the way, on kleph MITScheme takes 22 secs for (pi 100)
interpreted (185 times slower) and about 8 secs for (pi 100) compiled
(about 67 times slower) so I guess my interpreter is not doing too
badly.

/*  'Spigot' algorithm origionally due to Stanly Rabinowitz */
/*
main(c,v)
int c;char **v;{
  int n=200,j,m,b=2,k,t,r=1e5;
  long q;
  short *a;
  if(c==2)n=atoi(v[1])/5+1;
  k=m=3.322*n*5;
  a=calloc(1+m,2);
  while(k)a[--k]=2;
  for(a[m]=4;j<n;b=q%r){
    q=0;
    for(k=m;k;){
      q+=a[k]*r;
      t=(2*k+1);
      a[k]=q%t;
      q=q/t;
      q*=k--;}
    printf("%05d%s",b+q/r,++j%10?"  ":"\n");}
  puts("");}
*/

                       main
 (c,v)int c;char**v;{int n=200,j,m,
b=2,k,t,r=1e5;long q;short*a;if(c==2
 )n=atoi(v[1])/5+1;k=m=3.322*n*5;a=
       calloc(       1+m,2)
       ;while        (k)a[
       --k]=2        ;for(a
       [m]=4;        j<n;b=
       q%r){q        =0;for
       (k=m;k        ;){q+=
       a[k]*r        ;t=(2*
       k+1);         a[k]=q
       %t;q=         q/t;q*
       =k--;}        printf(
      "%05d%s"       ,b+q/r,
      ++j%10?        " ":"\n"
      );}puts         ("");}
  
===========================================================

(define r 100000)
(define (pi n)
  (let* ((n (+ (quotient n 5) 1))
	 (m (quotient (* n 5 3322) 1000))
	 (a (make-vector (+ 1 m) 2)))
    (vector-set! a m 4)
    (do ((j 1 (+ 1 j))
	 (q 0 0)
	 (b 2 (remainder q r)))
	((> j n))
      (do ((k m (- k 1)))
	  ((zero? k))
	(set! q (+ q (* (vector-ref a k) r)))
	(let ((t (+ 1 (* 2 k))))
	  (vector-set! a k (remainder q t))
	  (set! q (* k (quotient q t)))))
      (let ((s (number->string (+ b (quotient q r)))))
	(do ((l (string-length s) (+ 1 l)))
	    ((>= l 5) (display s))
	  (display #\0)))
      (display (if (zero? (modulo j 10)) #\newline #\ )))
    (newline)))

sfk@otter.hpl.hp.com (Steve Knight) (05/02/91)

Whenever the topic of benchmarks comes around, I am always impressed by
the subtle depth of the problems.  It seems to be an area in which the demand
for simple answers is much greater than the supply.

> So what are the prospects for machine independent performance
> measures?

Very poor.  Folks are simply too incautious about the interpretation of
benchmarks in any case.  

(A case in point -- some folks I know were doing a whole series of benchmarks 
on some software.  They wrote up all the tests real nice.  However, I couldn't
make head or tail of their figures when I compared it against other results
I'd checked out.  So I got on the machine & tried the tests.  My first test was
to check out the timing software they were using.  It was wrong by a factor
of 2.  This is by no means a unique example.)

I guess I can add one point.  When doing benchmarks across machines, it is 
important to establish what the timing and storage measurements are actually 
measuring.  Does the timing include process swapping, page swapping, and other
admin overheads?  Does it correlate with a stopwatch?  Does it include
time spent in the kernel?  Does it include garbage collection time (how many
were there)?  Do two C programs of known relative performance give equal
performance ratios on a range of machines?  If you're on 68xxx machines,
do the same test but avoiding 68020 instructions, to check if its ineffective
use of the 68020 instruction set that causes the difficulties.  And so on.

Steve

huntley@copper.ucs.indiana.edu (Haydn Huntley) (05/02/91)

In article <JAFFER.91May1163805@kleph.ai.mit.edu> jaffer@zurich.ai.mit.edu (Aubrey Jaffer) writes:
| At the end of this message is benchmark code in C and in Scheme.  My
| original hope was that computing the ratios of execution time for the
| 2 programs would allow me to roughly measure speed of an
| implementation independent of the hardware.  There are problems with
| this approach.
| 
| The benchmark programs take an argument and compute that many digits
| of pi.  The programs take time proportional to the square of the
| number of digits.  This allows one to compensate for widely differing
| times of computation.
| 
| When I measure the times on my 10Mhz 68000 I get:
| (pi 100)       takes 33.9 secs (scm2d)
| time pi 200    takes 7.1 secs
| 33.9/(7.1/4) ==> 19 times slower than C
| 
| But on kleph.ai.mit.edu (which I think has a 68020) I get:
| (pi 100)       takes 6.6 secs (scm2d)
| time pi 800    takes 7.6 secs
| 6.6/(7.6/64) ==> 56 times slower that C ??!
| 
| What is going on here?  J. Finger suggests that kleph has an
| instruction cache.  That would mean that the C inner loop stays in the
| cache while the scm interpreter doesn't.

I also performed some similar benchmarks on a VAX 8650 and a Sun 3,
and compared your C version compiled with GCC, and your Scheme version
compiled with Chez Scheme 3.2 and found that the C versions were 20 to
30 times faster.  The times I measured were:

On a VAX 8650 (copper.ucs.indiana.edu)
(pi 100)       takes   1.84 secs (Chez Scheme 3.2)
(pi 800)       takes 106.93 secs (Chez Scheme 3.2)
time pi 100    takes    .06 secs (GCC 1.39)
time pi 800    takes   3.54 secs (GCC 1.39)
ratio of Scheme to C is approx. 30

On a Sun 3 with a 16Mhz 68020 (cssun2.ucs.indiana.edu)
(pi 100)       takes   4.84 secs (Chez Scheme 3.2)
(pi 800)       takes 246.46 secs (Chez Scheme 3.2)
time pi 100    takes    .2  secs (GCC 1.39)
time pi 800    takes  10.0  secs (GCC 1.39)
ratio of Scheme to C is approx. 24

So what is going on here?  Well, simply put, Scheme is not nearly as
efficient a language for executing this benchmark as C is.  In
general, Scheme can be a more convenient language to program because
it does garbage collection for you, while a language like C requires
more care when doing dynamic memory allocation and reclaimation.  On
the other hand, C is usually much more efficient, because C compilers
can usually store variables in registers, and at the worst, they can
always put them on the stack, while Scheme compilers sometimes must
store things out in the heap, and use registers less advantagously
than C compilers do.

In addition, Scheme has latent typing and generic arithmetic, which
require tests to be made at run-time, while C has static typing, so it
doesn't have to do any tests on the types of variables.

I would guess (and I'm just a lowly CS grad student) that the
difference between using a cache, and missing it's use, typically
would only account for a speed difference of at most a factor of 4,
and more probably only a factor of 2.

Did you use the same compiler and same options on both your 68000 and
68020?  Do they both run at the same clock rate?  Your C benchmarks
show that kaleph is 8 times faster than your 68000!

| So what are the prospects for machine independent performance
| measures?  An algorithm with a more complicated inner loop would keep
| the C program out of cache.  But such algorithms tend either to be
| artifically lengthened or lack the nice scaling properties of the pi
| programs.  Part of the appeal to me of the pi program is that it does
| an actual difficult computation with the best code I can devise.
| Previously I used an intentionally less that optimal fibonacci.
| 
| An argument can also be made that the cached figures are more
| realistic for everyday programming problems.  This has an interesting
| consequence in that the penalty for using interpreters on newer
| machines is more severe than on old ones.
| 
| So if anyone can suggest solitions to these problems I would be very
| interested.  By the way, on kleph MITScheme takes 22 secs for (pi 100)
| interpreted (185 times slower) and about 8 secs for (pi 100) compiled
| (about 67 times slower) so I guess my interpreter is not doing too
| badly.

I'd say that your interpreter seems to be incredibly fast!  Is it an
really an interpreter or actually a compiler?  Does it support generic
arithmetic?  (fixnums, bignums, and flonums)  Does it do run-time
checks to see if the heap is nearly full?  Does it perform any type
checks at run-time?  

Please don't laugh at these questions, but I've been working on a
Scheme compiler this semester, and I learned how expensive it is to
implement generic arithmetic, garbage collection, and type checking!
Each one of them costs something like a factor of 2 in performance!

--Haydn

--
;;  *****************************************************
;;  *  Haydn Huntley    huntley@copper.ucs.indiana.edu  *
;;  *****************************************************

boehm@parc.xerox.com (Hans Boehm) (05/03/91)

huntley@copper.ucs.indiana.edu (Haydn Huntley) writes:
>...  I learned how expensive it is to
>implement generic arithmetic, garbage collection, and type checking!
>Each one of them costs something like a factor of 2 in performance!

>--Haydn

At least in the case of garbage collection, I think this statement is a bit
misleading.  Depending on your ratio of accessible to available memory,
caching effects, frequency of side effects, etc., allocating objects on the
heap instead of stack allocating them can easily cost you a factor of 2
in performance.  But if you compare garbage collection to heap allocation
and explicit deallocation (as with C malloc/free), it's not clear that
garbage collection costs you anything in total execution time.  I've seen
cases, running with a trace and sweep allocator, where inserting explicit
deallocations slowed down the code.  It's often cheaper for a garbage
collector to restore a large collection of objects to free lists at once,
than it is to deallocate them one at a time.  Looking only at cpu cycles,
this effect should usually be more pronounced for copying collectors.  (Copying
collectors have other disadvantages, but that's another story.)

I suspect the problem with this example is some combination of superfluous
allocation and generic arithmetic.  The example isn't entirely fair
to Scheme either, since there are other ways to solve this problem that
actually use Scheme's builtin BigNums, and are probably much faster.  The
C program is reimplementing BigNums in a pretty clumsy way. (Our desk
calculator utility computes 3000 digits of PI in about 10 seconds on a
SPARCstation 2, using a fancier algorithm, and taking advantage of BigNum
arithmetic.  It's written in Russell, but spends most of its time in the
DEC/INRIA BigNum package, which is written in C.)

Hans
(boehm@xerox.com)

jinx@zurich.ai.mit.edu (Guillermo J. Rozas) (05/03/91)

    | So if anyone can suggest solitions to these problems I would be very
    | interested.  By the way, on kleph MITScheme takes 22 secs for (pi 100)
    | interpreted (185 times slower) and about 8 secs for (pi 100) compiled
    | (about 67 times slower) so I guess my interpreter is not doing too
    | badly.

    I'd say that your interpreter seems to be incredibly fast!  Is it an
    really an interpreter or actually a compiler?  Does it support generic
    arithmetic?  (fixnums, bignums, and flonums)  Does it do run-time
    checks to see if the heap is nearly full?  Does it perform any type
    checks at run-time?  

The numbers don't tell the whole story.  When analyzing benchmarks you
have to be very careful.  The benchmark does not so much test the
speed of the interpreter or compiler used for evaluation, but mostly
the speed of the numeric and IO libraries in the system.  I will
justify my claims by showing the results of a few tests that I ran
using MIT Scheme's compiler.

I'm not trying to justify MIT Scheme's apparent slowness, nor make a
sales pitch for it.  I'm just using it as an exploratory tool because
I know it well.

I have three versions of the benchmark:

    A: tst-initial, the program sent by Aubrey Jaffer without modification.
    In MIT Scheme the program will do out-of-line calls for all numeric
    operations.

    B: tst-generic, the program modified (in an MIT-Scheme-specific way) to
    open code generic arithmetic.

    C: tst-fixnum, the program modified (in an MIT-Scheme-specific way) to
    use fixnum-only arithmetic instead of generic arithmetic.

Note that in MIT Scheme, the out-of-line versions of the generic
arithmetic procedures are VERY slow, thus if the code ever calls them,
it will not run very fast.  This is not a property of the
interpreter/compiler, but a property of the way the generic arithmetic
routines are built.  A short amount of work would fix this, but no one
has really worked on improving the speed of the standard library.

Here is a transcript, with some comments.

- The code was run on chamartin.ai.mit.edu, an HP9000s350 with 32
Mbytes of memory identical to Kleph, where Aubrey ran his tests.  The
HP9000s350 has a 25 MHz 68020.

- SHOW-TIME prints times in milliseconds.  

    1 ]=> (load "/zu/jinx/tmp/tst-initial")


    Loading "/zu/jinx/tmp/tst-initial.com" -- done
    ;Value: pi

    1 ]=> (show-time (lambda () (pi 100)))

    00003 14159 26535 89793 23846 26433 83279 50288 41971 69399
    37510 58209 74944 59230 78164 06286 20899 86280 34825 34211
    70679 

    process time: 8120; real time: 8386
    ;No value

1: Pretty much as reported by Aubrey.

    1 ]=> (load "/zu/jinx/tmp/tst-generic")


    Loading "/zu/jinx/tmp/tst-generic.com" -- done
    ;Value: pi

    1 ]=> (show-time (lambda () (pi 100)))

    00003 14159 26535 89793 23846 26433 83279 50288 41971 69399
    37510 58209 74944 59230 78164 06286 20899 86280 34825 34211
    70679 

    process time: 2900; real time: 3086
    ;No value

2: Not much of an improvement.

    1 ]=> (fluid-let ((r 10000))
      (show-time (lambda () (pi 128))))

    00003 01415 09265 03589 07932 03846 02643 03832 07950 02884
    01971 06939 09375 01058 02097 04944 05923 00781 06406 02862
    00899 08628 00348 02534 02117 00679 

    process time: 660; real time: 784
    ;No value

3: Noticeable improvement even though it is run on a larger parameter.
BTW, the leading 0 in all the strings should be ignored.  The program
is not parameterized correctly according to the value of R.

    1 ]=> (load "/zu/jinx/tmp/tst-fixnum")


    Loading "/zu/jinx/tmp/tst-fixnum.com" -- done
    ;Value: pi

    1 ]=> (fluid-let ((r 10000))
      (show-time (lambda () (pi 128))))

    00003 01415 09265 03589 07932 03846 02643 03832 07950 02884
    01971 06939 09375 01058 02097 04944 05923 00781 06406 02862
    00899 08628 00348 02534 02117 00679 

    process time: 520; real time: 591
    ;No value

4: Not much of an improvement.  See below.

Explanation:

1: Shows the results as reported by Aubrey.  Even when compiled, the
code runs slowly in MIT Scheme.

2: The results using open-coded generic arithmetic.  Not much faster.

3: Same program as above but with a smaller value of R and a larger
value of N to generate the same digits.  It should be slower (the
program is roughly quadratic on N) but runs much faster.  What's the
deal?  The answer is that the unmodified program occasionally uses
bignums in MIT Scheme because the range of fixnums in MIT Scheme is
not large enough to accomodate the range of values of Q.  By reducing
R, we avoid bignums, and the program runs much faster.  Note that in
MIT Scheme fixnums only cover the range -2^25 <= x <= (2^25)-1, and Q
in the loop in Aubrey's program exceeds this range every 8-10
iterations.  I believe that Aubrey's scm uses fixnums in the range 
-2^29 <= x (2^29)-1, but I'm not certain.

4: Not much of a difference between generic arithmetic and fixnum arithmetic.
Most of the time at this point is spent in NUMBER->STRING, DISPLAY,
etc., and only speeding those up will make the code run faster.

Clearly (because of 2 and 3) the time taken by the program is not so
much a consequence of the speed of the interpreter/compiler as a
consequence of the particular procedures being called and the
range of values in use.

Thus the benchmark tests the speed of the support library, not the
speed of the underlying interpreter or compiler.  

It is a valid benchmark (most programs are), but we need to draw the
correct conclusions from the results.

To test the inherent speed of an interpreter or compiler, I would
write a program that does not use any of the constants (ie.
primitives) in the language.  For example, it could use Church pairs
and Church integers.  Thus most (or all) differences in the runtime
libraries would be removed.  Of course, we could not then compare it
to C.


PS: For those who care, the code in B was the benchmark as sent by
Aubrey, preceded by

(declare (usual-integrations)
	 (integrate-external "/scheme/runtime/fixart"))

(define-integrable (quo x y)
  (if (and (fix:fixnum? x) (fix:fixnum? y))
      (fix:quotient x y)
      (quotient x y)))

(define-integrable (rem x y)
  (if (and (fix:fixnum? x) (fix:fixnum? y))
      (fix:remainder x y)
      (remainder x y)))

(define (mod x y)
  (let ((r (rem x y)))
    (if (or (zero? r)
	    (< x 0)
	    (< y 0)
	    (not (< y 0)))
	r
	(+ r y))))

And with the renames quotient->quo, remainder->rem, modulo->mod.
The reason is that (due to oversight) none of the procedures above are
open-coded by the compiler.

ram+@cs.cmu.edu (Rob MacLachlan) (05/04/91)

In article <1991May2.155838.20830@bronze.ucs.indiana.edu> huntley@copper.ucs.indiana.edu (Haydn Huntley) writes:

>I also performed some similar benchmarks on a VAX 8650 and a Sun 3,
>and compared your C version compiled with GCC, and your Scheme version
>compiled with Chez Scheme 3.2 and found that the C versions were 20 to
>30 times faster.
>
>On a Sun 3 with a 16Mhz 68020 (cssun2.ucs.indiana.edu)
>(pi 100)       takes   4.84 secs (Chez Scheme 3.2)
>(pi 800)       takes 246.46 secs (Chez Scheme 3.2)
>time pi 100    takes    .2  secs (GCC 1.39)
>time pi 800    takes  10.0  secs (GCC 1.39)
>ratio of Scheme to C is approx. 24

On a DECStation 3100 running Mach:
(pi 800)       takes  2.7 secs (CMU Common Lisp Apr-28-1991, speed 3, safety 0)
time pi 800    takes  2.1 secs (MIPS cc -O4)
ratio of Lisp to C is 1.28

>So what is going on here?  Well, simply put, Scheme is not nearly as efficient
>a language for executing this benchmark as C is.  [...]  Scheme can be a more
>convenient language to program [...]  while a language like C requires more
>care. [...]  On the other hand, C is usually much more efficient, because C
>compilers can usually store variables in registers, and at the worst, they can
>always put them on the stack, while Scheme compilers sometimes must store
>things out in the heap, and use registers less advantagously than C compilers
>do.

Yes, register allocation and avoidance of number consing are crucial to good
numeric performance.  But Scheme could do it too.

>In addition, Scheme has latent typing and generic arithmetic, which
>require tests to be made at run-time, while C has static typing, so it
>doesn't have to do any tests on the types of variables.

Supporting dynamic typing and generic arithmetic doesn't mean you have to use
it all the time.  That's what type inference and declarations are for.

>| So what are the prospects for machine independent performance
>| measures? Part of the appeal to me of the pi program is that it does
>| an actual difficult computation with the best code I can devise.
>| Previously I used an intentionally less that optimal fibonacci.

Both of these benchmarks can be meaningful, but they measure *totally*
different things.  The recursive fibonacci is basically a function call
benchmark (like the traditional TAK) which means that it can probably predict
the performance of most lisp programs within a factor of 3.  Given constant
compiler technology, integer Specmarks will be a much better predictor.

This PI program is doing extended precision arithmetic.  Unless you are
implementing extended precision arithmetic in Lisp (unusual but possible), then
you are better off with another benchmark.  And if you do find a compiler that
generates reasonable code, you are probably mainly testing the speed of integer
divide.

I suggest reading comp.benchmarks for a while if you are interested in general
benchmarking concepts.

>[...] I've been working on a Scheme compiler this semester, and I learned how
>expensive it is to implement generic arithmetic, garbage collection, and type
>checking!  Each one of them costs something like a factor of 2 in performance!

Well, they may, but they don't have to, as the CMU numbers demonstrate.  Common
Lisp is pretty much a superset of scheme, modulo call/cc.  And I don't see why
supporting call/cc should make this program run any slower, since it doesn't
use call/cc, and doesn't call any functions that might.

I believe that the slowdown in non-CMU Lisps is entirely due to generic and
bignum arithmetic.  Note that the original C program was written for 32 bit
integers, although it presumably could have been written otherwise.  Because of
this, any Lisp without efficient support for full-word integers will lose
horribly, not only calling generic arithmetic routines, but also using bignum
arithmetic.

You should be able to get reasonable performance in pretty much any Lisp by
changing the program to use FIXNUM arithmetic and adding any necessary
declarations.  CMU CL has an edge here because:
 -- You don't have to change the program.
 -- You don't have to add declarations out the wazoo.
 -- Code runs reasonably fast, even with full type checking.

Here is the CL version:
________________________________________________________________
(in-package "USER")

(defconstant r 100000)
(defun pi (n)
  (declare (type (integer 0 1000) n) (inline ceiling))
  (let* ((n (ceiling n 5))
	 (m (truncate (* n 5 3322) 1000))
	 (a (make-array (+ 1 m) :initial-element 2
			:element-type '(unsigned-byte 12))))
    (setf (aref a m) 4)
    (do ((j 1 (+ 1 j))
	 (q 0 0)
	 (b 2 (rem q r)))
	((> j n))
      (declare (type (unsigned-byte 31) j b q))
      (do ((k m (- k 1)))
	  ((zerop k))
	(declare (type (unsigned-byte 12) k))
	(setq q (+ q (* (aref a k) r)))
	(multiple-value-bind (quo rem)
			     (truncate q (+ 1 (* 2 k)))
	  (declare (type (unsigned-byte 19) quo))
	  (setf (aref a k) rem)
	  (setq q (* k quo))))
      (format t "~5,'0D" (+ b (truncate q r)))
      (if (zerop (mod j 10))
	  (terpri)
	  (write-char #\space)))
    (terpri)))

jinx@zurich.ai.mit.edu (Guillermo J. Rozas) (05/04/91)

    Well, they may, but they don't have to, as the CMU numbers demonstrate.  Common
    Lisp is pretty much a superset of scheme, modulo call/cc.  And I don't see why
    supporting call/cc should make this program run any slower, since it doesn't
    use call/cc, and doesn't call any functions that might.

How do you know that QUOTIENT or FORMAT (for example) may not error,
thereby calling the condition system which may use
CALL-WITH-CURRENT-CONTINUATION to construct a condition object to be
used if the computation is going to be continued?

I think that it is almost impossible to know.  Because of this,
essentially any potentially assigned variables must be on the heap so
that invoking a continuation multiple times will not un-do side effects.

    You should be able to get reasonable performance in pretty much any Lisp by
    changing the program to use FIXNUM arithmetic and adding any necessary
    declarations.  CMU CL has an edge here because:
     -- You don't have to change the program.
     -- You don't have to add declarations out the wazoo.
     -- Code runs reasonably fast, even with full type checking.

The program that you sent is pretty different from the original, and
has 5 different declarations, some pretty arcane (e.g. (declare (type
(unsigned-byte 19) quo))).

Thus it seems that you changed the program, and that there are
declarations up the wazoo, although perhaps less than in other
dialects of CL.  I like the fact that you don't have to insert THE
everywhere, but I still think that this is a far cry from what you
seem to imply.

Perhaps this is a portable CL program, and not a CMU-CL program.
Please explain what you meant.

ram+@cs.cmu.edu (Rob MacLachlan) (05/04/91)

I'd like to clarify my intent here.  Of course, part of the reason I posted was
that I couldn't miss a chance to brag about my spiffy new compiler.  But the
main reason was that I didn't want to let go unchallenged the claim that Lisp
(or any language with dynamic typing, generic arithmetic and GC) will be 8
times slower than C.

Any powerful language feature sometimes hurts performance, but implementors can
insure that you only get hurt when you play with the sharp edges.  This means
that there is not a straightforward product combination of inefficiency when
you have a language with several potentially inefficient features.

    How do you know that QUOTIENT or FORMAT (for example) may not error,
    [...] I think that it is almost impossible to know.  [...]
    any potentially assigned variables must be on the heap

Yeah, I guess it can get weedy if you aren't willing to restrict the extent of
continuations created in the debugger.  But if you don't have word integers,
this program will still perform terribly even if you do change to Scheme's
idiomatic iteration.

	You should be able to get reasonable performance in pretty much any
	Lisp by changing the program to use FIXNUM arithmetic and adding any
	necessary declarations.  CMU CL has an edge here because:
	 -- You don't have to change the program.
	 -- You don't have to add declarations out the wazoo.
	 -- Code runs reasonably fast, even with full type checking.

    [...] it seems that you changed the program, and that there are
    declarations up the wazoo, although perhaps less than in other
    dialects of CL.  I like the fact that you don't have to insert THE
    everywhere, but I still think that this is a far cry from what you
    seem to imply.

Well, I guess my wazoo is different than yours.  I meant:
 -- I didn't have to understand how the program worked to change it.  I only
    changed syntax and added some declarations (which I reverse-engineered from
    the knowledge that long == (signed-byte 32), and that the maximum N the C
    program allowed was 1000.)  [I also couldn't resist the temptation to
    clarify the program by using FORMAT...]
 -- There are no THE declarations.
 -- Safety does not mean generic arithmetic in CMU CL (though the time I 
    reported time was unsafe.)

  Robert MacLachlan (ram+@cs.cmu.edu)

jaffer@zurich.ai.mit.edu (Aubrey Jaffer) (05/04/91)

huntley@copper.ucs.indiana.edu (Haydn Huntley) asks:
>Did you use the same compiler and same options on both your 68000 and
>68020?
No and Yes.  The 68000 is a 1984 vintage unix and the 68020 is an HP-UX.
I used no compiler options on both.

>Do they both run at the same clock rate?  Your C benchmarks
>show that kaleph is 8 times faster than your 68000!
My 68000 is 10Mhz.  The 68020 is 25Mhz.

>I'd say that your interpreter seems to be incredibly fast!  Is it an
>really an interpreter or actually a compiler?
Scm has an interpreter written entirely in C (it uses no macros written
in scheme).  In addition, it memoizes the locations of all variable
references by modifiying the code.  This is described in code.doc in
the distribution.

>Does it support generic arithmetic? (fixnums, bignums, and flonums)
Not yet.  I am adding this now.  However, 30 bit fixnum arithmetic
speed will not be affected.

>Does it do run-time checks to see if the heap is nearly full?
Yes.  The times include GC.

>Does it perform any type checks at run-time?
It checks for number of arguments, type of arguments, structure
references, numeric overflow, and unboundness of varibles.

jeff@aiai.edinburgh.ac.UK (Jeff Dalton) (05/05/91)

> Date: 3 May 91 20:23:21 GMT
> From: "Guillermo J. Rozas" <jinx@zurich.ai.mit.edu>

> How do you know that QUOTIENT or FORMAT (for example) may not error,
> thereby calling the condition system which may use
> CALL-WITH-CURRENT-CONTINUATION to construct a condition object to be
> used if the computation is going to be continued?
> 
> I think that it is almost impossible to know.  Because of this,
> essentially any potentially assigned variables must be on the heap so
> that invoking a continuation multiple times will not un-do side effects.

If so, this looks like a very strong argument against having full
continuations.

hieb@heston.cs.indiana.edu (Robert Hieb) (05/06/91)

jeff@aiai.edinburgh.ac.UK (Jeff Dalton) writes:

>> Date: 3 May 91 20:23:21 GMT
>> From: "Guillermo J. Rozas" <jinx@zurich.ai.mit.edu>

>> essentially any potentially assigned variables must be on the heap so
>> that invoking a continuation multiple times will not un-do side effects.

>If so, this looks like a very strong argument against having full
>continuations.

In practice, assigned variables in Scheme are typically "state"
variables with indefinite extent anyway, and thus not suitable for
stack placement.

The assignments in the piece of code that occasioned this side track
seem to be an artifact of an overly-literal translation from C code.
Replacing the "do" loops with "named let" loops makes it easy to
eliminate the assignments to "q" by turning it into a formal parameter
of the inner loop.  "named let" often makes it possible to remove
assignments and breaks from C-style loops.

In general, tail recursion optimization in Scheme makes it natural to
create state machines that pass (rather than assign) parameters.

dyb@marilyn.cs.indiana.edu (R. Kent Dybvig) (05/16/91)

wa@raven.cad.mcc.com (Wayne Allen) writes:

>I've heard good things about Chez Scheme, and find it hard to 
>take Huntley's data on face value.  

Okay.  I had decided to stay out of this thread, since I thought that
Bill Rozas covered the subject sufficiently.  Huntley's numbers are
probably entirely correct for Chez Scheme Version 3.2 given the
original program, which uses generic arithmetic.  Furthermore, since
Version 3.2 uses 19 bit fixnums, the original program will not run
correctly with fixnum-specific operators.  However, by playing around
with the parameters (specifically, the "r" parameter), it is possible
to use fixnum-specific operators and get much better performance: on
the order of four or five times slower than C.

Using Version 3.9n (a prerelease of Chez Scheme Version 4), which uses
30-bit fixnums, it is still not possible to run "(pi 800)" with the
original program without causing fixnum overflow; the CMU Common Lisp
numbers must be valid only because of the "unsigned" declaration.  With
a slight change to the "r" parameter to allow fixnum-specific operators
to be used, Version 3.9n is slightly less than 3 times slower than
optimized GNU C on a Sun-3.

Although a factor of 3 is much more encouraging than a factor of 20,
we clearly have more work to do to make this benchmark perform as well
in Scheme as it does in C.

Incidentally, there's a bug in the original C code: j is not
initialized.  I tried the C code first on a Sun 4 and it did not run
properly until I initialized j to 0.

Kent

R. Kent Dybvig                 | Computer Science Department
dyb@iuvax.cs.indiana.edu       | Indiana University
812/855-6486                   | Bloomington, IN 47405