[comp.arch] Perils of comparison -- an example

casey@admin.cognet.ucla.edu (Casey Leedom) (08/14/88)

In article <282@quintus.UUCP> ok@quintus () writes:
> 
> ... kLI/s are defined solely by that particular benchmark, by the way.
> Other benchmarks may be "procedure calls per second", but _only_ Naive
> Reverse gives "logical instructions".

  I believe "kLI/s" is 1000's of Logical Inferences per second (but I may
be wrong of course).  This is normally abrieviated as kLIPS.  Really fast
PROLOG machines are rated in mLIPS (10^6 LIPS).

  LIPS is a logical analog to the floating point FLOPS metric.  Note that
both LIPS and FLOPS are useful measures while MIPS is of debatable use -
at least until the industry can standardize the measure and stop gloming
a system's entire performance profile into a single number.

Casey

ok@quintus.uucp (Richard A. O'Keefe) (08/15/88)

In article <15221@shemp.CS.UCLA.EDU> casey@cs.ucla.edu.UUCP (Casey Leedom) writes:
>In article <282@quintus.UUCP> ok@quintus () writes:
>> 
>> ... kLI/s are defined solely by that particular benchmark, by the way.
>> Other benchmarks may be "procedure calls per second", but _only_ Naive
>> Reverse gives "logical instructions".
>
>  I believe "kLI/s" is 1000's of Logical Inferences per second (but I may
>be wrong of course).  This is normally abrieviated as kLIPS.  Really fast
>PROLOG machines are rated in mLIPS (10^6 LIPS).

Right, it is "logical _inferences_ per second".  Silly me.

There is a single specific benchmark, called naive reverse, which happens
to do 496 procedure calls.  To determine the kLI/s rating, you run this
benchmark N times, for some large N.  If it takes T seconds, you report
(496*N)/T as the LIPS rating.

When you are benchmarking, it is necessary to be precise about what you
have measured.  Some people have taken any old small program and
reported the number of procedure calls it did per second as LIPS.  It
simply won't *DO*! Procedures can have different numbers of arguments,
and the cost of head unification can range from next to nothing to
exponential in the size of the arguments.

Don't get me wrong:  Naive Reverse is not a specially good benchmark.
(Think about the fact that native code for it fits comfortably into a
68020's on-chip instruction cache...) But using *different* benchmarks
when talking about different machines can't yield better comparisons!

There is a more comprehensive set of micro-benchmarks which was described
in AI Expert last year.  Instead of a single LI/s rating, it would be
better to report an "AIE spectrum".  But even the best micro-benchmarks
don't always predict the performance of real programs well, for reasons
explained in the SmallTalk books, amongst others.

One of the things which makes the DLM article credible is that it reports
figures for several other (small) benchmarks (I surmise that "quickstart"
really meant "quicksort").  I have seen enough papers that report really
high performance where the system described seems never to have run
anything _but_ Naive Reverse.  At least the DLM is realer than that!

eugene@eos.UUCP (Eugene Miya) (08/16/88)

In article <292@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>Don't get me wrong:  Naive Reverse is not a specially good benchmark.

I see you came from prolog and cross posted to arch.

>There is a single specific benchmark, called naive reverse, which happens
>to do 496 procedure calls.  To determine the kLI/s rating, you run this
>benchmark N times, for some large N.  If it takes T seconds, you report
>(496*N)/T as the LIPS rating.

I've stated this many times in comp.arch, and I'll repeat this once
for the prolog community benefit.  Measurement of repetition
isn't equivalent to repetition of measurement on a computer. Cache,
paging, and optimization conspire against oversimplistic
measurements of this type.

>When you are benchmarking, it is necessary to be precise 

You said it all.

I've been trying to find out what "really constitutes a Logical
Instruction" As far as I can tell, it's totally arbitrary whereas
Instructions and Operations tend to correspond to discrete states
(barring instruction pipelining, yes yes....).  (Yes I have Gabriel's
thesis and others).

Your keyword about measuring prolog is "naive."  This isn't a putdown,
but the prolog community will have to recognize some of these problems.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

ok@quintus.uucp (Richard A. O'Keefe) (08/20/88)

In article <1303@eos.UUCP> eugene@eos.UUCP (Eugene Miya) writes:
>In article <292@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>>Don't get me wrong:  Naive Reverse is not a specially good benchmark.
>
>I see you came from prolog and cross posted to arch.
>
Misemphasis: it was a joint posting to both groups because I thought the
original article (comments on a paper about a new architecture) were
relevant to both groups.

>I've stated this many times in comp.arch, and I'll repeat this once
>for the Prolog community's benefit.  Measurement of repetition
>isn't equivalent to repetition of measurement on a computer. Cache,
>paging, and optimization conspire against oversimplistic
>measurements of this type.

We *know* that.  But we *also* know that if you measure one iteration
of a typical micro-benchmark it falls below the resolution of the clock.
Running nrev a few thousand times is to get a figure which can be
distinguished from clock quantisation.

Let me summarise my position here:
(1) A paper describing a machine called DLM appeared in FGCS.
(2) The paper compared the DLM with a 68020 using *different* micro benchmarks
(3) one of which is the official definition of LI/s, but
(4) neither of which is good.
(5) Because of (2) and other reasons, it appears that the special-purpose
    machine is not as much of an advance over conventional chips as it seems.

>I've been trying to find out what "really constitutes a Logical
>Instruction" As far as I can tell, it's totally arbitrary whereas
>Instructions and Operations tend to correspond to discrete states
>(barring instruction pipelining, yes yes....).  (Yes I have Gabriel's
>thesis and others).

Gabriel's thesis?  Do you mean Tick's?  Logical Inferences per second
are defined by the naive reverse benchmark and by nothing else.  The
place to look for the definition is Warren's thesis.  The term has been
*mis*applied as "number of procedure calls per second" which can be almost
anything depending on what code you run.

>Your keyword about measuring prolog is "Naive."  This isn't a putdown,
>but the Prolog community will have to recognize some of these problems.
>Another gross generalization from Eugene Miya.

Well, if it isn't a putdown, it'll do until a real one comes along.
We *know* that these micro-benchmarks don't extrapolate well, but they're
the best we've got.  ("Naive" refers to the algorithm, by the way.)
Some constructive advice about how to structure a benchmark suite to
compare implementations of a high-level language on a range of 32-bit
workstations would be really welcome, and if the comp.arch community is
really so sophisticated, such advice should be forthcoming, no?