[comp.benchmarks] Don't use bc

amos@taux01.nsc.com (Amos Shapir) (12/02/90)

[Quoted from the referenced article by ethan@thinc.UUCP (Ethan Lish of THINC)]
>
>Greetings -
>
>	This _benchmark_ does *NOT* have a legitimate value!
>

Sure it doesn't; I wonder how no one else noted this yet: "bc" is probably
the worst choice of a utility to benchmark by.  On most UNIX systems, it
just parses expressions, and forks "dc" to execute them ("dc" is a reverse-
polish string based numeric interpreter).  So the results depend on how
fast your system forks, and how "bc" and "dc" communicate.

Besides, there are several versions of "bc" (some of which do not fork "dc")
and since the original version of "dc" was rather buggy, several versions
of it too, some of which are major rewrites.

The bottom line is: comparing "bc" runs on different systems is necessarily
comparing apples and oranges (or at least plums & prunes) unless you're
sure you have the same version of "bc", "dc", and UNIX.  Results posted
here so far indicate most comparisons are indeed meaningless.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522255  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (12/02/90)

>>>>> On 1 Dec 90 21:20:17 GMT, amos@taux01.nsc.com (Amos Shapir) said:

[.... commenting on the "echo 2^5000/2^5000 | /bin/time bc" benchmark ...]

>    This benchmark has no validity!

Amos> Sure it doesn't; I wonder how no one else noted this yet: "bc"
Amos> is probably the worst choice of a utility to benchmark by.  On
Amos> most UNIX systems, it just parses expressions, and forks "dc" to
Amos> execute them [...]

Of course the biggest problem is that almost no one actually *uses*
`bc' for any large amount of computation, so no vendor has any
incentive to optimize its performance.

A secondary problem is that one could trivially optimize the benchmark
away by adding a constant-expression simplifier to `bc' before it
calls `dc', but everyone already knew that....

(Maple evaluated the expression on my SGI 4D/25 in 0.4 seconds wall
time).
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

dale@convex.com (Dale Lancaster) (12/04/90)

>Amos> Sure it doesn't; I wonder how no one else noted this yet: "bc"
>Amos> is probably the worst choice of a utility to benchmark by.  On
>Amos> most UNIX systems, it just parses expressions, and forks "dc" to
>Amos> execute them [...]

>Of course the biggest problem is that almost no one actually *uses*
>`bc' for any large amount of computation, so no vendor has any
>incentive to optimize its performance.

For this reason it makes a good a benchmark as any.  I suspect that most Unix
based bc's/dc's work the same.  This is a typical dusty-deck, no optimization
piece of code that shows the performance of the machine from not just the hardware
side but the software (compiler) side as well.  Today you have to benchmark the
compiler as much as the hardware.  But of course the only true measure of performance
is my own application :-)

dml

de5@ornl.gov (Dave Sill) (12/04/90)

In article <5042@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes:
>[Quoted from the referenced article by ethan@thinc.UUCP (Ethan Lish of THINC)]
>>
>>Greetings -
>>
>>	This _benchmark_ does *NOT* have a legitimate value!
>>
>
>Sure it doesn't; I wonder how no one else noted this yet: "bc" is probably
>the worst choice of a utility to benchmark by.

It may not be rigorous, but it does have value.  For one thing, it's
short enough to be memorized and easily typed at a box at, say, an
expo.  

>On most UNIX systems, it
>just parses expressions, and forks "dc" to execute them ("dc" is a reverse-
>polish string based numeric interpreter).  So the results depend on how
>fast your system forks, and how "bc" and "dc" communicate.

How does that invalidate the results?  That's like penalizing an
optimizing compiler for taking shortcuts the other one didn't.  If the
bc benchmark runs faster on system A than it does on B because vendor
A took the time to optimize bc, then good for them!  The danger is not
some inherent unreliability in the benchmark, it's in incorrectly
interpreting the results.

This highlights very well the fundamental danger of benchmarking:
generalization.  Just because one system outperforms another on, say,
a floating point benchmark, doesn't mean that it will *always*
outperform it on all floating point code.

>The bottom line is: comparing "bc" runs on different systems is necessarily
>comparing apples and oranges (or at least plums & prunes) unless you're
>sure you have the same version of "bc", "dc", and UNIX.  Results posted
>here so far indicate most comparisons are indeed meaningless.

Bc is bc.  If it takes 2^5000/2^5000 and correctly calculates the
result, what does it matter how it gets there?  I.e., this benchmark
measures bc's performance.  Interpreting it as a hardware benchmark is
fallacious, since hardware performance is only one factor in the
result.

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) replies to the
same article:
>
>Of course the biggest problem is that almost no one actually *uses*
>`bc' for any large amount of computation, so no vendor has any
>incentive to optimize its performance.

Ah, but would it be better to benchmark something the vendor has
expected and optimized?  If you're looking for actual performance
instead of theoretical peak performance, perhaps it better to throw
something unexpected at them.

>A secondary problem is that one could trivially optimize the benchmark
>away by adding a constant-expression simplifier to `bc' before it
>calls `dc', but everyone already knew that....

Yes, but that would be readily apparent, wouldn't it?  And it wouldn't
invalidate the test.  You just need to keep in mind that you're
testing bc, and its dependence on hardware performance is only
indirect. 

>(Maple evaluated the expression on my SGI 4D/25 in 0.4 seconds wall
>time).

Exactly, so Maple is faster than bc.  You can't interpret this to mean
that the SGI is faster than all the other systems that take longer to
do it with bc.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

jbuck@galileo.berkeley.edu (Joe Buck) (12/04/90)

In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes:
> It [the bc bmark] may not be rigorous, but it does have value.
> For one thing, it's short enough to be memorized and easily typed
> at a box at, say, an expo.  

Surely you've heard the one about the drunk looking for his keys under
the lamppost, even though he'd lost them in an alley.  He explains
that he's looking under the lamppost because the light is better.

The bc "benchmark" is easy to measure, but the number is worthless.
The drunk finds it easy to look under the lamppost, but he's not
going to find what he's looking for.

> >Of course the biggest problem is that almost no one actually *uses*
> >`bc' for any large amount of computation, so no vendor has any
> >incentive to optimize its performance.

> Ah, but would it be better to benchmark something the vendor has
> expected and optimized?  If you're looking for actual performance
> instead of theoretical peak performance, perhaps it better to throw
> something unexpected at them.

Even if your argument was true, I've seen the bc "benchmark" proposed
years ago.  Vendors already know about it.

Here's a valid benchmark: how fast does a standard compiler run, with
a standard input, compiling for a standard architecture?  (This is the
"gcc" test in SPEC).  Other valid tests are: how fast does a widely
used application program, like TeX, troff, SPICE, etc, run given a
fairly complex input?  The idea is to see how fast the computer runs
programs that people actually use heavily.  If a vendor's machine
manages to do all these things fast, it will probably be fast on your
real workload.

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

de5@ornl.gov (Dave Sill) (12/04/90)

In article <39871@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes:
>In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes:
>> It [the bc bmark] may not be rigorous, but it does have value.
>> For one thing, it's short enough to be memorized and easily typed
>> at a box at, say, an expo.  
>
>[Amusing story about a drunk looking for his keys under a lamppost,
>even though he'd lost them in an alley, because the light was
>better.] 
>
>The bc "benchmark" is easy to measure, but the number is worthless.

Why do you say that?  It's using a known tool to perform a known task,
right?  No, it won't tell whether machine X is faster than machine Y
with absolute certainty.  But it will tell me *something*.

>> >Of course the biggest problem is that almost no one actually *uses*
>> >`bc' for any large amount of computation, so no vendor has any
>> >incentive to optimize its performance.
> 
>> Ah, but would it be better to benchmark something the vendor has
>> expected and optimized?  If you're looking for actual performance
>> instead of theoretical peak performance, perhaps it better to throw
>> something unexpected at them.
>
>Even if your argument was true, I've seen the bc "benchmark" proposed
>years ago.  Vendors already know about it.

Do they know *my* version of it?  Maybe I've changed the numbers
around, maybe I've changed the operations...

>Here's a valid benchmark: how fast does a standard compiler run, with
>a standard input, compiling for a standard architecture?  (This is the
>"gcc" test in SPEC).  Other valid tests are: how fast does a widely
>used application program, like TeX, troff, SPICE, etc, run given a
>fairly complex input?

I agree that both are valid tests.  I don't agree that the bc test is
fundamentally different.  It's a smaller version of the same idea.

>The idea is to see how fast the computer runs
>programs that people actually use heavily.  If a vendor's machine
>manages to do all these things fast, it will probably be fast on your
>real workload.

Exactly, but you can't carry a SPEC tape with you wherever you go.  I
*can* carry the bc test with me and very quickly determine which end
of the spectrum an unknown machine falls in.  Will I base purchase
decisions on such an admittedly trivial test?  Of course not.

Benchmarks are tools.  Some are higher quality than others.  Some take
more skill to use than others.  The existence of $12,000 table saws
doesn't negate the ability of a $15 hand saw to cut a limb off a tree.
They're different tools for different tasks.  Our job as computing
professionals is to select the appropriate tool for the task.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

amos@taux01.nsc.com (Amos Shapir) (12/04/90)

[Quoted from the referenced article by Dave Sill <de5@ornl.gov>]
|In article <5042@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes:
|
|>On most UNIX systems, it
|>just parses expressions, and forks "dc" to execute them ("dc" is a reverse-
|>polish string based numeric interpreter).  So the results depend on how
|>fast your system forks, and how "bc" and "dc" communicate.
|
|How does that invalidate the results?  That's like penalizing an
|optimizing compiler for taking shortcuts the other one didn't.  If the
|bc benchmark runs faster on system A than it does on B because vendor
|A took the time to optimize bc, then good for them!  The danger is not
|some inherent unreliability in the benchmark, it's in incorrectly
|interpreting the results.

But all I see in this group is articles posting *one number* for this test;
it may be ok to use it for a CPU-bound test (e.g., running "dc" alone),
but running "bc" means creating two processes, and also measuring the overhead
of context switches, etc.  The problem is that most people do not know
that this is happening, and use the results as if "bc" was doing what actually
only "dc" does.

|
|Bc is bc.  If it takes 2^5000/2^5000 and correctly calculates the
|result, what does it matter how it gets there?  I.e., this benchmark
|measures bc's performance.  Interpreting it as a hardware benchmark is
|fallacious, since hardware performance is only one factor in the
|result.

But show me one article in this thread that does *not* treat this as a
hardware benchmark; after all, that's exactly the charter of this group!
"Bc" alone is not useful enough to warrant such an attention, and even if it
was, benchmarks measuring its performance for its own sake would be
posted in comp.bc, not here.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522255  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N

de5@ornl.gov (Dave Sill) (12/05/90)

In article <5045@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes:
>[Quoted from the referenced article by Dave Sill <de5@ornl.gov>]
>|
>|How does that invalidate the results?  That's like penalizing an
>|optimizing compiler for taking shortcuts the other one didn't.  If the
>|bc benchmark runs faster on system A than it does on B because vendor
>|A took the time to optimize bc, then good for them!  The danger is not
>|some inherent unreliability in the benchmark, it's in incorrectly
>|interpreting the results.
>
>But all I see in this group is articles posting *one number* for this test;
>it may be ok to use it for a CPU-bound test (e.g., running "dc" alone),
>but running "bc" means creating two processes, and also measuring the overhead
>of context switches, etc.  The problem is that most people do not know
>that this is happening, and use the results as if "bc" was doing what actually
>only "dc" does.

The man on the street could not care less how many context switches
are involved--or even what a context switch *is*--all he cares about
is how long it takes the computer to perform his task.  That's one
number: elapsed time.  *You* might be extremely concerned about what's
happening behind the scenes, but not everyone is.

>|Bc is bc.  If it takes 2^5000/2^5000 and correctly calculates the
>|result, what does it matter how it gets there?  I.e., this benchmark
>|measures bc's performance.  Interpreting it as a hardware benchmark is
>|fallacious, since hardware performance is only one factor in the
>|result.
>
>But show me one article in this thread that does *not* treat this as a
>hardware benchmark;

Show me one that does.  I haven't seen *any* conclusions drawn from
these results, except that one SPARCstation SLC seemed to take
entirely too long.  (Which points out another use of this trivial
benchmark: a quick diagnostic to compare identical systems or to
track a system through time.)

>after all, that's exactly the charter of this group!

No it isn't.  I specifically included *all* types of benchmarking.
Otherwise it would've been called comp.arch.benchmarks or
comp.hw.benchmarks.

>"Bc" alone is not useful enough to warrant such an attention, and even if it
>was, benchmarks measuring its performance for its own sake would be
>posted in comp.bc, not here.

First of all, there is no comp.bc.  Even if there was, this would be a
very appropriate place.  Just because the bc benchmark doesn't suit
your needs doesn't mean it's worthless.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

patrick@convex.COM (Patrick F. McGehearty) (12/06/90)

In article <1990Dec3.204027.16794@cs.utk.edu> Dave Sill <de5@ornl.gov> writes:
>In article <39871@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes:
>>In article <1990Dec3.191756.15280@cs.utk.edu>, de5@ornl.gov (Dave Sill) writes:
>>> It [the bc bmark] may not be rigorous, but it does have value.
>>> For one thing, it's short enough to be memorized and easily typed
>>> at a box at, say, an expo.  
>>
>>The bc "benchmark" is easy to measure, but the number is worthless.
>
>Why do you say that?  It's using a known tool to perform a known task,
>right?  No, it won't tell whether machine X is faster than machine Y
>with absolute certainty.  But it will tell me *something*.
>
>I agree that both are valid tests.  I don't agree that the bc test is
>fundamentally different.  It's a smaller version of the same idea.
>
>>The idea is to see how fast the computer runs
>>programs that people actually use heavily.  If a vendor's machine
>>manages to do all these things fast, it will probably be fast on your
>>real workload.
>
>Exactly, but you can't carry a SPEC tape with you wherever you go.  I
>*can* carry the bc test with me and very quickly determine which end
>of the spectrum an unknown machine falls in.  Will I base purchase
>decisions on such an admittedly trivial test?  Of course not.
>
I suggest that the bc benchmark is worse than worthless for several reasons.

First, as pointed out, it is not measuring the raw add/multiply rate
of the machine.  It measures the "multi-precision arithmetic"
capabilities as implemented by dc, which is mostly subroutine call/returns.
Further, I have never seen a system where bc/dc is a significant user of
cycles.  Thus, the less than expert user will believe the measurements
represent something different from reality.

Second, for most machines, little architecture or compiler work has
been (or should be) done to optimize this application.  So you will not be
able to tell the difference between those machines which have features
useful to your application and those which do not.

Third, widespread reporting of such a benchmark will encourage other,
less knowledgable buyers to read more into the numbers than should be
read.

Fourth, if buyers use the benchmark, then vendors will be encouraged
to put resources into enhancing their performance on it instead of
enhancing something useful.  This is a bad thing and the primary reason
why I am posting.  Bad benchmarks lead to lots of wasted effort.

I use the Whetstone benchmark as a "proof by example".  I know of several
vendor development efforts (going back much more than 10 years, this is not
a new phenomenon) which went to extreme efforts to improve their Whetstone
results, including adding special microcode for certain instructions which
the compiler only generated for for the Whetstone benchmark.  Obviously,
this particular trick only makes sense for the old style CISCy
architectures, but you get the idea of what vendors will do to improve their
benchmark results.  There are similar stories for the Dhrystone benchmark.
In these cases, the development efforts were not totally wasted.  Efforts to
speed up the transcendental functions (SIN, COS, etc) used in the Whetstones
helped those applications which used the transcendentals.  I see no value
to most users of general purpose computing (scientific or business) in
optimizing bc/dc.

Many procurements require some minimum rate on a some well-known benchmark
for a vendor to be even allowed to bid.  If you can't make this number, you
don't get a chance to show how good your architecture and compilers are for
executing the customer's real application.  There are even a significant
number of customers who do not run benchmarks before purchase.  They
just rely on quoted numbers for well-known benchmarks.  It is our duty
as responsible professionals to develop and measure benchmarks that mean
something and which explain what they mean.

For workstations, the SPECmark benchmarks provide programs which are
sufficiently complex as to avoid the trivial trick optimizations.  If an
optimization can make that set run faster, it will probably also apply to
real application code.  For scientific computers, the Perfect Club
benchmarks serve the same purpose.  They represent a dozen scientific
applications with inner loops which cover a variety of code patterns which
are found in real applications.  The Livermore loops also have codes
which are representative of inner loops of real applications.  Improving
these codes will improve real application performance.  In a few years,
their solution times will become so short as to require new problem
definitions or data sets, but meanwhile, we in system development will have
some meaningful metrics to work towards improving.

If you really must have a "quick and dirty" benchmark, how about the
following:

	program main
	real*8 a(256,256),b(256,256),c(256,256)
	call matmul(a,b,c,256)
	end
	subroutine matmul(a,b,c,n)
	real*8 a(n,n),b(n,n),c(n,n)
	do i = 1, n
	do j = 1, n
	  c(i,j) = 0.0
	  do k = 1, n
	    c(i,j) = c(i,j) + a(i,k)*b(k,j)
	  enddo
	enddo
	enddo
	return
	end

This is a basic matrix multiply loop which takes less than a second
on a Convex C210.  If you are running on a fast machine you might want to
change 256 to 1024 (for 64 times more work).  The matmul routine is separate
from the main routine so that the optimizer cannot eliminate the work unless
it performs interprocedural optimization or routine inlining.  Be sure
not to invoke any such options.

This toy benchmark focuses on the floating point performance of the
machine.  It should show the architecture in a relatively favorable
light if floating point is an important part of its product segment.
It is large enough to blow most current cache systems if there is
too great of a disparity between cache and non-cache processor performance.

It is not hard to memorize or carry a copy around and type in.
On a convex, execute with fc -O2 test.f -o test; /bin/time -e test
-O2 requests the vectorization optimization.
-O3 requests the parallel/vectorization optimization.
The -e switch is a convex extension to /bin/time to provide extended
accuracy to the microsecond level.  Otherwise timing is only recorded
to the nearest 100th of a second for compatibility with previous releases.

djh@xipe.osc.edu (David Heisterberg) (12/06/90)

In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>	program main
>	real*8 a(256,256),b(256,256),c(256,256)
>	call matmul(a,b,c,256)
>...
>	  do k = 1, n
>	    c(i,j) = c(i,j) + a(i,k)*b(k,j)
>	  enddo

Won't Convex performance suffer here due to bank conflicts?  At least on
a CRAY, the above algorithm would not (or should not) find a place in any
production code.  The inner loop would be run over "i".
--
David J. Heisterberg		djh@osc.edu		And you all know
The Ohio Supercomputer Center	djh@ohstpy.bitnet	security Is mortals'
Columbus, Ohio  43212		ohstpy::djh		chiefest enemy.

mhcoffin@watmsg.uwaterloo.ca (Michael Coffin) (12/06/90)

In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>I suggest that the bc benchmark is worse than worthless for several reasons.
>[ ... ]
>For workstations, the SPECmark benchmarks provide programs which are
>sufficiently complex as to avoid the trivial trick optimizations.
>[...]

It is only worse than worthless if it doesn't correlate well with real
performance.  Does anyone have SPECmark numbers for these machines?
It would be interesting to run correlations between SPECmark and
BCmark.  I'll be the correlation is high.

-mike

lars@spectrum.CMC.COM (Lars Poulsen) (12/06/90)

I can hear Eugene chuckling. Most of the objections to the "bc
benchmark" are valid, but they apply in some form or another to any
benchmark. It's just that this one is so simple that the issues become
obvious.
-- 
/ Lars Poulsen, SMTS Software Engineer
  CMC Rockwell  lars@CMC.COM

de5@ornl.gov (Dave Sill) (12/06/90)

In article <109872@convex.convex.com>, patrick@convex.COM (Patrick F. McGehearty) writes:
>
>I suggest that the bc benchmark is worse than worthless for several reasons.
>
>First, as pointed out, it is not measuring the raw add/multiply rate
>of the machine.

Thus it is worthless for measuring raw arithmetic speed.  Granted.

>It measures the "multi-precision arithmetic"
>capabilities as implemented by dc, which is mostly subroutine call/returns.

So how does that make it worthless?  E.g., assume I want to measure an
inexact mix of arithmetic, function calls, context switches, etc.
Does the bc test not depend on these types of tasks?

>Further, I have never seen a system where bc/dc is a significant user of
>cycles.  Thus, the less than expert user will believe the measurements
>represent something different from reality.

I don't see how you got from the premise to your conclusion.  As for
the "less that expert user", if we fall into the trap of making
twit-proof benchmarks and outlawing anything that's not obvious and
general, we're not going to get anywhere.

>Second, for most machines, little architecture or compiler work has
>been (or should be) done to optimize this application.

Lack of optimization does not invalidate a benchmark.

>So you will not be
>able to tell the difference between those machines which have features
>useful to your application and those which do not.

The bc test shouldn't be used to attempt to predict the performance of
one's application unless one has specifically determined that such a
comparison is valid.  Granted.  

>Third, widespread reporting of such a benchmark will encourage other,
>less knowledgable buyers to read more into the numbers than should be
>read.

The twit-proof trap, again.  In this case, though, it's not really a
big issue because we're not proposing the bc test be added to the SPEC
suite.  It's a limited-usefulness, extremely trivial benchmark.

>Fourth, if buyers use the benchmark, then vendors will be encouraged
>to put resources into enhancing their performance on it instead of
>enhancing something useful.  This is a bad thing and the primary reason
>why I am posting.  Bad benchmarks lead to lots of wasted effort.

Oh come on!  This is the twit-proof argument again.  If someone's
stupid enough to use bc to specify minimum performance in a
procurement, then they absolutely deserve what they get.  I'm
certainly not going to lose any sleep over the possibility.

>I use the Whetstone benchmark as a "proof by example".

It seems like you're using it a proof by counterexample.

> :
>In these cases, the development efforts were not totally wasted.  Efforts to
>speed up the transcendental functions (SIN, COS, etc) used in the Whetstones
>helped those applications which used the transcendentals.  I see no value
>to most users of general purpose computing (scientific or business) in
>optimizing bc/dc.

No vendor I know is stupid enough to devote their resources to
optimizing bc just because a handful of people use it as a trivial 
benchmark.  Even if it happened, though, what harm would there be in
that? 

>Many procurements require some minimum rate on a some well-known benchmark
>for a vendor to be even allowed to bid.  If you can't make this number, you
>don't get a chance to show how good your architecture and compilers are for
>executing the customer's real application.  There are even a significant
>number of customers who do not run benchmarks before purchase.  They
>just rely on quoted numbers for well-known benchmarks.  It is our duty
>as responsible professionals to develop and measure benchmarks that mean
>something and which explain what they mean.

Exactly.  This is what the major commercial suites are for.  As I've
already said, though, one doesn't want to have to carry a SPEC tape
with them wherever they go.  It's not always feasible to run a major,
rigorous suite, and in many cases it's overkill.

For example, and I've already pointed out this use of the bc test,
let's say I'm working at my DECstation one day and it seems to be
sluggish.  Is it just my imagination, or is it really slower?  Should
I devote a day or so to running the SPEC suite just to find out, or
should I type "echo 2^5000/2^5000 | /bin/time bc" and compare it to
previous runs on the same machine or to a similarly configured system?

>If you really must have a "quick and dirty" benchmark, how about the
>following:
>
>	[FORTRAN program deleted]

Some of my systems don't have FORTRAN compilers.  It's too long to
easily remember and type in.  It doesn't do a mix of function calls,
arithmetic, forks, context switches, etc.

As I've already said: benchmarks are tools.  They come in all sizes,
perform many different tasks, vary greatly in quality, etc.  The bc
benchmark doesn't do everything, and, like any tool, can be used for
things it's not intended to be used for, but it *does* have its niche.

-- 
Dave Sill (de5@ornl.gov)
Martin Marietta Energy Systems
Workstation Support

jbuck@galileo.berkeley.edu (Joe Buck) (12/07/90)

In article <1990Dec6.000702.16354@watmath.waterloo.edu>, mhcoffin@watmsg.uwaterloo.ca (Michael Coffin) writes:
> It [bc benchmark]is only worse than worthless if it doesn't
> correlate well with real
> performance.  Does anyone have SPECmark numbers for these machines?
> It would be interesting to run correlations between SPECmark and
> BCmark.  I'll be the correlation is high.

SPECmark is actually 10 numbers, not one.  It might be interesting
to see which of the 10 spec numbers correlates most strongly with
the bc number, and what the coefficient of correlation is.  Anyone
have SPEC numbers to do the experiment with?

--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck

ciemo@bananapc.wpd.sgi.com (Dave Ciemiewicz) (12/07/90)

In article <dale.660251271@convex.convex.com>, dale@convex.com (Dale Lancaster) writes:
|> >Amos> Sure it doesn't; I wonder how no one else noted this yet: "bc"
|> >Amos> is probably the worst choice of a utility to benchmark by.  On
|> >Amos> most UNIX systems, it just parses expressions, and forks "dc" to
|> >Amos> execute them [...]
|> 
|> >Of course the biggest problem is that almost no one actually *uses*
|> >`bc' for any large amount of computation, so no vendor has any
|> >incentive to optimize its performance.
|> 
|> For this reason it makes a good a benchmark as any.  I suspect that most Unix
|> based bc's/dc's work the same.  This is a typical dusty-deck, no optimization
|> piece of code that shows the performance of the machine from not just the hardware
|> side but the software (compiler) side as well.  Today you have to benchmark the
|> compiler as much as the hardware.

I really can't believe that this joke is being perpetuated.  The idea of
benchmarking is to create a reference standard by which different machines
may *MEANINGFULLY* be compared.  Since bc may actually be coded differently
between systems, the comparisons become weaker.  I just diff'ed the sources
between BSD and SYSV versions of dc which is the compute engine for bc.
There are changes to the SYSV version for robustness that may sway results
one way or the other.  For all intents and purposes, you might as well be
comparing the execution of bc using different stop watches.  Since it is
purported that AIX is a complete rewrite of UNIX, they may have even rewritten
something as simple as dc to avoid licensing which would also be a plausible
explanation for the faster performance on the RS6000.  It may not have
anything to do with the IBM compilers.  (Please correct me if I'm wrong.)

This mockary of a benchmark does not fall into the category of "unchanged,
dusty deck" code as many would believe.  We are measuring numbers with
different rulers.  Not only that, the rulers are the cheap plastic kind we
used in grade school.  When you put two of them side-by-side, you find out
they're different lengths.  Meaningless for precise comparison.  Possibly
meaningless for gross comparisons.

|> But of course the only true measure of performance
|> is my own application :-)
|> 
|> dml

These are the best kind of comparisons of all.  When I run my application on
this system, what is the performance difference when I try to accomplish
some real task.

The ANSYS people port their product to a number of architectures and then
run the products through scripted sets of tasks for comparisons.  This may
not be as swell or buzzwordy as MIPS or Dhrystones, yet, these comparisons
are better for an individual trying to make comparisons between systems
for running the ANSYS products than any of the other "metrics".

Imagine buying a car based only on the horsepower, torque, and G ratings
without actually taking the care for a test drive.  The differences can
be astounding.

						--- David Ciemiewicz

P.S. I am speaking for myself.  I like to live the delusion that my company
supports my opinions however it really depends on what I'm talking about.

patrick@convex.COM (Patrick F. McGehearty) (12/07/90)

In article <1211@sunc.osc.edu> djh@xipe.osc.edu (David Heisterberg) writes:
>In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>>	program main
>>	real*8 a(256,256),b(256,256),c(256,256)
>>	call matmul(a,b,c,256)
>>...
>>	  do k = 1, n
>>	    c(i,j) = c(i,j) + a(i,k)*b(k,j)
>>	  enddo
>
>Won't Convex performance suffer here due to bank conflicts?  At least on
>a CRAY, the above algorithm would not (or should not) find a place in any
>production code.  The inner loop would be run over "i".

Just goes to show you (and me) how easy it is to not measure what you
think you are measuring.  There are many valid matrix-multiply patterns,
but different patterns give different performance levels, with some
better on some machines than others.  The one I listed was on the top
of my memory because it tests several of the optimization features
of the Convex fc compiler.  Thus, I use it occasionally to be sure we don't
lose these features.  In particular, it tests "loop splitting" and
"loop interchange".  Thus, the optimized version of the loop (which is
generated invisibly to the user) is:

	do i = 1,n	! split off to allow interchange
	do j = 1,n
	c(i,j) = 0.0
	enddo
	enddo
	do k = 1,n
	do j = 1,n
	do i = 1,n	! interchanged to avoid bank conflicts
	  c(i,j) = c(i,j) + a(i,k)*b(k,j)
	enddo
	enddo
	enddo

This version would be better for testing compute capabilities on most
vector machines than the original.  The following version would be better
for testing compute capabilities on some machines with weak optimiziers,
but worse for vector machines:

	do i = 1,n
	do j = 1,n
	sum = 0.0
	do k = 1,n
	  sum = sum + a(i,k)*b(k,j)
	enddo
	c(i,j) = sum
	enddo
	enddo

The weak optimizers repeatedly compute the address of c(i,j) in the
first version, but not the second.  Unfortunately, the scalar sum
causes less efficient code to be generated for many vector machines.
So the same problem (matrix multiply) can show either of two
machines to be better depending on small details of the problem statement.

Which reemphasizes the point that trivial benchmarks mislead the less than
expert.  Note that I careful do not say 'twit'.  It is very easy to make
errors in benchmark development and analysis.  I have seen all sorts of
errors made in these areas by people who are competent in their own
specialties.  Most Computer Science and Computer Engineering curriculms
provide very little training in measurement methodology.  I learned much
more about measurement from my Physics and Experimental Psychology courses
than I did from my CS training.  Physics labs teach about experimental
error, and Psychology teachs about experimental design.  The CS I was
exposed to focused on modeling and analysis, with some discussion of
modelling errors.  Given this lack of training in measurement, specialists
in the field need to be aware of the naiviety of the users of our results.

A benchmark should have some well-understood relation to the real purposes
for which a machine is to be used in order to have value.  If the
machine is for program development, then measure compilation of non-trivial
programs.  If the machine is for numerical work, measure some non-trivial
application kernels.  If databases, then run some db ops.  Etc.
Any single number of performance without a definition of proposed workload
is of little value.

jrbd@craycos.com (James Davies) (12/07/90)

In article <dale.660251271@convex.convex.com> dale@convex.com (Dale Lancaster) writes:
>
>For this reason it makes a good a benchmark as any.  I suspect that most Unix
>based bc's/dc's work the same.  This is a typical dusty-deck, no optimization
>piece of code that shows the performance of the machine from not just the 
>hardware side but the software (compiler) side as well.

I suspect it would take only a few hours of work on bc to add the peephole
optimization X/X ==> 1.  This would be totally pointless in practical terms,
but it might sell machines if everyone keeps taking this "benchmark" so
seriously :-)

>Today you have to benchmark the compiler as much as the hardware.

Unfortunately, this benchmark is too easy to cheat on (the above change
relies on neither the compiler nor the hardware, just bc itself).

>But of course the only true measure of performance is my own application :-)

That's the rationale for benchmarks using more realistic programs, like
SPEC and the Perfect Club.  I think the bc "benchmark" points up a major
reason that people like SPEC more than Perfect:  it sums up a machine's
performance in a single number.  If the Perfect people started computing
a "PerfectMark" by averaging their times, they might get more press coverage,
but it wouldn't necessarily mean they were measuring anything more useful.
People like easy answers to hard questions, I guess.

tim@proton.amd.com (Tim Olson) (12/07/90)

In article <1990Dec6.000702.16354@watmath.waterloo.edu> mhcoffin@watmsg.uwaterloo.ca (Michael Coffin) writes:
| It is only worse than worthless if it doesn't correlate well with real
| performance.  Does anyone have SPECmark numbers for these machines?
| It would be interesting to run correlations between SPECmark and
| BCmark.  I'll be the correlation is high.

Well, it is fairly high, but that may be due to the limited
intersection of machines which have SPECMarks and machines listed on
the bc benchmark.  Here's what I have found so far:

SYSTEM                   TIME   SPEC   comp	ratio
------                   ----   
MIPS RC6280              2.8    44.0   44.0	1.00
IBM RS/6000 Model 530	 3.4    28.9   36.2	1.25
IBM RS/6000 Model 730    3.5    29.0   35.1	1.21
IBM RS/6000 320          4.4    22.3   27.9	1.25
Motorola 8612 33MHz 88k  5.2    17.8   23.6	1.33
SGI 4D25                 8.1    12.2   15.2	1.25
SPARCstation 1+         12.2    11.8   10.1	0.86
VAX 6410 (4.0)          12.8     6.8    9.6	1.41
SPARCstation 1          16.0     8.4	7.7	0.92

The "comp" column is a computed SPECMark based upon the bc time (1/t *
123) -- the 123 was chosen as the scale factor (assuming a linear
relationship of 1/t with SPECMark) so that the first machine (MIPS
RC6280) matched.

Note that the only CISC machine in the listing, a VAX 6410, has a high
ratio of real SPECMark to "computed" SPECMark, compared to the other
machines.  Also, the SPARCstations are predicted ~10% low, while the
MIPS-based machines are predicted 25% high.


*** PLEASE *** don't try to turn this into a quick way to guess a
SPECMark; numbers are for informational purposes, only! ;-)



--
	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (12/07/90)

In article <1990Dec6.221422.5676@mozart.amd.com> 
	tim@amd.com (Tim Olson) writes:
>The "comp" column is a computed SPECMark based upon the bc time (1/t *
>123) -- the 123 was chosen as the scale factor (assuming a linear
>relationship of 1/t with SPECMark) so that the first machine (MIPS
>RC6280) matched.

The Number You've All Been Waiting For: VAX 11/780 = 73.3

Evidently BCmarks come out smaller than SPECmarks. But then, we
already knew that it was a budget benchmark.
-- 
Don		D.C.Lindsay

hooft@ruunsa.fys.ruu.nl (Rob Hooft) (12/07/90)

In <1211@sunc.osc.edu> djh@xipe.osc.edu (David Heisterberg) writes:
>In article <109872@convex.convex.com> patrick@convex.COM (Patrick F. McGehearty) writes:
>>	program main
>>	call matmul(a,b,c,256)
>>...
>>	  do k = 1, n
>>	    c(i,j) = c(i,j) + a(i,k)*b(k,j)
>>	  enddo

>Won't Convex performance suffer here due to bank conflicts?

Convex is good! See below for the compiler results.


            Optimization by Loop for Routine MATMUL

Line     Iter.   Reordering            Optimizing / Special            Exec.
Num.     Var.    Transformation         Transformation                  Mode 
-----------------------------------------------------------------------------
   5     I       Dist                                                         
   5-1   I       FULL VECTOR Inter                                            
   5-2   I       FULL VECTOR Inter                                            
   6-1   J       Scalar                                                       
   6-2   J       Scalar                                                       
   8-2   K       Scalar                                                       

Line     Iter.   Analysis
Num.     Var.             
-----------------------------------------------------------------------------
   5-1   I       Interchanged to innermost
   5-2   I       Interchanged to innermost

-- 
Rob Hooft, Chemistry department University of Utrecht.
hooft@hutruu54.bitnet hooft@chem.ruu.nl hooft@fys.ruu.nl

zeeff@b-tech.ann-arbor.mi.us (Jon Zeeff) (12/07/90)

>>I suggest that the bc benchmark is worse than worthless for several reasons.
>>[ ... ]
>
>It is only worse than worthless if it doesn't correlate well with real
>performance.  Does anyone have SPECmark numbers for these machines?
>It would be interesting to run correlations between SPECmark and
>BCmark.  I'll be the correlation is high.

I agree.  Nobody claims it's a great benchmark, but I like to see more 
proof (possibly in the form of comparisons to SPEC) instead of so much 
speculation.  

I'm suprised at how well it agreed with the rankings I expected.
-- 
Jon Zeeff (NIC handle JZ)	 zeeff@b-tech.ann-arbor.mi.us

jfc@athena.mit.edu (John F Carr) (12/08/90)

>Today you have to benchmark the compiler as much as the hardware

True: I noticed that dc for the IBM RT was compiled with pcc (the portable C
compiler, which does only peephole optimization).  I recompiled with a
better compiler and the time for the test dropped from 45 seconds to 25
seconds.  On a VAX, using gcc instead of pcc made about a 10% difference.

The VAX 3 series and the IBM RT make an interesting comparison.  DEC claims
4.9 MIPS for the faster versions of the VAX 3 (VAX 3100 model 38, VAX 3900).
The RT is generally considered to be around 3-4 MIPS (Dhrystone v1 says
4.4).  The "dc" benchmark (bc 2^5000/2^5000 pipes "2 5000^ 2 5000^/ps." to
dc on BSD UNIX) runs twice as fast on a VAX 3900 when the best compilers
available are used.  On the other hand, a big number math package I've been
using runs faster on the RT even though the VAX version is written in
assebmly and the RT version in C.  Floating point is usually faster on the
VAX, but because the RT uses a 68881 coprocessor it does better on
transcendental functions.  Let me pick the benchmark and I'll demonstrate
that my RT is faster than a DECstation 3100 (~12 MIPS).

--
    John Carr (jfc@athena.mit.edu)