[comp.arch] benchmarking

johnw@astroatc.UUCP (John F. Wardale) (05/13/87)

In article <28200036@ccvaxa> preece@ccvaxa.UUCP writes:
>
>  grenley@nsc.nsc.com:
>> How about, instead, compiles?  They are usually CPU intense (unless you
>----------
>I don't think compilers are sufficiently comparable to make good
>benchmarks, unless you wanted to specify the compiler, too (say,

Ok, so here I am, a new developer....   I need to buy a unix box.
(I'm developing code for some flavor of unix.)  So I select about
a dozen "likely" candidates; put my sources on each; then run the
following on each:
		time "touch types.h;make unix" 
or some other, similar or reasonable thing.  I compare the speeds,
and costs, and buy the one that's most effective for me.

While this is the best approach for selecting box to compile
kernels on, it has the following problems:

1)  Its en expensive (time consuming) exercise.
2)  Ones actual uses for a system are *LIKELY* to change in the
	future.

-----------------------------

Grenley is right!  A lot of people what/need a "system
performance" benchmark!   I wish that benchmarks like dhrystone
*INCLUDED* the time-to-compile-link-etc. in the time for dhry's
per second!   This would make (generally slow) super-optimizing
compilers look less good, while improving the lightening fast
(direct to memory -- ala turbo-pascal) compilers that may generate
slightly poorer than average code.
What difference does it make to me if super-O's `C' runs 15k dhry/sec
if it takes 2 or 3 times longer to compile than speedy's `C' which
only gets 8k dhry/sec.

Given numbers like this, I would REALLY want both, but then I like
to use interpreters (fast, threaded beasts, *NOT* like BASIC -- yuk!)
to develop code, and only "compile" it once.

-----------------------------

Picking a machine is ***ALOT*** more than finding one with the
highest number of XYZ benchmark that you can afford!

Anyone care to write an AI-ish program that collects prices, sw,
reliability, etc. etc. etc. (Ok, gobs of benchmarks too) and
several formula for calculating single-figure-of-merit, and help
in matching a formula to your expected needs.   I would expect
results like: The following 10-30 machines would be good for you,
I rate them as follows on a 1-5 scale (maybe a 1-10 scale)
(I assume that compressing to a single figure, would be question-
able for even one significant digit, but I also realize I'm lazy 
and need help picking the machine for me (or my company, etc.)

			John W

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - 
Name:	John F. Wardale
UUCP:	... {seismo | harvard | ihnp4} !uwvax!astroatc!johnw
arpa:   astroatc!johnw@rsch.wisc.edu
snail:	5800 Cottage Gr. Rd. ;;; Madison WI 53716
audio:	608-221-9001 eXt 110

To err is human, to really foul up world news requires the net!

grenley@nsc.nsc.com (George Grenley) (05/13/87)

It's nice to see some discussion on this issue.  I think, based on the
response so far, that we need to fork this discussion into two categories.
One would be "CPU Benchmarks", which us chip peddlers would like, and
a "System Benchmark", which is what real users want.

In article <272@astroatc.UUCP> johnw@astroatc.UUCP (John F. Wardale) writes:
>Ok, so here I am, a new developer....   I need to buy a unix box.
>(I'm developing code for some flavor of unix.)  So I select about
>a dozen "likely" candidates; put my sources on each; then run the
>following on each:
>		time "touch types.h;make unix" 
>or some other, similar or reasonable thing.  I compare the speeds,
>and costs, and buy the one that's most effective for me.
>
>While this is the best approach for selecting box to compile
>kernels on, it has the following problems:
>
>1)  Its an expensive (time consuming) exercise.

Why?  Assuming media portability, one ought to be able to run it in 
a reasonable amount of time - either don't compile ALL of Unix, or just
start it running and go on about your other duties.

>Grenley is right!  A lot of people what/need a "system
>performance" benchmark!   I wish that benchmarks like dhrystone
>*INCLUDED* the time-to-compile-link-etc. in the time for dhry's
>per second!   This would make (generally slow) super-optimizing
>compilers look less good, while improving the lightening fast
>(direct to memory -- ala turbo-pascal) compilers that may generate
>slightly poorer than average code.

The answer, of course depends on whether you're a programmer or a
user.  Most machines compile a program once and run it many many
times, so efficiency of compiled code is more important than
compile time.

Unless, of course, you are the programmer...

In general, though, if machine A runs a dumb compiler twice as fast as
machine B, it will run a smart compiler pretty close to twice as fast,
too.  So, for benchmarking, we can use any compiler.

My employer is justifiably proud of its new optimizing compiler which
gets about 20% faster code than the old one - so the system performance
goes up 20% with the same H/W - 20% performance bump for free!  But,
you have to ask yourself, is it `fair' to include compiler improvements
in CPU benchmarks?  Some say not; I disagree.  We are interested in the
H/W that produces the most overall performance.  After all, the whole
idea behind RISC is that they're easy to write optimizing compilers for.
It wouldn't be fair to not allow them to use them.

eugene@pioneer.UUCP (05/20/87)

Gee whiz!  I go to a conference and take a little vacation than there
150 architecture articles with lots on benchmarking all bad.  I've 35
to go and have yet to see a completely scientific one.  No wonder why
physicists don't think of computing "science" as a science {Sorry, John
nothing personal against you}.  You guys should go out and get books on
experiment design: Campbell and Stanley, Cochrane and Cox, or find
one of Phil Heidelberger's (IBM, IEEE) articles.  No sense in trying to
knock down what's wrong with your various horse races {Note I just make
a video tape about this for a supercomputer class at U of Idaho
yesterday}.  More shortly,  we will publish a bibliography for PER.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

reiter@endor.harvard.edu (Ehud Reiter) (05/20/87)

In article <1589@ames.UUCP> eugene@pioneer.arpa (Eugene Miya N.) writes:
>150 architecture articles with lots on benchmarking all bad ...
>You guys should go out and get books on
>experiment design: Campbell and Stanley, Cochrane and Cox, or find
>one of Phil Heidelberger's (IBM, IEEE) articles.

I've seen Eugene Miya say good scientific experimental design should be used for
benchmarking many times, but I'm still a bit puzzled as to how this should be
done.  Ideally, if we had good data on exactly what programs were typically
run by each class of user, then we could measure the performance of a machine
for a carefully chosen set of "benchmark" programs, and then use the above data
to extrapolate the machine's performance for each user class.  I assume this
is what Eugene means by a good experiment.

However, I've never seen good data on what programs typical users run, and
without this data we can not perform the above "experiment".  Perhaps this
just means we should try to gather good data on what programs users run -
I think this is a great idea, as long as someone else does it!

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

eugene@pioneer.arpa (Eugene Miya N.) (05/21/87)

In article <6024@steinmetz.steinmetz.UUCP> William E. Davidsen Jr writes:
>
>After doing benchmarks for about 15 years now, I will assure everyone
>that the hard part is not getting reproducable results, but in (a)
>deciding how these relate to the problem you want to solve, and (b) getting
>people to believe that there is no "one number" which can be used to
>characterize performance. If pressed I use the reciprocal of the total
>real time to run the suite. It's as good as any other voodoo number...

Yes I agree, and I have not had to do it that long.  Let's take a moment
to study ways to relate or characterize end-users applications:
1) without gross generalizations, but real quantitative data,
and 2) using common ideas and tools?  Okay?  Static as well as dynamic
tools.  What can we tell independent of machines and languages?

Second:
There's lots of disciplines which abuse and use single figures of merit
and get away from them.  Consider: earlier in the season, (end of ski
season really): base of NS was a sea of mud, 2/3 way up the mountain
in a sheltered area was the snow gauge reading 5.5 feet.  You think we
have problems with measurement?  Is an average ($ {int int from {all area of
ski resort} depth function dx dy} over area} a reasonable way to
characterize resort coverage?

Do we buy cars on single figures of merit?  If not, then now many?

Consider cardiology: heart function.  Single figures are used: heart
rates, but EKGs are much better they portrary more.  Picture worth a
thousand words?  Try embedding one on the net with any good resolution.

Yes, we can get away, but we have to take others with us.  I better stop
before Alan Smith totally loses respect (probably has already).

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

nerd@percival.UUCP (Michael Galassi) (05/25/87)

In article <415@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:

>As larry says, real page-thrashers are highly dependent on a lot of attributes.
>That doesn't mean they're bad tests, merely that they're extremely hard
>to do in a controlled way.  In particular, you often see radically different
>results according to buffer cache sizes, for example.
>
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

I've not seen this stated around here so I'll do it.
Benchmarks can be divided into two major categories:
Those which exercise the processor (CPU FPU MMU etc...) and those which
exercise the WHOLE computer (i.e. i/o system too).  For the person who
is evaluating a CPU family for a new design I can see where the first
class of benchmarks comes in VERY handy, but the rest of us (those who
want to buy a computer, install UNIX, and generate accounts) the MIPS,
FLOPS, *stones, etc that the cpu will do are rarely of much interest.
I care much more about how the system will handle with a dozen users
all doing real tasks (vi, cc, f77, rn, rogue, or whatever) than I do
about the the time it takes the cpu to find the first X primes when it
is not installed in its cardcage where god wanted it to be.
I guess I don't care much about the "a lot of attributes" individualy,
but rather how they all work together.  Give me anything that overall
preforms well (so long as there is no intel cpu in it) and I'll be
pleased as pie.
-michael
-- 
If my employer knew my opinions he would probably look for another engineer.

	Michael Galassi, Frye Electronics, Tigard, OR
	..!{decvax,ucbvax,ihnp4,seismo}!tektronix!reed!percival!nerd

mash@mips.UUCP (05/26/87)

In article <642@percival.UUCP> nerd@percival.UUCP (Michael Galassi) writes:
>In article <415@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:

>>That doesn't mean they're bad tests, merely that they're extremely hard
>>to do in a controlled way.  In particular, you often see radically different
>>results according to buffer cache sizes, for example.

>Benchmarks can be divided into two major categories:
>Those which exercise the processor (CPU FPU MMU etc...) and those which
>exercise the WHOLE computer (i.e. i/o system too).  For the person who
>is evaluating a CPU family for a new design I can see where the first
>class of benchmarks comes in VERY handy, but the rest of us (those who
>want to buy a computer, install UNIX, and generate accounts) the MIPS,
>FLOPS, *stones, etc that the cpu will do are rarely of much interest.
>I care much more about how the system will handle with a dozen users
>all doing real tasks (vi, cc, f77, rn, rogue, or whatever) than I do...
>I guess I don't care much about the "a lot of attributes" individualy,
>but rather how they all work together.  Give me anything that overall
>preforms well (so long as there is no intel cpu in it) and I'll be
>pleased as pie.

1) There ARE people who mostly care about computational benchmarks;
some of the CAD folks are perfect examples, as are those who run troff, etc.
But that's not the point.

2) I think most people in this newsgroup understand that system benchmarks
are important.  I'll try one more time: THEY'RE JUST HARD TO DO. That
doesn't stop people from doing them, which makes especiall good sense if they
have some job streams that really represent their loads.  We do these sorts
of benchmarks all the time; I've been doing UNIX system-type benchmarks of
one ilk or another for aa lot of years.  The trouble is, it's going to be hard
enough to agree on some compute-bound benchmarks, without the hassle of trying
to normalize all the rest of the stuff.  For example, do you normalize
on system cost?  Do you normalize memory sizes?  Do you normalize on
disk number and type?  All that we're saying is that system benchmarks
are painfully hard to get representative; there are many pitfalls
and benchmarking weirdnesses to look out for; "overall preforms well"
is a REAL hard metric, for example.

Note: I don't yet see a strong sense of agreement on a set of
CPU benchmarks that we believe.  From past experience, getting a
set of system benchmarks that people agree on will be much harder.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

eugene@pioneer.arpa (Eugene Miya N.) (05/26/87)

I am completing another iteration of one of my prototype test programs.
The test is written in FORTRAN, but could easily be written in C or
(yes) Pascal [in fact I would encourage writing a Pascal version].  It
is a preliminary program to be run prior to a CPU/memory test (no
floating-point, that's third).  But back to this first test.

It's very simple, but also appears a little dumb.  It's designed to test
for one type of optimization as well as test the quality of a system
clock.  Just give me a week or so.  If there is enough interest I will
post it, otherwise I will take selected interested parties.  John ca be
certain he will get a copy: remember, it's simple, obvious, and appears
somewhat stupid.  Remember, it tests clock quality which is certainly
important to any subsequent tests.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  {hplabs,hao,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene

eugene@pioneer.UUCP (05/27/87)

In article <3490003@wdl1.UUCP> bobw@wdl1.UUCP (Robert Lee Wilson Jr.) writes:
>
>I've never been quite sure what that accomplishes. To put it another
>way, what is the benchmark supposed to be measuring: SYSTEM
>performance, or HARDWARE performance.
>
>-----------------------------------------------------------------
>I disclaim almost everything, probably including this line.

Let me ask: HOW DO YOU SEPARATE THEM?  [I think it's possible.]
People talk about CPU and memory performance benchmarks: how do you
separate these?  Can you tell me when something is hardware bound or
software bound?  What does it mean when you say system?  Is the WHOLE
[Another poster's term] of a system equal to the sum of it's parts?
[Take optimizers into account.]  Do we have to say: nope we can't
separate them, there is a Gestalt working here, and we have to assume
the applications and the machine are ATOMIC (indivisible) for to divide
the problem into parts would destroy the character of the problem
(benchmarking the machine).

For those people only concern about running their applications: while
you have valid concerns [i.e., getting the job done], there are a few
people who seek progress.  They seek to understand where their problems
run, and to look to the future to improve their performance rather than
treat their work solely like a black box.  Given these architects,
engineers, and scientists some credit some time for they are the one who
look to the future (to improvements).  Sure computers are a tool, but
you have to hone your tools, thank God for Seymour Cray.

--eugene

reiter@endor.harvard.edu (Ehud Reiter) (05/27/87)

I think some people are perhaps missing the point.  Of course, we would all like
system benchmarks which accurately predict the performance for our workloads.
But such benchmarks are usually impossible, because performance varies quite
a bit depending on workload, and most users just don't have a very good idea
of what their workload is and will evolve into.  Even when the workload is
accurately known, this kind of benchmarking is expensive and time-consuming.

The point is, there is a great demand out there for simple, single figure
performance numbers which are in the public domain.  No matter how much we
complain that single figures are meaningless, people out there in the real
world are going to continue using them.  There's a reason why MIPS and
Dhrystones are so often quoted.

And, we can do better than Dhrystone!  We all know what the problems with
Dhrystone are - can't be globally optimized, too much string handling,
too small, etc.  We can certainly write a benchmark which, although still
"bad", will be much better than Dhrystone.

I think we can even get away with replacing single-number benchmarks by
two number benchmarks, which would give a high and low performance figure
instead of just a single performance figure (that is, the benchmark would
consist of lots of programs.  The performance numbers would be normalized
against some standard (good old 4.2BSD VAX-11/780?), and the summary
statistics would be the highest and lowest of the normalized numbers).

In summary, we can't write a perfect benchmark, but we can write a better
benchmark.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

ps@celerity.UUCP (Pat Shanahan) (05/29/87)

In article <2100@husc6.UUCP> reiter@endor.UUCP (Ehud Reiter) writes:
>...
>The point is, there is a great demand out there for simple, single figure
>performance numbers which are in the public domain.  No matter how much we
>complain that single figures are meaningless, people out there in the real
>world are going to continue using them.  There's a reason why MIPS and
>Dhrystones are so often quoted.

This is very unfortunate, if true. People who believe simple, single figure
performance numbers are doomed to be suprised by reality.

>
>And, we can do better than Dhrystone!  We all know what the problems with
>Dhrystone are - can't be globally optimized, too much string handling,
>too small, etc.  We can certainly write a benchmark which, although still
>"bad", will be much better than Dhrystone.

I agree. I don't know of any real C program that does as much structure
assignment as the C Dhrystone. I think that C performance is important
enough to justify a benchmark that reflects how the language is actually
used.

>
>I think we can even get away with replacing single-number benchmarks by
>two number benchmarks, which would give a high and low performance figure
>instead of just a single performance figure (that is, the benchmark would
>consist of lots of programs.  The performance numbers would be normalized
>against some standard (good old 4.2BSD VAX-11/780?), and the summary
>statistics would be the highest and lowest of the normalized numbers).

I think a better approach would be the one taken in the Livermore loops
benchmark. The report includes the performance for the individual loops, as
well as summary information such as the harmonic mean. I am not sure if
high and low would really help much, except in convincing people that single
numbers are meaningless. The extreme outliers can be due to architectural
choices that are good for most programs but bad for certain exceptional
programs. For example, pipelining may be good for real programs, but bad for
an artifical test of jump performance. If you are going to report high and
low it is very important to make all the benchmark programs reasonably
mixed. If you are going to report individual results this is less critical.

>
>In summary, we can't write a perfect benchmark, but we can write a better
>benchmark.
>
>					Ehud Reiter
>					reiter@harvard	(ARPA,BITNET,UUCP)
>					reiter@harvard.harvard.EDU  (new ARPA)

It should certainly be possible to write a better benchmark of C performance
than the Dhrystone.
-- 
	ps
	(Pat Shanahan)
	uucp : {decvax!ucbvax || ihnp4 || philabs}!sdcsvax!celerity!ps
	arpa : sdcsvax!celerity!ps@nosc

mdr@reed.UUCP (Mike Rutenberg) (10/04/88)

In article <6729@nsc.nsc.com> grenley@nsc.UUCP (George Grenley) writes:
>some official number, rather than measured data.  Also, there are frequently
>variations in supposedly standard code.  I have different versions of Dhry1.1
>which vary over 40%, even though they are supposedly the same code.

But it is so hard to make it run and yet be "the same code."

The problem is that to get good results with a given benchmark within a
given system, you often do have to tweak things to get comparable
numbers, often holding your breath that it all works out.

In compiling the dhrystone benchmark, a C compiler I use will remove
the loop overhead calculation since it is simply a loop that has
no body and a trivial side effect on the iteration variable.  Even
procedure calls are eliminated by the compiler if it is clear the
called procedure does not do anything.

@BEGIN(Black Magic)
You can do things to trick the compiler into keeping the loop.  A null
procedure the loop calls will do the trick if compiled separately.  But
then you have to also put a call to this null procedure in the main
dhrystone loop.  But this may do bad things to your numbers, especially
if it affects your cache hit-rate.  And this will change the numbers
you get, not in a positive way.
@END(Black Magic)

I wish benchmarks would be rewritten to be a ultimately portable &
really really smart about outwitting too-smart compilers.  It would be
nice to be able to run a benchmark program totally unchanged.  This
would avoid the temptation or need to modify the tests.

Mike

mash@mips.COM (John Mashey) (10/07/88)

In article <10498@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
...
>But it is so hard to make it run and yet be "the same code."

>The problem is that to get good results with a given benchmark within a
>given system, you often do have to tweak things to get comparable
>numbers, often holding your breath that it all works out.

>In compiling the dhrystone benchmark, a C compiler I use will remove
>the loop overhead calculation since it is simply a loop that has
>no body and a trivial side effect on the iteration variable.  Even
>procedure calls are eliminated by the compiler if it is clear the
>called procedure does not do anything.

>@BEGIN(Black Magic)
>You can do things to trick the compiler into keeping the loop.  A null
>procedure the loop calls will do the trick if compiled separately.  But
>then you have to also put a call to this null procedure in the main
>dhrystone loop.  But this may do bad things to your numbers, especially
>if it affects your cache hit-rate.  And this will change the numbers
>you get, not in a positive way.
>@END(Black Magic)

>I wish benchmarks would be rewritten to be a ultimately portable &
>really really smart about outwitting too-smart compilers.  It would be
>nice to be able to run a benchmark program totally unchanged.  This
>would avoid the temptation or need to modify the tests.

(back from 3 weeks' Down Under; it will take a while to catch up!)

ONE MORE TIME:
	use large, real programs as benchmarks.
	do NOT use small programs as benchmarks
	be especially careful of small synthetic benchmarks

Two of the most counterproductive things people can be doing are:
	a) Tuning compilers to optimize small benchmarks, especially
	with optimizations that don't really matter much on real
	programs. (Optimizations that actually matter elsewhere are fine.)
	b) Continually reworking synthetic benchmarks to stay ahead
	of advances in compiler optimization.
It is sad how much effort across this business has gone down the rat-holes.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

grenley@nsc.nsc.com (George Grenley) (10/08/88)

The following discussion started with a posting of mine about organizing some
head to head benchmark comparisons.  I wanted to give all interested parties
a chance to look at one another's hardware...  The primary reason is because
it is difficult if not impossible to reproduce most vendors' benchmark
numbers - and I specifically include my employer, NSC, in this category.  We
publish 16600 for Dhry1.1 at 30 mhz, no wait state - but no '532 has ever
run exactly that number.  (it came from the simulator).

One reason is simply, most companies won't spend the money to go out and get
other companies' hardware to test.

In article <4655@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <10498@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
 ...
>>But it is so hard to make it run and yet be "the same code."

(deleted, refernce to why Dhry is susceptible to over optimization...)

>
>>@BEGIN(Black Magic)
>>You can do things to trick the compiler into keeping the loop.  A null
>>procedure the loop calls will do the trick if compiled separately.  But
>>then you have to also put a call to this null procedure in the main
>>dhrystone loop.  But this may do bad things to your numbers, especially
>>if it affects your cache hit-rate.  And this will change the numbers
>>you get, not in a positive way.
>>@END(Black Magic)
>
>>I wish benchmarks would be rewritten to be a ultimately portable &
>>really really smart about outwitting too-smart compilers.  It would be
>>nice to be able to run a benchmark program totally unchanged.  This
>>would avoid the temptation or need to modify the tests.

AGREED! SO LET'S DO IT!  Time for Dhry 3.0, or whatever.  It seems to me the
easiest way to tackle the loop-that-does-nothing problem is to have it do
something, preferably process a variable that is supplied at run time, so 
the compiler cannot know what it is going to do...

But in any case, some new CPU benchmarks need to be developed.  Perhaps we can
all agree that an existing one is suitable, or perhaps we need to create a
new one.

>(back from 3 weeks' Down Under; it will take a while to catch up!)
>
>ONE MORE TIME:
>	use large, real programs as benchmarks.
>	do NOT use small programs as benchmarks
>	be especially careful of small synthetic benchmarks
>
>Two of the most counterproductive things people can be doing are:
>	a) Tuning compilers to optimize small benchmarks, especially
>	with optimizations that don't really matter much on real
>	programs. (Optimizations that actually matter elsewhere are fine.)
>	b) Continually reworking synthetic benchmarks to stay ahead
>	of advances in compiler optimization.
>It is sad how much effort across this business has gone down the rat-holes.

Agreed on all counts, especially regarding waste effort (quadbyte string
compare on a certain new processor - can you say "dhrystone in microcode"?)

The only drawback to John's generally correct suggestion is the lack of any
standards for larger benchmark programs, for integer.  Whet and Linpak seem
pretty well established as FP b'marks, although I wonder whether they're not
a bit cooked sometimes... (I heard once that a Fortran compiler was released
which SPECIFICALLY checked the soruce to see if it was Whet, and if it was,
stuck in a VERY fast routine).

Someone on this net suggested using GNU's public domain versions of various
Unix utilities (grep, nroff, etc).  Sounds like a good plan to me - it doesn't
matter if they're unix compatible, just so they compile and run.

Perhaps the first step is to convene a working group to standardize this
stuff and promote its use.  I volunteer.  Any others?

George Grenley
NSC

bcase@cup.portal.com (10/09/88)

George Grenley says:
>Agreed on all counts, especially regarding waste effort (quadbyte string
>compare on a certain new processor - can you say "dhrystone in microcode"?)

I don't think you really want to start this line of argument; I mean, if
we are talking about wasted effort, there are many who would agree that
persuing aggressive implementations of certain processor architectures
is plenty wasted (please believe me, I don't necessarily mean the 32000).
Besides, that quadbyte string compare isn't dhrystone
in microcode, there ain't no microcode!  :-) :-) :-) (by traditional
definitions anyway).  If SPARC or MIPS or somebody takes over the market
completely, then what effort was wasted and what wasn't?

Let's all have a nice day.

henry@utzoo.uucp (Henry Spencer) (10/09/88)

In article <6868@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes:
>Someone on this net suggested using GNU's public domain versions of various
>Unix utilities (grep, nroff, etc).  Sounds like a good plan to me...

Note three complications:

	1. The GNU stuff is ***NOT*** public domain.

	2. It tends to be huge.

	3. It really knows it's on a 32-bit machine.

These are not necessarily fatal problems, although #3 may be a problem for
wider-word machines, but they are worth concern.
-- 
The meek can have the Earth;    |    Henry Spencer at U of Toronto Zoology
the rest of us have other plans.|uunet!attcan!utzoo!henry henry@zoo.toronto.edu

rik@june.cs.washington.edu (Rik Littlefield) (10/09/88)

Many postings in this stream seem to assume that "large, real" programs are
somehow the most fair to use for benchmarking.  That's not necessarily true.
Any program that has had all or most of its development on a single system
has undoubtedly been tuned for best performance ON THAT SYSTEM.  Look at the
series of postings on "Duff's device" (an unrolled loop) -- systems without
instruction caches (or with large ones :-) tend to produce programs that use
Duff's device, those with small caches encourage using tight loops instead.
If somebody's compiler doesn't do induction on array index expressions, they
tend to write critical loops using pointers.  Etc, etc.  I'd guess that an
awful lot of Unix programs have been tuned to whatever it is that pcc does
or doesn't do.  The point is, large real programs tend to have long
histories that bias them in favor of old compiler technology and
architectures.

Another problem with large real programs is that it's often very difficult
to tell what the benchmark results mean.  Does nroff run fast on system Q
because Q does stream I/O especially well, or because Q is really good at
optimizing some 10-line inner loop that shoves around characters?  If I
can't read the code or tell where it's spending its time, how can I possibly
relate a benchmark result to some different program or application?
Personally, I get a lot more insight out of a few hundred lines of good test
cases that I can understand in detail.  

Now, I'm all in favor of benchmarking large real programs, particularly the
ones that *I* like to run.  They also make a very nice sanity check to guard
against silly benchmark deficiencies like do-nothing loops and results that
can be determined at compile time.  But if cost constraints make me pick one
or the other, I'll take the suite of synthetic tests any day.

--Rik

pardo@june.cs.washington.edu (David Keppel) (10/10/88)

rik@june.cs.washington.edu (Rik Littlefield) writes:
>[ large "real" program benchmarks vs. synthetic benchmarks ]

Oh, gee, an opportunity to apply the scientific method :-)

(a) Benchmark a bunch of computer systems (hardware/os/compiler)
    using synthetic benchmarks.
(b) Compare the benchmark performance to observations in the
    "real" world.
(c) Learn something about benchmarks, refine your synthetic
    benchmarks.
(d) go to (a)   (Oh no, not a GOTO!)

People like Kahan bitched to other computer scientists about floating
point inconsistancy.  Now people formally study floating point numbers
as a subject.  Performance modelling is a formal area, but I don't
know anybody studying "benchmarks" as a formal subject.  When people
do, benchmarks may get much better.

	;-D on  ( We're in the dark about benchmurking )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

mash@mips.COM (John Mashey) (10/10/88)

In article <1988Oct9.011633.13259@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:

>Note three complications:

>	1. The GNU stuff is ***NOT*** public domain.

>	2. It tends to be huge.

Reasons why people ahve liked these as potential benchmarks are:
	1. Although not public domain, "generally available" is much better
	than "proprietary", which is unfortunately true for many otherwise
	desirable benchmarks.
	2. Tends to be huge.  GOOD!  Most of the common itneger benchmarks are
	toys or near-toys, unlike the floating-point ones.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

grenley@nsc.nsc.com (George Grenley) (10/10/88)

In article <4853@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <1988Oct9.011633.13259@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>
>>Note three complications:
>
>>	1. The GNU stuff is ***NOT*** public domain.

Let me apologize to the nice folks (creatures?) at GNU.  I meant that GNU source
was available & standard, but I in no way meant to imply it was in the public
domain.  Thanx to Henry at UT for pointing this out...

>>	2. It tends to be huge.
>
>Reasons why people ahve liked these as potential benchmarks are:
>	1. Although not public domain, "generally available" is much better
>	than "proprietary", which is unfortunately true for many otherwise
>	desirable benchmarks.
>	2. Tends to be huge.  GOOD!  Most of the common itneger benchmarks are
>	toys or near-toys, unlike the floating-point ones.
>-- 
>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

I want to clarify my earlier posting about b'marks - I am primarily interested
in CPU benchmarks, not system b'marks.  This is partly because my employer
makes chips, not systems (we tried that once, and we still get mail about it),
and partly because, as a hardware engineer, I am interested in providing the
kind of info that will assist other h/w engineers in making a CPU selection
on the merits, not based on whimsy/guesswork etc.

Unfortunately, this augers against large, OS dependent benchmark programs.  I
imagine that trying to make grep run "standalone" is not trivial....

I've received a lot of email on b'marking; one individual pointed out that
the database community "scales" the size of the b'mark (i.e., size of dbase)
to the size of machine.  An interesting idea.  I think we should consider
taking some of the small integer b'marks, and "enlarge"them by having the
program call itself recursively in a non-trivial way.  Then, the test would
consist of running the program at, say, 1 through 1000 levels of recursion,
or whenver you run out of RAM.  Then, publish the performance numbers.

Comments?  I am willing to volunteer to drive this if anyone (like, f'rinstance,
someone who can code better than me) wants to help.  Maybe we'll even get it
O-f'shally blessed by IEEEEEEEE.

Regards,
George Grenley
NSC

ps to Henry Spencer - is UT still using any series 32000 iron?  If so, drop me
a note - I may have some news of interest to you.

mash@mips.COM (John Mashey) (10/12/88)

In article <6001@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>Many postings in this stream seem to assume that "large, real" programs are
>somehow the most fair to use for benchmarking.  That's not necessarily true.
As we've said numerous times, the best benchmark for anybody is for them
to run their own real applications, because such applications obviously
have the highest correlation with what they'll see in real use.
When I keep saying "use large, real programs", it's because I usually
have in front of me numerous statistics about the behavior of programs
that show that most of the toy benchmarks aren't very good predictors of
the real applications, especially when applied to the higher-performance
designs.  Why is this?
	a) Toys don't stress cache designs, so that small caches and large
	ones act about the same, which is simply untrue for many real programs.
	(in this case, "cache" includes any place in the memory herarchy,
	including registers, stack caches, register windows, 1-to-n-level
	of memory caches, disk caches in main memory, etc.
	b) Toys don't stress limits.  For example, consider the performance
	differences attributable to the different X86 memory models.
	c) Toys don't stress software.  Anybody can compile Dhrystone or
	Whetstone, and many can optimize them.  Compiling/optimizing Spice
	tells you a lot more.

>Any program that has had all or most of its development on a single system
>has undoubtedly been tuned for best performance ON THAT SYSTEM.  Look at the
>series of postings on "Duff's device" (an unrolled loop) -- systems without
>instruction caches (or with large ones :-) tend to produce programs that use
>Duff's device, those with small caches encourage using tight loops instead.
>If somebody's compiler doesn't do induction on array index expressions, they
>tend to write critical loops using pointers.  Etc, etc.  I'd guess that an
>awful lot of Unix programs have been tuned to whatever it is that pcc does
>or doesn't do.  The point is, large real programs tend to have long
>histories that bias them in favor of old compiler technology and
>architectures.
Most application software doesn't worry about this kind of thing very much:
the 3rd-party folksd worry most about making things work across lots of
machines.
>
>Another problem with large real programs is that it's often very difficult
>to tell what the benchmark results mean.  Does nroff run fast on system Q
>because Q does stream I/O especially well, or because Q is really good at
>optimizing some 10-line inner loop that shoves around characters?  If I
>can't read the code or tell where it's spending its time, how can I possibly
>relate a benchmark result to some different program or application?
>Personally, I get a lot more insight out of a few hundred lines of good test
>cases that I can understand in detail.  
This is certainly true, although good measurement tools help you figure out
where the time is going.  Of course, if you have small benchmarks that
give you good correlation with what you actually use, then you're OK,
adn there's nothing wrong with using them, i.e., by definition, you're
using something correlated with youre real applications.  One of the points
we've tried to make is that one must be very careful when using simple
benchmarks to predict the performance across wider ranges of architecture
and software.  For example, simple benchmarks used to analyze PC-class
machines, don't encessarily work very well for larger ones.  (For PC-class
machines, you can probably geta first-order prediction by knowing
clock-rate, CPU type, and memory-latency).
>
>Now, I'm all in favor of benchmarking large real programs, particularly the
>ones that *I* like to run.  They also make a very nice sanity check to guard
>against silly benchmark deficiencies like do-nothing loops and results that
>can be determined at compile time.  But if cost constraints make me pick one
>or the other, I'll take the suite of synthetic tests any day.

It is, of course, a goal for many people in this to create small synthetic
benchmarks that accurately predict the behavior on large real applications,
and this is a very desirable goal.  It's merely hard! 
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

lindsay@k.gp.cs.cmu.edu (Donald Lindsay) (10/12/88)

In article <6899@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes:
>I've received a lot of email on b'marking; one individual pointed out that
>the database community "scales" the size of the b'mark (i.e., size of dbase)
>to the size of machine.  An interesting idea.  I think we should consider
>taking some of the small integer b'marks, and "enlarge"them by having the
>program call itself recursively in a non-trivial way.  Then, the test would
>consist of running the program at, say, 1 through 1000 levels of recursion,
>or whenver you run out of RAM.  Then, publish the performance numbers.
>Comments?  I am willing to volunteer to drive this if anyone (like, f'rinstance,
>someone who can code better than me) wants to help.

First, I am solidly behind the idea that the best benchmark is the user's
application.

That said, synthetic benchmarks might as well be as good as they can be.
So, some guidelines:
 - the code working set must be adjustable, without upper bound.
 - the data working set, likewise.
 - the compiler must be prevented from inlining.
 - the compiler must be prevented from eliminating dead code.
 - the benchmark must be small, so that it can be presented in full in
   reports. (This avoids the "slight change" problem, as well as permitting
   easy shipment.)

There is a fairly simple way to achieve these ends. Do not write a
benchmark program: write a program which writes out the benchmark program.
A simple loop in the Generator program allows the creation of arbitrarily
large source files. (Since compilers can get bent by this, the Generator
should also generate multiple source files.) The procedure names will be
somewhat unimaginative: f0001, f0002, and so on. If the source files are in
C, then it's fair to generate macros and macro calls, simply to reduce the
file space requirements.

Next, the Generator should write the code to fill an array with pointers
to these functions. Similarly, we need a data array.

Next, we need a portable routine which generates pseudo-random numbers.
(Portable mostly means that it avoids arithmetic overflow.) The quality of
the randomness is unimportant, as long as it doesn't get stuck at 0 or
other such silliness.  The generated program will use the randoms to form
subscripts, either into the data array, or into the function pointer array.
In this way, we may control the size of the working sets.

Since the functions should (largely) be accessed via the array, inlining is
defeated. Avoid dead code.

I have no comment concerning the contents of the routines: the Generator is
independent of this, and should be able to generate several benchmarks (for
instance, an integer one, and a float one).

Since the benchmarks must be told how "big" to be, the benchmark report
form should be written as part of the benchmark. This must specify how many
runs must be made, and with exactly what parameters. 
-- 
Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

tim@crackle.amd.com (Tim Olson) (10/13/88)

In article <3285@pt.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
| Next, the Generator should write the code to fill an array with pointers
| to these functions. Similarly, we need a data array.
| 
| Next, we need a portable routine which generates pseudo-random numbers.
| (Portable mostly means that it avoids arithmetic overflow.) The quality of
| the randomness is unimportant, as long as it doesn't get stuck at 0 or
| other such silliness.  The generated program will use the randoms to form
| subscripts, either into the data array, or into the function pointer array.
| In this way, we may control the size of the working sets.

I once received a benchmark program that was similar in nature to this. 
It used a very large number of functions in separate source files, and a
big switch statement that selected a function to call based upon a
random number.  I ran this benchmark, then I profiled it.  It turned out
that 30% of the runtime was in calculating the random number to use for
the function selection!

	-- Tim Olson
	Advanced Micro Devices
	(tim@crackle.amd.com)

jon@jim.odr.oz (Jon Wells) (10/13/88)

From article <4655@winchester.mips.COM>, by mash@mips.COM (John Mashey):
> In article <10498@reed.UUCP> mdr@reed.UUCP (Mike Rutenberg) writes:
> ...
>>But it is so hard to make it run and yet be "the same code."
> 
>>The problem is that to get good results with a given benchmark within a
>>given system, you often do have to tweak things to get comparable
>>numbers, often holding your breath that it all works out.
>> [ stuff deleted ...]
> 
> ONE MORE TIME:
> 	use large, real programs as benchmarks.
> 	do NOT use small programs as benchmarks
> 	be especially careful of small synthetic benchmarks
>[ stuff deleted... ]

Seems to me that there are two quite distinct classes of benchmarks
required (maybe three). There are at least three different levels
of information required....

A) The raw execution rate of a processor under ideal conditions.

B) Now fast does the thing go in a particular system (memory config etc.).

C) Now fast does the system, as a whole, go.

   The following comments do not apply to number crunchers which run
   under `ideal' conditions most of the time so A is perhaps the most
   important thing.

A and B could be solved either by simulation or benchmarking, they are
both *very* processor specific things, and as such the same `code' can
and must be used for both. By the same code I mean the same
*instructions*, if you have to write it in assembler because your
compiler optimizes it out then do it. What you're trying find are the
performance limits of a particular processor/configuration, that is, A
gives you the upper limit of B, and B tells you how well our doing.

Both these things are *only* of interest to the architects of processors
and systems, and C is the *only* thing that the people buying these
systems are interested in. I neither know nor care how many wheatstones
this machine does, I do know that it takes a *very* long time to walk
the directory hierarchy, so long that I've never bothered waiting for
such a program to complete. 

C can only be found by running large complex things that approximate
the systems' end use. I can see no better benchmark than the software
that the system will run. Ken MacDonald's unix benchmark suite,
MUSBUS, is one such example. The suite is floating around on various
servers but you'll need a complete and well debugged unix system
before you'll be able to use it.

The point is, that both types of benchmarks are useful and of interest,
just to different classes of people.

jon.

--

eugene@eos.UUCP (Eugene Miya) (10/13/88)

Oh no benchmarking wars again...... (sigh)

In article <6868@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes:
>... (I heard once that a Fortran compiler was released
>which SPECIFICALLY checked the soruce to see if it was Whet, and if it was,
>stuck in a VERY fast routine).

I checked this story out (months ago) .  Without mentioning specific
names within a VERY large computer company I discovered it was an
APL compiler not a Fortran compiler.  The benchmark was a simple
Gaussian sum (3 APL characters).  The benchmark adds 1 thru n, the compiler
did what Gauss did: you know n(n+1)/2.  It was placed there by the compiler
writer who knew the person in the APL community who did this as a
benchmark.  Serves the benchmarker right.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

eugene@eos.UUCP (Eugene Miya) (10/14/88)

In article <3285@pt.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
>First, I am solidly behind the idea that the best benchmark is the user's
>application.

This is fine if you are Livermore and devote two Crays to running many
runs of a single program.  You have problems if you run a diversity of codes.

Big programs (sorry John I can't completely) can be as deceptive as small.
Big programs have more paths to test.  It's just as blind (but in different
ways) as small programs.  Now using both you might get more.

>There is a fairly simple way to achieve these ends. Do not write a
>benchmark program: write a program which writes out the benchmark program.
> ... much deleted

I am working on this (with a company).  How much are you willing to pay?
Naw, just kidding, our ideas are too crude for a product.  We are just
writting tools to help.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

eugene@eos.UUCP (Eugene Miya) (10/15/88)

In article <6005@june.cs.washington.edu> pardo@cs.washington.edu (David Keppel) writes:
>rik@june.cs.washington.edu (Rik Littlefield) writes:
>>[ large "real" program benchmarks vs. synthetic benchmarks ]
>
>Oh, gee, an opportunity to apply the scientific method :-)
>
>(a) Benchmark a bunch of computer systems (hardware/os/compiler)
>    using synthetic benchmarks.
>(b) Compare the benchmark performance to observations in the
>    "real" world.
>(c) Learn something about benchmarks, refine your synthetic
>    benchmarks.
>(d) go to (a)   (Oh no, not a GOTO!)

I am sorry.

I don't see the scientific method in this.  I don't see a theory,
a hypothesis, a controlled experiment, nor even a control. 8-)
Actually, don't worry, I get this all the time from the other
"real" sciences myself.  I do see the beginnings of empirical work.
Better luck next time.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

eugene@eos.UUCP (Eugene Miya) (10/18/88)

In article <5356@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <6001@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>>Many postings in this stream seem to assume that "large, real" programs are
>>somehow the most fair to use for benchmarking.  That's not necessarily true.
>As we've said numerous times, the best benchmark for anybody is for them
>to run their own real applications, because such applications obviously
>have the highest correlation with what they'll see in real use.

>When I keep saying "use large, real programs", it's because I usually
>have in front of me numerous statistics about the behavior of programs
>that show that most of the toy benchmarks aren't very good predictors of
>the real applications, especially when applied to the higher-performance
>designs.  Why is this?
>	a) Toys don't stress cache designs,
>	b) Toys don't stress limits.
>	c) Toys don't stress software.

REAL programs have certain biases and short comings, but I am unable to
come up with an elioquent example.  The problem comes with "large, real."
Does large mean memory requirements? (Crank the arrays bigger,
ever see an arrays proposed with 1 Tera word of memory? Read Cray Channels).
Does large mean computationally complex?  Each of these is true to a degree.
Then there is the question of what constitutes "real," and I don't mean
in the metaphysical sense.  I call this "the tension of simplicity."
It affects all we do with measurement: portability, interpretability,
and how we run.

I believe we have to play with some toys before jumping into "real"
programs.  We have to find out what makes them "real." [Many have ideas,
but few are good.]  I think if it weren't for toys, we wouldn't have
things like the NeXT, the Mac, the Apple II.  Computers would be big boxes
behind glass windows, and we would be hung up on who's card deck would be
submitted next.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."
  Actually, I can think about one example at LLNL, but it's classified.

peter@stca77.stc.oz (Peter Jeremy) (10/19/88)

My comments in the following are very C orientated.  I realise this is not
very portable but, most of you will be familiar with C, most other languages
are (or could be) capable of doing the same thing and I am not familiar with
recent compiler capabilities in other languages.

In article <3285@pt.cs.cmu.edu> lindsay@k.gp.cs.cmu.edu (Donald Lindsay) writes:
>In article <6899@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes:
>> [ Offers to write scalable synthetic benchmark, if no-one else wants to ]
>
>First, I am solidly behind the idea that the best benchmark is the user's
>application.

I think we can all take this as read.  Unfortunately in most cases it is
impractical.  Synthetic benchmarks are our best substitute, as long as we
know what we are doing (marketroid "benchmark" results are a glaring
example of not knowing what they are doing :-).

>That said, synthetic benchmarks might as well be as good as they can be.
>So, some guidelines:
> [ code and data working sets fully adjustable, small benchmark presented
> in full in the report ]

> - the compiler must be prevented from inlining.
I think this statement may need some more thought.  putc() is a 'function'
that has been 'inlined' since the beginning of C - it was implemented as a
macro because, until very recently, C compilers didn't allow function
inlining.  Inlining small functions makes sense because the function call
overhead (both size and time) is a significant portion of the size of the
function.

What is needed is a way to differentiate between the following classes of
functions:
	1) small library routines (eg strcpy, strcmp)
	2) large very general library routines (printf, scanf)
	3) other library routines
	4) small synthetic routines simulating small routines
	5) small synthetic routines simulating large routines
Some recent C compilers are capable of inlining functions in group 1, and
analysing parameters to functions in group 2 to possibly replace them with
less generalised (and smaller) library routines.  I see no reason to stop
the compiler doing this (although it is generally possibly by compiler
switches or include file changes) because it will do the same to _all_ code
and a synthetic benchmark should be "typical" in this regard.

Small routines that are simulating large routines must not be inlined.  I
think this is what Donald was talking about.

Small routines that are simulating small routines are a grey area.  In a
typical large application that was written knowing that inlining functions
was an option, the author might choose to inline some functions.  Thus
inline functions could be used in application programs and a benchmark
should take this into account.

I believe that a benchmark should take into account the capabilities of the
software development environment since having a system that can execute
"good" code (eg hand-crafted assembler) blindingly fast is not much good if
the only compilers available generate atrocious code.  This means that the
benchmark should attempt to use all the compiler's capabilities, whilst
preventing the compiler from mangling those routines that are simulating
large blocks of application code. 

> - the compiler must be prevented from eliminating dead code.
Why?  If code is dead, it stays dead whether it is a synthetic benchmark or
an application.  What is needed is some way of differentiating between
compilers that are capable of detecting (and removing) dead code in a large
application, and those that are only capable of detecting dead code in "toy"
situations (ie synthetic benchmarks).

The problem with this requirement is that many (most?) compilers don't have
the switches to allow this - they always remove the dead code they find. 
And a compiler that does support this switch probably does a better job of
dead code detection.

> [ Write a program to generate the benchmark program.  Description of what the
> generator program should do mostly deleted. ]
>
>Next, we need a portable routine which generates pseudo-random numbers.
>(Portable mostly means that it avoids arithmetic overflow.) The quality of
>the randomness is unimportant, as long as it doesn't get stuck at 0 or
>other such silliness.
It needs to be sufficiently random that the OS/hardware memory management
and caching routines can't take advantage of the number or order of
references.

>Since the functions should (largely) be accessed via the array, inlining is
>defeated. Avoid dead code.

This automatically biases the result.  Whilst I don't have figures, I
suspect that very _few_ function calls in a typical application are
indirect.  Whilst this does prevent a compiler from using any global
optimization tricks it might know, it also provides an unfair advantage to
processors that can efficiently execute indirect function calls.
-- 
Peter Jeremy (VK2PJ)         peter@stca77.stc.oz
Alcatel-STC Australia        ...!munnari!stca77.stc.oz!peter
41 Mandible St               peter%stca77.stc.oz@uunet.UU.NET
ALEXANDRIA  NSW  2015

bzs@xenna (Barry Shein) (10/19/88)

The short story on benchmarks is: Make a careful hypothesis and design
an experiment which will provide relevant data relating to that
hypothesis.

Hypothesis: This processor/memory combination is faster than that
one on simple integer operations.

Experiment: Design and run a small benchmark which will allow you
to time each.

Hypothesis: This system is faster at running large programs which
stress the virtual memory system.

Experiment: Design and run a benchmark which will allow you to time
each.

Unfortunately one has to know how to relate their hypothesis to the
benchmark design. Instrumentation and careful refinement of the
hypothesis (eg. define "stress the virtual memory system") helps.

Then there's the old rule of thumb that the speed of a computer system
is measured from the moment you get an idea in your head to the moment
you have the answer in your hands, any other measure is superfluous.

I knew a scientist who insisted his Vax730 was many times faster than
the huge campus IBM mainframe based on that rule. He could usually go
from conception to answer on the 730 in less time than it took
standing on line waiting for a user services person to explain what
IEH700104 meant. I think he was right.

	-Barry Shein, ||Encore||

eugene@eos.UUCP (Eugene Miya) (10/19/88)

Barry (bless his soul!) gave us hypothesis (null) and experiment in
benchmarking.  Now, suggested exercise to reader.  Where is the control?
Not the control as in flow control or control of flow, but experiment
control? (8-) for you Barry!)

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

peter@stca77.stc.oz (Peter Jeremy) (10/20/88)

In article <1710@eos.UUCP> eugene@eos.UUCP (Eugene Miya) writes:
>In article <6868@nsc.nsc.com> grenley@nsc.nsc.com.UUCP (George Grenley) writes:
>>... (I heard once that a Fortran compiler was released
>>which SPECIFICALLY checked the soruce to see if it was Whet, and if it was,
>>stuck in a VERY fast routine).
>
>I checked this story out (months ago) .  Without mentioning specific
>names within a VERY large computer company I discovered it was an
>APL compiler not a Fortran compiler.  The benchmark was a simple
>Gaussian sum (3 APL characters).  The benchmark adds 1 thru n, the compiler
>did what Gauss did: you know n(n+1)/2.  It was placed there by the compiler
>writer who knew the person in the APL community who did this as a
>benchmark.  Serves the benchmarker right.

Presumably the benchmark was +/{iota}n.  At least one APL _interpreter_
that I am aware of (IBM VSAPL) has an internal representation format designed
to efficiently handle arithmetic progression vectors.  All it stores is the
tag, number of elements, value of first element and increment. This makes
simple arithmetic and indexing into and by the array very efficient.  Whether
the extra code in the interpreter necessary to support this 'type' is
justified on typical applications, or whether it was just put in for sales
reasons, I don't know.

Given that this is documented (in the manual on writing VSAPL Auxiliary
Processors), it is hardly a great secret.  For that matter +/{itoa}n is
hardly a great benchmark.  I far prefer things like {domino}?100 100{rho}1E6,
it might not be a good benchmark, but its sure good for soaking up CPU time
(and sure beats trying to do it in any other language :-).

And whilst the rumour mill is running:  Rumour has it that at least one
PClown C compiler recognizes the Sieve of Eratosthenes Benchmark (so beloved
by BYTE magazine) and spits out special code, or at least the optimiser
was written with the Sieve in mind.
-- 
Peter Jeremy (VK2PJ)         peter@stca77.stc.oz
Alcatel-STC Australia        ...!munnari!stca77.stc.oz!peter
41 Mandible St               peter%stca77.stc.oz@uunet.UU.NET
ALEXANDRIA  NSW  2015

tbray@watsol.waterloo.edu (Tim Bray) (10/20/88)

In article <3913@encore.UUCP> bzs@xenna (Barry Shein) writes:
>Hypothesis: This processor/memory combination is faster than that
>one on simple integer operations.
>Hypothesis: This system is faster at running large programs which
>stress the virtual memory system.
There is an implicit claim here that one can divide programs into
equivalence classes with names such as 'simple integer operations' and
'virtual memory stressers'.  I don't believe that.  Benchmarks are like 
trying to count feathers with boxing gloves on, but they won't go away,
sigh...

>Then there's the old rule of thumb that the speed of a computer system
>is measured from the moment you get an idea in your head to the moment
>you have the answer in your hands, any other measure is superfluous.
Hear, hear.

Tim Bray, New Oxford English Dictionary Project, U of Waterloo

pardo@june.cs.washington.edu (David Keppel) (10/23/88)

peter@stca77.stc.oz (Peter Jeremy) writes:
>[ prevent/classify function inlining ]

Inlining a given function on a given machine may speed the computer
while inlining the same functin on another machine -- or even the same
machine, but with a different level of optimization -- may *slow* the
observed performance.

In particular, I can hypothesize machines/situations in which calling
getc() as a function is *faster* because the whole thing stays
in-cache and lets other code stay in-cache too.

In running a "real" system it is the effective speed that bothers me,
not fast hardware/slow software or slow hardware/fast software.
Conclusions:

* Ultimately, benchmarking is very hard.
* Even with a benchmark that tests a particular feature and a very
  good description of the workload, it may not be possible to
  extrapolate the combined performance from the individual
  performance.
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

eugene@eos.UUCP (Eugene Miya) (02/28/90)

In article <132232@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>>In article <3300102@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
>> [doesn't like SPEC]
>
>In article <36438@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>I'm sad to hear that what we've done so far is "no better than Dhrystone",
>>because if that's true, a whole bunch of us have wasted, in toto, at
>>least several million $ to try to do something better....
>
>I, for one, think SPEC is great.
Oh well.  Too bad.
>On the other hand, SPEC is not the end all to beat all.  No benchmark
>is.  If I could design the ideal benchmark, I'd design something that
>had a bunch of knobs that I could turn, like an I/O knob, a CPU knob, a
>memory knob, etc.  I don't have this, so I run several different
>benchmarks that measure these sorts of things.  SPEC is one, Musbus is
>another, and we have several internal/proprietary benchmarks as well.
>Some people don't like you to quote one figure from one benchmark - I
>like to see all the figures from all the benchmarks.  The more data you
>have the easier it is to weed out the spikes.

Sorry, John, I tend to suspect SPEC spent a lot of money.

Larry is not talking about a single program.  This is something I
am working on parts, when I get tiny bits of time.  And like most
research 90% of its failure.  I do not believe the future lies
in simply having more numbers.  More numbers can just be more confusing.
You want number?  Try 42.  Douglas Adams published that.

The fundamental idea which separates people is whether or not you believe
the whole a of benchmark equals or exceeds the sum of its parts.
If you believe in "magic" i.e. known optimizations, features, etc.
that wholes > than parts, then you aren't scientific about the problem.
A person won't get anywhere and you can posit little green men who
only come on Tuesdays as to why your code runs fast.  I am not saying
timings of parts should sum to a whole code, but as you work on
higher and higher conceptual ideas of programs, you can factor these
optimization, etc. into performance.

Users simply concern with pure speed will inevitably be disappointed.
I can point to analogies of performance in other areas.  The idea
of placing a VAX under a bell jam, gold-plating a code, etc.
That's all covered in an article I read after visiting the NBS entitled
Foundations of Metrology in an NBS journal.  There's ways of doing this,
but just like the platinum bar, there's limits of usefulness: hence
why we use other measuring tools, why we refine atomic clocks, etc.
Until we are willing to do that with computers, benchmarking won't get far.

I don't get any warm fuzzy feeling from the Nelson, the Loops, Dongarra, etc.
sure their's bit of truth, but you have to be willing to consider surrogates.
We want to run (with benchmarks), but we have to crawl before walking
and playing.  We are going to need a progression of research.
But most of you don't have the time or inclination to listen, so I
will go back to my hacking.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:

  "You trust the `reply' command with all those different mailers out there?"
  "If my mail does not reach you, please accept my apology."
  {ncar,decwrl,hplabs,uunet}!ames!eugene
  Do you expect anything BUT generalizations on the net?
  [If it ain't source, it ain't software -- D. Tweten]

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (03/01/90)

In article <6336@eos.UUCP> eugene@eos.UUCP (Eugene Miya) writes:
>In article <132232@sun.Eng.Sun.COM> lm@sun.UUCP (Larry McVoy) writes:
>>>In article <3300102@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes:
>>In article <36438@mips.mips.COM> mash@mips.COM (John Mashey) writes:
:
>>I, for one, think SPEC is great.
>Oh well.  Too bad.
:
>Sorry, John, I tend to suspect SPEC spent a lot of money.

I, for another, like SPEC reasonably well.  It does a good job of balancing
integer and floating point requirements.  It matches reasonably well what
some vendors use as a definition of "MIPS".  Overall, I think it is a good
job.

>You want number?  Try 42.  Douglas Adams published that.
:
>The fundamental idea which separates people is whether or not you believe
>the whole a of benchmark equals or exceeds the sum of its parts.

Of course, every techie realizes that benchmark numbers are just numbers.
I can give you a list of 100 things that no existing benchmark measures well.
But, as a defense against marketing droids, it is a reasonable first line
of defense.

>Users simply concern with pure speed will inevitably be disappointed.

Those looking for a single number to characterize speed will inevitably be
disappointed.  Those looking for a number to demonstrate that certain kinds
of programs will not experience bottlenecks may be much better served.  The
purpose to rating systems with numbers like SPECMARK is not to say,
"My system is better than yours, because it is 17.6 SPECMARKS and yours is
only 16.9"  The purpose is eliminate systems from the solution space because
they won't be fast enough.  Marketing types will misuse it all the same, but
so what?  They have freedom of speech too.

>I don't get any warm fuzzy feeling from the Nelson, the Loops, Dongarra, etc.

Most people don't get warm fuzzies from benchmark programs.  But, they can
narrow your solution space if used correctly.  You don't have to consider
systems which are too slow for your job.  You still need to apply
other measures to make sure that the system meets all your requirements.

The main problem that I see with using benchmark programs is that some
*marketING* driven companies have a tendency to neglect things which aren't
being measured.  For example, context switching speed.  I see some evidence
that after initial euphoria over faster CPU speeds on micros, people are 
beginning to go back to fundamentals, and building more balanced systems.
Of course, *MARKET* driven companies have been doing it all along. 

>  [If it ain't source, it ain't software -- D. Tweten]

Agreed.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)604-6117