[comp.arch] Towards A Meaningful Performance Measure

crowl@rochester.UUCP (11/02/87)

In article <864@tut.cis.ohio-state.edu> manson@tut.cis.ohio-state.edu
(Bob Manson) writes:
>Alright, I'm a little pissed over the recent posting of these supposed
>"performance figures". Exactly what are these things supposed to mean?  Well,
>they are compiled programs on different systems that are run and supposed to
>represent the speed of the various processors, MIPS etc.  We all know that
>MIPS is mostly a meaningless figure (Okay, so there's some debate there, but I
>see no meaning to how many instructions per second a processor runs-"The
>COPYMEM instruction blockmoves the entire memory to disk space, but takes
>1,000,000 usec to execute, with an effective MIPS of 1/1000000").

This is in part a terminology problem.  John Mashey's performance brief states
mips as relative to a VAX 11/780.  (Note that the 780 executes about 500,000
instructions per second.  Its original one mips figure came from early
comparisons with IBM machines claiming one mips.)  I suggest we stop calling
performance "mips", and start being more specific about what we really mean.
I suggest the term "Vax Relative Performance".

Unfortunately, that is not enough.  We must define what configuration of Vax
we use as the baseline.  I suggest an 11/780 with full memory and a floating
point accellerator.  CPU oriented benchmarks should run completely in physical
memory.

The compiler and operating system also affect performance.  To make the base
machine highly available, both should be common.  I suggest Unix BSD 4.2 as
the base operating system and the portable C compiler as the base C language
compiler.  This allows realistic Unix/C benchmarks like grep, nroff, etc.  Note
that such benchmarks must have the same source.  Putting a better compiler on
the Vax will increase its relative performance, so DEC can honestly sell a 780
as having a Vax Relative Performance greater than one.

Of course, 780's are becoming scarce.  We may have to pick another machine just
to keep the base machine readily available.  Suggestions?
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

hansen@mips.UUCP (Craig Hansen) (11/03/87)

In article <3806@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes:
> The compiler and operating system also affect performance.  To make the base
> machine highly available, both should be common.  I suggest Unix BSD 4.2 as
> the base operating system and the portable C compiler as the base C language
> compiler.

BSD 4.3 has been out long enough to be "common," and has significant
improvements in the performance of compiled Fortran code, and some (more
modest) improvements in C code. VMS has better compilers, both C and Fortran,
in terms of how the resulting code perform. The availability of a common
machine is important, however. MIPS has both 4.3 and VMS 780's, but no longer
keeps a 4.2 around. It's been obsoleted by 4.3.

> Of course, 780's are becoming scarce.  We may have to pick another machine just
> to keep the base machine readily available.  Suggestions?

Digital Review uses the roughly equivalent MicroVax as a base machine in their
comparisons.

Ultimately, the Vax Relative Performance measure will never be much more
precise than defining the inch as "Three round and dry barleycorns laid end to
end." In fact, it's more like a cubit - it depends on what the leading machine
of the day is (or was). The 780 VRP at best provides rough comparisons of
machine performance, and by all rights ought to be a moving target. Since we
(at MIPS) are clearly making an effort to be using the best compiler system we
can muster when benchmarking our machines, we consider it only fair to use the
best compiler system that can be made available for the VAX 780. We are trying
to be conservative as possible in our performance claims in an environment
where inflated and misleading claims are the order of the day, but at times it
seems to work against us....

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

alan@pdn.UUCP (Alan Lovejoy) (11/03/87)

In article <3806@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
>[Re: "mips", let's] start being more specific about what we really mean.
>I suggest the term "Vax Relative Performance".
>
>Unfortunately, that is not enough.  We must define what configuration of Vax
>we use as the baseline.  I suggest an 11/780 with full memory and a floating
>point accellerator.  CPU oriented benchmarks should run completely in physical
>memory.
>
>The compiler and operating system also affect performance.  To make the base
>machine highly available, both should be common.  I suggest Unix BSD 4.2 as
>the base operating system and the portable C compiler as the base C language
>compiler.  This allows realistic Unix/C benchmarks like grep, nroff, etc.  Note
>
>Of course, 780's are becoming scarce.  We may have to pick another machine just
>to keep the base machine readily available.  Suggestions?

The installed base of 11/780's is puny by comparison to the popular
micros.  Why not choose one of them?  I suggest the IBM PS/2 Model 60 as
the baseline.  Most systems will be faster, so that the relative
performance ratios will mostly be above 1.  Most benchmarkers will be
able to get access to such a machine.  Most readers will have some idea
of the real performance of the baseline machine.  And the machine is
likely to be around for a long time, so that familiarity with it will
not be short-lived.  And even after its demise, people will remember it.

I strongly vote against PCC as the compiler.  It is known to vary
considerably in code-generation quality on different architectures.
Not good.  It's better if people use the best compiler avaiable for 
the system, whatever that is.

As for operating system, I suggest an average based on standalone, 
UNIX V and the most widely used OS on that system, if that is not
UNIX V.  UNIX V is clearly headed for standard status.  Leave UNIX out 
if it's not available.  I also strongly urge against UNIX utilities as 
benchmarks:  not everyone uses or can use UNIX (a sacrilege, I know).

--alan@pdn

P.S. I'm a Mac enthusiast and despise the Intel architecture, but that
only makes the idea of having the PS/2 m60 as a baseline even *more*
attractive since it would create an atmosphere where everyone would
market their systems as being x times faster than Big Blue's pride and
joy.

crowl@cs.rochester.edu (Lawrence Crowl) (11/04/87)

In article <1714@pdn.UUCP> alan@pdn.UUCP (0000-Alan Lovejoy) writes:
>I strongly vote against PCC as the compiler.  It is known to vary considerably
>in code-generation quality on different architectures.  Not good.  It's better
>if people use the best compiler avaiable for the system, whatever that is.

My intent was that the base architecture/operating system/compiler be constant.
This means that some readily available compiler be part of the base system.  I
wanted to avoid cases where people benchmark relative to two different
compilers.  I want people to use the compiler appropriate to their machine.

>As for operating system, I suggest an average based on standalone, UNIX V and
>the most widely used OS on that system, if that is not UNIX V.  UNIX V is
>clearly headed for standard status.  Leave UNIX out if it's not available.  I
>also strongly urge against UNIX utilities as benchmarks:  not everyone uses or
>can use UNIX (a sacrilege, I know).

I do not understand your reasoning for averaging UNIX and something else.  Why
not just provide two number, one for each operating system on that machine.  I
suspect that the OS will have little effect on CPU bound jobs (relative to the
compiler anyway).  The reason for choosing UNIX utilities is that those people
interested in a UNIX box have more realistic measures of performance.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

hansen@mips.UUCP (Craig Hansen) (11/05/87)

In article <3907@sol.ARPA>, crowl@cs.rochester.edu (Lawrence Crowl) writes:
> My intent was that the base architecture/operating system/compiler be constant.
> This means that some readily available compiler be part of the base system.  I
> wanted to avoid cases where people benchmark relative to two different
> compilers.  I want people to use the compiler appropriate to their machine.

Unfortunately, the base architecture/operating system/compiler, from a
competitive point of view, isn't constant. Each prospective customer has a
set of machines and applications that are relevant to them. VAX (TM DEC) VMS
(TM DEC) users generally stick up their noses at VAX BSD 4.2-relative
measurements, because they know that the BSD compilers are inferior. The VAX
780 is fast becoming an irrelevant machine, because faster/cheaper VAXes are
available. UNIX V (TM ATT) comes with a file-system that's only good for
coffee breaks and lunch hours, so file system performance relative to an
off-the shelf sys-V machine isn't meaningful....you get the idea.

When you're talking about (decimal) orders of magnitude of performance above
a VAX 780, the two machines being compared aren't in the same regime
anymore. (Remember that the VAX 780 is approaching its tenth anniversary for
goshsakes.) The real problem is that the characteristics of the architecture
are going to be different for 10-20-50 MIPS machine than they are for a VAX
780, in terms of cache and memory system design, pipelining, compiler
optimizations, register windows, etc., so that the selection of programs to
run on the machine heavily influences the performance ratio. We've all seen
how much Dhrystone overestimates the performance of "10 MIPS" RISC machines,
compared to larger, more realistic workloads. The only good thing I'll say
about the VAX 780 is that its scalar floating point performance (about a
MegaDoubleWhetstone) is fairly well-balanced against its scalar integer
performance (about 1 MIPS).  Thus, if you build a machine in which FP
applications and integer applications both perform about the same V.R.P.,
you'll have a reasonably well-balanced machine.

As to standarizing on a single compiler/OS, remember that the company
producing the base architecture has an interest in making their machines
look competitive. When competitors say their machine is 10X a VAX 780, when
using trussed up benchmarks and a markedly inferior compiler/OS on the VAX,
DEC should by all rights be screaming bloody murder. Should DEC claim that
their VAX 780 is 1.5 times faster than their VAX 780?

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

reiter@endor.harvard.edu (Ehud Reiter) (11/05/87)

In article <881@mips.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>As to standarizing on a single compiler/OS, remember that the company
>producing the base architecture has an interest in making their machines
>look competitive. When competitors say their machine is 10X a VAX 780, when
>using trussed up benchmarks and a markedly inferior compiler/OS on the VAX,
>DEC should by all rights be screaming bloody murder. Should DEC claim that
>their VAX 780 is 1.5 times faster than their VAX 780?

At one time, I was convinced that there was a well-defined "inflation"
pattern in MIPS as you went from big computer company to little computer
company.  So, for example, an IBM "1 MIPS" machine would have the same
"performance" as a DEC "2 MIPS" machine, which had the same performance as
a SUN "4 MIPS" machine, which had the same performance as an "8 MIPS" machine
from brand X start-up computer company.  Each company in the hierarchy would
ignore its smaller competitors (as being beneath its dignity to comment on),
and proudly claim that it had a large price/performance advantage over its
larger competitors.

Today, as the computer industry finally starts being somewhat competitive
(as opposed to being a monopoly by you know who), I think the big companies
(IBM, DEC) are being forced to be a bit more sophisticated in their
marketing, and stress things customers really care about, like software,
reliability, peripherals, and performance in specific applications (usually
floating-point or I/O intensive applications, not "integer crunching" ones).
It's mainly the little companies (and university people) who still go around
bragging about magic numbers which they pull from thin air and call "MIPS".

The problem with MIPS is that attempts to measure "integer crunching"
performance, and

	(a)   It is impossible to summarize "integer crunching" performance
in one number.
	(b)   In any case, not many customers care about integer crunching
performance.


So, I think that attempts to give good definitions for "MIPS" will remain
fairly academic exercises, of little relevance to the "real world" of computing.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

crowl@cs.rochester.edu (Lawrence Crowl) (11/05/87)

In article <881@mips.UUCP> hansen@mips.UUCP (Craig Hansen) writes:
>When competitors say their machine is 10X a VAX 780, when using trussed up
>benchmarks and a markedly inferior compiler/OS on the VAX, DEC should by all
>rights be screaming bloody murder. Should DEC claim that their VAX 780 is 1.5
>times faster than their VAX 780?

DEC can and should claim that their 780/VMS/vcc configuration is 1.5 times the
(hypothetical) standard 780/4.2/pcc configuration.  DEC may also pick the
benchmarks that show their machine in its best light and publish those figures.
Remember, that a Vax Relative Performance figure must be taken in the context
of the benchmarks used to measure the performance.  If someone does not feel
the benchmark set is fair to their machine, they are free to use another.
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (11/05/87)

In article <3806@sol.ARPA> crowl@cs.rochester.edu (Lawrence Crowl) writes:
[ ... ]
|Unfortunately, that is not enough.  We must define what configuration of Vax
|we use as the baseline.  I suggest an 11/780 with full memory and a floating
|point accellerator.  CPU oriented benchmarks should run completely in physical
|memory.
|
|The compiler and operating system also affect performance.  To make the base
|machine highly available, both should be common.  I suggest Unix BSD 4.2 as
|the base operating system and the portable C compiler as the base C language
|compiler.  This allows realistic Unix/C benchmarks like grep, nroff, etc.  Note
|that such benchmarks must have the same source.  Putting a better compiler on
|the Vax will increase its relative performance, so DEC can honestly sell a 780
|as having a Vax Relative Performance greater than one.

I think this depends on what you want to test; if you want to run 4.3BSD
it's a good way to test, if you want to know how fast your C and FORTRAN
programs will run, you should use the best compilers, etc, if that's
what you want to know. I suspect it is in most cases.

Programs which measure the raw speed of the hardware will give results
which often don't match the high level language results. This doesn't
imply that either is wrong, but that you have to know what you want to
measure.

I have a benchmark suite which I use for UNIX (about 70 machines so
far), and I run with the default compiler and whatever you get with "-O"
for an option. I may repeat with other optimization options if
available, and often see a major change in performance, not always for
the better.

Among other things, I measure the highest scalar speed available from C
for short, long, float, and double. I measure speed of transcendental
functions and the time to do a compare and branch for integer and float.
I do a turing machine simulation, grey to binary and binary to grey. If
I have the machine to myself I run multitasking benchmarks, and if it
has vector capability I test that also. I look at the compile speed
also, and disk performance (to see what I get by using a "better"
drive).

What this suite tells me is the profile of capability for the machine.
There is no one number I can find which is a meaningful index of
performance, and if pressed I use the realtime to run the entire test,
relative to a VAX 11/780. This is as meaningful as any other one number.

I suggest that benchmarks using the "standard" are only valid if you are
testing the machine performance rather than the typical time it takes to
do things using the best tools on a system. Even then the PCC
performance varies from machine to machine in quality, etc.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (11/05/87)

In article <1714@pdn.UUCP> alan@pdn.UUCP (0000-Alan Lovejoy) writes:
[ ... ]
|The installed base of 11/780's is puny by comparison to the popular
|micros.  Why not choose one of them?  I suggest the IBM PS/2 Model 60 as
|the baseline.  Most systems will be faster, so that the relative
|performance ratios will mostly be above 1.  Most benchmarkers will be
|able to get access to such a machine.  Most readers will have some idea
|of the real performance of the baseline machine.  And the machine is
|likely to be around for a long time, so that familiarity with it will
|not be short-lived.  And even after its demise, people will remember it.

I like your idea. However the 60 is pretty low as a starting point, and
very limited. Perhaps a model 80 would be better, since it represents
something current practice. Perhaps running AIX or Xenix/386, since the
32 bit performance is about double the 16 bit performance (as I measured
it, using Xenix/[23]86). This also allows benchmarks on paging, which
the 60 doesn't.

|As for operating system, I suggest an average based on standalone, 
|UNIX V and the most widely used OS on that system, if that is not
|UNIX V.  UNIX V is clearly headed for standard status.  Leave UNIX out 
|if it's not available.  I also strongly urge against UNIX utilities as 
|benchmarks:  not everyone uses or can use UNIX (a sacrilege, I know).

I guess I would rather see the most common UNIX version, rather than
specifying something which raises strong feelings. For some machines I
would like to see all UNIX versions, such as the RT/PC or VAX. I would
rather see both sets of numbers, since I probably will have to choose
one or the other.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

mash@mips.UUCP (John Mashey) (11/06/87)

In article <3113@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>So, I think that attempts to give good definitions for "MIPS" will remain
>fairly academic exercises, of little relevance to the "real world" of computing.

Unfortunately, one can do the whole song-and-dance about
	a) No one number is enough
	b) Benchmark the real applications
	c) Individual machines vary a lot
and somebody who hearsthat, and intellectually accepts that, will STILL
then ask you: "But how many mips is it?"  In particular, the trade press
ends up doing this by default, as they end up showing mips-ratings,
even in the same issues that contain good explanations of why this is only
a gross approximation.  Telling an accurate story takes a tremendous amount
of work, and is nontrivial to understand; that's why people want mips-ratings.
sigh.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (11/06/87)

In article <3113@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
|The problem with MIPS is that attempts to measure "integer crunching"
|performance, and
|
|	(a)   It is impossible to summarize "integer crunching" performance
|in one number.
|	(b)   In any case, not many customers care about integer crunching
|performance.

Here I don't feel that you are correct... machine usage seems to fall
into two categories of user, the "number crunchers" who need f.p.
performance, and the rest of the software development, word processing,
E-mail, record keeping world. The speed of integer arithmetic is *very*
important to most groups.

One of the things I have noted in my own benchmarking is that the one
thing which best predicts the performance overall is the integer test
and branch time (usual disclamers about no one number), and that
machines which do well in "transient response" are more pleasant to use.
Transient response is the time to do little things, like ls, cat, etc.
The RT/PC scored very well on that, and even though it was not
competitive with a Sun overall, it was more pleasant to use.

<<<< all my own opinions >>>>
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

reiter@endor.harvard.edu (Ehud Reiter) (11/06/87)

In article <884@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>Unfortunately, one can do the whole song-and-dance about
>	a) No one number is enough
>	b) Benchmark the real applications
>	c) Individual machines vary a lot
>and somebody who hearsthat, and intellectually accepts that, will STILL
>then ask you: "But how many mips is it?"  In particular, the trade press
>ends up doing this by default, as they end up showing mips-ratings,
>even in the same issues that contain good explanations of why this is only
>a gross approximation.

There's no question that the trade press is in love with MIPS, and that people
who should know better still ask about MIPS.  The point, though, is that since
it is impossible to define a single figure that measures performance, we, the
(enlightened?) readers of comp.arch, should not waste our time trying to do so.

We should also realize that MIPS do have one great advantage over newer and
more "scientific-sounding" performance measures, and that is that since most
people do realize that there is something funny about MIPS, they take MIPS
figures with a large grain of salt.  I have a feeling that this is less true
of the newer and more "scientifically defined" benchmarks like Dhrystone,
which many people take much more seriously than MIPS, even though Dhrystone
suffers from the same fundamental problem as MIPS that single figures are
meaningless (not to mention Dhrystone's numerous technical difficulties, which
have been discussed at length on comp.arch).

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

jss@hector.UUCP (Jerry Schwarz) (11/06/87)

In article <3113@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>
>Today, as the computer industry finally starts being somewhat competitive
>(as opposed to being a monopoly by you know who), I think the big companies
>(IBM, DEC) are being forced to be a bit more sophisticated in their
>marketing, and stress things customers really care about, like software,
>reliability, peripherals, and performance in specific applications (usually
>floating-point or I/O intensive applications, not "integer crunching" ones).

Maybe I'm wrong, but it has been my impression that IBM has always
(at least 35 years) had very sophisticated marketing strategies,
and have usually (again over the course of 35 years) stressed things 
other than raw speed.  

Jerry Schwarz

henry@utzoo.UUCP (Henry Spencer) (11/08/87)

> ... UNIX V (TM ATT) comes with a file-system that's only good for
> coffee breaks and lunch hours...

Whereas Berklix comes with a file-system that's only good for producing big
numbers on *single-user* benchmarks... :-)
-- 
Those who do not understand Unix are |  Henry Spencer @ U of Toronto Zoology
condemned to reinvent it, poorly.    | {allegra,ihnp4,decvax,utai}!utzoo!henry

reiter@endor.harvard.edu (Ehud Reiter) (11/10/87)

In article <7786@steinmetz.steinmetz.UUCP> davidsen@crdos1.UUCP (bill davidsen) writes:
>In article <3113@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>|The problem with MIPS is that attempts to measure "integer crunching"
>|performance, and  ...
>|	(b)   In any case, not many customers care about integer crunching
>|performance.
>
>Here I don't feel that you are correct... machine usage seems to fall
>into two categories of user, the "number crunchers" who need f.p.
>performance, and the rest of the software development, word processing,
>E-mail, record keeping world. The speed of integer arithmetic is *very*
>important to most groups.

My own impression has been that people doing the above tasks care more
about I/O performance (speed of terminals, disks, etc) than CPU performance.
The main exception to this is that people want the OS to quickly perform the
bookkeeping overhead associated with doing I/O (many systems are more
limited by the speed at which the OS can do the bookkeeping than the actual
speed at which the I/O devices perform).

However, since different machines have vastly different OS's, MIPS ratings
(or any hardware-only performance measure) gives very little insight into
how quickly machines can do I/O bookkeeping.  Machine X could perform integer
adds at half the speed of machine Y, but still be able to maintain an I/O
throughput rate that was ten times as high as machine Y, simply because X
had an OS which was much better suited to the application.

So, what I'm saying is that what I believe most people care about is not raw
hardware integer compute speed, but the speed at which the OS can perform
its chores, and there is not necessarily much correlation between the two
numbers.

					Ehud Reiter
					reiter@harvard	(ARPA,BITNET,UUCP)
					reiter@harvard.harvard.EDU  (new ARPA)

radford@calgary.UUCP (Radford Neal) (11/11/87)

In article <881@mips.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:

> As to standarizing on a single compiler/OS, remember that the company
> producing the base architecture has an interest in making their machines
> look competitive. When competitors say their machine is 10X a VAX 780, when
> using trussed up benchmarks and a markedly inferior compiler/OS on the VAX,
> DEC should by all rights be screaming bloody murder. Should DEC claim that
> their VAX 780 is 1.5 times faster than their VAX 780?

This looks like a good argument *for* standardizing on an obsolete
machine You've *got* to standardize all aspects of the base system,
otherwise numbers today aren't comparable with numbers last year. An
obsolete machine's software won't be improving much, so this won't mislead
people. And the obsolete machine's manufacturer won't care what 
people think of its performance any more.

So the ideal benchmark standard is an obsolete machine (with static
operating system and compiler), that is nevertheless still common
and likely to remain so. The Vax/780 and the IBM PC are about as
close as one is likely to get, but the IBM PC has a "typical" architecture
only if you restrict it to "small model" programs, which is not
tolerable. Thus the current use of a Vax/780 seems as good as we
can hope for, though standardizing on some old operating system (e.g 4.2)
is likely to be a problem (who wants to keep it around just for
benchamarking?)

    Radford Neal
    The University of Calgary