[comp.benchmarks] bc benchmark sigh

mash@mips.COM (John Mashey) (12/26/90)

Having finally caught up with the net after a long trip, I'm sad to see
that 1 out of 3 postings in this newsgroup concern the "bc" benchmark
or some variety of thereof.  I had higher hopes for this, especially as
at least some people have read previous discussions in comp.arch.
This %#@!$% thing is like a vampire: every time you think you've finally
put a stake thru its heart, it returns one more time.

1. Small benchmarks are very prone to misinterpretation, prone to
compiler gimmickry, and seldome excercise modern machines very well.
About their only even-slightly-rational use is to compare machines
with the same chips running at different clock rates.
Small, synthetic benchmarks can easily over- or under- emphasize language and/or
machine features out of all proportion to mixtures found in more realistic
benchmarks.

As a matter of faith, I consider small benchmarks guilty until proven
innocent, i.e., if you can prove their results correlate well, across
product lines, with much more substantial real programs, then maybe
you have something (and in fact, this is a good thing to have;
for instance, I've often thought of offering a small prize for anyone
who can create a small program that predicts performance on the 10
SPEC benchmarks across machine lines, but I haven't figured out
how to describe this well enough to figure out if someone has achieved it.)


2. Filling the net with timings for a benchmark where no one even explains
what code is being executed, how big it is, whether or not it correlates
with ANYTHING, etc, etc, is like trying to predict the speed of automobiles
by ripping out their steering wheels, and seeing how fast they roll.

3.  NOW, here are SOME FACTS about this benchmark:
	1) It is tiny:
		99.57% of the instruction cycles (on a MIPS machine)
		are accounted for by 10 LINES OF CODE
		71% of the cycles are consumed in 3 LINES OF CODE
		In addition, unlike matrix kernels, whose code is small,
		but whose data references are big, this doesn't even
		have that property: all the code & data fit in tiny caches.
	2) Its instruction usage bears little resemblance to much of
	anything: see Hennessy & patterson for typical characteristics
	of code.  In particular, this code almost never makes function calls,
	and ((on a MIPS machine, which HAS integer multiply and divide)
	spends 50% of the total cycles doing integer multiply and divide.
	I assure you, this is typical of very few programs; this is NOT
	the kind of statistics that any computer architect I know designs
	machines around, etc, etc.  (Of course, I should love this benchmark,
	as it REALLY hurts machines with no integer multiply.)

	At the end of this posting are the slices of prof & pixstats output.

4. PLEASE STOP WASTING TIME WITH THIS BENCHMARK
	(Please, let this be the last stake in its heart :-)

5. ABOUT THE ONLY USEFUL THING I CAN THINK OF TO DO WITH THIS is for somebody
to run this benchmark on many of the machines for which SPEC integer benchmarks
exist, plot the two together, and compute a correlation for them;
or even, pick any one of the SPEC integer benchmarks and do it for that.
(Or pick some other realistic integer benchmark for which well-controlled
results exist.)

----------
Profile listing generated Tue Dec 25 13:42:35 1990 with:
   prof -pixie dc 

*  -p[rocedures] using basic-block counts;                                 *
*  sorted in descending order by the number of cycles executed in each     *
*  procedure; unexecuted procedures are excluded                           *

84303520 cycles

    cycles %cycles  cum %     cycles  bytes procedure (file)
                               /call  /line

  84058950   99.71  99.71    1827369     36 mult (dc.c)
    132423    0.16  99.87       4905     37 div (dc.c)
     31153    0.04  99.90        538     21 nalloc (dc.c)
.....

OH GOOD: it spends 99.7% of its time in one function...
IN FACT, going to the next level of detail, where we see the number of
cycles spent in the statements that consumed the time, we discover
that 83.7% of the instruction cycles are spent IN JUST 4 LINES OF C....:


*  -h[eavy] using basic-block counts;                                      *
*  sorted in descending order by the number of cycles executed in each     *
*  line; unexecuted lines are excluded                                     *

procedure (file)                           line bytes     cycles      %  cum %
mult (dc.c)                                1097   100   22754044  26.99  26.99
mult (dc.c)                                1094    96   20317562  24.10  51.09
mult (dc.c)                                1093    68   16755620  19.88  70.97
mult (dc.c)                                1095    36   10771470  12.78  83.74
mult (dc.c)                                1098    40    8383670   9.94  93.69
mult (dc.c)                                1096    16    4787320   5.68  99.37
mult (dc.c)                                1084    80      83600   0.10  99.47
mult (dc.c)                                1102    96      45066   0.05  99.52
mult (dc.c)                                1087    68      41076   0.05  99.57
nalloc (dc.c)                              1974    36      29529   0.04  99.60
div (dc.c)                                  665   144      24070   0.03  99.63
mult (dc.c)                                1101    96      23606   0.03  99.66
div (dc.c)                                  657   124      22139   0.03  99.69
mult (dc.c)                                1104    40      20630   0.02  99.71
......
------------
Following is an analysis of instruction  usage, on MIPS R3000-based
machine:
pixstats dc:
 174126742 (2.065) cycles (6.97s @ 25.0MHz)
  84303520 (1.000) instructions  [# instructions]]
      1283 (0.000) calls  [basicaally: never does function calls]]
  28881440 (0.343) loads  [a little high]
   8458964 (0.100) stores
  89823222 (1.065) multiply/divide interlock cycles (12/35 cycles)
		(amazingly high: 50% of the time in this code is doing
		integer multiply divide.  Real programs do exist
		like this, but this is completely unrepresentative of
		the vast bulk of integer code....]

1.36e+05 cycles per call  ... like I said: hardly ever does function calls
6.57e+04 instructions per call


Instruction concentration:
         1   1.4%
         2   2.8%
         4   5.7%
         8  11.4%
        16  22.7%
        32  45.4%
        64  90.8%
       128  99.6%
       256  99.8%
       512  99.9%
      1024 100.0%
      2048 100.0%
      3697 100.0%

THIS SAYS: in a peerfect full-associative cache, 90.8% ofthe instruction
cycles would be spent in only 64 words (64 instructions), and 99.9% would
fit into 1024 words.... i.e., it fits into almost any machine's cache...

opcode distribution: [dynamic]]
     div    2395317    2.84%
   multu    1197623    1.42%

A PROGRAM WITH TWICE AS MANY INTEGER DIVIDES AS MULTIPLIES....
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

borasky@ogicse.ogi.edu (M. Edward Borasky) (12/27/90)

Thank you for at least driving another stake in "bc benchmark"'s heart.
However, as you and I know, there is a tremendous need out there for
[sigh] [gasp] A SINGLE NUMBER to characterize JUST EXACLTY HOW FAST
ANY GIVEN COMPUTER IS.  I have my own personal favorite which I will
not belabor because everyone has his own personal favorite.  My question
is this: just as you and I believe that vampires don't exist, do you
believe that a single-number that measures a computer's speed doesn't
exist?  I won't state MY belief to avoid bias in the discussion.  My
use of the word "bias" in the preceding sentence is a HINT on my belief!

pbickers@tamaluit.phys.uidaho.edu (Paul Bickerstaff) (12/27/90)

In article <15379@ogicse.ogi.edu>, borasky@ogicse.ogi.edu (M. Edward
Borasky) writes:
> My question
> is this: just as you and I believe that vampires don't exist, do you
> believe that a single-number that measures a computer's speed doesn't
> exist?  I won't state MY belief to avoid bias in the discussion.  

There is NO such number!!

It does not take an expert to dig up two programs,  #1 runs faster on
machine A than on 
machine B but #2 runs faster on machine B.  Both programs could e.g. be
in fortran.
Further,  a suite of "representative" programs could run faster on A
when the load factor
is 1.0 but faster on B when the load factor is say 8. The list of
possibilities goes on. 

Paul Bickerstaff                 Internet: pbickers@tamaluit.phys.uidaho.edu
Physics Dept., Univ. of Idaho    Phone:    (208) 885 6809
Moscow ID 83843, USA             FAX:      (208) 885 6173

choll@telesoft.com (Chris Holl @adonna) (12/28/90)

In article <15379@ogicse.ogi.edu>, borasky@ogicse.ogi.edu
(M. Edward Borasky) writes:

>   ...do you believe that a single-number that measures a computer's
>   speed doesn't exist?

When I worked at Boeing Computer Services we typically looked for one
number to compare two machines.  The comparison was very narrow in
scope however.  We compared one vendor's computer to their next box.
As long as the architecture stays the same, such comparisons are
valid.  This was needed because as a computer service, there had to be
a way to consistently charge customers independent of which box their
job actually ran on.  The CPU times needed to be normalized so it
wouldn't matter if a job ran on a Cyber 175 or 760.  Cray 1S or X-MP.
In fact, we had to guarantee this for our government customers who
insisted that their bill for the same job should always be within some
percentage (5%, I think).

The CPU ratios was determined by running 10 to 14 CPU kernels such as
linear code, loops, subroutine calls, memory fetches (in and out of
stride), matrix reductions, etc.  Again, as long as the architecture
was the same (or close enough) the ratios stayed pretty constant (and
typically close to the ratio of the clocks, which was usually the
biggest difference).  Where the architecture changed we had trouble 
justifying one number.

For example, we compared a Cray 1 to a Cray X-MP.  The clocks were
12.5 to 9.5 nanoseconds.  All the ratios looked fine (1.32 give or
take a bit) except scatter/gather which was 10 to 14 times faster on
the X!  (Hardware scatter/gather - architecture change.)  A job that
performed a lot of scatter/gather would burn different amounts of CPU
seconds on the different Crays.

The other "one number" we used was for capacity planning.  After
maturing through many yardsticks of throughput, one of my fellows
(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity 
test that would precisely model the current workload on a machine.
This was used with great accuracy to measure the capacity of different
machine (for that workload).

Anyway, summing up my ramblings, one number is okay for a given
architecture or a given application.  Unfortunately that is not what
most people are looking for.  They want you to tell them how fast
their jobs are going to be on machine X if they are this fast on Y.
My answer has always been "Depends what you're doing." which rarely
satisfies 'em.  :-).

Chris Holl
TeleSoft (formally of BCS)
5959 Cornerstone Ct. W.
San Diego, CA  92121

borasky@ogicse.ogi.edu (M. Edward Borasky) (12/28/90)

In article <1142@telesoft.com> choll@telesoft.com (Chris Holl @adonna) writes:
>When I worked at Boeing Computer Services we typically looked for one
>number to compare two machines.  
[...]
>  This was needed because as a computer service, there had to be
>a way to consistently charge customers independent of which box their
>job actually ran on.  
[...]
>In fact, we had to guarantee this for our government customers who
>insisted that their bill for the same job should always be within some
>percentage (5%, I think).
I was hoping for a response like this.  There are two types of computer
users -- those like you and me who realize that computing costs money,
is a resource that must and can be managed, and those like students,
computer science faculty, dreamer/architects who think that computing
should be, can be and often is essentially free.  Granted, you can pick
YOUR favorite speed number (let's say SPECmarks) and come up with a
very low cost box that sits on your desk and delivers it, complete
with stunning 3D graphics and UNIX and some kind of windowing.  But
although the user of this box may think of $10K as very little money,
the company or university who bought 100 of them (now we're talking a
million) PLUS the Ethernet PLUS the guy who comes and bails you out
when you delete your whole directory accidentally plus the guy who
backs your files up once a week so you CAN get bailed out, etc. -- the
company/university has a large investment here.
>
>For example, we compared a Cray 1 to a Cray X-MP.  The clocks were
>12.5 to 9.5 nanoseconds.  All the ratios looked fine (1.32 give or
>take a bit) except scatter/gather which was 10 to 14 times faster on
>the X!  (Hardware scatter/gather - architecture change.)  A job that
>performed a lot of scatter/gather would burn different amounts of CPU
>seconds on the different Crays.
I'll bet that the COST difference between the two machines was such
that you could afford to give away the extra speed on th X-MP from
the hardware scatter/gather -- bill the X-MP as if it were strictly
1.32 times the Cray 1.
>
>The other "one number" we used was for capacity planning.  After
You just said the secret word -- "capacity planning"!  I wish the duck
were still around to drop down and give you fifty dollars!
>(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity 
>test that would precisely model the current workload on a machine.
>This was used with great accuracy to measure the capacity of different
>machine (for that workload).
Is this published?  Could you post it?  The guys here and in "comp.arch"
would LOVE to see it!
>
[...]
>Unfortunately that is not what
>most people are looking for.  They want you to tell them how fast
>their jobs are going to be on machine X if they are this fast on Y.
Yes, for that you DO need more than ONE number.  But for supercomputers
and supercomputer applications, you can do a damn fine job with THREE
numbers!  Two numbers to describe the computer and one for the appli-
cation.

mash@mips.COM (John Mashey) (12/29/90)

In article <15424@ogicse.ogi.edu> borasky@ogicse.ogi.edu (M. Edward Borasky) writes:
>>The other "one number" we used was for capacity planning.  After
>You just said the secret word -- "capacity planning"!  I wish the duck
>were still around to drop down and give you fifty dollars!
>>(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity 
>>test that would precisely model the current workload on a machine.
>>This was used with great accuracy to measure the capacity of different
>>machine (for that workload).
>Is this published?  Could you post it?  The guys here and in "comp.arch"
>would LOVE to see it!

Yes, it would be good to see this.  Note the important fact that there are
two steps:
	a) Characterizing the workload
	b) Predicting the performance on that workload
Part a) is why SPEC advises people to try to correlate their own workloads
with some subset of SPEC benchmarks, and then ignore the other SPEC benchmarks,
and in fact, I've started to see users doing this already.

Also, I've seen some pretty good benchmarks, with workloads tailored to
different departments within a company ... unfortunately, the best ones
I've seen were all proprietary...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) (12/31/90)

In article <15379@ogicse.ogi.edu>, borasky@ogicse.ogi.edu (M. Edward Borasky) writes:
> My question is this: just as you and I believe that vampires don't 
> exist, do you believe that a single-number that measures a computer's 
> speed doesn't exist?

	I do not believe that a SINGLE (atomic) number exists which 
	usefully measure's a computer's speed.  The problem with a
	single number is that it may measure only one or a few aspects
	of the computer's speed.   For example: how long it takes 
	execute an instruction stream which contains 60% integer 
	divides and 39% integer multiplies and which fits into the 
	system data and instruction caches.  For some small population
	that aspect may be interesting.  For many others that is not
	a reflection of reality.

	Some people will be interested in how fast the computer does
	integer adds and subtracts.  Others, floating point arithmetic.
	Still others byte copies and compares.  And still others will
	have applications that are dominated by I/O.

	What is required are MANY (atomic) numbers, each of which
	measures one or more of the interesting aspects of a computer's
	speed.  To go along with these numbers are reasonably detailed
	descriptions of what that number measures.  That way you can
	examine your application to characterize it's use of the system
	and find the number which is the best match.  Only then is it
	safe to compare single numbers.

	Now, one thing that can be done is to take these MANY (atomic) 
	numbers and run them through some mathmatical function to get
	a single number that represents the combination of all of the
	numbers.  The function has to be carefully constructed so that
	one exceptional number (good or bad) doesn't dominate the final
	answer.

	If all bc(1)'s are created equal and you know that it's use
	of the system reflects how you use the system, then it may
	well be a good single benchmark.  I believe that using bc(1)
	is only safe as benchmark if your primary applciation is in
	fact bc(1).  If a vendor decides to make bc(1) go very fast
	at the expense of improving the performance of their compilers
	and libraries then only the people that use bc(1) will win.

    My application:

	Read 24 bytes and extract four non-well aligned fields; a 
	read/write flag, LBN number, transfer size and unit number.
	If this record is interesting, perform a variety of operations 
	some of which are:

		Look up the transfer size in a list based on the
		unit number to count how many transfers of that
		size there are (pointer chasing ending with an
		integer increment).

		Increment a counter (based on the unit number) for
		each read or write.

		Determine the absolute distance from the previous
		LBN to the current LBN (for this unit number).

		Determine what logical partition this LBN was in to
		increment another set of counters (still more pointer
		chasing for the transfer size).

	    So far the application has stressed integer add, compares
	    and moving bytes around.

		Look up the LBN in a database to determine whether or 
		not the LBN is file system metadata and what kind.

	    It just became seriously I/O bound (a large buffer cache
	    helps A LOT).

	Once all this has been done for every 24 byte record in the 
	input stream, print a summary of the results.  Most of the 
	math is done in floating point (adds with a small number of 
	multiplies and divides).

	If you like them, vampires are a nice fantasy.  As are single
	number benchmarks.
-- 
Alan Rollow				alan@nabeth.enet.dec.com

eachus@linus.mitre.org (Robert I. Eachus) (01/01/91)

     There is a way to have and use a single meaningful (balanced)
benchmark number, but only to do intial selection:

     Say you choose SPECmarks.  (I use Dhrystones, but that is a minor
detail.)  What you end up with are two things, one a standard single
number rating, and the other a set of comments or annotations telling
the particular strengths and weaknesses of various machines.  First of
all, realize that there are really only three shades of gray where
buying hardware is concerned: More than fast enough, very tight, and
no way Jose.  If you pick a machine in the grey area, there is a major
tradeoff between cost to optimize code for the machine selected, and
the cost of more than adequate hardware.  Unfortunately, supercomputer
useres and real-time people sometimes find that there IS no other
choice, but I digress.

     If your single number correlates well with price, then choosing
the right machine for the job entails first benchmarking your
application on whatever machine you have around.  (I often see
"estimates" of code size and run-time that are off by two orders of
magnitude. If you can't get within a factor of two or three, why
bother spending time benchmarking the hardware?)  At this point we now
have an estimate that say application Y will require a 50 MIPS VAX.
You can now characterize the application in terms such as integer vs.
floating, vectorizeable vs. scalar, I/O intensive vs. compute bound,
or single task instensive vs. multitasking, and check out the machines
in the range you need with the strengths you want.  This will often
come down to a single choice or at least a single processor family
that you need to run your application specific benchmark on.

     This method of choosing hardware DOES require running application
specific benchmarks twice.  But, at least in the applications that I
care about, you are fooling yourself if you don't write and run a good
application specific benchmark to start with.  Using the SPECmark
suite and tailoring it to your particular application might give a
better fit, but the amount of work required to determiine the
coefficients is usually more than that involved in this approach.

--

					Robert I. Eachus

with STANDARD_DISCLAIMER;
use  STANDARD_DISCLAIMER;
function MESSAGE (TEXT: in CLEVER_IDEAS) return BETTER_IDEAS is...

choll@telesoft.com (Chris Holl @adonna) (01/01/91)

>From: borasky@ogicse.ogi.edu (M. Edward Borasky)

> I was hoping for a response like this.  There are two types of computer
> users -- those like you and me who realize that computing costs money,
> is a resource that must and can be managed, and those like students,
> computer science faculty, dreamer/architects who think that computing
> should be, can be and often is essentially free.  

Boeing Computer Services occasionally received criticism of their
"high" bills for computing services.  In response they produced a paper
called "The Real Cost of Computing" to educate their customers.  It
described many of the things you mentioned including support of
the hardware, software, configuration, backups, etc.  I don't know if
the paper is available, but I could find out if there is interest.

> I'll bet that the COST difference between the two machines was such
> that you could afford to give away the extra speed on the X-MP from
> the hardware scatter/gather -- bill the X-MP as if it were strictly
> 1.32 times the Cray 1.

That's exactly what we wound up doing.  We couldn't justify a larger
figure because if a job ran on the X that didn't use scatter/gather it
would get a higher bill than it would have on the S, and that wouldn't
do.  Using 1.32 a job that used scatter/gather simply got a better deal
on the X.  We made a point of telling users this (so they got the
proper perspective :-) and to encourage them to take advantage of the
hardware.  If their jobs became more efficient, throughput would go up.

The algorithm for billing was slightly different however.  The 1-S had
a single processor and 2 Meg while the X-MP had 2 processors and 4 Meg. 
We billed for CPU seconds and memory residency.  When a job started
using more than 2 Meg it started to pay a percentage of the other
processor, even if it wasn't using it.  If a job used the entire 4 Meg
it payed for both processors, because no one else could use the other
processor without occuping memory.

BCS took the 1-S out of the configuration after users had migrated to
the X, so after the overlap period everyone got a better deal.  Now they
have a Y-MP.  I wasn't there for that transition, so I don't know
exactly how they managed it.

>  You just said the secret word -- "capacity planning"!  I wish the duck
>  were still around to drop down and give you fifty dollars!  

Duck?  $50?

> >(Dr. Howard "Doc" Schemising - wonderful guy) developed a capacity 
> >test that would precisely model the current workload on a machine.  
> >This was used with great accuracy to measure the capacity of different 
> >machine (for that workload).  
> 
> Is this published?  Could you post it?  The guys here and in "comp.arch"
> would LOVE to see it!  

In article <44371@mips.mips.COM>, mash@mips.COM (John Mashey) writes:

> Yes, it would be good to see this.  Note the important fact that there are
> two steps:
> 	a) Characterizing the workload
> 	b) Predicting the performance on that workload

> Also, I've seen some pretty good benchmarks, with workloads tailored to
> different departments within a company ... unfortunately, the best ones
> I've seen were all proprietary...

Yes, step a) is critical.  And yes, Doc Schmeising's benchmark is
proprietary.  (Sorry I typo-ed his name wrong the first time.)  I have
had a few requests for more information on this capacity benchmark,
and I don't think the following violates any of BCS' rights.

Doc Schmeising's benchmark was called QBM (Quick BenchMark) and is
owned by Boeing Computer Services (BCS).  There was talk at one time of
marketing it, but they never did.  A shame, because it is a great tool.
Doc has retired and I'm not there any more and I don't even know if
they are still using it.  

The basic premise is to take a "slice" of your system's workload that
runs in some fixed period of time (we used 10 minutes - Quick) and dump
it into another system to see how long it takes.  If the work can't
complete in 10 minutes, the target system has less throughput than the
base system.  If it can complete in 10 minutes the target system has
equal or greater throughput.

The tricky part is to quantify throughput, and this was QBM's real
strength.  

BCS collected data on a variety of resources used by jobs.  This
included thing like

     .  CPU seconds burned
     .  Amount of memory used
     .  Duration of memory residency
     .  Disk blocks transfered
     .  Disk accesses

and a few others.  This data was stored on tapes and went back years.
It was collected for CDC Cybers and Crays (the two main workhorses at
BCS).  You, as the performance guru and benchmarker, would have to
select a 10 minute window where your machine was "full."  Full does not
mean the system was on its knees and response time was horrible.  It
means processing a reasonable workload with acceptable response time,
good CPU usage, no thrashing, etc.  I would pick a period of a busy
day; say, a busy hour, or 2 hours, or 30 minutes, or whatever, and feed
it to QBM.  QBM would filter out noise, and select 10 minute samples
from the time period.  It would select as many as you asked for (since
the 10 minute slices could start on any fraction of a second) and print
out a variety of statistics about the sample.  A GREAT DEAL OF CARE WAS
TAKEN TO PICK GOOD DATA.  A lot of accounting and performance data was
reviewed, and eventually one was picked and that was called your
baseline.  That represented the "full" capacity of your base machine.
This selection process was done once or twice a year.  Only when you
felt the work profile (type of work being done) had changed enough to
affect your results.  This would happen.  For example, when users
migrated from a Cyber to a Cray eventually they would start writing
code to take better advantage of vectorization.

QBM would then take the data from that sample and produce a synthetic
workload consisting of the same number of jobs, starting and stoping
at the same times during the 10 minute window (very important) and
using the same resources.

Some assumptions were made about how the CPU seconds were used (matrix
reductions?  straight line code?  etc.) and how they were distributed
during the job.  You can't use all the CPU seconds and then do all the
I/O.  On the other hand, an even distribution is not representitave
either.  A variety of CPU kernals were used (the 10 to 14 I mentioned
in my first posting) and given different weightings dependent on what
we thought our machines were being used for.

Now here's the real beauty of QBM:  Not only could it create a
workload that was the same as your real-life sample, it could create a
workload that was 1.5 times that sample.  Or 5 times, or 0.5 times.
You could now create any multiple of that workload.

A lot of testing went into QBM to be sure that the jobs it created
would actually run in exactly 10 minutes on the base machine.  It is
much to Doc's credit that they did.  A QBM workload of 1.0 ran in 10
minutes.  If 1.1 ran in 10 minutes, then the sample you picked wasn't
really when your machine was full.  After some experience with it, we
had samples were 0.9 would complete in 10 minutes, 1.0 would also, but
1.1 would not.

We would then take these samples to a new machine and scale the load up
or down until the jobs JUST ran in 10 minutes.  In this way we could
report that "For our current workload AND configuration, the X-MP will
provide 3.85 times the throughput of the 1-S."  (This was our actual
result.)  Of course the configurations of both base and target system
had to be taken into account.

This is significant also, because we could benchmark different
configurations on the same machine.  Suppose we had more channels?
Fewer channels and faster disks?  More disks?  All these questions
could be answered by setting up test configuations and measuring the
throughput.

No papers were written about QBM (unfortunately) althought I did
present benchmarking results at a couple CUGs (Cray Users Group) and
those papers are available.

Hope this has helped,

Chris.


Christopher Holl
TeleSoft
5959 Cornerstone Ct. W.
San Diego, CA  92121
(619) 457-2700