[comp.benchmarks] dyad operation speed

patrick@convex.COM (Patrick F. McGehearty) (04/05/91)

In article <MCCALPIN.91Apr4093107@pereland.cms.udel.edu> mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
With some clarifications from (jbs@watson.ibm.com)
[Apologies in advance if I have misattributed some of the following]
{in a discusion of the HP Snake peak speed}
> (2) Except for specially coded, cache-friendly stuff like Matrix Multiply
> and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most
> calculations will be limited by the memory bandwidth.  The formula for
> 64-bit vector dyad operations (e.g. DSCAL) is
>
> 	Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP)
>
>This is almost correct.  Just delete the word DSCAL and consider
>operations of the form:
>
>	a(i) = b(i) op c(i)
>
>where op is one of +,-,*.  This requires 2 8-byte reads and one 8-byte
>write per FP operation.
>
>This operation dominates most scientific codes (in my experience), and
>is therefore to be preferred over DAXPY for estimating machine speed.
>DAXPY requires the same amount of memory traffic, but squeezes in an
>extra FP op by using a loop-invariant scalar:
>	y(i) = y(i) + scalar*x(i)
>
>DSCAL requires only one read per op, and so scales like:
>
>	DSCAL MFLOPS = (MB/s)/(16 bytes/FLOP)
>


I would like to add that serious users of scientific computing should
not underestimate the effect of compiler technology on these rates.

While there is not much that can be done about:
	a(i) = b(i) op c(i)
in isolation, small variations can yield significant improvements.
I.e. the trivial expansion to:
	a(i) = b(i) op c(i)
	d(i) = a(i) op e(i)
requires only five memory operations for two ops.  Further, if a(i) is
strictly a tmp, computing d(i) only requires four memory operations for
two floating point operations, yielding a 50% speed gain if the compiler
is good enough to take advantage of it.  Another common form is:
	a(i,j) = b(j) op c(i,j)
which also only requires two memory ops instead of three if the compiler
can handle it.  The daxpy equation has been well studied and can be
enhanced significantly:
With inlining,
	y(i) = y(i) + scalar*x(i)
changes to two loops:
	y(i) = y(i) + s(j)*m(i,j)
Only 3*n+n*n memory operations are required for 2*n*n floating point
operations, if the operations are done in the correct order.
So a sufficiently good compiler could help a machine approach a rate of
	DMXPY MFLOPS = (MB/s)/8 bytes/2 FLOPs
for DAXPY, which is 4 times as good as the DSCAL shown above, or
6 times as good as the dyad rate.

These are the sort of speed ups which hand tuning code frequently yields.

How often can compilers do these sorts of things?  Well, that depends on
whose compiler we are talking about, and who's code.  I have seen some
amazing examples, both of super optimizations and really bad code generation
in my years [decades?] of system measurement.  In the absense of other data,
here are some heuristics for predicting compiler quality:

(1) Compilers for new architectures will not be as well tuned as those
	which are at least two years old.
(2) Compilers for architectures that require use of esoteric features
	to get max performance take longer to mature than compilers
	for simple architectures.
	Esoteric features include: vectors, parallelism, super-scalar
	characteristics, VLIW, special purpose instructions (like string
	compare or block move), caches that are much faster than main
	memory, etc, etc.
(3) Companies that emphasis high performance or good cost/performance
	in their advertising tend to spend on compiler technology.
(4) Careful reading of benchmark articles such as Dongarra's Linpack
	or the Perfect Club report can provide indications of relative
	compiler quality vs hand code tuning skills of the benchmarker.

But, there exceptions to all of the above rules, so really know what you are
buying, run YOUR benchmarks on candidate systems before making a major
computer purchase.  Try to make them representative of the way the system
will be used.  If it a single user workstation or a batch engine, then
one job at a time may be sufficient.  If it is a compute server for many
users at once, then run 1, 2, ..., n copies of your tasks and see what
that does to throughput.

In summary, rough measures such as the dyad rate and peak linpack rate are
better than Peak Mflops, but do not put too much value on them when making
a purchase decision.

preston@ariel.rice.edu (Preston Briggs) (04/05/91)

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) writes:
>> (2) Except for specially coded, cache-friendly stuff like Matrix Multiply
>> and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most
>> calculations will be limited by the memory bandwidth.  The formula for
>> 64-bit vector dyad operations (e.g. DSCAL) is
>>
>> 	Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP)
>>
>>This is almost correct.  Just delete the word DSCAL and consider
>>operations of the form:
>>
>>	a(i) = b(i) op c(i)
>>
>>where op is one of +,-,*.  This requires 2 8-byte reads and one 8-byte
>>write per FP operation.
>>
>>This operation dominates most scientific codes (in my experience), and
>>is therefore to be preferred over DAXPY for estimating machine speed.
>>DAXPY requires the same amount of memory traffic, but squeezes in an
>>extra FP op by using a loop-invariant scalar:
>>	y(i) = y(i) + scalar*x(i)
>>
>>DSCAL requires only one read per op, and so scales like:
>>
>>	DSCAL MFLOPS = (MB/s)/(16 bytes/FLOP)


patrick@convex.COM (Patrick F. McGehearty) writes:

>While there is not much that can be done about:
>	a(i) = b(i) op c(i)
>in isolation, small variations can yield significant improvements.
>I.e. the trivial expansion to:
>	a(i) = b(i) op c(i)
>	d(i) = a(i) op e(i)
>requires only five memory operations for two ops.  Further, if a(i) is
>strictly a tmp, computing d(i) only requires four memory operations for
>two floating point operations, yielding a 50% speed gain if the compiler
>is good enough to take advantage of it.  Another common form is:
>	a(i,j) = b(j) op c(i,j)
>which also only requires two memory ops instead of three if the compiler
>can handle it.  The daxpy equation has been well studied and can be
>enhanced significantly:
>With inlining,
>	y(i) = y(i) + scalar*x(i)
>changes to two loops:
>	y(i) = y(i) + s(j)*m(i,j)
>Only 3*n+n*n memory operations are required for 2*n*n floating point
>operations, if the operations are done in the correct order.
>So a sufficiently good compiler could help a machine approach a rate of
>	DMXPY MFLOPS = (MB/s)/8 bytes/2 FLOPs
>for DAXPY, which is 4 times as good as the DSCAL shown above, or
>6 times as good as the dyad rate.


In an important paper,

	Estimating interlock and improving balance for pipelined architectures
	David Callahan, John Cocke, Ken Kennedy
	International Conference on Parallel Processing
	1987

the authors introduce the term "machine balance" for the ratio

	fetches/cycle
	-------------
	 flops/cycle

They also define "loop balance" for a loop

	fetches in the loop
	-------------------
	 flops in the loop

They note that if the the loop balance of a particular loop
is greater than the machine balance, then the loop is memory-bound.
Similarly, is the loop balance is less than the machine balance,
then the loop is compute-bound.

They also consider the effect of a variety of loop transformation
on the loop balance, noting several that can improve the balance
(reduce the number of fetches).

Note also that the basic motivation for the introduction of
the Level 2 and Level 3 BLAS routines was to allow better performance
on machine with limited memory bandwidth.

For example, the i860 doesn't have the bandwidth to allow it to perform
vector operations (BLAS Level-1) on large vectors at peak speeds.
It doesn't matter how good the compiler is, or how carefully you
tweak the assembly, the machine is limited by the speed it can fetch
from off chip.

On the other hand, the level-2 (matrix-vector) and level-3 (matrix-matrix)
routines allow much higher performance, even on machine with limited
memory bandwidth.

Therefore, for portable performance, code using the highest level
routines possible.

When comparing architectures, we should be able to determine peak rates
for the level-1 routines fairly easily.  The level 2 and level 3 routines
are harder.  I believe though, that the level 3 routines can be sufficiently
warped (via CCK's loop transforms) to match _any_ machine balance,
given adequate registers.  So, given the machine balance, the size
of the register set, and perhaps the length of the various pipelines,
we should be able to determine the peak performance of the level-2
and level-3 routines.

Preston Briggs