[comp.benchmarks] A question about flat out Snake speed.

ssr@stokes.Princeton.EDU (Steve S. Roy) (04/03/91)

With all the discussion of the speed of HP's hot new Snake systems,
I've been wondering what their peak speeds are.  Suppose you hand
coded a matrix multiply, trig function, FFT or whatever.  What is the
maximum speed you could get and what would the limiting factors be?
Is a daxpy ( x = a*x+y ) limited by the FPU or by cache or by main
memory?  How flexible is the multiply accumulate?  Can the integer and
fp units run in parallel as on the i860?

Yes, I know that this sort of theoretical peak speed is often
irrelevent for whole applications.  I still want to know because in my
codes I know which parts I can apply this treatment to and I can then
calculate the speedup I would see.  I'm also curious to find out how
good the compiler is at exploiting the architecture.

Thanks.
Steve Roy

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/04/91)

>>>>> On 3 Apr 91 14:34:49 GMT, ssr@stokes.Princeton.EDU (Steve S. Roy) said:

Steve> With all the discussion of the speed of HP's hot new Snake
Steve> systems, I've been wondering what their peak speeds are.
Steve> Suppose you hand coded a matrix multiply, trig function, FFT or
Steve> whatever.  What is the maximum speed you could get and what
Steve> would the limiting factors be?  Is a daxpy ( x = a*x+y )
Steve> limited by the FPU or by cache or by main memory?  How flexible
Steve> is the multiply accumulate?  Can the integer and fp units run
Steve> in parallel as on the i860?

(1) It looks like the "Peak Speed" of the Snakes will be 50 MFLOPS for
the 720 and 66 MFLOPS for the 730.  This is based on the comment from
someone at HP that the adder and multiplier could each accept a new
pair of operands every other clock.  Since they are fully pipelined
and can run simultaneously, this gives a peak MFLOPS = MHz.

Note that this is different than the IBM RS/6000 whose
adder/multiplier can accept operands every clock so that add/multiply
peak MFLOPS = 2*MHz.




(2) Except for specially coded, cache-friendly stuff like Matrix Multiply
and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most
calculations will be limited by the memory bandwidth.  The formula for
64-bit vector dyad operations (e.g. DSCAL) is

	Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP)

The HP has a 32-bit memory bus capable of 1 word/clock transfer rates,
so the "Streaming MFLOPS" for these machines is:

	HP/9000 Models:
		720	= 200/24 =  8.3 MFLOPS
		730	= 264/24 = 11.0 MFLOPS

The formula for the IBM RS/6000 is slightly different since the
bottleneck is between cache and the FPU, not between main memory and
cache.  The "Streaming MFLOPS" for these machines are

	IBM RS/6000 Models:
	320,520 	= 160/24 =  6.8 MFLOPS
	530,730,930	= 200/24 =  8.3 MFLOPS
	540		= 240/24 = 10.0 MFLOPS
	550		= 328/24 = 13.7 MFLOPS

The corresponding numbers for triadic vector operations like DAXPY are
exactly twice these estimates.



(3) How fast your code will run will depend to a great degree on how
much re-use you get of cached data.  Long streaming vector ops will
run at the memory bandwidth-limited speed, while short (cacheable),
reused vectors will run at closer to the "Peak Speed".
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/04/91)

>>>>> On 3 Apr 91 19:27:04 GMT, (mccalpin@perelandra.cms.udel.edu) I
wrote about the "Peak" vs "Streaming" speeds (MFLOPS) of the new HP
Snakes and IBM RS/6000 computers.

Thanks to jbs@watson.ibm.com for pointing out some errors that I would
like to clear up below:

-----------------------
Me> (2) Except for specially coded, cache-friendly stuff like Matrix Multiply
Me> and Gaussian Elimination of Dense Matrices (i.e. LINPACK), most
Me> calculations will be limited by the memory bandwidth.  The formula for
Me> 64-bit vector dyad operations (e.g. DSCAL) is

Me> 	Streaming MFLOPS = (MBytes/sec)/(24 bytes/FLOP)

This is almost correct.  Just delete the word DSCAL and consider
operations of the form:

	a(i) = b(i) op c(i)

where op is one of +,-,*.  This requires 2 8-byte reads and one 8-byte
write per FP operation.

This operation dominates most scientific codes (in my experience), and
is therefore to be preferred over DAXPY for estimating machine speed.
DAXPY requires the same amount of memory traffic, but squeezes in an
extra FP op by using a loop-invariant scalar:
	y(i) = y(i) + scalar*x(i)

DSCAL requires only one read per op, and so scales like:

	DSCAL MFLOPS = (MB/s)/(16 bytes/FLOP)

-------------------------
Me> The formula for the IBM RS/6000 is slightly different since the
Me> bottleneck is between cache and the FPU, not between main memory and
Me> cache.  The "Streaming MFLOPS" for these machines are

To clarify: What I was trying to say was that the memory bandwidth to
be used to calculate streaming MFLOPS on these machines is the
bandwidth of 8 bytes/clock from the cache to the registers, NOT the
bandwidth of 16 bytes/clock from the main memory to the cache.

This extra bandwidth from main memory to cache is not wasted (since it
decreases the cache refill rate considerably) but the machine is not
capable of 128 byte/clock transfers to and from the FPU where the work
is done.

-------------------------
Finally:

Me> 	IBM RS/6000 Models:
Me> 	320,520 	= 160/24 =  6.8 MFLOPS
                                    ^^^
This should of course be 6.7 MFLOPS.
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

daryl@hpcupt3.cup.hp.com (Daryl Odnert) (04/07/91)

Here are the key things to consider in formulating an answer to
Steve's question:

   o  One instruction is issued to either the CPU or the floating-point
      coprocessor in each cycle.

   o  The central processor and coprocessor are both running at the same
      speed (either 50MHz or 66MHz, depending on the model.)

   o  The CPU can execute instructions in parallel with multi-cycle
      floating-point operations.

   o  Floating-point add and multiply operations have three cycle
      latencies, regardless of precision (double or single).

   o  There are independent ALU and MPY functional units within the
      floating-point coprocessor which can operate in parallel.

   o  Because they are pipelined, the the ALU and MPY units can
      start a flop every other cycle as long as no data dependencies
      are present.

   o  Assuming no cache misses, loads take 2 cycles to complete.
      No interlocks will occur as long as the instruction executed
      immediately after the load does not reference the load target
      register.

   o  The pipeline penalty for stores is zero, one, or two cycles,
      depending on the distance between the store instruction and
      the next memory reference.  In other words, a store followed
      immediately by another memory reference will suffer a two
      cycle stall.

   o  The FMPYADD instruction allows *independent* multiplication
      and addition operations to be dispatched in a single cycle.
      This is NOT a multiply-and-accumulate operation.

If there are no memory references and you alternate multiply and add
operations, the coprocessor has a peak performance rate of 66 megaflops.

The DAXPY loop consists of 5 operations per vector element, 2 double-precision
loads, 1 multiply, 1 add, and 1 double-precision store.  If the loop can be
scheduled such that there are no interlocks, a non-superscalar PA-RISC machine
would hit a peak rate of 2 flops every 5 cycles in the inner loop.

Thus, peak performance for the DAXPY loop on the 66MHz Snakes box is:

66 million instructions per second * (2 flops / 5 instructions) = 26.4 MFLOPS

The FORTRAN compiler available at the first release of the HP 9000/700 would
automatically unroll the inner loop of DAXPY 4-times.  The optimizer is
able to use the FMPYADD instructions and schedule the loop in such a way
that each iteration executes in 22 cycles.  Each iteration is executing
8 flops.  This result in performance rating of 66 * (8 / 22) = 24 MFLOPS
in the inner loop.  Thus at the present time, the compilers are achieving
about 90% of the peak performance potential on this particular loop (assuming
all data fits in the cache.)  I expect that a future releases of the compiler
will achieve 100% of the potential.

Regards,
Daryl Odnert       daryl@hpcllla.cup.hp.com
Hewlett-Packard
California Language Lab
Cupertino, California

daryl@hpcupt3.cup.hp.com (Daryl Odnert) (04/09/91)

John McCalpin writes:

> The HP has a 32-bit memory bus capable of 1 word/clock transfer rates,
> so the "Streaming MFLOPS" for these machines is:
> 
>        HP/9000 Models:
>               720     = 200/24 =  8.3 MFLOPS
>               730     = 264/24 = 11.0 MFLOPS
>

John's data on the Snakes machine is incorrect.  The HP9000 models 720 and 730
have a 64-bit memory bus and are capable of 2 word/clock transfer rates.
Thus, the correct "Streaming MFLOPS" for these machines are:

        HP/9000 Models:
               720      = 400/24 = 16.7  MFLOPS
               730      = 528/24 = 22.0  MFLOPS

Daryl Odnert      daryl@hpcllla.cup.hp.com
Hewlett-Packard
California Language Lab

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/09/91)

>>>>> On 6 Apr 91 20:38:43 GMT, daryl@hpcupt3.cup.hp.com (Daryl Odnert) said:

Daryl> Here are the key things to consider in formulating an answer to
Daryl> Steve's question:

Thanks for the helpful posting....

I just thought I would point out one detail that got buried near the
bottom of your posting:

Daryl> Thus, peak performance for the DAXPY loop on the 66MHz Snakes
Daryl> box is:  66 million instructions per second * (2 flops / 5 instructions)
Daryl> = 26.4 MFLOPS

Daryl> This (compiled code) result(s) in performance rating of 66 *
Daryl> (8/ 22) = 24 MFLOPS in the inner loop.  Thus at the present
Daryl> time, the compilers are achieving about 90% of the peak
Daryl> performance potential on this particular loop 
Daryl> (assuming all data fits in the cache.)
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

This is, of course, a big assumption!  It is true that the HP caches
are rather large, but the 256kB data cache will not even hold one
200x200 doubleprecision matrix.

The memory bandwidth of the machine limits the long-vector DAXPY
performance to under 22 MFLOPS.  How well does the Fortran-produced
code perform for long vectors?  (My estimate would be in the range of 
18-20 MFLOPS).

It is also important to remember that the 66 MHz machines are not the
$12000 machines.  For these 50 MHz boxes, the numbers scale down by a
factor of 0.76, giving long-vector DAXPY performance of about 14-15
MFLOPS.  

For comparison, the IBM RS/6000 machines run uncached DAXPY at:

	Model 320	6.25 MFLOPS (measured)		13.3 Theoretical
	Model 320H	7.81 MFLOPS (est)		16.7 Theoretical
	Model 530      10.53 MFLOPS (observed)		16.7 Theoretical
	Model 540      12.64 MFLOPS (est)		20.0 Theoretical
	Model 550      17.26 MFLOPS (est)		27.3 Theoretical

The 320H is slower than the 530 even though they both have the same
clock speed (25 MHz) because the 530 has a wider bus (128-bit vs
64-bit), larger cache line size (128 bytes vs 64 bytes), and a larger
data cache (64kB vs 32kB).

Perhaps part of the reason that the IBM performance is so much less
than the memory-bandwidth-limited performance (labeled "Theoretical"
above) is that stores cannot overlap with reads or computations....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@perelandra.cms.udel.edu (John D. McCalpin) (04/10/91)

>>>>> On 8 Apr 91 21:12:21 GMT, daryl@hpcupt3.cup.hp.com (Daryl
Odnert) said:

Daryl> John McCalpin writes:
> The HP has a 32-bit memory bus capable of 1 word/clock transfer rates,
> so the "Streaming MFLOPS" for these machines is:
>        HP/9000 Models:
>               720     = 200/24 =  8.3 MFLOPS
>               730     = 264/24 = 11.0 MFLOPS
>

Daryl> John's data on the Snakes machine is incorrect.  The HP9000
Daryl> models 720 and 730 have a 64-bit memory bus and are capable of
Daryl> 2 word/clock transfer rates.  Thus, the correct "Streaming
Daryl> MFLOPS" for these machines are:

Daryl>         HP/9000 Models:
Daryl>                720      = 400/24 = 16.7  MFLOPS
Daryl>                730      = 528/24 = 22.0  MFLOPS
                                 ^^^
I think we have a little misunderstanding here. Daryl is claiming the
peak transfer rate from cache to fpu, while I am concerned about the
sustained transfer rate from main memory through the cache to the fpu.

Ths document that I have in my hands right now (5091-0977E), is very
specific:  for the 50 MHz machines it claims 400 MB/sec to cache and
200 MB/sec to main memory for the duration of a 64-byte transfer.  For
the 66 MHz machines, the numbers are (presumably) 528 MB/s and 264
MB/s, respectively.

The *sustained* transfer rate will be less than the peak rate, but the
peak rate is close enough for my purposes right now.  I am interested
in seeing the speed for long DAXPY's, though....
--
John D. McCalpin			mccalpin@perelandra.cms.udel.edu
Assistant Professor			mccalpin@brahms.udel.edu
College of Marine Studies, U. Del.	J.MCCALPIN/OMNET

mccalpin@brahms.udel.edu (John D McCalpin) (04/16/91)

In article <MCCALPIN.91Apr8194820@perelandra.cms.udel.edu> I wrote:

>The memory bandwidth of the machine limits the long-vector DAXPY
>performance to under 22 MFLOPS.  How well does the Fortran-produced
>code perform for long vectors?  (My estimate would be in the range of 
>18-20 MFLOPS).

I realized that this number is way too high.  Looking at reasonable
values for the cache miss latency leads me to estimate the long
DAXPY performance of the 66 MHz machines as about 9 MFLOPS, or 
under 1/2 of the performance based on the "peak" memory bandwidth.

>For comparison, the IBM RS/6000 machines run uncached DAXPY at:

I have revised these numbers taking into account the type of memory
access pattern and cache miss pattern exhibited by long DAXPY's.  These
new results are shown below:

Model	Measured Estimated   Theory   	Measured/
	 MFLOPS	  MFLOPS     MFLOPS	 Theory
------------------------------------------------------------------
320	6.25 		       6.67	  93.7%
320H	----	   7.81	       8.33
530    10.53 		      11.11	  94.8%
540     ----	  12.64	      13.33
550     ----	  17.26	      18.22
------------------------------------------------------------------
HP 720   ???		       6.67
HP 750   ???		       8.80
------------------------------------------------------------------

The "sustained" MFLOPS of the machine is now modelled by:

			  	       N words
 MFLOPS = (peak bandwidth) * ----------------------------- / (12 words/op)
			     (N cycles + 8 cycles latency)

where N=8 for the 320, 320H, 520, and N=16 for the 530, 540, 500.

Note that this means that the first three machines can only sustain
1/2 of the advertised "peak" memory bandwidth, while the last three
can sustain 2/3 of the "peak" memory bandwidth.

For the HP machines, I can only guess, but "reasonable" guesses give
an estimate like:

			  	       N words
 MFLOPS = (peak bandwidth) * ------------------------------ / (12 words/op)
			     (N cycles + 12 cycles latency)

I have HP glossies that tell me that the cache line size (N) on the 720
is 8 words.  I estimate a larger latency because the HP does not have
the "critical-word-first" cache refill hardware that IBM does.  So if the
memory latency is the same for the first word, then the HP will wait
(on the average) another 4 cycles before it gets the one it was waiting 
for.
I have no info on the cache line size for the 750.

Anybody from HP want to correct my errors on the latencies and provide
some numbers for long DAXPY operations to see how well the compiler 
manages?