[net.arch] CRAY Question

reiter@harvard.UUCP (Ehud Reiter) (04/30/86)

In one of Jack Dongarra's articles on LINPACK performance (Computer Archicture
News, vol 11, no 5 (Dec 83)), he says that a CRAY 1-M executes the benchmark
faster than a CRAY 1-S because the 1-M has slower memory.  I fail to see how it
is even theoretically possible for slower memory to mean higher performance,
and would appreciate someone who knows about CRAY's explaining this to me
(Dongarra talks about a "missed chain-slot").

Thanks.
						Ehud Reiter
						harvard!reiter.UUCP
						reiter@harvard.ARPA

eugene@ames.UUCP (Eugene Miya) (05/01/86)

> In one of Jack Dongarra's articles on LINPACK performance
> (Computer Archicture News, vol 11, no 5 (Dec 83)), he says that a
> CRAY 1-M executes the benchmark faster than a CRAY 1-S because the
> 1-M has slower memory.  I fail to see how it
> is even theoretically possible for slower memory to mean higher performance,
> and would appreciate someone who knows about CRAY's explaining this to me
> (Dongarra talks about a "missed chain-slot").
> 						Ehud Reiter

George Spix from Cray Research should be able to this (gas@lanl), but
I've not seen him lately, so I'm give it a shot.

First point of clarification.  Technically, there are no more Cray-1Ms anymore.
We had one here, and it was redesignated a Cray X-MP/1.  This is a
machine which is moving to UC Berkeley Next month.  Second, you should
realize that X-MPs represent cleaned up Cray-1's (not 1S).  They have
a faster cycle time: 9.5 ns versus 12.5 ns, they have vector chaining,
(I assume you know what chaining is, otherwise check an architecture
book, you message did not sound like a specific request for chaining
description), they have three paths between memory and CPU rather than one.
The X-MP/1 has a slower MOS rather than bipolar memory which comes with
'top of the line' (read: current fastest model Xs, the 2 is MOS, and it
also only has one data path to a given quadrant of memory).
Lastly, machines like these are not like micros and minis in that
you really tune them for the slowness of memory (any memory): delay
loops are unacceptable.  You count clock periods and make architectural
features to compensate for them (i.e. chaining).  It takes four clocks
to get a word of memory into a CPU (assuming no bank contention).
I have also been told by a Cray site engineer here that the newer
MOS memories also have a slightly different internal organization.

This is all why I pointed out to the fellow at NC that the MIPs/Mhz
thing has a von Neumann bottleneck problem (via mail).  Lastly,  I
have measured the effects of why this has happened, and I posted this
to the Net over a year ago, but in cleaning my old author_copy file,
I decided to remove it (I included a graph in that posting).
Aside the X-MP, also has a nice hardware box know as the Hardware
Performance Monitor which does instruction counts non-obtusively
(another reason why a VAX is a poor machine to do performance
research on).

From the Rock of Ages Home for Retired Hackers:
--eugene miya
  NASA Ames Research Center
  com'on do you trust Reply commands with all these different mailers?
  {hplabs,ihnp4,dual,hao,decwrl,tektronix,allegra}!ames!aurora!eugene
  eugene@ames-aurora.ARPA

brooks%lll-crg@lll-crg.UUCP (05/02/86)

How can the Cray 1 M with slower memory be faster than a Cray 1 S for
a special set of circumstances?

The S has memory with a 4 clock cycle time.  There are 16 banks.  For
stride 1 vector fetch the cycle time of the memory is 4 times faster than
it really needs to be.  Suppose you want good scalar performance?  Suppose
you want good performance on stride 4 vector fetch?  Suppose you want
good performance on stride 2 vector fetch? (in this case you only need 8
banks)

How can a Cray 1 M be as fast?  Suppose your application is stride 1 vector
fetch and 99.99% vectorized.  Even if the mos ram is 4 time slower, taking
a 16 clock latency, the cpu will get full bandwidth.  Suppose the slightly
slower memory causes a missed chain slot, emabling parallel use of the
adder and multiplier, to be hit.  The machine with mos memory could be faster.
This is of course for very special circumstances and do not look for faster
performance on average.  High scalar speed is where it at!

							Eugene