reiter@harvard.UUCP (Ehud Reiter) (04/30/86)
In one of Jack Dongarra's articles on LINPACK performance (Computer Archicture News, vol 11, no 5 (Dec 83)), he says that a CRAY 1-M executes the benchmark faster than a CRAY 1-S because the 1-M has slower memory. I fail to see how it is even theoretically possible for slower memory to mean higher performance, and would appreciate someone who knows about CRAY's explaining this to me (Dongarra talks about a "missed chain-slot"). Thanks. Ehud Reiter harvard!reiter.UUCP reiter@harvard.ARPA
eugene@ames.UUCP (Eugene Miya) (05/01/86)
> In one of Jack Dongarra's articles on LINPACK performance > (Computer Archicture News, vol 11, no 5 (Dec 83)), he says that a > CRAY 1-M executes the benchmark faster than a CRAY 1-S because the > 1-M has slower memory. I fail to see how it > is even theoretically possible for slower memory to mean higher performance, > and would appreciate someone who knows about CRAY's explaining this to me > (Dongarra talks about a "missed chain-slot"). > Ehud Reiter George Spix from Cray Research should be able to this (gas@lanl), but I've not seen him lately, so I'm give it a shot. First point of clarification. Technically, there are no more Cray-1Ms anymore. We had one here, and it was redesignated a Cray X-MP/1. This is a machine which is moving to UC Berkeley Next month. Second, you should realize that X-MPs represent cleaned up Cray-1's (not 1S). They have a faster cycle time: 9.5 ns versus 12.5 ns, they have vector chaining, (I assume you know what chaining is, otherwise check an architecture book, you message did not sound like a specific request for chaining description), they have three paths between memory and CPU rather than one. The X-MP/1 has a slower MOS rather than bipolar memory which comes with 'top of the line' (read: current fastest model Xs, the 2 is MOS, and it also only has one data path to a given quadrant of memory). Lastly, machines like these are not like micros and minis in that you really tune them for the slowness of memory (any memory): delay loops are unacceptable. You count clock periods and make architectural features to compensate for them (i.e. chaining). It takes four clocks to get a word of memory into a CPU (assuming no bank contention). I have also been told by a Cray site engineer here that the newer MOS memories also have a slightly different internal organization. This is all why I pointed out to the fellow at NC that the MIPs/Mhz thing has a von Neumann bottleneck problem (via mail). Lastly, I have measured the effects of why this has happened, and I posted this to the Net over a year ago, but in cleaning my old author_copy file, I decided to remove it (I included a graph in that posting). Aside the X-MP, also has a nice hardware box know as the Hardware Performance Monitor which does instruction counts non-obtusively (another reason why a VAX is a poor machine to do performance research on). From the Rock of Ages Home for Retired Hackers: --eugene miya NASA Ames Research Center com'on do you trust Reply commands with all these different mailers? {hplabs,ihnp4,dual,hao,decwrl,tektronix,allegra}!ames!aurora!eugene eugene@ames-aurora.ARPA
brooks%lll-crg@lll-crg.UUCP (05/02/86)
How can the Cray 1 M with slower memory be faster than a Cray 1 S for a special set of circumstances? The S has memory with a 4 clock cycle time. There are 16 banks. For stride 1 vector fetch the cycle time of the memory is 4 times faster than it really needs to be. Suppose you want good scalar performance? Suppose you want good performance on stride 4 vector fetch? Suppose you want good performance on stride 2 vector fetch? (in this case you only need 8 banks) How can a Cray 1 M be as fast? Suppose your application is stride 1 vector fetch and 99.99% vectorized. Even if the mos ram is 4 time slower, taking a 16 clock latency, the cpu will get full bandwidth. Suppose the slightly slower memory causes a missed chain slot, emabling parallel use of the adder and multiplier, to be hit. The machine with mos memory could be faster. This is of course for very special circumstances and do not look for faster performance on average. High scalar speed is where it at! Eugene