[comp.hypercube] Memory Bandwidth

Donald.Lindsay@K.GP.CS.CMU.EDU (02/17/88)
In article <975@hubcap.UUCP>, 
	rmw6x@hudson.acc.virginia.edu (Robert M. Wise)writes: >
>Could you define what you been by memory bandwidth?  Do you mean the
>width of the data bus, or do you mean the overall throughput (for lack
>of a better term)?  Is the memory bandwidth on a hypercube found by
>multiplying the bandwith of one node by the number of nodes in use?
> ...
>By the way, do you know anyone that HAS a 1024 node Ncube?  
>How many nodes is his Ncube-10, and how much memory per node?

Sandia started with 256 nodes and upgraded to 1024 last fall. It has 0.5MB
per node, hence 0.5GB total. Ignoring ECC overhead, there are 4096 RAM
chips, and each is productively involved in every memory cycle. 

A simple traditional machine, with a 32 bit data path, would involve (say)
32 RAM chips in each memory cycle. The memory might consist of 4096 RAM
chips:  but 4064 of them would be unused during each cycle.

Of course, a Cray isn't simple. I don't have data on the Y-MP, so let us
assume 8 CPUs, each quad-ported to (some) memory, with 64-bit data paths.
That gives a 2K-bit-wide data path, versus the 16K on the NCUBE. However, we
are measuring the Cray, not at the RAM chips, but at a pipelined interface.
Assuming for simplicity that this gives the Cray an 8:1 speedup, then the
Y-MP and the NCUBE have similar aggregate potential memory bandwidths.

Unfortunately, that's only one of the meaningful measures. For example, the
Cray only fetches 8 instruction streams, not 1024. Plus, a cube loses
bandwidth to the communications channels. (An NCUBE/10 node is essentially
locked out of memory if 7 of its 22 DMA channels are running at once.)

If we look at registers, the NCUBE has 64KB of GPRs, and the Y-MP probably
has 32KB of vector registers (alone). Again, there's rough equivalence.

If we want the data bandwidth available to the functional units, then I
assume the NCUBE gets 1 register/clock/cpu = 32K bits/clock. The Y-MP should
get at least 4 registers/clock/cpu, for 2K bits/clock. If the clocks have
about a 16:1 ratio, then again, there's rough equivalence. The Cray may be
faster than that, but for both machines, there's the issue of percent
utilization. On the Cray, the issue comes from trying to schedule the
application code onto each CPU's parallel resources. On the NCUBE, the issue
comes from things like instruction decoding, and overlapping messages with
computation.

On the money side, at least, things are clear. NCUBE is getting
approximately the maximum possible bandwith per dollar. (Air cooled CMOS,
conventional PCBs, medium speed DRAMs.) Cray prefers to pay a premium.

>Consider the matrix multiplication problem.

Odd that you should mention it. Sandia's Laplace solver (Dirichlet boundary
conditions, Green's function ) first solved subproblems. Then, it combined
the intermediate results, using a matrix multiply to do a linear
superposition. The final results then went (in parallel ) to a graphics
terminal. We can hope that they will publish details soon.

Don Lindsay			CMU Computer Science