[comp.hypercube] bandwidth balance

Donald.Lindsay@K.GP.CS.CMU.EDU (02/12/88)

When building a parallel machine, a designer chooses the balance between
computational resources, and memory bandwidth. For example, both Intel and
Thinking Machines recently announced new hypercubes, which had about the
same memory bandwidth as the previous models, but with vector arithmetic
units spread through the cube.

In general, today's hypercubes are bandwidth-heavy compared to conventional
machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I
know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to
memory.  A 64K-processor Connection Machine has a 64K-bit-wide path. This is
somewhat more than any Cray - regardless of where in the Cray you choose to
measure.

I recently heard a talk by Gil Weigand of Sandia National Labs. He claims
considerable success in getting near-linear scaleup on his NCUBE/10. In
particular, he mentioned a Laplacian solver which was deliberately memory
intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300
times less computation. He claimed his time-to-result was dramatically
better than on the Sandia Cray, even though the Cray is the superior in
MFLOPS.

This raises several interesting questions.
- Could this algorithm work on the Cray, or is the massive memory bandwidth
  the whole secret ?
- Is a 64-processor Cray-4 going to compare more favorably with the (bigger)
  cubes it will compete with ?
- Can we find other problems that fall to such attacks ?

I'd call this good news.


	Don		lindsay@k.gp.cs.cmu.edu    CMU Computer Science

rmw6x@hudson.acc.virginia.edu (Robert M. Wise) (02/15/88)

In article <964@hubcap.UUCP> Donald.Lindsay@K.GP.CS.CMU.EDU writes: >
>When building a parallel machine, a designer chooses the balancebetween 
>computational resources, and memory bandwidth. For example, both Intel and 
>Thinking Machines recently announced new hypercubes, which had about the 
>same memory bandwidth as the previous models, but with vector arithmetic 
>units spread through the cube.  
> 

Could you define what you been by memory bandwidth?  Do you mean the
width of the data bus, or do you mean the overall throughput (for lack
of a better term)?  Is the memory bandwidth on a hypercube found by
multiplying the bandwith of one node by the number of nodes in use?

>In general, today's hypercubes are bandwidth-heavy compared to conventional
>machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I
>know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to
>memory.  A 64K-processor Connection Machine has a 64K-bit-wide path. This is
>somewhat more than any Cray - regardless of where in the Cray you choose to
>measure.
>

By the way, do you know anyone that HAS a 1024 node Ncube?  I think
the largest one in use (at this time, and last I heard, etc etc
standard disclaimer...)  is a 512 node version at Ncube. Kind of their
in house development machine.

>I recently heard a talk by Gil Weigand of Sandia National Labs. He claims
>considerable success in getting near-linear scaleup on his NCUBE/10. In
>particular, he mentioned a Laplacian solver which was deliberately memory
>intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300
>times less computation. He claimed his time-to-result was dramatically
>better than on the Sandia Cray, even though the Cray is the superior in
>MFLOPS.
>
How many nodes is his Ncube-10, and how much memory per node?

>This raises several interesting questions.
>- Could this algorithm work on the Cray, or is the massive memory bandwidth
>  the whole secret ?
>- Is a 64-processor Cray-4 going to compare more favorably with the (bigger)
>  cubes it will compete with ?
>- Can we find other problems that fall to such attacks ?
>
>I'd call this good news.
>
>

With architectures like the the hypercube, you can often get speedup
by using more memory.  Consider the matrix multiplication problem.
(AB=C) If every node computes a subset of the final matrix elements by
means of the standard inner product, then for it to compute a portion
of the final matrix that is N x N, it must look at (minimum) N rows
and N columns.  If dividing the problem among four nodes, each node
would have to look at half of the elements of A and half of the
elements of B.  If memory were very restricted, then the problem could
be solved by "pipelining" the matrix rows/columns (even elements) to
each node in a ring on mesh topology.  However, if every node has all
the elements that it needs to start with, then no communication
between nodes is ever done, all that need be done is put the results
somewhere.

I suspect that there are a lot of algorithms which benefit from this
approach, although not as much as the matrix multiplication kind of
thing.  Any thoughts on this?  Might make an interesting paper.
Hmmmmm.  Never mind, I didn't say that...



-Bob Wise

bitnet: rmw6x@virginia
internet: rmw6x@hudson.acc.virginia.edu

lisper@YALE.ARPA (Bjorn Lisper) (02/17/88)

[ An editorial reminder:  You cannot include more text than
  you add - the mailer checks that, not me.  If it looks simple,
  I'll occasionally fix it for you, but I don't if it takes
  editorializing.  Please help out.

	Steve
]

In article <975@hubcap.UUCP> rmw6x@hudson.acc.virginia.edu (Robert M. Wise)
writes:
>In article <964@hubcap.UUCP> Donald.Lindsay@K.GP.CS.CMU.EDU writes:
>>  ....
>>I recently heard a talk by Gil Weigand of Sandia National Labs. He claims
>>considerable success in getting near-linear scaleup on his NCUBE/10.
>....
>I suspect that there are a lot of algorithms which benefit from this
>approach, although not as much as the matrix multiplication kind of
>thing.  Any thoughts on this?  Might make an interesting paper.
>Hmmmmm.  Never mind, I didn't say that...

Don't forget that you not only must put the results somewhere, the elements
of A must also be sent to the proper processors before the computation
starts. On the other hand this distribution of data in "chunks" is what
gives you the speedup. This is especially true on an architecture where
there is a large startup cost for every message sent. Your scheme uses the
extra memory to make "chunk communication" possible, whereas a pipelined
scheme must send intermediate results every now and then. But it doesn't
save on the number of operations. The Laplacian solver mentioned above is
said to require 300 times less COMPUTATION in exchange for the increased
memory requirements. If that statement is to be taken literally, then some
other mechanism must be responsible for the speedup than communications
savings.

Bjorn Lisper

spr@eli.cray.com (Steve Reinhardt) (02/18/88)

In 
>To: hall!K.GP.CS.CMU.EDU!Donald.Lindsay
> In general, today's hypercubes are bandwidth-heavy compared to conventional
> machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I
> know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to
> memory.  A 64K-processor Connection Machine has a 64K-bit-wide path. This is
> somewhat more than any Cray - regardless of where in the Cray you choose to
> measure.

Yes, but...  You describe how wide the data-paths are, but not what the
bandwidth is (bandwidth defined as bits/second transferred).  A Cray Y-MP
has 8 CPUs x 4 ports apiece x 166.7M cycles per second x 64 bits/reference
for a total memory bandwidth of ~340Gbits/sec.  What are the clock cycles
of the NCUBE, etc., (to multiply by datapath widths to get bandwidth)?

> I recently heard a talk by Gil Weigand of Sandia National Labs. He claims
> considerable success in getting near-linear scaleup on his NCUBE/10. In
> particular, he mentioned a Laplacian solver which was deliberately memory
> intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300
> times less computation. He claimed his time-to-result was dramatically
> better than on the Sandia Cray, even though the Cray is the superior in
> MFLOPS.
> This raises several interesting questions.
> - Could this algorithm work on the Cray, or is the massive memory bandwidth
>   the whole secret ?

The 256MB won't fit in main memory on a Cray-1 or X-MP.  Was this just a
in-core/out-of-core difference?  (Granted, it's probably easier to put
a larger memory on a distributed memory machine than a shared memory
machine.)

> I'd call this good news.

I agree.  We still don't know enough about what kinds of machines
are best for what kinds of problems.


Steve Reinhardt		{inhp4,sun!tundra}!cray!spr
Cray Research, Inc.  	spr@hall.cray.com@uc.msc.umn.edu
Mendota Heights, MN
		Standard disclaimer.	

bradley@M.CS.UIUC.EDU (02/22/88)

/* Written  9:10 pm  Feb 15, 1988 by bradley@uiucdcsm.cs.uiuc.edu in uiucdcsm:comp.hypercube */

   According to a September, 1987 CalTech
   technical bulletin, CalTech also has a 512 node Ncube, with plans to expand 
   it to 1024 nodes.
/* End of text from uiucdcsm:comp.hypercube */


I misread the bulletin.  CalTech does NOT currently intend to expand their
Ncube to 1024 nodes.  Sorry.

David Bradley

Donald.Lindsay@K.GP.CS.CMU.EDU (02/22/88)

I originally said:
>Of course, a Cray isn't simple. I don't have data on the Y-MP, so let us
>assume 8 CPUs, each quad-ported to (some) memory, with 64-bit data paths.
>That gives a 2K-bit-wide data path, versus the 16K on the NCUBE. However, we
>are measuring the Cray, not at the RAM chips, but at a pipelined interface.
>Assuming for simplicity that this gives the Cray an 8:1 speedup, then the
>Y-MP and the NCUBE have similar aggregate potential memory bandwidths.

Steve Reinhardt reports the clock rate of the Y-MP interface as 166.7 MHz.
The NCUBE uses a 10MHz clock, hence the ratio is 17:1, not 8:1. Worse,
the NCUBE can't cycle memory in one clock, and therefore is slowed by
another factor of 2 or 3.

The Y-MP is therefore ahead by perhaps a factor of six. The MFLOPS ratio
would typically be higher, as is the price ratio. As previously pointed
out, all of these have serious caveats.

IBM's proposed TF-1 will have

    32K CPUs * 50 MHz * ( 32 + 64 ) bit-wide-path = 160,000 Gb/s 

i.e. 462 times more than the Y-MP. I firmly believe that IBM can achieve
this (or some significant fraction of it). I am, however, a bit dubious
about the ease with which they expect to get them synchronized (!) and
communicating. Also, the reliability of 3.5MW of CMOS ( and 2048 disks ) is
certainly quite the topic.