Donald.Lindsay@K.GP.CS.CMU.EDU (02/12/88)
When building a parallel machine, a designer chooses the balance between computational resources, and memory bandwidth. For example, both Intel and Thinking Machines recently announced new hypercubes, which had about the same memory bandwidth as the previous models, but with vector arithmetic units spread through the cube. In general, today's hypercubes are bandwidth-heavy compared to conventional machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to memory. A 64K-processor Connection Machine has a 64K-bit-wide path. This is somewhat more than any Cray - regardless of where in the Cray you choose to measure. I recently heard a talk by Gil Weigand of Sandia National Labs. He claims considerable success in getting near-linear scaleup on his NCUBE/10. In particular, he mentioned a Laplacian solver which was deliberately memory intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300 times less computation. He claimed his time-to-result was dramatically better than on the Sandia Cray, even though the Cray is the superior in MFLOPS. This raises several interesting questions. - Could this algorithm work on the Cray, or is the massive memory bandwidth the whole secret ? - Is a 64-processor Cray-4 going to compare more favorably with the (bigger) cubes it will compete with ? - Can we find other problems that fall to such attacks ? I'd call this good news. Don lindsay@k.gp.cs.cmu.edu CMU Computer Science
rmw6x@hudson.acc.virginia.edu (Robert M. Wise) (02/15/88)
In article <964@hubcap.UUCP> Donald.Lindsay@K.GP.CS.CMU.EDU writes: > >When building a parallel machine, a designer chooses the balancebetween >computational resources, and memory bandwidth. For example, both Intel and >Thinking Machines recently announced new hypercubes, which had about the >same memory bandwidth as the previous models, but with vector arithmetic >units spread through the cube. > Could you define what you been by memory bandwidth? Do you mean the width of the data bus, or do you mean the overall throughput (for lack of a better term)? Is the memory bandwidth on a hypercube found by multiplying the bandwith of one node by the number of nodes in use? >In general, today's hypercubes are bandwidth-heavy compared to conventional >machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I >know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to >memory. A 64K-processor Connection Machine has a 64K-bit-wide path. This is >somewhat more than any Cray - regardless of where in the Cray you choose to >measure. > By the way, do you know anyone that HAS a 1024 node Ncube? I think the largest one in use (at this time, and last I heard, etc etc standard disclaimer...) is a 512 node version at Ncube. Kind of their in house development machine. >I recently heard a talk by Gil Weigand of Sandia National Labs. He claims >considerable success in getting near-linear scaleup on his NCUBE/10. In >particular, he mentioned a Laplacian solver which was deliberately memory >intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300 >times less computation. He claimed his time-to-result was dramatically >better than on the Sandia Cray, even though the Cray is the superior in >MFLOPS. > How many nodes is his Ncube-10, and how much memory per node? >This raises several interesting questions. >- Could this algorithm work on the Cray, or is the massive memory bandwidth > the whole secret ? >- Is a 64-processor Cray-4 going to compare more favorably with the (bigger) > cubes it will compete with ? >- Can we find other problems that fall to such attacks ? > >I'd call this good news. > > With architectures like the the hypercube, you can often get speedup by using more memory. Consider the matrix multiplication problem. (AB=C) If every node computes a subset of the final matrix elements by means of the standard inner product, then for it to compute a portion of the final matrix that is N x N, it must look at (minimum) N rows and N columns. If dividing the problem among four nodes, each node would have to look at half of the elements of A and half of the elements of B. If memory were very restricted, then the problem could be solved by "pipelining" the matrix rows/columns (even elements) to each node in a ring on mesh topology. However, if every node has all the elements that it needs to start with, then no communication between nodes is ever done, all that need be done is put the results somewhere. I suspect that there are a lot of algorithms which benefit from this approach, although not as much as the matrix multiplication kind of thing. Any thoughts on this? Might make an interesting paper. Hmmmmm. Never mind, I didn't say that... -Bob Wise bitnet: rmw6x@virginia internet: rmw6x@hudson.acc.virginia.edu
lisper@YALE.ARPA (Bjorn Lisper) (02/17/88)
[ An editorial reminder: You cannot include more text than you add - the mailer checks that, not me. If it looks simple, I'll occasionally fix it for you, but I don't if it takes editorializing. Please help out. Steve ] In article <975@hubcap.UUCP> rmw6x@hudson.acc.virginia.edu (Robert M. Wise) writes: >In article <964@hubcap.UUCP> Donald.Lindsay@K.GP.CS.CMU.EDU writes: >> .... >>I recently heard a talk by Gil Weigand of Sandia National Labs. He claims >>considerable success in getting near-linear scaleup on his NCUBE/10. >.... >I suspect that there are a lot of algorithms which benefit from this >approach, although not as much as the matrix multiplication kind of >thing. Any thoughts on this? Might make an interesting paper. >Hmmmmm. Never mind, I didn't say that... Don't forget that you not only must put the results somewhere, the elements of A must also be sent to the proper processors before the computation starts. On the other hand this distribution of data in "chunks" is what gives you the speedup. This is especially true on an architecture where there is a large startup cost for every message sent. Your scheme uses the extra memory to make "chunk communication" possible, whereas a pipelined scheme must send intermediate results every now and then. But it doesn't save on the number of operations. The Laplacian solver mentioned above is said to require 300 times less COMPUTATION in exchange for the increased memory requirements. If that statement is to be taken literally, then some other mechanism must be responsible for the speedup than communications savings. Bjorn Lisper
spr@eli.cray.com (Steve Reinhardt) (02/18/88)
In >To: hall!K.GP.CS.CMU.EDU!Donald.Lindsay > In general, today's hypercubes are bandwidth-heavy compared to conventional > machines. A 256-node Butterfly has an 8K-bit-wide path to memory. (Yes, I > know it's not quite a cube.) A 1024-node NCUBE has a 16K-bit-wide path to > memory. A 64K-processor Connection Machine has a 64K-bit-wide path. This is > somewhat more than any Cray - regardless of where in the Cray you choose to > measure. Yes, but... You describe how wide the data-paths are, but not what the bandwidth is (bandwidth defined as bits/second transferred). A Cray Y-MP has 8 CPUs x 4 ports apiece x 166.7M cycles per second x 64 bits/reference for a total memory bandwidth of ~340Gbits/sec. What are the clock cycles of the NCUBE, etc., (to multiply by datapath widths to get bandwidth)? > I recently heard a talk by Gil Weigand of Sandia National Labs. He claims > considerable success in getting near-linear scaleup on his NCUBE/10. In > particular, he mentioned a Laplacian solver which was deliberately memory > intensive. It used 128 times the memory ( 2MB --> 256MB ) in return for 300 > times less computation. He claimed his time-to-result was dramatically > better than on the Sandia Cray, even though the Cray is the superior in > MFLOPS. > This raises several interesting questions. > - Could this algorithm work on the Cray, or is the massive memory bandwidth > the whole secret ? The 256MB won't fit in main memory on a Cray-1 or X-MP. Was this just a in-core/out-of-core difference? (Granted, it's probably easier to put a larger memory on a distributed memory machine than a shared memory machine.) > I'd call this good news. I agree. We still don't know enough about what kinds of machines are best for what kinds of problems. Steve Reinhardt {inhp4,sun!tundra}!cray!spr Cray Research, Inc. spr@hall.cray.com@uc.msc.umn.edu Mendota Heights, MN Standard disclaimer.
bradley@M.CS.UIUC.EDU (02/22/88)
/* Written 9:10 pm Feb 15, 1988 by bradley@uiucdcsm.cs.uiuc.edu in uiucdcsm:comp.hypercube */ According to a September, 1987 CalTech technical bulletin, CalTech also has a 512 node Ncube, with plans to expand it to 1024 nodes. /* End of text from uiucdcsm:comp.hypercube */ I misread the bulletin. CalTech does NOT currently intend to expand their Ncube to 1024 nodes. Sorry. David Bradley
Donald.Lindsay@K.GP.CS.CMU.EDU (02/22/88)
I originally said: >Of course, a Cray isn't simple. I don't have data on the Y-MP, so let us >assume 8 CPUs, each quad-ported to (some) memory, with 64-bit data paths. >That gives a 2K-bit-wide data path, versus the 16K on the NCUBE. However, we >are measuring the Cray, not at the RAM chips, but at a pipelined interface. >Assuming for simplicity that this gives the Cray an 8:1 speedup, then the >Y-MP and the NCUBE have similar aggregate potential memory bandwidths. Steve Reinhardt reports the clock rate of the Y-MP interface as 166.7 MHz. The NCUBE uses a 10MHz clock, hence the ratio is 17:1, not 8:1. Worse, the NCUBE can't cycle memory in one clock, and therefore is slowed by another factor of 2 or 3. The Y-MP is therefore ahead by perhaps a factor of six. The MFLOPS ratio would typically be higher, as is the price ratio. As previously pointed out, all of these have serious caveats. IBM's proposed TF-1 will have 32K CPUs * 50 MHz * ( 32 + 64 ) bit-wide-path = 160,000 Gb/s i.e. 462 times more than the Y-MP. I firmly believe that IBM can achieve this (or some significant fraction of it). I am, however, a bit dubious about the ease with which they expect to get them synchronized (!) and communicating. Also, the reliability of 3.5MW of CMOS ( and 2048 disks ) is certainly quite the topic.