[comp.parallel] scalability of n-cubes, meshes

wilson@carcoar.Stanford.EDU (Paul Wilson) (11/23/89)

My admittedly naive intuitions would say that only meshes are truly
scalable, since you have to pack things into real (<= 3D) space.
Hypercubes end up needing long wires to project a higher-dimensional
graph into 2- or 3-space.  As processor speeds increase (and the
speed of light presumably doesn't) these end up being slower
than other links and destroy the scalability of n-cubes.

It would *seem* to me that a 3D mesh is the only way to go
because that's the highest dimensionality you can embed into
a 3D reality.  You get constant time per hop, no problem.

Now could somebody summarize Dally's argument that 2D meshes
(and toruses) are better?  I can see why you can wrap a 2D mesh
around to make a torus one way, but can't you get much
better packing by using a 3D mesh with its higher connectivity?
It will have edges rather than wrapping around transparently,
but that's the only way to avoid long wires in the limit.

Any comments?  I'm also curious about the pros and cons of
CM-style combinations of mesh and n-cube routing systems.

thanks prematurely,

   Paul


Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@carcoar.stanford.edu
Box 4348   Chicago,IL 60680

david@cs.washington.edu (David Callahan) (11/27/89)

In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
>Now could somebody summarize Dally's argument that 2D meshes
>(and toruses) are better?  I can see why you can wrap a 2D mesh
>around to make a torus one way, but can't you get much
>better packing by using a 3D mesh with its higher connectivity?
>It will have edges rather than wrapping around transparently,
>but that's the only way to avoid long wires in the limit.

We are building a machine with a 3d mesh. The "folding" that
eliminates edges in a 2d torus can be applied in three dimensions so
that the resulting cube has no edges or corners.

The machine is a shared-memory machine and the network is packet
switched rather than using worm-hole routing since the messages are
short. Latency to a remote memory module is 2 cycles per network node
traversed, one cycle routing and one cycle "on the wire". 

For a 256 processor machine, we plan a 16x16x16 cube. This size
assures adequate flux across any cut of the machine. For such a
machine, the expected one-way latency to memory is 12 hops or 24
cycles. Simulations indicate that contention raises this average only
slightly.

>Paul R. Wilson                         

-- 
David Callahan  (david@tera.com, david@june.cs.washington.edu,david@rice.edu)
Tera Computer Co. 	400 North 34th Street  		Seattle WA, 98103

byrd@portia.Stanford.EDU (Greg Byrd) (11/27/89)

In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
{stuff deleted}
>
>It would *seem* to me that a 3D mesh is the only way to go
>because that's the highest dimensionality you can embed into
>a 3D reality.  You get constant time per hop, no problem.
>
>Now could somebody summarize Dally's argument that 2D meshes
>(and toruses) are better?

I assume you mean better than 3D?  That is not Dally's argument
at all.  The argument is that given constant bisection width,
lower dimensional k-ary n-cubes give better latency and throughput
than higher dimensional ones.  The main reason is that there are
fewer channels which contribute to the bisection, so you can make
them wider.

Latency is dependent on channel width, number of hops, and the size
of the message.  For lower dimensions, the number of hops goes up,
but so does the channel width.  So for a given number of nodes
(k^n), you trade off dimension (n) against radix (k) to get the 
best average latency.

Assuming that delay across a wire is constant (independent of the
length) then the optimal dimensionality for 256, 16K, and 1M nodes
is 2, 4, and 5, respectively.  If a logarithmic delay is assumed,
the optimal points are 2, 3, and 5.  If a linear delay is assumed,
then 2 is optimal for all three network sizes.

With regard to long wires for tori, one can "fold" the torus in
such a way that all wires are twice as long as the mesh, but without
any wires that must traverse the entire edge.

Note also that meshes have half the bisection width of tori, so
they can have twice the channel width, again at the expense of a
larger average number of hops.  (All of the latency numbers, by the
way, assume every message travels the "average" distance, and there
is no contention in the network.)

For more details, see Dally's thesis (CalTech, 1986 -- also published
by Kluwer) or see:

  William J. Dally, "Wire-Efficient VLSI Multiprocessor Communcation
  Networks," in _Advanced_Research_in_VLSI_, MIT Press, 1987,
  pp. 391-415.

-----------------------------------------------------------------------------
Greg Byrd                 byrd@sumex-aim.stanford.edu
Knowledge Systems Lab     701 Welch Rd., Bldg. C, Palo Alto, CA  94304
Stanford University       (415) 725-3849

grunwald@ncar.UCAR.EDU (Dirk Grunwald) (11/28/89)

>>> On 22 Nov 89 21:31:51 GMT, wilson@carcoar.Stanford.EDU (Paul Wilson) said:
Paul> It would *seem* to me that a 3D mesh is the only way to go
Paul> because that's the highest dimensionality you can embed into
Paul> a 3D reality.  You get constant time per hop, no problem.

Paul> Now could somebody summarize Dally's argument that 2D meshes
Paul> (and toruses) are better?  I can see why you can wrap a 2D mesh
Paul> around to make a torus one way, but can't you get much
Paul> better packing by using a 3D mesh with its higher connectivity?
Paul> It will have edges rather than wrapping around transparently,
Paul> but that's the only way to avoid long wires in the limit.

Paul> Any comments?  I'm also curious about the pros and cons of
Paul> CM-style combinations of mesh and n-cube routing systems.
----

You can embed a torus in a plane if you allow wires to overlap. E.G.
package torus:

	    +-----------------------------+
	    |		      		  |
	    A <-> B <-> C <-> D <-> E <-> F
	 
as	 
	+-----------+ +---------+
	|           | |		|
	A <-> B     F--   C--+  E <-> D
	      |		  |  |        |
	      +-----------+  +--------+

you decrease the maximum wire length at the expense of increased
average wire length. Similar constructions are possible for 2 & 3-d
tori.

the problem is you can't extend this system as easily, i.e. adding
more processors becomes a pain. There are other possibilites because
the world is not planer. You could also package the above as:

	A B
       +++++
	D C

where +++++ is the PC board and you use bond-through connections.
Or where +++++ is a cabinet and you use wire connectors.
Last I heard, Dally is now advocating a 3-D mesh.

Of course, multi-layer PC boards make hypercubes appear better.
Consider something like the Ncube system that wires hypercubes
together using PC boards.

I think that if you use free lasers or fiber optics, you'd probably go
for hypercubes again, or a hypercube of shared memory machines.

chi@unc.cs.unc.edu (Vernon Chi) (11/29/89)

It's not clear why you couldn't have transparent wrap-around on 3-D
meshes analogous to torus connected 2-D meshes.  In principal, the
one dimensional wrap scheme which avoids long wires for any array
size, e.g.,

                     _______   _______   __
                    /       \ /       \ /  \
                   X    X    X    X    X    X
                    \__/ \_______/ \_______/


where "X" represents a node, should generalize to 2- and 3-D meshes.

Following your discussion of physically embedding in 3 dimensions, however,
it's worth noting that any realizable technology for packaging depends
on a layered hierarchy of successively larger functional entities, e.g.,
IC carriers, PC boards, card cages, racks, multiple rack equipment bays,
etc.  In harsh reality of actually constructing a system, it's hard to
see how to avoid the inherently longer wires required by successively
higher packaging levels regardless of the interconnection topology.

This suggests that data locality may play as prominent a role in
scalability as interconnect topology.  Which might make 2- and 3-D
meshes look even better in the limit as compared to higher dimensional
topologies.

paulv@piring.cwi.nl (Paul Vitanyi) (11/29/89)

In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
>My admittedly naive intuitions would say that only meshes are truly
>scalable, since you have to pack things into real (<= 3D) space.
>Hypercubes end up needing long wires to project a higher-dimensional
>graph into 2- or 3-space.  As processor speeds increase (and the
>speed of light presumably doesn't) these end up being slower
>than other links and destroy the scalability of n-cubes.
>
>It would *seem* to me that a 3D mesh is the only way to go
>because that's the highest dimensionality you can embed into
>a 3D reality.  You get constant time per hop, no problem.
<<<stuff deleted>>>
>Any comments?  I'm also curious about the pros and cons of
>CM-style combinations of mesh and n-cube routing systems.

This type of questions is addressed in my paper
P. Vitanyi, Locality, communication,
and interconnect length in multicomputers, SIAM J. on Computing,
17(1988), pp. 659-672.

E.g., it shows that while networks of small diameter
like trees necessarily have *some* long wires,
symmetric networks with small diameter like n-cubes, CCC,
necessarily have *average* long wires.

Donald.Lindsay@MATHOM.GANDALF.CS.CMU.EDU (11/30/89)

One recent poster suggested that hypercubes and meshes had the same
routing latency per hop. I don't agree: wider is faster.  The reason
is that the routing hardware at each node cannot make a decision
until it has seen the entirety of the packet header.  On an NCUBE-2,
the header is 32 bits, sent serially: the latency is 32+ bit times
(44, to be precise). A byte-wide path, such as BBN uses, could make
the same decision in 4+ clocks. (And both could be faster if the
header's likeliest command mode was decodable from, say, the first
half of the header.) 

This particular latency is often dominated by the message time.  But,
it's the kind of point that designers have to be sensitive to.


Don		D.C.Lindsay 	Carnegie Mellon Computer Science

gene@cs.bu.edu (Gene Itkis) (12/01/89)

In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
>My admittedly naive intuitions would say that only meshes are truly
>scalable, since you have to pack things into real (<= 3D) space.
>...
>It would *seem* to me that a 3D mesh is the only way to go
>because that's the highest dimensionality you can embed into
>a 3D reality.  You get constant time per hop, no problem.

One problem with 3D mesh is that as the processors work the heat is
generated(*). Obviously this heat has to be dissipated somehow. The heat
created is proportional to the volume of the mesh, and the dissipation can
happen only proportionally to the surface area. So, if you care about
scalability, the 3D meshes are no good - they'll overheat, cook themselves.
Thus the best you can hope for in our physics is 2D mesh.

>...
>Hypercubes end up needing long wires to project a higher-dimensional
>graph into 2- or 3-space.  As processor speeds increase (and the
>speed of light presumably doesn't) these end up being slower
>than other links and destroy the scalability of n-cubes.

A somewhat relevant issues are considered in a FOCS-89 paper "Power of Fast
VLSI Models is Insensitive to Wires' Thinness". As one of the consequences of
the result, if the communication time (of the length x wire) is, say, 
$x log^(1.1) x$ then designers can design their chips using "imaginary" wires
of width 0, and the chips so designed can be simulated by a mesh of finite
processors on-line without increase in space or time.
--------------------
 (*) Some people doubt that energy dissipation is necessary in reversible
computations.  But non-dissipating gates have not been discovered yet. In
addition, some computations, e.g. error-correcting, must be irreversible, and
therefore will be sure to produce heat. So, the assumption that heat has to
be generated by computing elements seems to be safe, at least for a while.

wen-king@csvax.caltech.edu (Wen-King Su) (12/01/89)

>From Donald.Lindsay@MATHOM.GANDALF.CS.CMU.EDU
<One recent poster suggested that hypercubes and meshes had the same
>routing latency per hop. I don't agree: wider is faster.  The reason
<is that the routing hardware at each node cannot make a decision
>until it has seen the entirety of the packet header.  On an NCUBE-2,
<the header is 32 bits, sent serially: the latency is 32+ bit times
>(44, to be precise). A byte-wide path, such as BBN uses, could make
<the same decision in 4+ clocks. (And both could be faster if the
>header's likeliest command mode was decodable from, say, the first
<half of the header.) 

If it is done right, the first unit of the message to enter a router of
either kind can carry enough information for the router to decide
whether it should 1) strip off that unit, and/or 2) determine the
output channel of the message.  For an n-cube, the minimum size of the
unit is 1 bit.  For a grid, the size log(sqrt(N)).  You can make do
with 3 bits if you can afford 2 clocks instead of 1.  Our wormhole
routing chip has 8-bit channels, and it requires 1 clock for decoding
the header.  For more details, see John Ngai's PhD thesis: Caltech
CS-TR-89-09; and Charles Flaig's Master thesis: Caltech CS 5241:TR:87.

/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

lkaplan@BBN.COM (Larry Kaplan) (12/02/89)

In article <7252@hubcap.clemson.edu> Donald.Lindsay@MATHOM.GANDALF.CS.CMU.EDU writes:
>
>...On an NCUBE-2,
>the header is 32 bits, sent serially: the latency is 32+ bit times
>(44, to be precise). A byte-wide path, such as BBN uses, could make
>the same decision in 4+ clocks. (And both could be faster if the
>header's likeliest command mode was decodable from, say, the first
>half of the header.) 

Actually, the current generation BBN Butterfly Switch does better than this.
By "eating" the addressing bits used by a particular switch element, the
relevant addressing bits for the next switch element arrive there in the FIRST
byte.  So routing decisions are made after 1 clock.
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

kbreinho%ug@cs.utah.edu (Keith Breinholt) (12/06/89)

>}>One problem with 3D mesh is that as the processors work the heat is
>}>generated(*). >}>Thus the best you can hope for in our physics is 2D mesh.
>}> lots editted out -- Moderator

> I did not intend to claim that 3D mesh cannot be simulated in 2D per se (as a
>matter of fact it can be simulated in 1D - Turing machine). But if you notice,
>the diameter (max distance between nodes) of, say, yours (or any other) 2D
>representation of 3D mesh is going to be assymptotically much larger than the
>actual 3D mesh diameter (for the same number of nodes). So, in fact your
>suggestion can be read as "pretend you have 3D, but implement it in 2D in
>reality", and this does not contradict anything I have said.

I think you misunderstood me, partly due to the wording of my reply.  What I was
trying to say is that a 3D mesh CAN be reproduced on a 2D plane,
not simulated or pretended in 2D.

A hexagon has 6 sides.  A cube has 6 sides.  If I can produce 6 sets of connections
to neighboring processors in a 2D plane I have effectively produced a 3D connection
scheme in a 2D plane.  And as stated before, this type of representation requires no
more cooling than existing planar boards.  Which does contradict your statement.

If 3D meshes "cook themselves" it will be due to a design flaw, not because physics
dictates it so.

Keith L. Breinholt             kbreinho@ug.utah.edu or usiklb@vm.usi.utah.edu
Utah Supercomputing Institute
(801) 581-4439

bjornl@tds.kth.se (Bj|rn Lisper) (12/08/89)

In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes:

%One problem with 3D mesh is that as the processors work the heat is
%generated(*). Obviously this heat has to be dissipated somehow. The heat
%created is proportional to the volume of the mesh, and the dissipation can
%happen only proportionally to the surface area. So, if you care about
%scalability, the 3D meshes are no good - they'll overheat, cook themselves.
%Thus the best you can hope for in our physics is 2D mesh.

You implicitly assume that the 3D mesh is a solid body. This is not
necessarily true. It could for instance consist of a stack of cards, with
air flowing between them, or have channels with coolant flowing through. In
this way, you get the same cooling area as for the 2D mesh.

Bjorn Lisper

gene@CS.BU.EDU (Gene Itkis) (12/08/89)

In article <7373@hubcap.clemson.edu> bjornl@tds.kth.se (Bj|rn Lisper) writes:
>In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes:

>... So, if you care about
>>scalability, the 3D meshes are no good - they'll overheat, cook themselves.
>>Thus the best you can hope for in our physics is 2D mesh.
>
>You implicitly assume that the 3D mesh is a solid body. This is not
>necessarily true. It could for instance consist of a stack of cards, with
>air flowing between them, or have channels with coolant flowing through. In
>this way, you get the same cooling area as for the 2D mesh.

Yes, provided that the distance between the "cards" is not constant, ie growing
with the number of automata.  This architecture would still remain effectively
2D. If you are to pack automata densely (leaving a constant volume around every
automaton) you eventually run into the overheating problem even if you run
coolant between automata. Think of water being "generated" instead of heat. Do
not forget, we want scalability.

kbreinho%ug@cs.utah.edu (Keith Breinholt) (12/09/89)

Before I get anymore fan mail (read "flame mail") let me post an apology to
Gene Itkish and all of the others who sought to enlighten me on 3d meshes.

In my posting I was assuming a cube connected 3d mesh.  Thus the reference
to 6 sides which can be represented in a 2d hexagonal mesh. However, as 
Gene pointed out, this is not a 3d mesh (for you purists out there) although
I have the same signal paths and possibly signal lengths.

And as more signal paths are added in a 3d mesh, it becomes impractical to
mimic them in 2d.

The fact remains that at some huge scale 3d meshes cannot be properly cooled.
(And I think we need to add the words, "without some special method of
cooling.")  I say this because even 2d or 1d curcuits at some point of scale
will "cook" themselves.

Now, did I leave something out?

Keith L. Breinholt             kbreinho@ug.utah.edu or usiklb@vm.usi.utah.edu
Utah Supercomputing Institute
(801) 581-4439

landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) (12/14/89)

>In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes:
>%One problem with 3D mesh is that as the processors work the heat is
>%generated(*).....

In article <7373@hubcap.clemson.edu> bjornl@tds.kth.se (Bj|rn Lisper) writes:
>You implicitly assume that the 3D mesh is a solid body. This is not
>necessarily true. It could for instance consist of a stack of cards, with
>air flowing between them, or have channels with coolant flowing through. In
>this way, you get the same cooling area as for the 2D mesh.

No, he doesn't assume that.  This is important.  At some point the heat must
exit the surface of the 3D cube.  Regardless of what method you use, it will
have some limit per area, and so the surface area limits the amount of heat
you can dissipate.  Even if the machine is spewing out white-hot steam at the
speed of light, this limit exists.

As noted by Gene, you either figure out how to do reversible computation that
doesn't dissipate a fixed amount of heat per operation, or you live with the
limit.

	Howard A. Landman
	landman%hanami@eng.sun.com

pase@orville.nas.nasa.gov (Douglas M. Pase) (12/14/89)

In <7341@hubcap.clemson.edu> kbreinho%ug@cs.utah.edu (Keith Breinholt) writes:
-[...] A hexagon has 6 sides.  A cube has 6 sides.  If I can produce 6 sets
-of connections to neighboring processors in a 2D plane I have effectively
-produced a 3D connection scheme in a 2D plane.

Hexagons and 3-d cubes are in fact planar figures, but 3-d meshes are not.
Yes you can lay out the processors in 2-d, but the mesh connections will
require more than one layer.  The physical distance between nearest neighbors
can also be quite large in a big 3-d mesh, though I gather not as large as
for the same sized n-cube.
Dr. Douglas M. Pase				Computer Sciences Corporation
95 Sierra Vista Ave				NAS MS 258-6
Mountain View, CA  94043			NASA Ames Research Center
(415) 940-1197					Moffett Field, CA  94035
pase@orville.nas.nasa.gov			(415) 694-6394

cleary@cpsc.ucalgary.ca (John Cleary) (12/18/89)

In article <7427@hubcap.clemson.edu>, landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) writes:
> >In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes:
> >%One problem with 3D mesh is that as the processors work the heat is
> >%generated(*).....
> 
It worth noting however that the temperature of a 3D cube rises very slowly
with the number of nodes.  The heat lost by radiation (probably the only
feasible mechanism in the limit) is porportional to A*T^4 where A is the
surface area and T is the temperature.  If the radius of the 3D body is R then
A scales as R^2 and the number of processors (N) and hence the total heat (per
unit time) as R^3.  The result is that T scales as N^(1/12) pretty slow.

Of course if you could carry some of the heat away with neutrinos then 
you could build a big one before the temperature got too high :-)


	John G. Cleary
	Department of Computer Science
	University of Calgary
	2500 University Drive 
	N.W. Calgary
	Alberta T2N 1N4
	Canada

	cleary@cpsc.UCalgary.ca
	Phone: (403)282-5711 or (403)220-6087

wen-king@csvax.caltech.edu (Wen-King Su) (12/19/89)

<From: cleary@cpsc.ucalgary.ca (John Cleary)
>In article <7427@hubcap.clemson.edu>, landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) writes:
<> >In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes:
>> >%One problem with 3D mesh is that as the processors work the heat is
<> >%generated(*).....
>> 
<It worth noting however that the temperature of a 3D cube rises very slowly
>with the number of nodes.  The heat lost by radiation (probably the only
<feasible mechanism in the limit) is porportional to A*T^4 where A is the
>surface area and T is the temperature.  If the radius of the 3D body is R then
<A scales as R^2 and the number of processors (N) and hence the total heat (per
>unit time) as R^3.  The result is that T scales as N^(1/12) pretty slow.

'T' is the temperature at the surface of the 3D cube.  In order to allow
heat to dessipate, there must be a thermal gradiant from the core to the
surface.  In order to keep the core temperature at an acceptiable level,
the surface temperature must drop (not rise, nor to remain constant) as
R is increased.  At a certain R, the required surface temperature will
hit 0 degree Kelvin, and we can go no further.

/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

cline@cheetah.ece.clarkson.edu (Marshall Cline) (12/20/89)

In article <7480@hubcap.clemson.edu> cleary@cpsc.ucalgary.ca (John Cleary) writes:

>In article <7427@hubcap.clemson.edu>, landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) writes:
>It worth noting however that the temperature of a 3D cube rises very slowly
>with the number of nodes.  The heat lost by radiation (probably the only
>feasible mechanism in the limit) is porportional to A*T^4 where A is the
>surface area and T is the temperature.  If the radius of the 3D body is R then
>A scales as R^2 and the number of processors (N) and hence the total heat (per
>unit time) as R^3.  The result is that T scales as N^(1/12) pretty slow.
...
>	John G. Cleary

In case anyone doubts John's analysis, a quick example from nature
should help.  Small animals are generally very active and have high
metabolisms (consider a bird's heart-rate vs a grizzly bear's).  Why?
Ans: small animals have a low body_volume/surface_area ratio (R^3 /
R^2 gets small for small R).  But if you drag a whale up on a beach,
it will literally *cook* from its own body heat, having more pounds
per square inch.

The problem of designing big computers that don't cook themselves
therefore isn't new -- God had the same problem when designing big
animals!  The solution thus appears to be that big computers will need
better cooling systems.

Marshall
--
___________________________________________________________________
Marshall Cline/ECE Dept/Clarkson Univ/Potsdam NY 13676/315-268-3868
Internet: cline@sun.soe.clarkson.edu -or- bh0w@clutx.clarkson.edu
BitNet: bh0w@clutx  Usenet: uunet!sun.soe.clarkson.edu!cline