wilson@carcoar.Stanford.EDU (Paul Wilson) (11/23/89)
My admittedly naive intuitions would say that only meshes are truly scalable, since you have to pack things into real (<= 3D) space. Hypercubes end up needing long wires to project a higher-dimensional graph into 2- or 3-space. As processor speeds increase (and the speed of light presumably doesn't) these end up being slower than other links and destroy the scalability of n-cubes. It would *seem* to me that a 3D mesh is the only way to go because that's the highest dimensionality you can embed into a 3D reality. You get constant time per hop, no problem. Now could somebody summarize Dally's argument that 2D meshes (and toruses) are better? I can see why you can wrap a 2D mesh around to make a torus one way, but can't you get much better packing by using a 3D mesh with its higher connectivity? It will have edges rather than wrapping around transparently, but that's the only way to avoid long wires in the limit. Any comments? I'm also curious about the pros and cons of CM-style combinations of mesh and n-cube routing systems. thanks prematurely, Paul Paul R. Wilson Software Systems Laboratory lab ph.: (312) 996-9216 U. of Illin. at C. EECS Dept. (M/C 154) wilson@carcoar.stanford.edu Box 4348 Chicago,IL 60680
david@cs.washington.edu (David Callahan) (11/27/89)
In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes: >Now could somebody summarize Dally's argument that 2D meshes >(and toruses) are better? I can see why you can wrap a 2D mesh >around to make a torus one way, but can't you get much >better packing by using a 3D mesh with its higher connectivity? >It will have edges rather than wrapping around transparently, >but that's the only way to avoid long wires in the limit. We are building a machine with a 3d mesh. The "folding" that eliminates edges in a 2d torus can be applied in three dimensions so that the resulting cube has no edges or corners. The machine is a shared-memory machine and the network is packet switched rather than using worm-hole routing since the messages are short. Latency to a remote memory module is 2 cycles per network node traversed, one cycle routing and one cycle "on the wire". For a 256 processor machine, we plan a 16x16x16 cube. This size assures adequate flux across any cut of the machine. For such a machine, the expected one-way latency to memory is 12 hops or 24 cycles. Simulations indicate that contention raises this average only slightly. >Paul R. Wilson -- David Callahan (david@tera.com, david@june.cs.washington.edu,david@rice.edu) Tera Computer Co. 400 North 34th Street Seattle WA, 98103
byrd@portia.Stanford.EDU (Greg Byrd) (11/27/89)
In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes: {stuff deleted} > >It would *seem* to me that a 3D mesh is the only way to go >because that's the highest dimensionality you can embed into >a 3D reality. You get constant time per hop, no problem. > >Now could somebody summarize Dally's argument that 2D meshes >(and toruses) are better? I assume you mean better than 3D? That is not Dally's argument at all. The argument is that given constant bisection width, lower dimensional k-ary n-cubes give better latency and throughput than higher dimensional ones. The main reason is that there are fewer channels which contribute to the bisection, so you can make them wider. Latency is dependent on channel width, number of hops, and the size of the message. For lower dimensions, the number of hops goes up, but so does the channel width. So for a given number of nodes (k^n), you trade off dimension (n) against radix (k) to get the best average latency. Assuming that delay across a wire is constant (independent of the length) then the optimal dimensionality for 256, 16K, and 1M nodes is 2, 4, and 5, respectively. If a logarithmic delay is assumed, the optimal points are 2, 3, and 5. If a linear delay is assumed, then 2 is optimal for all three network sizes. With regard to long wires for tori, one can "fold" the torus in such a way that all wires are twice as long as the mesh, but without any wires that must traverse the entire edge. Note also that meshes have half the bisection width of tori, so they can have twice the channel width, again at the expense of a larger average number of hops. (All of the latency numbers, by the way, assume every message travels the "average" distance, and there is no contention in the network.) For more details, see Dally's thesis (CalTech, 1986 -- also published by Kluwer) or see: William J. Dally, "Wire-Efficient VLSI Multiprocessor Communcation Networks," in _Advanced_Research_in_VLSI_, MIT Press, 1987, pp. 391-415. ----------------------------------------------------------------------------- Greg Byrd byrd@sumex-aim.stanford.edu Knowledge Systems Lab 701 Welch Rd., Bldg. C, Palo Alto, CA 94304 Stanford University (415) 725-3849
grunwald@ncar.UCAR.EDU (Dirk Grunwald) (11/28/89)
>>> On 22 Nov 89 21:31:51 GMT, wilson@carcoar.Stanford.EDU (Paul Wilson) said: Paul> It would *seem* to me that a 3D mesh is the only way to go Paul> because that's the highest dimensionality you can embed into Paul> a 3D reality. You get constant time per hop, no problem. Paul> Now could somebody summarize Dally's argument that 2D meshes Paul> (and toruses) are better? I can see why you can wrap a 2D mesh Paul> around to make a torus one way, but can't you get much Paul> better packing by using a 3D mesh with its higher connectivity? Paul> It will have edges rather than wrapping around transparently, Paul> but that's the only way to avoid long wires in the limit. Paul> Any comments? I'm also curious about the pros and cons of Paul> CM-style combinations of mesh and n-cube routing systems. ---- You can embed a torus in a plane if you allow wires to overlap. E.G. package torus: +-----------------------------+ | | A <-> B <-> C <-> D <-> E <-> F as +-----------+ +---------+ | | | | A <-> B F-- C--+ E <-> D | | | | +-----------+ +--------+ you decrease the maximum wire length at the expense of increased average wire length. Similar constructions are possible for 2 & 3-d tori. the problem is you can't extend this system as easily, i.e. adding more processors becomes a pain. There are other possibilites because the world is not planer. You could also package the above as: A B +++++ D C where +++++ is the PC board and you use bond-through connections. Or where +++++ is a cabinet and you use wire connectors. Last I heard, Dally is now advocating a 3-D mesh. Of course, multi-layer PC boards make hypercubes appear better. Consider something like the Ncube system that wires hypercubes together using PC boards. I think that if you use free lasers or fiber optics, you'd probably go for hypercubes again, or a hypercube of shared memory machines.
chi@unc.cs.unc.edu (Vernon Chi) (11/29/89)
It's not clear why you couldn't have transparent wrap-around on 3-D meshes analogous to torus connected 2-D meshes. In principal, the one dimensional wrap scheme which avoids long wires for any array size, e.g., _______ _______ __ / \ / \ / \ X X X X X X \__/ \_______/ \_______/ where "X" represents a node, should generalize to 2- and 3-D meshes. Following your discussion of physically embedding in 3 dimensions, however, it's worth noting that any realizable technology for packaging depends on a layered hierarchy of successively larger functional entities, e.g., IC carriers, PC boards, card cages, racks, multiple rack equipment bays, etc. In harsh reality of actually constructing a system, it's hard to see how to avoid the inherently longer wires required by successively higher packaging levels regardless of the interconnection topology. This suggests that data locality may play as prominent a role in scalability as interconnect topology. Which might make 2- and 3-D meshes look even better in the limit as compared to higher dimensional topologies.
paulv@piring.cwi.nl (Paul Vitanyi) (11/29/89)
In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes: >My admittedly naive intuitions would say that only meshes are truly >scalable, since you have to pack things into real (<= 3D) space. >Hypercubes end up needing long wires to project a higher-dimensional >graph into 2- or 3-space. As processor speeds increase (and the >speed of light presumably doesn't) these end up being slower >than other links and destroy the scalability of n-cubes. > >It would *seem* to me that a 3D mesh is the only way to go >because that's the highest dimensionality you can embed into >a 3D reality. You get constant time per hop, no problem. <<<stuff deleted>>> >Any comments? I'm also curious about the pros and cons of >CM-style combinations of mesh and n-cube routing systems. This type of questions is addressed in my paper P. Vitanyi, Locality, communication, and interconnect length in multicomputers, SIAM J. on Computing, 17(1988), pp. 659-672. E.g., it shows that while networks of small diameter like trees necessarily have *some* long wires, symmetric networks with small diameter like n-cubes, CCC, necessarily have *average* long wires.
Donald.Lindsay@MATHOM.GANDALF.CS.CMU.EDU (11/30/89)
One recent poster suggested that hypercubes and meshes had the same routing latency per hop. I don't agree: wider is faster. The reason is that the routing hardware at each node cannot make a decision until it has seen the entirety of the packet header. On an NCUBE-2, the header is 32 bits, sent serially: the latency is 32+ bit times (44, to be precise). A byte-wide path, such as BBN uses, could make the same decision in 4+ clocks. (And both could be faster if the header's likeliest command mode was decodable from, say, the first half of the header.) This particular latency is often dominated by the message time. But, it's the kind of point that designers have to be sensitive to. Don D.C.Lindsay Carnegie Mellon Computer Science
gene@cs.bu.edu (Gene Itkis) (12/01/89)
In article <7178@hubcap.clemson.edu> wilson@carcoar.Stanford.EDU (Paul Wilson) writes: >My admittedly naive intuitions would say that only meshes are truly >scalable, since you have to pack things into real (<= 3D) space. >... >It would *seem* to me that a 3D mesh is the only way to go >because that's the highest dimensionality you can embed into >a 3D reality. You get constant time per hop, no problem. One problem with 3D mesh is that as the processors work the heat is generated(*). Obviously this heat has to be dissipated somehow. The heat created is proportional to the volume of the mesh, and the dissipation can happen only proportionally to the surface area. So, if you care about scalability, the 3D meshes are no good - they'll overheat, cook themselves. Thus the best you can hope for in our physics is 2D mesh. >... >Hypercubes end up needing long wires to project a higher-dimensional >graph into 2- or 3-space. As processor speeds increase (and the >speed of light presumably doesn't) these end up being slower >than other links and destroy the scalability of n-cubes. A somewhat relevant issues are considered in a FOCS-89 paper "Power of Fast VLSI Models is Insensitive to Wires' Thinness". As one of the consequences of the result, if the communication time (of the length x wire) is, say, $x log^(1.1) x$ then designers can design their chips using "imaginary" wires of width 0, and the chips so designed can be simulated by a mesh of finite processors on-line without increase in space or time. -------------------- (*) Some people doubt that energy dissipation is necessary in reversible computations. But non-dissipating gates have not been discovered yet. In addition, some computations, e.g. error-correcting, must be irreversible, and therefore will be sure to produce heat. So, the assumption that heat has to be generated by computing elements seems to be safe, at least for a while.
wen-king@csvax.caltech.edu (Wen-King Su) (12/01/89)
>From Donald.Lindsay@MATHOM.GANDALF.CS.CMU.EDU <One recent poster suggested that hypercubes and meshes had the same >routing latency per hop. I don't agree: wider is faster. The reason <is that the routing hardware at each node cannot make a decision >until it has seen the entirety of the packet header. On an NCUBE-2, <the header is 32 bits, sent serially: the latency is 32+ bit times >(44, to be precise). A byte-wide path, such as BBN uses, could make <the same decision in 4+ clocks. (And both could be faster if the >header's likeliest command mode was decodable from, say, the first <half of the header.) If it is done right, the first unit of the message to enter a router of either kind can carry enough information for the router to decide whether it should 1) strip off that unit, and/or 2) determine the output channel of the message. For an n-cube, the minimum size of the unit is 1 bit. For a grid, the size log(sqrt(N)). You can make do with 3 bits if you can afford 2 clocks instead of 1. Our wormhole routing chip has 8-bit channels, and it requires 1 clock for decoding the header. For more details, see John Ngai's PhD thesis: Caltech CS-TR-89-09; and Charles Flaig's Master thesis: Caltech CS 5241:TR:87. /*------------------------------------------------------------------------*\ | Wen-King Su wen-king@vlsi.caltech.edu Caltech Corp of Cosmic Engineers | \*------------------------------------------------------------------------*/
lkaplan@BBN.COM (Larry Kaplan) (12/02/89)
In article <7252@hubcap.clemson.edu> Donald.Lindsay@MATHOM.GANDALF.CS.CMU.EDU writes: > >...On an NCUBE-2, >the header is 32 bits, sent serially: the latency is 32+ bit times >(44, to be precise). A byte-wide path, such as BBN uses, could make >the same decision in 4+ clocks. (And both could be faster if the >header's likeliest command mode was decodable from, say, the first >half of the header.) Actually, the current generation BBN Butterfly Switch does better than this. By "eating" the addressing bits used by a particular switch element, the relevant addressing bits for the next switch element arrive there in the FIRST byte. So routing decisions are made after 1 clock. _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
kbreinho%ug@cs.utah.edu (Keith Breinholt) (12/06/89)
>}>One problem with 3D mesh is that as the processors work the heat is >}>generated(*). >}>Thus the best you can hope for in our physics is 2D mesh. >}> lots editted out -- Moderator > I did not intend to claim that 3D mesh cannot be simulated in 2D per se (as a >matter of fact it can be simulated in 1D - Turing machine). But if you notice, >the diameter (max distance between nodes) of, say, yours (or any other) 2D >representation of 3D mesh is going to be assymptotically much larger than the >actual 3D mesh diameter (for the same number of nodes). So, in fact your >suggestion can be read as "pretend you have 3D, but implement it in 2D in >reality", and this does not contradict anything I have said. I think you misunderstood me, partly due to the wording of my reply. What I was trying to say is that a 3D mesh CAN be reproduced on a 2D plane, not simulated or pretended in 2D. A hexagon has 6 sides. A cube has 6 sides. If I can produce 6 sets of connections to neighboring processors in a 2D plane I have effectively produced a 3D connection scheme in a 2D plane. And as stated before, this type of representation requires no more cooling than existing planar boards. Which does contradict your statement. If 3D meshes "cook themselves" it will be due to a design flaw, not because physics dictates it so. Keith L. Breinholt kbreinho@ug.utah.edu or usiklb@vm.usi.utah.edu Utah Supercomputing Institute (801) 581-4439
bjornl@tds.kth.se (Bj|rn Lisper) (12/08/89)
In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes:
%One problem with 3D mesh is that as the processors work the heat is
%generated(*). Obviously this heat has to be dissipated somehow. The heat
%created is proportional to the volume of the mesh, and the dissipation can
%happen only proportionally to the surface area. So, if you care about
%scalability, the 3D meshes are no good - they'll overheat, cook themselves.
%Thus the best you can hope for in our physics is 2D mesh.
You implicitly assume that the 3D mesh is a solid body. This is not
necessarily true. It could for instance consist of a stack of cards, with
air flowing between them, or have channels with coolant flowing through. In
this way, you get the same cooling area as for the 2D mesh.
Bjorn Lisper
gene@CS.BU.EDU (Gene Itkis) (12/08/89)
In article <7373@hubcap.clemson.edu> bjornl@tds.kth.se (Bj|rn Lisper) writes: >In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes: >... So, if you care about >>scalability, the 3D meshes are no good - they'll overheat, cook themselves. >>Thus the best you can hope for in our physics is 2D mesh. > >You implicitly assume that the 3D mesh is a solid body. This is not >necessarily true. It could for instance consist of a stack of cards, with >air flowing between them, or have channels with coolant flowing through. In >this way, you get the same cooling area as for the 2D mesh. Yes, provided that the distance between the "cards" is not constant, ie growing with the number of automata. This architecture would still remain effectively 2D. If you are to pack automata densely (leaving a constant volume around every automaton) you eventually run into the overheating problem even if you run coolant between automata. Think of water being "generated" instead of heat. Do not forget, we want scalability.
kbreinho%ug@cs.utah.edu (Keith Breinholt) (12/09/89)
Before I get anymore fan mail (read "flame mail") let me post an apology to Gene Itkish and all of the others who sought to enlighten me on 3d meshes. In my posting I was assuming a cube connected 3d mesh. Thus the reference to 6 sides which can be represented in a 2d hexagonal mesh. However, as Gene pointed out, this is not a 3d mesh (for you purists out there) although I have the same signal paths and possibly signal lengths. And as more signal paths are added in a 3d mesh, it becomes impractical to mimic them in 2d. The fact remains that at some huge scale 3d meshes cannot be properly cooled. (And I think we need to add the words, "without some special method of cooling.") I say this because even 2d or 1d curcuits at some point of scale will "cook" themselves. Now, did I leave something out? Keith L. Breinholt kbreinho@ug.utah.edu or usiklb@vm.usi.utah.edu Utah Supercomputing Institute (801) 581-4439
landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) (12/14/89)
>In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes: >%One problem with 3D mesh is that as the processors work the heat is >%generated(*)..... In article <7373@hubcap.clemson.edu> bjornl@tds.kth.se (Bj|rn Lisper) writes: >You implicitly assume that the 3D mesh is a solid body. This is not >necessarily true. It could for instance consist of a stack of cards, with >air flowing between them, or have channels with coolant flowing through. In >this way, you get the same cooling area as for the 2D mesh. No, he doesn't assume that. This is important. At some point the heat must exit the surface of the 3D cube. Regardless of what method you use, it will have some limit per area, and so the surface area limits the amount of heat you can dissipate. Even if the machine is spewing out white-hot steam at the speed of light, this limit exists. As noted by Gene, you either figure out how to do reversible computation that doesn't dissipate a fixed amount of heat per operation, or you live with the limit. Howard A. Landman landman%hanami@eng.sun.com
pase@orville.nas.nasa.gov (Douglas M. Pase) (12/14/89)
In <7341@hubcap.clemson.edu> kbreinho%ug@cs.utah.edu (Keith Breinholt) writes:
-[...] A hexagon has 6 sides. A cube has 6 sides. If I can produce 6 sets
-of connections to neighboring processors in a 2D plane I have effectively
-produced a 3D connection scheme in a 2D plane.
Hexagons and 3-d cubes are in fact planar figures, but 3-d meshes are not.
Yes you can lay out the processors in 2-d, but the mesh connections will
require more than one layer. The physical distance between nearest neighbors
can also be quite large in a big 3-d mesh, though I gather not as large as
for the same sized n-cube.
Dr. Douglas M. Pase Computer Sciences Corporation
95 Sierra Vista Ave NAS MS 258-6
Mountain View, CA 94043 NASA Ames Research Center
(415) 940-1197 Moffett Field, CA 94035
pase@orville.nas.nasa.gov (415) 694-6394
cleary@cpsc.ucalgary.ca (John Cleary) (12/18/89)
In article <7427@hubcap.clemson.edu>, landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) writes: > >In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes: > >%One problem with 3D mesh is that as the processors work the heat is > >%generated(*)..... > It worth noting however that the temperature of a 3D cube rises very slowly with the number of nodes. The heat lost by radiation (probably the only feasible mechanism in the limit) is porportional to A*T^4 where A is the surface area and T is the temperature. If the radius of the 3D body is R then A scales as R^2 and the number of processors (N) and hence the total heat (per unit time) as R^3. The result is that T scales as N^(1/12) pretty slow. Of course if you could carry some of the heat away with neutrinos then you could build a big one before the temperature got too high :-) John G. Cleary Department of Computer Science University of Calgary 2500 University Drive N.W. Calgary Alberta T2N 1N4 Canada cleary@cpsc.UCalgary.ca Phone: (403)282-5711 or (403)220-6087
wen-king@csvax.caltech.edu (Wen-King Su) (12/19/89)
<From: cleary@cpsc.ucalgary.ca (John Cleary) >In article <7427@hubcap.clemson.edu>, landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) writes: <> >In article <7274@hubcap.clemson.edu> gene@cs.bu.edu (Gene Itkis) writes: >> >%One problem with 3D mesh is that as the processors work the heat is <> >%generated(*)..... >> <It worth noting however that the temperature of a 3D cube rises very slowly >with the number of nodes. The heat lost by radiation (probably the only <feasible mechanism in the limit) is porportional to A*T^4 where A is the >surface area and T is the temperature. If the radius of the 3D body is R then <A scales as R^2 and the number of processors (N) and hence the total heat (per >unit time) as R^3. The result is that T scales as N^(1/12) pretty slow. 'T' is the temperature at the surface of the 3D cube. In order to allow heat to dessipate, there must be a thermal gradiant from the core to the surface. In order to keep the core temperature at an acceptiable level, the surface temperature must drop (not rise, nor to remain constant) as R is increased. At a certain R, the required surface temperature will hit 0 degree Kelvin, and we can go no further. /*------------------------------------------------------------------------*\ | Wen-King Su wen-king@vlsi.caltech.edu Caltech Corp of Cosmic Engineers | \*------------------------------------------------------------------------*/
cline@cheetah.ece.clarkson.edu (Marshall Cline) (12/20/89)
In article <7480@hubcap.clemson.edu> cleary@cpsc.ucalgary.ca (John Cleary) writes: >In article <7427@hubcap.clemson.edu>, landman@hanami.Eng.Sun.COM (Howard A. Landman x61391) writes: >It worth noting however that the temperature of a 3D cube rises very slowly >with the number of nodes. The heat lost by radiation (probably the only >feasible mechanism in the limit) is porportional to A*T^4 where A is the >surface area and T is the temperature. If the radius of the 3D body is R then >A scales as R^2 and the number of processors (N) and hence the total heat (per >unit time) as R^3. The result is that T scales as N^(1/12) pretty slow. ... > John G. Cleary In case anyone doubts John's analysis, a quick example from nature should help. Small animals are generally very active and have high metabolisms (consider a bird's heart-rate vs a grizzly bear's). Why? Ans: small animals have a low body_volume/surface_area ratio (R^3 / R^2 gets small for small R). But if you drag a whale up on a beach, it will literally *cook* from its own body heat, having more pounds per square inch. The problem of designing big computers that don't cook themselves therefore isn't new -- God had the same problem when designing big animals! The solution thus appears to be that big computers will need better cooling systems. Marshall -- ___________________________________________________________________ Marshall Cline/ECE Dept/Clarkson Univ/Potsdam NY 13676/315-268-3868 Internet: cline@sun.soe.clarkson.edu -or- bh0w@clutx.clarkson.edu BitNet: bh0w@clutx Usenet: uunet!sun.soe.clarkson.edu!cline