brooks@lll-crg.ARPA (Eugene D. Brooks III) (02/27/85)
WARNING: Serious flames here! It really upsets me that an individual would claim that a bus architecture is better than a Hypercube when the first thing he does in his argument is give the bus a 300 megabyte a second bandwidth and give the hypercube the equivalent of 9600 baud serial lines to communicate with. If you need that kind of edge to prove your point you are in DEEP TROUBLE. In the Caltech project the communication speed of each channel was on par with the floating point speed as it should be in numerical applications work. If the floating point speed of a node is 10MFLOPS, as it is for current high cost/performance ratio hardware these days, the communication channels should support a transfer rate to match. For those who wish compare the Hypercube to the x, xy, xyz, (or xyxt....) bus Just what do you think a hypercube is? Its a xyzt... bus with the number optimized. Just two cpus connect to each bus (the word bus is used loosely here) What is better? Nothing! For those who want to complain that a Cube of order=20 would require each processor to manage 20 messages on average being a bad thing there is of course a better way to build the hardware if this is a problem. The problems to which the Cube has been put to in the past used only communication with neighboring processors and rather straightforward maps of the problem onto the machine. If your problems are of a sort that require random addressed message fordwarding then you need to consider the alternate (and more expensive) hardware design. There is the old adage "You get what you pay for." In fact you probably should be considering a shared memory machine. Perhaps even a Hypercube!
rpw3@redwood.UUCP (Rob Warnock) (03/01/85)
+--------------- | WARNING: Serious flames here! | It really upsets me that an individual would claim that a bus architecture | is better than a Hypercube when the first thing he does in his argument is | give the bus a 300 megabyte a second bandwidth and give the hypercube the | equivalent of 9600 baud serial lines to communicate with. If you need that | kind of edge to prove your point you are in DEEP TROUBLE. +--------------- Sorry to have inspired such flames. (My allocations of bandwidth were not quite as skewed as you say, but that could be the heat of the flame speaking... In <171@redwood.UUCP>, I believe I said 33 megaBITs and 50 kilobits, referring to a 1 meganode array, or 20 pt-pt links. True, I did goof the pt-pt number somewhat -- it should have been more like 1-2 megabits/sec [see below].) When I entered the discussion, "x,y,z" bus systems were being harshly criticized, compared to pure hypercubes of 2**N processors with N point- to-point links each, on the basis that a given bus will only support a moderate number of nodes due to contention for the bus bandwidth (the term being used was "bus squabbling", as I recall). I wished to show that some areas were being overlooked in the comparison, particularly (1) the effect of memory bandwidth within a node, and (2) some queuing considerations. Let's try again, and I'll be more careful this time... My underlying assumptions, not stated clearly enough before, are that the node processors are of low cost and are therefore constructed very simply, using current network chips such as the Intel 82586 or the Amd/Mostek LANCE which DMA directly into and out of a node processors memory. Further, I assume that the link controllers do NOT have packet-sized FIFOs or external buffers, nor the corresponding problems with delays due to copying to/from such buffers. (This also follows from the assumption of simple, low-cost nodes.) Next, I assume that we are discussing fairly large arrays, such as 64 to 1024 processors or more, such that the "fanout" (of either busses in the bus case or processors in the hypercube case) is 6 to 10 or more. Finally, I assume that the processors have some work to do other than simply sending packets around, so that the AVERAGE communications rate was MUCH lower than the processor's memory speed. Specifically, the link rates cited previously with respect to the 1 Mnode array (100 nodes/bus or 20 links) carried this assumption. (See below, where I discuss bus saturation.) NOTE: In applications which are not at all cost-sensitive, this last assumption may not hold strongly! My assumptions were based on the kinds of applications I have in mind for such large arrays, and it may be that our apparent disagreement is due to differing assumptions. However, given these assumptions (which are reasonable for a large class of problems), I felt the "x,y,z" bus was being given short shrift. My first point is simply that (with these assumptions) the memory bandwidth of each processor's local memory bus is a limit to the PEAK communications rate. If one is to avoid overruns in a straightforward way, the total peak data link rate for all links (bus or point-to-point) connected to a given processor should not exceed the memory bus bandwidth at that processor. Hypercubes of the type being discussed will require a large number (6 to 20+) of point-to-point links per processor. The "x,y,z" bus system has exactly three controllers per processor, regardless of the size of the bus. In these configurations, and also assuming that the data rates of the links (of either kind) do NOT exceed the memory rate, the pt-pt link data rates must be significantly lower than the "x,y,z" bus rates. (The overheads of switching DMA masters will further favor systems with few links, rather than many.) The second point is that in the contention for a processor's memory bus, by simple queuing theory a small number of high-speed channels is much better than a large number of low-speed channels, if the total bandwidth is the same. Some example numbers will be useful here. Let's assume (to keep the arithmetic simple) a 1024 node array, of either the hypercube or "x,y,z" bus type. So each hypercube node has 10 links, and each "x,y,z" bus has 10 nodes. Further, let's assume that the memory bus is 30 Mbits/sec, so each of the three "x,y,z" busses is 10 Mbits/sec (Ethernet?), and each hypercube pt-pt link is 3 Mbit/sec. The average packet latency in this case (under light to moderate loads) is three times less for the bus system, for next-door neighbors. Each node of the hypercube is directly connected to 10 other nodes, and each node of the "x,y,z" bus is directly connected to 27 other nodes, or putting it another way, the maximum number of links (not nodes) a packet must traverse from one processor to another is 3 for the bus and 10 for the hypercube, or roughly 10 times better latency to go "across town" on the bus. ;-} Now, while in this example the "x,y,z" bus is favored for latency under light to moderate loads, it DOES run out of steam if the average communications load at every node gets close to a megabit per bus, or about 3 Mbit/sec of total memory load per node. Whether this is significant depends of the kind of problem the processor array is trying to solve, and whether one considers 10% of the memory bandwidth being spent on bus/link DMA to be too much or too little. But certainly there is a range of problems for which the "x,y,z" bus is better suited than the hypercube, and vice-versa, and the COST of the bus system is certainly much less (assuming the application has any cost sensitivity!). For a meganode array (for the same example memory speeds), the bus system still uses 10 Mbit/sec links, but the hypercube links drop to 1.5 Mbit/sec (T-1 carrier on twisted pairs maybe?). The "x,y,z" bus latency is more than 40 times better than the hypecube's, "cross town" under light load. Conversely, the "x,y,z" bus system saturates at an average communications load per node of only 1% of the memory bandwidth. Although there will still be applications for which the "x,y,z" bus system is appropriate, I feel that NEITHER the hypercube nor the "x,y,z" is an ideal solution for such large arrays. +--------------- | For those who want to complain that a Cube of order=20 would require each | processor to manage 20 messages on average being a bad thing there is of | course a better way to build the hardware if this is a problem... +--------------- I am interesting in hearing your "better way" (or anybody's). Please note that just because I defended the "x,y,z" bus versus the hypercube with pt-pt links, that doesn't mean I like either one especially! The method I currently favor (which I have talked about several times) is a hybrid of the hypercube and the "x,y,z" bus, which has exactly two busses per processor. (For more, see previous postings.) Since it has only two busses, it's better than even the "x,y,z" at peak data rate (and hence lower latency at light net load). Degree of fanout can be controlled by the number of processors in each "hyper-point" and the degree of each "hyper-edge", allowing a tradeoff between heavy-load net carrying capacity and "cross town" delay. In that way, it's like a hypercube for which groups of pt-pt links have been folded into busses. A related arrangement, at which I haven't looked closely, is to take the "topological dual" of a hypercube; that is, instead of a meganode array having processors with 20 links, you'd have busses with 20 processors on them. (How many busses per processor? I dunno. "Details at 11..." ;-} ) Still, there are no free lunches! I feel that a-priori knowledge of the average link data rate (expressed as a percentage of node memory bandwidth) will remain a critical factor in choice of a topology. This means that the optimum topology is (surprise!) application dependent. Rob Warnock Systems Architecture Consultant UUCP: {ihnp4,ucbvax!dual}!fortune!redwood!rpw3 DDD: (415)572-2607 USPS: 510 Trinidad Lane, Foster City, CA 94404