[net.arch] Hypercubes

brooks@lll-crg.ARPA (Eugene D. Brooks III) (02/27/85)

WARNING: Serious flames here!

It really upsets me that an individual would claim that a bus architecture
is better than a Hypercube when the first thing he does in his argument is
give the bus a 300 megabyte a second bandwidth and give the hypercube the
equivalent of 9600 baud serial lines to communicate with.  If you need that
kind of edge to prove your point you are in DEEP TROUBLE.

In the Caltech project the communication speed of each channel was on par
with the floating point speed as it should be in numerical applications work.
If the floating point speed of a node is 10MFLOPS, as it is for current high
cost/performance ratio hardware these days, the communication channels should
support a transfer rate to match.

For those who wish compare the Hypercube to the x, xy, xyz, (or xyxt....) bus
Just what do you think a hypercube is?  Its a xyzt... bus with the number
optimized.  Just two cpus connect to each bus (the word bus is used loosely
here)  What is better?  Nothing!

For those who want to complain that a Cube of order=20 would require each
processor to manage 20 messages on average being a bad thing there is of
course a better way to build the hardware if this is a problem.  The problems
to which the Cube has been put to in the past used only communication
with neighboring processors and rather straightforward maps of the problem
onto the machine.  If your problems are of a sort that require random addressed
message fordwarding then you need to consider the alternate (and more expensive)
hardware design.  There is the old adage "You get what you pay for." In fact
you probably should be considering a shared memory machine.

	Perhaps even a Hypercube!

rpw3@redwood.UUCP (Rob Warnock) (03/01/85)

+---------------
| WARNING: Serious flames here!
| It really upsets me that an individual would claim that a bus architecture
| is better than a Hypercube when the first thing he does in his argument is
| give the bus a 300 megabyte a second bandwidth and give the hypercube the
| equivalent of 9600 baud serial lines to communicate with.  If you need that
| kind of edge to prove your point you are in DEEP TROUBLE.
+---------------

Sorry to have inspired such flames. (My allocations of bandwidth were not
quite as skewed as you say, but that could be the heat of the flame speaking...
In <171@redwood.UUCP>, I believe I said 33 megaBITs and 50 kilobits, referring
to a 1 meganode array, or 20 pt-pt links. True, I did goof the pt-pt number
somewhat -- it should have been more like 1-2 megabits/sec [see below].)

When I entered the discussion, "x,y,z" bus systems were being harshly
criticized, compared to pure hypercubes of 2**N processors with N point-
to-point links each, on the basis that a given bus will only support a
moderate number of nodes due to contention for the bus bandwidth (the term
being used was "bus squabbling", as I recall). I wished to show that some
areas were being overlooked in the comparison, particularly (1) the effect
of memory bandwidth within a node, and (2) some queuing considerations.
Let's try again, and I'll be more careful this time...

My underlying assumptions, not stated clearly enough before, are that the
node processors are of low cost and are therefore constructed very simply,
using current network chips such as the Intel 82586 or the Amd/Mostek LANCE
which DMA directly into and out of a node processors memory. Further, I assume
that the link controllers do NOT have packet-sized FIFOs or external buffers,
nor the corresponding problems with delays due to copying to/from such buffers.
(This also follows from the assumption of simple, low-cost nodes.)

Next, I assume that we are discussing fairly large arrays, such as 64 to
1024 processors or more, such that the "fanout" (of either busses in the
bus case or processors in the hypercube case) is 6 to 10 or more. Finally,
I assume that the processors have some work to do other than simply sending
packets around, so that the AVERAGE communications rate was MUCH lower than
the processor's memory speed. Specifically, the link rates cited previously
with respect to the 1 Mnode array (100 nodes/bus or 20 links) carried this
assumption. (See below, where I discuss bus saturation.) NOTE: In applications
which are not at all cost-sensitive, this last assumption may not hold strongly!

My assumptions were based on the kinds of applications I have in mind
for such large arrays, and it may be that our apparent disagreement is
due to differing assumptions. However, given these assumptions (which
are reasonable for a large class of problems), I felt the "x,y,z" bus
was being given short shrift.


My first point is simply that (with these assumptions) the memory bandwidth
of each processor's local memory bus is a limit to the PEAK communications
rate. If one is to avoid overruns in a straightforward way, the total peak
data link rate for all links (bus or point-to-point) connected to a given
processor should not exceed the memory bus bandwidth at that processor.

Hypercubes of the type being discussed will require a large number (6 to
20+) of point-to-point links per processor. The "x,y,z" bus system has
exactly three controllers per processor, regardless of the size of the bus.
In these configurations, and also assuming that the data rates of the links
(of either kind) do NOT exceed the memory rate, the pt-pt link data rates
must be significantly lower than the "x,y,z" bus rates. (The overheads of
switching DMA masters will further favor systems with few links, rather
than many.)

The second point is that in the contention for a processor's memory bus,
by simple queuing theory a small number of high-speed channels is much better
than a large number of low-speed channels, if the total bandwidth is the same.

Some example numbers will be useful here. Let's assume (to keep the arithmetic
simple) a 1024 node array, of either the hypercube or "x,y,z" bus type. So each
hypercube node has 10 links, and each "x,y,z" bus has 10 nodes. Further, let's
assume that the memory bus is 30 Mbits/sec, so each of the three "x,y,z" busses
is 10 Mbits/sec (Ethernet?), and each hypercube pt-pt link is 3 Mbit/sec.

The average packet latency in this case (under light to moderate loads) is
three times less for the bus system, for next-door neighbors. Each node of
the hypercube is directly connected to 10 other nodes, and each node of the
"x,y,z" bus is directly connected to 27 other nodes, or putting it another
way, the maximum number of links (not nodes) a packet must traverse from one
processor to another is 3 for the bus and 10 for the hypercube, or roughly
10 times better latency to go "across town" on the bus. ;-}

Now, while in this example the "x,y,z" bus is favored for latency under light
to moderate loads, it DOES run out of steam if the average communications load
at every node gets close to a megabit per bus, or about 3 Mbit/sec of total
memory load per node.  Whether this is significant depends of the kind of
problem the processor array is trying to solve, and whether one considers
10% of the memory bandwidth being spent on bus/link DMA to be too much or
too little. But certainly there is a range of problems for which the "x,y,z"
bus is better suited than the hypercube, and vice-versa, and the COST of
the bus system is certainly much less (assuming the application has any
cost sensitivity!).

For a meganode array (for the same example memory speeds), the bus system
still uses 10 Mbit/sec links, but the hypercube links drop to 1.5 Mbit/sec
(T-1 carrier on twisted pairs maybe?). The "x,y,z" bus latency is more than
40 times better than the hypecube's, "cross town" under light load. Conversely,
the "x,y,z" bus system saturates at an average communications load per node
of only 1% of the memory bandwidth. Although there will still be applications
for which the "x,y,z" bus system is appropriate, I feel that NEITHER the
hypercube nor the "x,y,z" is an ideal solution for such large arrays.

+---------------
| For those who want to complain that a Cube of order=20 would require each
| processor to manage 20 messages on average being a bad thing there is of
| course a better way to build the hardware if this is a problem...
+---------------

I am interesting in hearing your "better way" (or anybody's). Please note
that just because I defended the "x,y,z" bus versus the hypercube with pt-pt
links, that doesn't mean I like either one especially!

The method I currently favor (which I have talked about several times)
is a hybrid of the hypercube and the "x,y,z" bus, which has exactly two
busses per processor.  (For more, see previous postings.) Since it has
only two busses, it's better than even the "x,y,z" at peak data rate
(and hence lower latency at light net load).  Degree of fanout can be
controlled by the number of processors in each "hyper-point" and the
degree of each "hyper-edge", allowing a tradeoff between heavy-load net
carrying capacity and "cross town" delay. In that way, it's like a
hypercube for which groups of pt-pt links have been folded into busses.

A related arrangement, at which I haven't looked closely, is to take the
"topological dual" of a hypercube; that is, instead of a meganode array
having processors with 20 links, you'd have busses with 20 processors on
them. (How many busses per processor? I dunno. "Details at 11..." ;-} )

Still, there are no free lunches! I feel that a-priori knowledge of the
average link data rate (expressed as a percentage of node memory bandwidth)
will remain a critical factor in choice of a topology. This means that the
optimum topology is (surprise!) application dependent.


Rob Warnock
Systems Architecture Consultant

UUCP:	{ihnp4,ucbvax!dual}!fortune!redwood!rpw3
DDD:	(415)572-2607
USPS:	510 Trinidad Lane, Foster City, CA  94404