[net.arch] intel's cube

wjafyfe@watmath.UUCP (Andy Fyfe) (03/04/85)

Several concerns about the cube have been raised by Ian Kaplan & Ron Warnock.

Network Routing.

The communication channels are set up so that 2 processors communicate
via channel i if their binary node id differs in only bit position i.
As an example, to go from node 19 to node 42 in a 6-cube, one path is
	010011		19	(via channel 0 to)
	010010		18	(via channel 3 to)
	011010		26	(via channel 4 to)
	001010		10	(via channel 5 to)
	101010		42
Network routing is easy, and the further a message has to go, the greater
the number of shortest routes.  As Ian states, processor intervention is
required to transfer the message from one channel to another.  With
some care the re-routing load could be, on average, well distributed.
(There is a separate transmit and receive line connecting the node to
node ethernet chips, so there is no contention there.  There may be
contention geting from memory to the chip, though.)

Broadcasting.

If the host needs to broadcast a message, the global ethernet could be used.
Suppose a node wants to broadcast a message.  It to could use the global
channel.  However, I would expect, in most cases, that if one node wants
to broadcast a message, all of the nodes would.  There are then 2 choices.
The broadcast could be serialized, with each processor sending its message
to all of the nodes.  Alternatively, it is very easy to map the cubee on to
a ring (via gray codes) and have each processor broadcast his message in
parallel to the next one in the ring, and have messages circulate.  In
either case, one iteration is required for each node.  The latter turns
out to be very natural in loop-type algorithms.

Software.

Software will be a problem, at least for a while.  Again, via gray codes,
it is very easy to map the cube onto a ring, a 2D mesh, etc., so  software
designed to run on such a machine can be ported quickly to the cube.
Library routines will help.  I haven't seen the intel software, so I
don't know exactly what will be provided.  This machine, like any other,
will need alot of useful software if it is to be sucessful (in my
opinion, at least).  Time will tell.

As part of the Waterloop project at Waterloo, some research has been done
on loop algorithms for such things as matrix manipulation, monte carlo
simulations, etc., and all of these algorithms will run on the cube.

Bus Contention.

This was raised by Rob Warnock.  It's something we're concerned about.
There are 8 ethernet (dma based) controllers, and a cpu on the local bus.
There is not enough bandwidth for all of them to be active at the same
time.  How bad the problem will be is another question, though.  Again,
I expect that the same code will run on all nodes, and will tend to
be synchronous (all nodes will be doing the same thing at about the
same time) with the synchronization being encouraged by message passing
(one node may have to busy wait because it's waiting for a message, for
example).  And the more message passing that goes on, the less efficient
the cube will be.  It may be that in practice one typically only has at
most 3 of the ethernet channels active at any point in time, one send
and receive channel active (for inter-node communication), with the
global receive channel accepting a message from the host.  I like
Rob's hyper-point (alias fat point) though -- it doesn't require some
good look.  It's worth a serious look (maybe Intel is listening).
Of course, we (Waterloop people) would just map the machine onto a
loop anyway! :-)

-----------------------------------------------------------------------

The cube is not a replacement for a Cray.  But multi-processors is the way
I believe we have to go.  Even on a Cray, to be effiecient, you need to
have "vectors" in your algorithm, which is just another way of saying
"parallelism".  And even the Crays are becoming multi-processor machines.

--Andy Fyfe		...!{decvax, allegra, ihnp4, et. al}!watmath!wjafyfe
			wjafyfe@waterloo.csnet