[net.arch] topology of highly connected computer systems

bradley@think.ARPA (Bradley C. Kuszmaul) (04/26/85)

Of course the "triangular" extensions are totally connected, which
really makes them very interesting or very boring depending on your
point of view.

They are interesting in that it is very easy to do routing, since
everything is connected.  The software for a totally connected computer
is easy, and easy software (being an oxymoron) is interesting.

They are boring in that you can't implement them for very large nets.
In general if you have $n$ processors, you need $n^2$ wires to
completely connect them and $n-1$ connections (ports) at each processor.
The wire milage quickly gets out of hand as $n$ grows, and the layout
becomes tricky also.

In a cube containing $n=2^m$ processors (i.e. an $m$-cube), there are
$m$ ports on each processor, hence $nm/2$ wires (i.e. $O(n\log n)$)
which is much better than $n^2$.  This is still a lot of wires to lay
out, but for something like the cosmic cube (64 processors?) we are only
talking about 192 wires, (probably expensive coaxial wires), which is
not too much of a pain to actually construct in 3-space.

I am much more interested in what happens as $n$ becomes much larger
(e.g. $n=10^6$).  The wiring problem becomes very hard again.  The
Connection Machine with 64K processors in the prototype faces an
engineering challenge on this front.

One nice thing about cube topologies is that they are easy to perform
routing on (the relative address says which dimensions to cross, and
there is a wire crossing every dimension from every node) and they are
fairly redundant (there are lots of ways to get from A to B, so that if
some node or wire fails, the machine might be able to keep running.
(fault tolerence is important for a million processor machine)).
Redundancy is also nice because it means that if lots of messages want
to go from A to B, then we may be able to send more than one at a time
through the net.

One bad thing about cube topologies (at least for cubes bigger than
about 10 dimensions) is that they do not use their bandwidth very
effectively.  It has been claimed (with proof - references may be
available on demand) that a 14 cube sending an "average" set of messages
only uses about one fourth of the bandwidth because of collisions.

It has been claimed, in this discussion, that cube topologies are nice
because they can simulate other topologies easily.  Actually, this is
not really a good argument, since most of the other seriously considered
topologies are equally good at simulating each other

(Handwaving proof:  Suppose topology X can be easily simulated on a
cube.  There must be an "easy" mapping from the nodes in X to the nodes
in the cube (i.e.  an enumeration, since the nodes in the cube have an
"easy" mapping from themselves to their addresses).  This "easy" mapping
should be bijective, or we will be wasting computational resources.
"Easy" bijective mappings should be easy to invert, so the cube can be
simulated on X "easily".  Thus if X and Y can be simulated easily on
a cube, then X and Y can easily simulate each other.)

I suspect that the real reason that people are building N-cubes instead
of skewed rings, minimum spanning trees, or some other topology, is that
computer scientists have a bias towards binary representations of
things, and the cube topology is nice and binary, and because routing is
so easy.  (routing is easy because if $a$ is the absolute address of
processor $X$ and $b$ is the absolute address of processor $Y$, then
 $A \xor B$ is the relative address of B from A (and A from B)

-Brad


-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA

brooks@lll-crg.ARPA (Eugene D. Brooks III) (04/28/85)

> out, but for something like the cosmic cube (64 processors?) we are only
> talking about 192 wires, (probably expensive coaxial wires), which is
> not too much of a pain to actually construct in 3-space.

The 64 processor was wired in a single backplane with wire wrap wire and
not coaxial cable.  The wiring was not even twisted pairs.  Of course it
probably would have been a lot less flakey with twisted pairs and appropo
drivers.


> I am much more interested in what happens as $n$ becomes much larger
> (e.g. $n=10^6$).  The wiring problem becomes very hard again.  The
> Connection Machine with 64K processors in the prototype faces an
> engineering challenge on this front.

A million processors not not a problem at all (Well perhaps I am stretching it
a bit.  I don't actually advocate a million processors.  I thing more along the
lines of a few hundred 20 MIP/MFLOP processors as being interesting.).  How to
construct a very large N cube can easily be worked out on the back of an
envelope.  You accomplish the feat by building and wiring the machine in 3
dimensions.  32k processors is a piece of cake (its only 32 processors on a
side) and one million processors is within the realm of possibility.

> if some node or wire fails, the machine might be able to keep running.
> (fault tolerence is important for a million processor machine)).
> Redundancy is also nice because it means that if lots of messages want
> to go from A to B, then we may be able to send more than one at a time
> through the net.
Deadlock free routing requires that a specific path be used for each
sender/reciver pair.

> One bad thing about cube topologies (at least for cubes bigger than
> about 10 dimensions) is that they do not use their bandwidth very
> effectively.  It has been claimed (with proof - references may be
> available on demand) that a 14 cube sending an "average" set of messages
> only uses about one fourth of the bandwidth because of collisions.
I would like to see the reference list, please send it to me via email.
By the same token you might send my your SNAIL MAIL address if you are
interested in a paper describing a solution to the bandwidth problem.
Request the paper titled "Shared Memory Hypercube".
The switching technque gives NlogN bandwidth, tested by simulation to N=1024,
and it adaptively removes conflicts.  My use of the network is to create a
shared memory server for a hypercube machine with CRAY style vector processors.
With simultaneous vector access by all processors the packet conflict problem
is as severe as it can be.  The network provides FULL bandwidth to all
processors simultaneously.
 
> It has been claimed, in this discussion, that cube topologies are nice
> because they can simulate other topologies easily.  Actually, this is
> not really a good argument, since most of the other seriously considered
> topologies are equally good at simulating each other
The cube topology is nice as it has other useful topologies IMBEDDED within it.
We are not talking about easy simulation here, efficient execution is the
target.  The wires is there! 
 
> I suspect that the real reason that people are building N-cubes instead
> of skewed rings, minimum spanning trees, or some other topology, is that
> computer scientists have a bias towards binary representations of
> things, and the cube topology is nice and binary, and because routing is
> so easy.
Admittedly the routing is easy, this is one of the reasons for the cubes
attractiveness.  The real reason for the interest in it however is it has
a genuine bandwidth edge over the other topologies.

bradley@think.ARPA (Bradley C. Kuszmaul) (04/30/85)

I am going to disagree a lot with brooks@lll-crg.ARPA, but don't take it
personally.  It sounds like Mr. Brooks has given the topic a lot of
thought, and most of our differences seem to come from different
assumptions.  Once we start with different assumptions, its hard to get
the same conclusions.  Some of my assumptions are:
  - Lots and lots and lots of small processors are better than fewer big
processors.
  - Any particular connection topology will be not quite right for most
applications, so we might be willing to sacrifice the best case behavior
for more generality or better average case behavior (for example nearest
neighbor communication in a 2 dimensional grid is very good, but not
very general.  A cube is well connected, but may use too many wires to
achieve its bandwidth)

In article <551@lll-crg.ARPA> brooks@lll-crg.ARPA (Eugene D. Brooks III) writes:
>> out, but for something like the cosmic cube (64 processors?) we are only
>> talking about 192 wires, (probably expensive coaxial wires), which is
>> not too much of a pain to actually construct in 3-space.
>
>The 64 processor was wired in a single backplane with wire wrap wire and
>not coaxial cable.  The wiring was not even twisted pairs.  Of course it
>probably would have been a lot less flakey with twisted pairs and appropo
>drivers.

Sorry about that.  I was under the impression from the discussion in
this group that the cosmic cube's connections are implemented with
ethernet controllers.  My main point was that a 6-cube is easy to wire
up, and it only gets easier by using smaller wires.

>> I am much more interested in what happens as $n$ becomes much larger
>> (e.g. $n=10^6$).>
>A million processors not not a problem at all (Well perhaps I am stretching it
>a bit.  I don't actually advocate a million processors.  I thing more along the
>lines of a few hundred 20 MIP/MFLOP processors as being interesting.).

I beg to differ.  I think that a million processors is not nearly
enough.  I believe that our differences probably stem from different
approaches to mapping applications onto the parallel processors.  If I
were doing something like electrical simulation of a VLSI
circuit, I would like to have each processor simulate a single
transistor.  If I am doing finite element analysis, each processor might
simulate one of the elements.  While it is true that by speeding up the
processors, and using fewer of them, we should be able to get the same
performance on some applications, I tend to think that it is possible to
get more MIPS per dollar by using smaller, cheaper processing elements.

>>  The wiring problem becomes very hard again.  The
>> Connection Machine with 64K processors in the prototype faces an
>> engineering challenge on this front.
>How to
>construct a very large N cube can easily be worked out on the back of an
>envelope.  You accomplish the feat by building and wiring the machine in 3
>dimensions.  32k processors is a piece of cake (its only 32 processors on a
>side) and one million processors is within the realm of possibility.

I don't see how wiring a 3d cube with 32 processors on a side gives us a
15 cube.  It is important to realize that if you bisect an N cube
"through" one of its dimensions (i.e. pick some $0<=i<N$, put all the
processors with 0 in the $i$th bit of on one side of the cut, and all
the other processors on the other side) that every processor has one
wire crossing the bisection.  (e.g. for a 32K machine, there are 16K
processors on each side, and each processor can communicate with one
processor on the other side of the bisection directly, so there are 16K
wires crossing the bisection.).  For a million processor machine, there
are half a million wires crossing the bisection.  I still claim that it
is hard to lay it out in three space.

>> if some node or wire fails, the machine might be able to keep running.
>> (fault tolerence is important for a million processor machine)).
>> Redundancy is also nice because it means that if lots of messages want
>> to go from A to B, then we may be able to send more than one at a time
>> through the net.
>Deadlock free routing requires that a specific path be used for each
>sender/reciver pair.
We probably are thinking of routing from different points of view.  I
don't believe that deadlock free routing requires a specific path be
used for each pair.  proof: Suppose we have a cube connected packet
switched network with enough storage at each vertice to store on message
from every other vertice (usenet in the year 2010?).  Suppose that we
further place the requirement that no pe (= vertice) is allowed to
inject a message into the network unless its previously injected
messages have already been delivered.  (We might require that every
processor which wants to send a message do it at the same time, and then
wait for the network to empty out before we send any more messages).
There will be no deadlock, since even if all the messages want to go to
the same place, there will be enough storage for them.


>> One bad thing about cube topologies (at least for cubes bigger than
>> about 10 dimensions) is that they do not use their bandwidth very
>> effectively.  It has been claimed (with proof - references may be
>> available on demand) that a 14 cube sending an "average" set of messages
>> only uses about one fourth of the bandwidth because of collisions.

>I would like to see the reference list, please send it to me via email.
Ok, this may take a second...

>By the same token you might send my your SNAIL MAIL address if you are
>interested in a paper describing a solution to the bandwidth problem.
>Request the paper titled "Shared Memory Hypercube".

Please send the paper to

  Bradley C. Kuszmaul
  Thinking Machines Corporation
  245 First Street
  Cambridge, MA 02142

>The switching technque gives NlogN bandwidth, tested by simulation to N=1024,
>and it adaptively removes conflicts.
Ah, but there are on the order of 2^N wires in the machine.  Sounds like
the wires are not being used very effectively.

> My use of the network is to create a
>shared memory server for a hypercube machine with CRAY style vector processors.
>With simultaneous vector access by all processors the packet conflict problem
>is as severe as it can be.  The network provides FULL bandwidth to all
>processors simultaneously.

It sounds to me like the packet conflict problem is as nice as it could
possibly be for that problem.  (Presumably processor number $i$ is
accessing memory $f(i)$ where $f$ is one-to-one.  What happens if $f$ is
not one-to-one (e.g. if $f(i)=c$ for some constant $c$.)  If $f$ is
always the same function, I would just use local memory and flush the
topology.  Thus, for your application, $f$ must be a different function
on different memory accesses.  How many different functions $f$ are
there that you can get the bandwidth?  I doubt that you can do it for
every one-to-one function.

Random thoughts and questions:
Why do you have shared memory?  Presumably this is so that processors
can communicate:  Why don't you just use the cube network to communicate
directly?  What is the start up time for your vectors (i.e. how big does
a vector have to be before the vector processing part wins over the
scalar processing part.)  Typical vector processors are limited in their
speed by lack of memory bandwidth (this is true for a single processor
with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved
memory in order to try to increase the memory bandwidth).  I don't see
how putting a cube topology between the memory and the processor, and
adding processors helps this situation.  Is it the case that your vector
processors are an order of magnitude slower than a CRAY (You mention
20MFLOPS earlier in the article)


>> It has been claimed, in this discussion, that cube topologies are nice
>> because they can simulate other topologies easily.  Actually, this is
>> not really a good argument, since most of the other seriously considered
>> topologies are equally good at simulating each other
>The cube topology is nice as it has other useful topologies IMBEDDED within it.
>We are not talking about easy simulation here, efficient execution is the
>target.  The wires is there! 

Too many wires, and not very effectivley used.  Suppose you spent the
same amount of money on a different network which gave you twice the
bandwidth for most of your applications.  Would the N-cube still make
sense?  The nice thing about the TOTALLY connected topology is that
EVERY other topology is embedded within it, but no one seriously
considers wiring up a totally conneccted network with a million
processors in it.  (10^12 wires?)

>> I suspect that the real reason that people are building N-cubes instead
>> of skewed rings, minimum spanning trees, or some other topology, is that
>> computer scientists have a bias towards binary representations of
>> things, and the cube topology is nice and binary, and because routing is
>> so easy.
>Admittedly the routing is easy, this is one of the reasons for the cubes
>attractiveness.  The real reason for the interest in it however is it has
>a genuine bandwidth edge over the other topologies.

That is not really true.  One of the topologies mentioned in this
discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
network achieves better bandwidth by decreasing the distance between
nodes in the graph, by using the same degree of vertices, and
using about the same number of wires as a cube.  $k$-shuffles are just
harder to understand:

  Write your address in base $k$.

  Now take this address and shift it left one digit (i.e. $\log k$ bits)
Now add any number from $0$ to $k-1$ to the number, and your original
processor is connected to that processor.

  If there are $n$ processors in the system, then the maximum distance
between two processors is $log n / log k$.  For 64K procesors in a
16-shuffle (which uses the same number of wires as a 64K 16-cube), the
maximum distance between two processors is 4 jumps (compared to 16 jumps
for the cube).  This means that there are likey to be only a quarter as
many conflicts, which increases the bandwidth.  If you actually had some
problem which only did nearest neighbor communication in an n-cube, we
could run your program almost as quickly on the k-shuffle topology,
because at worst there would be a four processor jump to perform your
communication (instead of one jump).  Very few programs actually use the
n-cube directly (although there are a few (e.g. subcube induction to
count the number of processors in a given state - [D.P. Christman,
"Programming the Connection Machine", MIT masters thesis, January,
1984])) And for most programs the $k$-shuffle will be at least as good.
Routing can be tricky for a $k$-shuffle.  There are topologies with even
better bandwidth which do have easy routing.


Happy connections.
 -Bradley
-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA