[net.arch] "The Shared Memory Hypercube"

brooks@lll-crg.ARPA (Eugene D. Brooks III) (05/05/85)
> A machine in a box 32 feet on a side.  Wow!  I've been considering
> building big machines (I've been thinking about machines the size of a
> small building too), and I'm glad someone else thinks the way I do :-)
Have you ever seen a CDC7600 or a CYBER205?  If you drop by LLNL sometime
I would be glad to show you a 7600.  Neither of them fits in your pocket.
In spite of this fact, I did not propose actually building a 15 cube,
as noted I want no more than 1024 10MFLOP nodes for quite some time
and will gladly accept 128 of them as a start.  We are in the process
of examining parallel processors for a future exp machine now and a 128 node
shared memory hypercube would be a clear winner.

> foot cubed processors).  And the speed of light across a 32
> foot region is going to be fun to deal with (I know I am looking
> forward to it:  Lets see now, what time is it anyway)  The speed of
> light problem, can of course, be dealt with in lots of ways (decrease
> the clock speed of the network, but keep the processors going faster,
> etc.)
Now you know why I look at a performance level of 10 MFLOP per node.  You can
do this with a 100ns clock.  Travel time for coax is about 32ns for 32feet.
I don't think that CRAY 2 speeds (4ns clock) will be possible in modest
N (100-1000) shared memory multiprocessors.

> >> >Deadlock free routing requires that a specific path be used for each
> >> >sender/reciver pair.
> >Sorry, I should have been more precise.  The only known proof for deadlock
> >free routing, without assuming infinite diskpacks at each node or any
> >other unacceptable bandwidth reducer like inhibiting a PE from sending
> >another message until a return receipt is received
> 
> Such a bandwidth reducer is not neccessarily unacceptable.
I want an operand from shared memory to every cpu on every clock.  The factor
of 10 reduction in performance that your restriction imposes (for 1024 nodes)
is unacceptable for a machine of interest here at LLNL.

> at each node.  Thus I conclude that you are acknowledging
> messages, but that you are allowed to send more messages before the
> acknowledgement comes back.  I don't know what the acknowledgements do
> for you, but it seems like you could create exactly the same deadlock
In the case of a write the acknowledgement allows a processor to be sure that a
shared memory location has indeed been updated before emmiting a sync
instruction that would allow another processor to access the location.
Fortunately a sync instruction does not get done after every memory access.
In the case of a read the acknowledgement contains the data.

> >The utilization of the wires is around 50% as packets travel half the maximal
> >logN distance on average.
> 
> How big a cube is that?  (I am still trying to dig up the reference about
> 14 cubes only using 25% of their potential bandwidth...)  A handwaving
> argument that larger cubes make less effective use of their wires than
> smaller cubes goes like this.  There are NlogN/2 wires, but only N
> messages could possibly be inserted into the net at a time, so there
> is logN factor that is being wasted.  Furthermore, as the number of
> processors grows th neumber of possible collisions increases, further
> decreasing the percentage of used bandwidth.  (You don't have to
> believe the argument if you don't want to...)
I will correct your argument for you.
Yes, N messages get inserted on the network on each clock.  Each message
must travel half the logN distance on average.  Hence on average N(LogN)/2
messages are being switched on each clock cycle.  Each switching uses a
wire.  Half the wires are used on average.  The argument has no N dependence
and simulations (to 1024 processors) show a relative N independence.
The simulations clearly show convergence to the large N limit already.

> >> Why do you have shared memory?
> >A shared memory machine can do anything a message passing machine can do.
> >You can even efficiently run message passing software.  The shared memory
> >machine is then more flexible and useful on a wider range of problems.
> 
> Conversely, a message passing machine can do anything a shared memory
> machine can do.  You can even efficiently run shared memory software.
> (Apparently that is how you are implementing shared memory) The
> message passing machine is then more flexible and useful on a wider
> range of problems.
Indeed you are right here! If you get right down to it the packet switch used
to implement the shared memory server is a message passing machine.  It is a
very high performance message passer that adaptively responds to conflicts
sorting them out and producing lock step operation if its possible.  The
network is composed of a large number of very dumb parts that collectively
produce this new behavior.  Its actually quite interesting, neurons are
relatively dumb parts you know.  Perhaps you guys at TMC are barking up the
wrong tree! Perhaps the mechanisms of these networks are worth studying.

> >> discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
> >> network achieves better bandwidth by decreasing the distance between
> >The k shuffle does not achieve better bandwidth, it achieves a lower latency.
> >Bandwith is bandwidth, latency is latency and parts is parts. :-)  The lower
> >latency could translate to shorter vector lengths but as is shown in my paper
> >the vector lengths required for the hypercube are fine.  The worry is that
> >the k shuffle will have the start address pickyness of the butterfly which
> >is unacceptible.  The answer to this question is not known at present.
> 
> But because the messages spend less time floating around in the
> network, the network may be able to achieve higher throughput in the
> $k$ shuffle than in the cube.
For vector fetch of long vectors you can't do better than a bandwidth of N.
Indeed if the k shuffle does display the adaptive behavior, and I am confident
that with the right switching node it will, without the restrictions on start
addresses(this I am not so sure about as the butterfly did not work out
too well here) then the k shuffle might achieve better throughput for short
vectors.  This is precisely why I would like to see someone running memory
server simulations for the k shuffle.  I note this possibility is interesting 
in the discussion of one of my papers.  Are there any takers for this job?
It would certainly lead to a good masters thesis and with some work perhaps
even farther.  I would be more than happy to help anyone get started.  I am
very busy building a simulator for a complete shared memory hypercube that will
run application codes.

> >> Very few programs actually use the n-cube directly
> >Take a look at the FFT,  its an important problem, it blows smoke on the
> >hypercube.  As I have said before, the wires is there.  The FFT uses all of
> >them.
> I was afraid you might bring up FFT.  Just as a matter of curiosity,
> what percentag of total computing time do you use up on FFT's?
I personally haven't used the FFT recently but it forms the compute intensive
kernel for an algorithm that has been worked into the ground by a couple
of my collaboators here at LLNL for several years now.  The algorithm is
used in solving the wave equation in waveguides and also in solving the
the Schrodinger equation in time dependent applications.  (My Phd is in 
theoretical physics.)  Of course it is one algorithm of many and a special
purpose machine would not be appropo.  This is why we are interested in
general purpose machines!

xxxxxxxxxxxxxxx

At this point I will refrain from discussing the points mentioned above
any further.  If there is anything new and interesting to talk about
I will be glad to reply.