[net.arch] "The Shared Memory Hypercube" Do you smell any smoke?

brooks@lll-crg.ARPA (Eugene D. Brooks III) (05/01/85)

> I am going to disagree a lot with brooks@lll-crg.ARPA, but don't take it
> personally.  It sounds like Mr. Brooks has given the topic a lot of thought
Not to mention a lot of work,
its Dr. Brooks. :-)

>Some of my assumptions are:
>   - Lots and lots and lots of small processors are better than fewer big
> processors.
A very bad assumption,  you want as many of the most COST EFFECTIVE processors
as you can afford.  To do anything else is to waste your computational dollar.
At the moment the most cost effective processor for vectorizable scientific
work is a VLSI vector processor of about 5-10 MFLOP capability.  The speed
figure is climbing steadly and will probably saturate at not too many factors of
two greater.  In the serial area the current 32bit micros are certainly very
cost effective.  Would you pay a dollar for an apple if you could buy it for
50 cents?

>   - Any particular connection topology will be not quite right for most
> applications, so we might be willing to sacrifice the best case behavior
> for more generality or better average case behavior (for example nearest
> neighbor communication in a 2 dimensional grid is very good, but not
> very general.  A cube is well connected, but may use too many wires to
> achieve its bandwidth)
This is points for the cube, it has very general utility.  It is one of the most
general topologies due to the many topologies imbedded in it.  Yes, it has
a lot of wires.  Whether a k connected network (4x4, 8x8, ... switching nodes
instead of 2x2 switch nodes) is better is an open research question.  You trade
wires for silicon.  You have to do the relevant simulations to find the answers
to this question.  The switching technique described in my recent work applies
to these systems as well.  You might consider writing the simulations and
publishing the papers for these!


> I tend to think that it is possible to
> get more MIPS per dollar by using smaller, cheaper processing elements.
You get the most MIPS for your dollar by using the most COST EFFECTIVE
processing elements.  These do not happen to be the smallest and cheapest.

> I don't see how wiring a 3d cube with 32 processors on a side gives us a
> 15 cube.
Consider a n=5 hypercube with the processors aligned along a line spaced at say
one foot intervals.  Each processor is labeled by a 5 bit number.  These 5 bits
form the first 5 bits of node addresses of the n=15 cube we are constructing.
Take 32 of these linear arrays. Place them side by side forming a plane.
Connect all of the 0 processors together in a line with the same wiring pattern
as used for the linear array.  Do the same for all of the other processors.
You now have a n=10 hypercube with 1024 processors.  The row index forms the
next 5 bits in the now 10 bit processor enumeration. All wires run in either
the x or y directions and the machine is a physical plane.  Now take 32 of these
planes and stack them in the z direction one foot apart.  Connect the processors
in each vertical line (there are 32x32 of them) with the same wiring pattern
as for the original line of 32 processors.  You now have a n=15 hypercube.
It is physically constructed in a 3 dimensional box 32'x32'x32'.  All wires
run in the x,y or z directions with a very regular pattern.  No wires run
diagonally.  The 1 meg processor is within reach but as I noted earlier
I would be quite happy with a 1024 processor cube of 10 MFLOP nodes.
Will any one sell me one?  I'll settle for 128 nodes as a start.
 
> >Deadlock free routing requires that a specific path be used for each
> >sender/reciver pair.
Sorry, I should have been more precise.  The only known proof for deadlock
free routing, without assuming infinite diskpacks at each node or any
other unacceptable bandwidth reducer like inhibiting a PE from sending
another message until a return receipt is received (see the further caviat
below), requires that a specific path be chosen for each sender/receiver pair.
This proof is documented in Dick Lang's 81(I think) Caltech Thesis.  There is
a caviat not clearly documented in Langs thesis as well.  If each message is
acknowledged with a response indicating the sucessful reception of the message
(as part of the reception action) then you need two seperate networks for the
messages and responses or infinite disk packs at each node.  I ran into a
deadlock condition in one of my hypercube simulations.  The "cure" was to use
seperate networks for the messages and return receipts.  Of course there may
be a better routing algorithm.  If you find it publish it and send me a copy
of the paper.  In fact send me a copy of the paper before you publish it,
I would like to take advantage of the algorithm as soon as possible in my
simulations and wouldn't want to wait for the publication lead time.


>   Thinking Machines Corporation
Have you guys over there taught any of your machines to "Think" yet?

Sorry, I know its old and you have heard it before but I couldn't resist. :-)


> Ah, but there are on the order of 2^N wires in the machine.
There are NlogN/2 wires, there logN ports on each network node, there are
N nodes and each wire has two ends.  Where does the 2^N come from?
The utilization of the wires is around 50% as packets travel half the maximal
logN distance on average.

> > My use of the network is to create a
> >shared memory server for a hypercube machine with CRAY style vector processors.
> >With simultaneous vector access by all processors the packet conflict problem
> >is as severe as it can be.  The network provides FULL bandwidth to all
> >processors simultaneously.
> 
> It sounds to me like the packet conflict problem is as nice as it could
> possibly be for that problem.  (Presumably processor number $i$ is
> accessing memory $f(i)$ where $f$ is one-to-one.
There are several addressing modes,  random start addresses with a common
stride among the processors, random scatter/gather and random start with
random stride.  Here the starting f(i) is not one to one as the start addresses
are unconstrained and may overlap.  The magic of the network is that it moves
toward an attractor in its motion, falls into lock step making f(i) one to one
and achieves full bandwidth.  The vector lengths required to do this are of
course a function of the machine size N and are documented in the papers
"The butterfly processor-memory interconnection in a vector processing
environment" and "The shared memory hypercube".  The Hypercube does well on
all of the addressing patterns, the butterfly does well on some of them.

> local memory
The shared memory hypercube has local memory.  Whether a segment is local or
global is useage determines how you arrange the addressing for the segment.
There are N local segments and one global segment in the simplest arrangement.
The ith processor has absolute priority and low latency for the ith local
segment.  It can still access all of the other local segments with lower
priority and higher latency.  This is one of the BIG advantages of the cube.
It can be N scalar machines, 2 N/2 parallel machines, ... or one big parallel
machine at the whim of the software.  You put the local scalar variables of
a task in a local segment.  Its very flexible!

> Why do you have shared memory?
A shared memory machine can do anything a message passing machine can do.
You can even efficiently run message passing software.  The shared memory
machine is then more flexible and useful on a wider range of problems.
It turns out that the type of codes run at LLNL are very difficult to run
on anything but a shared memory machine.  Message passing machines, to the
extent that they can live with a lower message bandwidth than aggregate memory
bandwidth , ie the processes are loosely coupled, are of course cheaper as the
network is cheaper.  If we could use one at LLNL it would have been bought a 
long time ago.


> What is the start up time for your vectors (i.e. how big does
> a vector have to be before the vector processing part wins over the
> scalar processing part.)
As you might guess it gets longer as the number of processors used in the
problem gets bigger.  Notice that I say the number of processors used
in the PROBLEM here.  For the butterfly its the number of processors in the
MACHINE which is potentially a different animal.  Luckily its a very slowly
growing function (like (logN)^2 with a suitably small constant out front).
   You would like it, its a large N winner.

> Typical vector processors are limited in their
> speed by lack of memory bandwidth (this is true for a single processor
> with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved
> memory in order to try to increase the memory bandwidth).
Your choice is as to whether you want to interleave several hypercube topology
memory servers, interleave the memory modules or accept a 10-20 mip instruction
rate.  It is not yet known whether the interleaving will break the adaptive
nature of the network that is so crucial to vector performance.  This simulation
has not yet been done but if you know me, and I am assuming that you have
learned a little about me by now, you can be sure the simulation will be done
soon.  If the interleave does not break the adaptive nature of the network
then interleave, if it does break the adaptive nature stay with 10MFLOP
processors and 100ns ram.  Remember that we are following a line of maximum
cost effectivness and not maximum absolute speed per node.  A 10MFLOP node
is maximally cost effective at the present time and is what you would use
anyway.  Yes, the processor node is an order of magnitude slower than a Cray.
It is also an order of magnitude more cost effective.  A 128 node machine
seems like a good start.  It would cost the same amount as a Cray and be 10
times faster in terms of aggregate performance.

> Too many wires, and not very effectivley used.
Wrong on both counts, see the comments made above.

> Suppose you spent the
> same amount of money on a different network which gave you twice the
> bandwidth for most of your applications.
Find it, prove its better than the cube for most of LLNL's applications,
build it and we will buy it.  Thats the bottom line.  If after spending
some time studying the matter you don't find something better, build a
shared memory hypercube to the proper specs.  We would have bought one
of those yesterday.  Now isn't that better than Russian roulette!

> discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
> network achieves better bandwidth by decreasing the distance between
The k shuffle does not achieve better bandwidth, it achieves a lower latency.
Bandwith is bandwidth, latency is latency and parts is parts. :-)  The lower
latency could translate to shorter vector lengths but as is shown in my paper
the vector lengths required for the hypercube are fine.  The worry is that
the k shuffle will have the start address pickyness of the butterfly which
is unacceptible.  The answer to this question is not known at present.

> Very few programs actually use the > n-cube directly (although there are
> a few (e.g. subcube induction to
> count the number of processors in a given state - [D.P. Christman,
> "Programming the Connection Machine", MIT masters thesis, January,
> 1984])) And for most programs the $k$-shuffle will be at least as good.
Take a look at the FFT,  its an important problem, it blows smoke on the
hypercube.  As I have said before, the wires is there.  The FFT uses all of
them.

Speaking of smoke, when it clears read the papers I sent you via SNAIL MAIL.

Any other takers for those papers? Pun intended. :-)
Send me your SNAIL MAIL address!

bradley@think.ARPA (Bradley C. Kuszmaul) (05/03/85)

In this posting, my latest entry in the "dualing hypercubists munch
Cray" story, I am going to "almost concede" several of Dr. Brooks' points.

For example, the assertion that a 5-10 MFLOP processor is more cost
effective than a smaller one may very well be true (one of the
percieved problems of the Connection Machine (CM) is that its
floating point performance (~60 to 600 MFLOPS, depending on how you
measure it) is not as good as one might expect from a machine with 4K
chips and 64K processors (four square feet of special purspose
silicon, plus memory and glue) If Thinking Machines (TMC) had used 4K
chips to maximize floating point speed (with, perhaps, 2K 5 Mflop
chips, and 2K network chips), the CM might have a 4 GFLOP
performance.

 The truth is that connection machine was not designed to be a high
speed number cruncher, but instead was designed to do something
called "logical inferences" (no one really knows how to measure
logical inferences per second, but everyone at TMC seems to believe
that the CM will do them faster than anything else).

 My research (apparently, with my ongoing master's research, I lose
in the "battle of the titles" to Brooks' completed doctorate)
involves trying to run applicative languages (languages without side
effects) on the CM.  One of the big arguments for running applicative
languages is that the parallelism is easy to exploit, and thus
programs running on an architecture well suited for such languages
should be able to run very quickly (hah!).  In order to protect my
reputation when the CM does not run my programs very quickly, I claim
to be "simulating applicative architectures".  :-)

 Sometimes, I believe that the CM was designed correctly after all,
since all the programs I have written are dominated by communication
time, so what's the point of spending more on processors if you can't
communicate fast enough to keep them busy.

 Sometimes, I believe that the CM was not designed to be fast enough
(it was conservatively designed so that it would work the first
time).  If the processors were bigger, maybe there would be less of a
communications bottleneck (hopefully by initiating less of it) .  If
the router were faster, or of a different topology, maybe the
communications traffic would run faster.  If we upped the clock to
10ns, it might be enough faster to make everyone happy.  (When I told
one of the designers that I want the model two to have a million
processors and a 10ns clock, he said I was on drugs, and elaborated
with "I don't want to design with ECL, if you look at it the wrong
way it fails.  How do you distribute and use a 10ns clock with CMOS?
How do you get bits across four meter wires at 10ns?  You design it,
we'll build it")

In article <560@lll-crg.ARPA> brooks@lll-crg.ARPA (Eugene D. Brooks III) writes:

>> I don't see how wiring a 3d cube with 32 processors on a side gives us a
>> 15 cube.
>Consider a n=5 hypercube with the processors aligned along a line spaced at say
>one foot intervals.  Each processor is labeled by a 5 bit number.  These 5 bits
>form the first 5 bits of node addresses of the n=15 cube we are constructing.

A machine in a box 32 feet on a side.  Wow!  I've been considering
building big machines (I've been thinking about machines the size of a
small building too), and I'm glad someone else thinks the way I do :-)

Of course the million processor machine will be something like 100
feet on a side.  (Hopefully that foot includes allowances for
accessing the machine to service it, since I can't imagine why else a
processor would want to be that far away from its neighbor :-)

Ok, so you are wiring up a 5-cube in a linear region, and the wiring
bundle will be about 16 wires at the thickest point.  That is not too
tough (in fact TMC wired up a 5-cube on one printed circuit board - no
doubt with an infinite number of layers :-)  To get up to the 2M
processor machine (sorry about jumping up to 2M from 1M, but 2M is a
cube in base 2) you need 128 processors lined up in a row, and the
wiring bundle will be 64 wires thick at the thickest point.  While it
may be possible to do,  I still claim that the packaging is very hard.
(I suppose it would help to have some custom wires too.  The CM uses
standard ribbon cable for its longer connections (i.e. the connections
along the highest dimensions), and printed circuit board for the lower
connectinos.  I guess pcb counts as custom wire.  (And to think that
TMC is getting a 64K machine into a 5 foot cube (apples and oranges of
course - 64K of really small processors is not the same as 32K of one
foot cubed processors).  And the speed of light across a 32
foot region is going to be fun to deal with (I know I am looking
forward to it:  Lets see now, what time is it anyway)  The speed of
light problem, can of course, be dealt with in lots of ways (decrease
the clock speed of the network, but keep the processors going faster,
etc.)

>I would be quite happy with a 1024 processor cube of 10 MFLOP nodes.
>Will any one sell me one?  I'll settle for 128 nodes as a start.

I understand that qualified buyers can obtain information about buying
a CM from TMC.  (Don't look at me that way... I only use TMC's
computers (at least during the school year).  The prototype is 64K
processors with from 1 to 10 Kflops per processor.  That's the best I
can offer :-)

>> >Deadlock free routing requires that a specific path be used for each
>> >sender/reciver pair.
>Sorry, I should have been more precise.  The only known proof for deadlock
>free routing, without assuming infinite diskpacks at each node or any
>other unacceptable bandwidth reducer like inhibiting a PE from sending
>another message until a return receipt is received

Such a bandwidth reducer is not neccessarily unacceptable.

> (see the further caviat
>below), requires that a specific path be chosen for each sender/receiver pair.
>This proof is documented in Dick Lang's 81(I think) Caltech Thesis.  There is
>a caviat not clearly documented in Langs thesis as well.  If each message is
>acknowledged with a response indicating the sucessful reception of the message
>(as part of the reception action) then you need two seperate networks for the
>messages and responses or infinite disk packs at each node.

I still am not sure I understand.  Suppose that each message is
acknowledge with a reciept before the next message is allowed to be
sent.  Then with an N processor machine, there can be at most N
messages in the network at one time, so you only need that much memory
at each node.  Thus I conclude that you are acknowledging
messages, but that you are allowed to send more messages before the
acknowledgement comes back.  I don't know what the acknowledgements do
for you, but it seems like you could create exactly the same deadlock
by using messages with no acknowledgement (by "simulating" the
acknowledgement in software:  Whenever a processor would have sent an
acknowledgement, it can just send a regular message.)

>>   Thinking Machines Corporation
>Have you guys over there taught any of your machines to "Think" yet?

Let's play the Turing test:  Can you prove that I am not one of those
machines (please don't respond by saying that my postings display a
lack of thought) :-

>Sorry, I know its old and you have heard it before but I couldn't resist. :-)

Could be worse.  It used to be International Thinking Machines.
(Imagine large blue letters with horizontal stripes spelling out ITM.)

>> Ah, but there are on the order of 2^N wires in the machine.
>There are NlogN/2 wires, there logN ports on each network node, there are
>N nodes and each wire has two ends.  Where does the 2^N come from?

My N was the number of dimensions, while your N is the number of
nodes.  Sorry about the confusion.  Still seems like a lot of wires
to me.

>The utilization of the wires is around 50% as packets travel half the maximal
>logN distance on average.

How big a cube is that?  (I am still trying to dig up the reference about
14 cubes only using 25% of their potential bandwidth...)  A handwaving
argument that larger cubes make less effective use of their wires than
smaller cubes goes like this.  There are NlogN/2 wires, but only N
messages could possibly be inserted into the net at a time, so there
is logN factor that is being wasted.  Furthermore, as the number of
processors grows th neumber of possible collisions increases, further
decreasing the percentage of used bandwidth.  (You don't have to
believe the argument if you don't want to...)

>> Why do you have shared memory?
>A shared memory machine can do anything a message passing machine can do.
>You can even efficiently run message passing software.  The shared memory
>machine is then more flexible and useful on a wider range of problems.

Conversely, a message passing machine can do anything a shared memory
machine can do.  You can even efficiently run shared memory software.
(Apparently that is how you are implementing shared memory) The
message passing machine is then more flexible and useful on a wider
range of problems.

>> Too many wires, and not very effectivley used.
>Wrong on both counts, see the comments made above.
Too many wires, and not very effectively used.  (Boring aren't I)

>> discussion (somewhere in an earlier posting) is the $k$-shuffle.  This
>> network achieves better bandwidth by decreasing the distance between
>The k shuffle does not achieve better bandwidth, it achieves a lower latency.
>Bandwith is bandwidth, latency is latency and parts is parts. :-)  The lower
>latency could translate to shorter vector lengths but as is shown in my paper
>the vector lengths required for the hypercube are fine.  The worry is that
>the k shuffle will have the start address pickyness of the butterfly which
>is unacceptible.  The answer to this question is not known at present.

But because the messages spend less time floating around in the
network, the network may be able to achieve higher throughput in the
$k$ shuffle than in the cube.

>> Very few programs actually use the n-cube directly
>Take a look at the FFT,  its an important problem, it blows smoke on the
>hypercube.  As I have said before, the wires is there.  The FFT uses all of
>them.
I was afraid you might bring up FFT.  Just as a matter of curiosity,
what percentag of total computing time do you use up on FFT's?

>Speaking of smoke, when it clears read the papers I sent you via SNAIL MAIL.

Oh well.  My response is wimpy this time.  I'll read the papers, and
then flame on...
 -Brad-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA

karsh@geowhiz.UUCP (Bruce Karsh) (05/05/85)

> 
> >> Very few programs actually use the n-cube directly
> >Take a look at the FFT,  its an important problem, it blows smoke on the
> >hypercube.  As I have said before, the wires is there.  The FFT uses all of
> >them.
> I was afraid you might bring up FFT.  Just as a matter of curiosity,
> what percentag of total computing time do you use up on FFT's?

  Well, in the geophysical industry, it's not uncommon to have mainframes
doing almost nothing but ffts.  The geophysical industry is probably one
of the biggest consumers of computers in the world.
  (I'm not implying that it's necessary for a parallel machine to do FFT's
quickly.  But the oil industry would be a very large purchaser of a parallel
machine that did FFT's quickly.)
-- 
Bruce Karsh                           |
U. Wisc. Dept. Geology and Geophysics |
1215 W Dayton, Madison, WI 53706      | This space for rent.
(608) 262-1697                        |
{ihnp4,seismo}!uwvax!geowhiz!karsh    |

eugene@ames.UUCP (Eugene Miya) (05/07/85)

<1483@think.ARPA> <560@lll-crg.ARPA>

> >Some of my assumptions are:
> >   - Lots and lots and lots of small processors are better than fewer big
> > processors.
> A very bad assumption,  you want as many of the most COST EFFECTIVE
> processors
> . . .
> > I tend to think that it is possible to
> > get more MIPS per dollar by using smaller, cheaper processing elements.
> You get the most MIPS for your dollar by using the most COST EFFECTIVE
> processing elements.  These do not happen to be the smallest and cheapest.

To reinforce Eugene's comments:
We just had a SIG meeting with Joe Oliger (the CS chair at Stanford)
recently.  Joe came to the conclusion [during the course of thinking] that
"fewer, high performance CPU" proponents of multiprocessing had the
advantage in being able to fit reasonable portions of problems into
individual processor/memories.

Do not forget! We are not developing these machines in a vacuum.
We have to look at the applications which may be run on these machines.
Consider a 100 x 100 x 100 array with 30 variables (and increasing
as our known of the natural sciences increases).  I can barely fit
a fluid dynamics code on a 32-node Hypercube because the storage requirements
per CPU are fierce [a different example, not the 100^3x30 example].
If I am repeating myself from any earlier postings, sorry.  Jack Dennis
had to revise the way he thought dataflow machines need to be built
after two weeks here: he needs more memory and faster I/O.

If there is any one thing multiprocessors allow us to do, it is add yet
more memory.  Hum? :-)

> > What is the start up time for your vectors (i.e. how big does
> > a vector have to be before the vector processing part wins over the
> > scalar processing part.)

I have just tested this recently on four different CRAY architectures.
Short vector startup time is very good.  The vector registers are faster
than the scalar registers after you pass 3 elements.  I have a graph
that shows this.  [To: Eugene Brooks: I sent a copy of this graph to
Frank McMahon, the guy who developed the Livermore Loops, for scatter-
gather operations in hardware on the XMP/48.]  DON'T WORRY about
vector STARTUP on a CRAY, the difference is insignificant, you are
wasting your time optimizing this arena.  A 205 might be another story,
more later.

> > Typical vector processors are limited in their
> > speed by lack of memory bandwidth (this is true for a single processor
> > with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved

Our Cray has a much higher degree of interleave.  The new C-2 will have
128-way.

--eugene miya
  NASA Ames Research Center
  {hplabs,ihnp4,dual,hao,decwrl,allegra}!ames!aurora!eugene

henry@utzoo.UUCP (Henry Spencer) (05/09/85)

> Short vector startup time is very good.  The vector registers are faster
> than the scalar registers after you pass 3 elements.  ...
> ...  DON'T WORRY about
> vector STARTUP on a CRAY, the difference is insignificant, you are
> wasting your time optimizing this arena.  ...

It should be noted that this is probably the single biggest reason why
the Cray machines have been so successful, compared to most previous
supercomputer efforts:  they're scorching fast even on scalars, and get
full vector performance on even short vectors.  This is why sales of
Cray machines are already an order of magnitude larger than the combined
sales of the Star-100 and the ASC, both of which were Real Fast if they
could ever get their pipelines full...  Life is much easier for the poor
users when they don't need to sweat blood to make their vectors long
enough for full-speed operation.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry

bradley@think.ARPA (Bradley C. Kuszmaul) (05/10/85)

    Path: think!mit-eddie!genrad!panda!talcott!harvard!seismo!hao!ames!eugene
    From: eugene@ames.UUCP (Eugene Miya)
    Newsgroups: net.arch
    Subject: Re: "The Shared Memory Hypercube"  Do you smell any smoke?
    Date: 6 May 85 23:05:16 GMT
    Date-Received: 8 May 85 08:09:55 GMT
    References: <2132@sun.uucp> <1447@think.ARPA> <551@lll-crg.ARPA> <1483@think.ARPA> <560@lll-crg.ARPA>

    > >Some of my assumptions are:
    > >   - Lots and lots and lots of small processors are better than fewer big
    > > processors.
    > A very bad assumption,  you want as many of the most COST EFFECTIVE
    > processors
    > . . .
    > > I tend to think that it is possible to
    > > get more MIPS per dollar by using smaller, cheaper processing elements.
    > You get the most MIPS for your dollar by using the most COST EFFECTIVE
    > processing elements.  These do not happen to be the smallest and cheapest.

    We just had a SIG meeting with Joe Oliger (the CS chair at Stanford)
    recently.  Joe came to the conclusion [during the course of thinking] that
    "fewer, high performance CPU" proponents of multiprocessing had the
    advantage in being able to fit reasonable portions of problems into
    individual processor/memories.

The claimed advantage is predicated on the idea that each processor
solving more of the problem is better, either in programming ease,
communications overhead, or some other measure of cost.  This advantage
may go away again if small portions of the problems are the RIGHT thing
for each processor to handle.  Such an case might be some finite element
analysis problem, in which it might be both easy to program each
processor to simulate exactly one element, and there is not a lot of
communication between different parts of the problem (e.g. if all the
interaction is with the nearest neighbors of the finite element).  (Of
course, if it is too hard to write the program, then I lose, and if
every element wants to talk to every other element in every discrete
time unit, then	I lose.  I think it is actually easier to program lots
and lots of processors than "just" several processors, and I think that
most simulations have strong localality of communication properties.)

    Do not forget! We are not developing these machines in a vacuum.
    We have to look at the applications which may be run on these machines.
    Consider a 100 x 100 x 100 array with 30 variables (and increasing
    as our known of the natural sciences increases).

I claim that what you really want is 100^3 processors each with only
enough memory for the 30 variables that you need.  I think that the
problems that you are addressing are due to the fact that the natural
granularity of the problem does not match the natural granularity of the
machine you are using.

If you are iteratively solving PDE's on a 100^3 matrix, then there is
not that much communication between different parts of the matrix (there
is some local communication), so you might want to put each element of
the matrix in a different processor, where the processors are (at least
logically, if not physically) arranged 3d grid network.

(Handwaving cost argument:  If adding a processor to every couple of
thousand bits of memory increases the memory cost by only a few percent
(including all the costs of power supplies, boards, air conditioners
etc.), and furthermore, you can use an otherwise cheaper memory system
because you don't need all the interleave hardware, and the memory does
not have to be that fast, and you don't need caches between the
processor(s) and the memory, then it might very well be the case that
for essentially the cost of a large, (but relatively low cost per bit)
memory you can have a supercomputer.  Why put the cental processor(s)
in?)

    I can barely fit a fluid dynamics code on a 32-node Hypercube
    because the storage requirements per CPU are fierce [a different
    example, not the 100^3x30 example].

Details?

    Jack Dennis had to revise the way he thought dataflow machines need
    to be built after two weeks here: he needs more memory and faster
    I/O.

That is not quite how I interpreted the report of the RIACS study that
he gave to his research group.  He mostly seemed pleased that FORTRAN
programmers could actually learn to program in an applicative language.
Adding memory and faster I/O may be just as expensive as just adding
more processors.  (Of course, Jack thinks I am an extremist for
believing that a million processors is not enough, so my interpretation
of what he said is also subject to correction (e.g. I would not want to
try to explain to him how I ever got the idea that he said or meant the
things I am attributing to him. :-))

    If there is any one thing multiprocessors allow us to do, it is add yet
    more memory.  Hum? :-)

If there is one thing that adding more memory does , it is increase the
cost of the system, especially when you have to do more hacking to get
the processors to talk to the memory (e.g. shared memory,
interleaving...)

    > > What is the start up time for your vectors (i.e. how big does
    > > a vector have to be before the vector processing part wins over the
    > > scalar processing part.)

    I have just tested this recently on four different CRAY architectures.
    Short vector startup time is very good.

I am aware that the Cray is good at short vectors (that's one of the
reasons that the Cray so popular.)  The Cray is one of the few vector
machine architectures which is good at short vectors.  My question had
to do with how good Dr. Brook's computer is at vectors (he said that he
was connecting several "Cray class" processors together, which is not
the same thing as several Cray processors.)

    > > Typical vector processors are limited in their
    > > speed by lack of memory bandwidth (this is true for a single processor
    > > with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved

    Our Cray has a much higher degree of interleave.  The new C-2 will have
    128-way.

  Which only reinforces the idea that vector processors are often
limited by their bandwidth.  I understand that the C-2 (Cray 2?) is
running a faster clock (~ 4ns) and slower memory (>100ns) than the Cray
1 (12.5 (or 9) ns and 50 (or faster) ns respectively.  This means that
an interleave of something like 25 is the minimum that you can get away
with without loss of performance (and just as the Cray 1 "has to have"
an interleave of four to keep the processor busy on memory operations,
there are a number of reasons for increasing the interleave far beyond
the minimum. (e.g. the processor hitting the memory interleave, multiple
processors with one memory, independent I/O processors.)  I have also
heard rumors to the effect that Cray has broken down and put multiport
memory in.
  These high bandwidth memories are very expensive, and I think there is
an argument for avoiding them.  Note that this argument, by itself, is
good for arguing for anything from a 10Kflop processor that the
Connection Machine uses up to the 10MIPS processors that Brooks
advocates, but that it becomes hard to justify a 1Gflop processor simply
because of the high cost of the memory system.

  I have some specific questions for Dr. Brooks (which I suspect he has
good answers for): 

  In the discussion of the architecture which you advocate, we have been
concentrating on the bandwidth considerations.  I am now wondering about
latency issues.

    What is the latency of your shared memory?  (If you are running a
10MIPS processor, does that mean that a memory operation on shared
memory can complete in time for the next memory operation to continue?
Do you use pipelining to help deal with any such latency problems?  If
you do, how "full" and "deep" does the pipeline have to be to keep the
processor busy.  Is there local memory for each processor, and does it
have lower latency than the global memory?-- 
bradley@think.uucp  (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley)
bradley@THINK.ARPA