[comp.arch] Connection Machine Argument

reiter@endor.harvard.edu (Ehud Reiter) (12/03/86)

The idea behind the Connection Machine architecture, of having an expensive
hypercube interconnect with extremely simple 1-bit SIMD processors at the
nodes, has bothered me quite a bit.  The below tries to rigorously argue that
such an architecture is inappropriate.  I would very much welcome any comments
on this, either to me personally or to the net.

     According to J. Ullman in COMPUTATIONAL ASPECTS OF  VLSI,  chapter  6,  the
cost  of  a butterfly network in VLSI is O(n*n).  That is, if the entire network
is on a chip, the size of the network is a quadratic function of the  number  of
nodes connected.  This is a lower bound (ie any butterfly network MUST take this
amount of space).  A similar result applies  to  a  hypercube,  and  a  shuffle-
exchange network is only slightly cheaper at  O((n/log(n))**2).

     This has two implications on the Connection Machine  architecture  of  many
1-bit  processors connected in a hypercube, if the goal is to eventually be able
to realize such an architecture on a single chip/wafer.

     First, the dense hypercube interconnect should only be used when necessary,
and  a  mesh  interconnect  should  be used when possible, since it will be much
cheaper, in hardware terms.  Since most of the applications that it  is  claimed
can run on a C.M. (vision, database, graphics, etc) can be run on a grid system,
in the long term they will not be cost-effective to run on a Connection  Machine
(indeed,  they  are  not  cost-effective today, compared to alternative parallel
architectures).

     Second, for applications which absolutely require the  dense  interconnect,
the  processors  should be made as large as possible and should NOT be simple 1-
bit machines.  Since the size of the interconnect grows as n squared, it will be
by  far  the  largest  component of a large system, and if we can replace each m
simple processors by 1 complex processor, we will reduce the size of the  inter-
connect by m*m, and thus the size of the system.

     For example, suppose I wish to build a complete system on a  single  wafer,
and  I  have 10,000,000,000 units of silicon.  Suppose a simple processor with 1
unit of processing power has an area of 1000, while a complex processor with  10
units  of processing power has an area of 100,000.  Suppose the hypercube inter-
connect requires n*n area.  Then, if I use simple processors,  the  interconnect
area limits me to about 100,000 processors, while if I use complex processors, I
am limited to about 60,000 processors.

     In other words, even though each individual simple processor  is  10  times
more  efficient (1 processing unit per 1000 area units) than a complex processor
(10 processing units per 100,000 area units), the system built  out  of  complex
processors  is  more powerful  (600,000 processing units) than a system built of
simple processors (100,000 processing units).

     This result is not dependent on the numbers in the example, but is general.
Given  any  two  processors, one of size SA and power PA, and one of size SB and
power PB, for sufficiently large systems I am always better off using  the  pro-
cessor with more power, no matter what the actual size and power ratios are.

     In summary, if we assume that the continuing  progress  of  microelectronic
technology and/or a shift to wafer-scale technology requires complete systems to
be fabricated  as a single VLSI unit, then  the  Connection  Machine  is  not  a
viable  architecture.   The  sometimes  made analogy to the brain is misleading,
because the brain is a 3-D structure, where the  interconnect  is  much  cheaper
(n**1.5), and VLSI systems are 2-D structures.


						Ehud Reiter
						reiter@harvard  (ARPA,UUCP)

ram@spice.cs.cmu.edu (Rob MacLachlan) (12/04/86)

    The connection machine is not committed to a hypercube architecture.
Routing of messages is done by the hardware below the instruction-set level.
Of course the actual cost of sending a message to a particular address will
depend on the topology of the connection network.  The existing languages
for programming the CM don't allow the underlying network topology to be
seen.

I would say that you have misunderstood the idea behind the CM, at least at
the routing level.  I think a better summary would be to say that there are
O(n) processors connected by an appropriate interconnect with a smart
controller.  Intelligence and flexibility at the routing level is definitely
part of the original concept, although it may not be fully realized in the
current design due to a desire to get something out the door.

There certainly is something to the claim that 1bit processors aren't the
ultimate for everything, but they seem to be well suited to the symbolic
processing tasks that the CM was originally designed for.  For things such
as floating point, it would be quite possible to associate a floating point
processor with each bit-serial processor.

  Rob

hsu@eneevax.UUCP (Dave Hsu) (12/04/86)

In article <745@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes:
>The idea behind the Connection Machine architecture, of having an expensive
>hypercube interconnect with extremely simple 1-bit SIMD processors at the
>nodes, has bothered me quite a bit.  The below tries to rigorously argue that
>such an architecture is inappropriate.  I would very much welcome any comments
>on this, either to me personally or to the net.
>...
>     This has two implications on the Connection Machine  architecture  of  many
>1-bit  processors connected in a hypercube, if the goal is to eventually be able
>to realize such an architecture on a single chip/wafer.

Precisely here ^^^ is the problem.

>     First, the dense hypercube interconnect should only be used when necessary,
>and  a  mesh  interconnect  should  be used when possible, since it will be much
>cheaper, in hardware terms.  Since most of the applications that it  is  claimed
>can run on a C.M. (vision, database, graphics, etc) can be run on a grid system,
>in the long term they will not be cost-effective to run on a Connection  Machine
>(indeed,  they  are  not  cost-effective today, compared to alternative parallel
>architectures).

I am decidedly not an expert on the CM, not even a novice yet; I haven't
seen our department's machine and I don't have my copy of Hillis' book
nearby, but it seems to me that the router does in fact perform the first
task to a degree (the first few bits worth, anyway) and for nearby nodes,
talking is cheap.

On the second point, it seems to me that the CM is not quite as
cumbersome at convolving an image as the MPP (comparable in
individual node power as well as in number) would be if you tried
to associate lisp nodes on it.  The interconnect scheme makes a
tremendous difference, but there is more flexibility in one direction
than in the other.

>     Second, for applications which absolutely require the  dense  interconnect,
>the  processors  should be made as large as possible and should NOT be simple 1-
>bit machines...
>     For example, suppose I wish to build a complete system on a  single  wafer,
>and  I  have 10,000,000,000 units of silicon.  Suppose a simple processor with 1
>unit of processing power has an area of 1000, while a complex processor with  10
>units  of processing power has an area of 100,000.  Suppose the hypercube inter-
>connect requires n*n area...
>
>     In other words, even though each individual simple processor  is  10  times
>more  efficient (1 processing unit per 1000 area units) than a complex processor
>(10 processing units per 100,000 area units), the system built  out  of  complex
>processors  is  more powerful  (600,000 processing units) than a system built of
>simple processors (100,000 processing units).
>
>     This result is not dependent on the numbers in the example, but is general.
>
>						Ehud Reiter

As you observe, the 2-D VLSI interconnect loses big to the 3-D brain
interconnect.  But wait!  Why do we want VLSI interconnects?  As your
numbers show, the architecture that packs more gates per unit of
silicon will have more power.  In the wafer argument, you suppose (one)
that putting everybody on one or two wafers is superior, and that
(two) the interconnect size for a complex node does NOT increase in
complexity over that for a simple node.

(1) Having everybody on one wafer is advantageous only if the
interconnect speed between processors on different chips is that much
slower.  However, by keeping tremendous numbers of processors on a
single wafer, what speed you gain by throwing out bus drivers,
backplanes and the like is clobbered by the speed you lose by not using
3-D interconnects in a hypercube architecture.  In a 3-D world, if we
suppose you have only 2,000,000 units of silicon available, for the
simple processor model you can pack 1,000 processors (not far from
realistic figures) but for the complex model, you may have fewer than 20.
Where did the rest of the complex power go?  Nowhere.  The simple model
caught up by disposing of idle 2-D interconnects.

(2) In your example we now have a model of 10-bit processors with a
bit-serial I/O path.  Obviously, there is a bottleneck, and that
bottleneck will cost you the load time for 10 bits, whereupon you end
up with a very idle processor that acts like...60,000 simple ones.  Or
you can spend the extra complexity on silicon...

-dave
-- 
David Hsu 	   Comm. & Signal Processing Lab, Systems Research Center /or/
"Him? No Comment." Systems Staff, Engineering Computer Facility, Dept. of EE
	-UMd	   The University of Maryland, College Park, MD 20742
ARPA: hsu@eneevax.umd.edu	UUCP: [seismo,allegra]!umcp-cs!eneevax!hsu

roger@telesoft.UUCP (Roger Arnold @prodigal) (12/05/86)

> The idea behind the Connection Machine architecture, of having an expensive
> hypercube interconnect with extremely simple 1-bit SIMD processors at the
> nodes, has bothered me quite a bit.  ...
> 
>      According to J. Ullman in COMPUTATIONAL ASPECTS OF  VLSI, chapter  
> 6,  the cost  of  a butterfly network in VLSI is O(n*n).  ...
> 
> ...  Since the size of the interconnect grows as n squared, it will be
> by  far  the  largest  component of a large system, and if we can
> replace each m simple processors by 1 complex processor, we will reduce
> the size of the  interconnect by m*m, and thus the size of the system...
> 
> 					Ehud Reiter
> 					reiter@harvard  (ARPA,UUCP)

I'm no authority on the connection machine, and perhaps I'm 
misinterpreting your argument, but is it not precisely the high cost
of the interconnect that dictates the use of 1-bit processors?  You
seem to be assuming that the size of the interconnect is independent
of the width of the data path, whereas it is probably close to linear.
It's most efficient to pump the bits serially through the network, and
if that's how they go over the network, that's how you want to process
them too.  As long as the processor cycle is matched to the network 
transmission speed, buffering for a wider processor just adds overhead, 
and buys nothing.

			      Roger Arnold
			      ..sdcsvax!telesoft!roger

danny@think.COM (Danny Hillis) (12/05/86)

	Your argument mixes two ideas. If by "Connection Machine" you
mean the current 64K-processor implementation (the CM1) then it is not
implemented on a wafer, so the arguments do not apply. Those wires are
not a large part of the overall cost of a CM1. If on the other hand you
mean the Connection Machine architecture, in the sense of a family of
software compatible machines, then there is no reason to constrain the
implementation to either single bit processors or are to any particular
interconnection topology. Connection Machine programs do not depend on
those details.
	The main point of the Connection Machine architecture is to
provide a level of abstraction at which these things don't matter.
Things like the optimal interconnection topology and processor size are
dependent on time-varying details like the cost of connectors, the
feasibilty of wafer-scale intergration, and the size of the machine.
The programmer would presumably like to write code that keeps running
faster as these things improve, without having to rethink everything each
time someone invents a new interconnection topology. That's why the
Connection Machine has a message router.
	I certainly agree with your arguments that hypercubes are not
likely to be the best interconnection topology for wafer-scale machines
and that we can use larger processors as it becomes easier to make them,
but I think these are reasons why the Connection Machine architecture
makes sense.			-danny

reiter@endor.harvard.edu (Ehud Reiter) (12/05/86)

I recently posted a comment on the net criticizing the Connection Machine
architecture of a large hypercube of 1-bit processors.  Very briefly, my
argument was that it was foolish to build such a system, because since
the hypercube interconnect was so expensive (O(n*n) in number of processors),
it would dominate the cost of the system, and thus there was little cost
savings, but a large performance penalty, in using small processors at the
nodes instead of large ones.  I've received a number of responses, and here
is my reply to these replies:

1)  My original argument was based on trying to build such a system with
wafer-scale integration, and a number of people criticized this assumption.
Now, obviously the question of how important wafer-scale integration will
be in the future has not been settled, but the same arguments also apply
to conventional systems, although in a much messier fashion.  Just as you
can prove that a hypercube requires O(n**2) space in a 2-D system, you can
prove that a hypercube requires O(n**1.5) in a 3-D system, so the space
devoted to the hypercube interconnect will eventually dominate the space
devoted to the processors, which grows as O(n).  The problem is that the
relationship between space and cost is much clearer in a chip, where
everything is implemented in silicon, than it is in a conventional system,
where wires, back-plane connectors, and chips all have different costs.
Nevertheless, one can point out that, for example, in a hypercube the amount
of wiring needed on the PC boards and the size of the backplane connectors
is at least a linear function of the number of processors on the PC board.
In other words, the density of a hypercube may be limited by board and
connector constraints, not by chip density - and any architecture that can't
take advantage of improving chip technology is in trouble.

2)  A number of people have made arguments to the effect that a 32 bit
processor, for example, would require a 32-bit wide interconnect, which would
then increase the cost of the interconnect system by a factor of 32.
I think, though, that the questions of how big the processors are and how
wide the interconnect is must be addressed separately.  I am not trying
to argue that a 32-bit interconnect is better than a 1-bit interconnect -
I am arguing that a 32-bit processor costs little more than a 1-bit
processor, in this context, and, at least for some applications, would
give better performance than a 1-bit processor.  There may of course
be applications (perhaps including the semantic nets which originally
motivated the Connection Machine) where in fact a 1-bit processor is
no better than a 32-bit processor, but as long as the hypercube of 32-bit
processors costs the same as the hypercube of 1-bit processors, and the 32-bit
system gives more performance for at least some applications, then I maintain
that the 32-bit system is superior.  If A and B cost the same, and B at least
sometimes performs better, then I should buy B, not A.
    Note, by the way, that the size of the network is only linear in the
network width, while it is quadratic (in VLSI) in number of processors - ie
a system of N 32-bit processors with a 32-bit wide interconnect is far cheaper
than a system of 32*N 1-bit processors with a 1-bit wide interconnect.

3)  A number of people have also said that I'm missing the point of the
Connection Machine, which is a broader concept than the current implementation.
Fine - the only point I'm trying to make here is that a hypercube of tiny
processors is a bad idea.

4)  A number of people have mentioned Hillis's book THE CONNECTION MACHINE.
I have in fact read the book, and I have a number of criticisms of it:
   a)  The question of how costly the interconnect is is not addressed.
Indeed, the book seems to imply that the cost of a Connection Machine
will rise linearly with the number of processors, and I think this is
simply not true, because in either 2-D or 3-D, the size of a hypercube
interconnect rises much faster than linear.
   b)  The book justifies using small processors because they give more
processing power per unit of area than large processors.  The whole thrust
of my argument, of course, has been that when one looks at the area of the
complete system, as opposed to the area devoted to processors, then it is
the large processors which are more efficient.
   c)  When comparing algorithms on a conventional system to algorithms on
a Connection Machine, the book only looks at simple algorithms.  For
instance, claims that the Connection Machine is much faster at database
lookup than uniprocessors assume that the uniprocessors are searching
the database linearly, instead of using indexes;  claims that the
Connection Machine is much faster at convolutions assume that the
uniprocessors are convolving in the straightforward way, instead of
using FFT's; etc.  Comparisons should be made against the best uniprocessor
algorithms, not the worst.

					Ehud Reiter
					reiter@harvard	  (ARPA,UUCP)

faustus@ucbcad.BERKELEY.EDU (Wayne A. Christopher) (12/06/86)

Maybe I'm missing something, but why is the cost of interconnect in a 
hypercube O(n^2) instead of O(n log n)?  I can understand that in 2D
you can't just connect each processor to its log n "nearest neighbors"
because they aren't so near anymore, but why should the cost be quadratic?

	Wayne

reiter@endor.harvard.edu (Ehud Reiter) (12/07/86)

In article <1157@ucbcad.BERKELEY.EDU> faustus@ucbcad.BERKELEY.EDU (Wayne A. Christopher) writes:
>Maybe I'm missing something, but why is the cost of interconnect in a 
>hypercube O(n^2) instead of O(n log n)?  I can understand that in 2D
>you can't just connect each processor to its log n "nearest neighbors"
>because they aren't so near anymore, but why should the cost be quadratic?

See Ullman's COMPUTATIONAL ASPECTS OF VLSI for a real proof.  In very crude
terms, the idea is to consider a box containing N processors and a hypercube
network.  Draw a vertical line splitting the box, with half the processors on
one side and half on the other.  Now, the hypercube is a "global parallel"
interconnect, which allows each processor on one side of the dividing line to
have a circuit to a processor on the other side of the line.  Thus, we must
have N/2 circuits crossing the dividing line.  Since there are only a small
finite number of levels in which wires can run, this means the the length of
the splitting vertical line is O(n) - ie the height of the box is O(n).  A
similar argument shows that the width of the box is O(n).  Thus, the size of
the box is O(n^2).

This is of course just a crude sketch - again, see Ullman's book for a real
proof.
						Ehud Reiter
						reiter@harvard	(ARPA,UUCP)