reiter@endor.harvard.edu (Ehud Reiter) (12/03/86)
The idea behind the Connection Machine architecture, of having an expensive hypercube interconnect with extremely simple 1-bit SIMD processors at the nodes, has bothered me quite a bit. The below tries to rigorously argue that such an architecture is inappropriate. I would very much welcome any comments on this, either to me personally or to the net. According to J. Ullman in COMPUTATIONAL ASPECTS OF VLSI, chapter 6, the cost of a butterfly network in VLSI is O(n*n). That is, if the entire network is on a chip, the size of the network is a quadratic function of the number of nodes connected. This is a lower bound (ie any butterfly network MUST take this amount of space). A similar result applies to a hypercube, and a shuffle- exchange network is only slightly cheaper at O((n/log(n))**2). This has two implications on the Connection Machine architecture of many 1-bit processors connected in a hypercube, if the goal is to eventually be able to realize such an architecture on a single chip/wafer. First, the dense hypercube interconnect should only be used when necessary, and a mesh interconnect should be used when possible, since it will be much cheaper, in hardware terms. Since most of the applications that it is claimed can run on a C.M. (vision, database, graphics, etc) can be run on a grid system, in the long term they will not be cost-effective to run on a Connection Machine (indeed, they are not cost-effective today, compared to alternative parallel architectures). Second, for applications which absolutely require the dense interconnect, the processors should be made as large as possible and should NOT be simple 1- bit machines. Since the size of the interconnect grows as n squared, it will be by far the largest component of a large system, and if we can replace each m simple processors by 1 complex processor, we will reduce the size of the inter- connect by m*m, and thus the size of the system. For example, suppose I wish to build a complete system on a single wafer, and I have 10,000,000,000 units of silicon. Suppose a simple processor with 1 unit of processing power has an area of 1000, while a complex processor with 10 units of processing power has an area of 100,000. Suppose the hypercube inter- connect requires n*n area. Then, if I use simple processors, the interconnect area limits me to about 100,000 processors, while if I use complex processors, I am limited to about 60,000 processors. In other words, even though each individual simple processor is 10 times more efficient (1 processing unit per 1000 area units) than a complex processor (10 processing units per 100,000 area units), the system built out of complex processors is more powerful (600,000 processing units) than a system built of simple processors (100,000 processing units). This result is not dependent on the numbers in the example, but is general. Given any two processors, one of size SA and power PA, and one of size SB and power PB, for sufficiently large systems I am always better off using the pro- cessor with more power, no matter what the actual size and power ratios are. In summary, if we assume that the continuing progress of microelectronic technology and/or a shift to wafer-scale technology requires complete systems to be fabricated as a single VLSI unit, then the Connection Machine is not a viable architecture. The sometimes made analogy to the brain is misleading, because the brain is a 3-D structure, where the interconnect is much cheaper (n**1.5), and VLSI systems are 2-D structures. Ehud Reiter reiter@harvard (ARPA,UUCP)
ram@spice.cs.cmu.edu (Rob MacLachlan) (12/04/86)
The connection machine is not committed to a hypercube architecture. Routing of messages is done by the hardware below the instruction-set level. Of course the actual cost of sending a message to a particular address will depend on the topology of the connection network. The existing languages for programming the CM don't allow the underlying network topology to be seen. I would say that you have misunderstood the idea behind the CM, at least at the routing level. I think a better summary would be to say that there are O(n) processors connected by an appropriate interconnect with a smart controller. Intelligence and flexibility at the routing level is definitely part of the original concept, although it may not be fully realized in the current design due to a desire to get something out the door. There certainly is something to the claim that 1bit processors aren't the ultimate for everything, but they seem to be well suited to the symbolic processing tasks that the CM was originally designed for. For things such as floating point, it would be quite possible to associate a floating point processor with each bit-serial processor. Rob
hsu@eneevax.UUCP (Dave Hsu) (12/04/86)
In article <745@husc6.UUCP> reiter@harvard.UUCP (Ehud Reiter) writes: >The idea behind the Connection Machine architecture, of having an expensive >hypercube interconnect with extremely simple 1-bit SIMD processors at the >nodes, has bothered me quite a bit. The below tries to rigorously argue that >such an architecture is inappropriate. I would very much welcome any comments >on this, either to me personally or to the net. >... > This has two implications on the Connection Machine architecture of many >1-bit processors connected in a hypercube, if the goal is to eventually be able >to realize such an architecture on a single chip/wafer. Precisely here ^^^ is the problem. > First, the dense hypercube interconnect should only be used when necessary, >and a mesh interconnect should be used when possible, since it will be much >cheaper, in hardware terms. Since most of the applications that it is claimed >can run on a C.M. (vision, database, graphics, etc) can be run on a grid system, >in the long term they will not be cost-effective to run on a Connection Machine >(indeed, they are not cost-effective today, compared to alternative parallel >architectures). I am decidedly not an expert on the CM, not even a novice yet; I haven't seen our department's machine and I don't have my copy of Hillis' book nearby, but it seems to me that the router does in fact perform the first task to a degree (the first few bits worth, anyway) and for nearby nodes, talking is cheap. On the second point, it seems to me that the CM is not quite as cumbersome at convolving an image as the MPP (comparable in individual node power as well as in number) would be if you tried to associate lisp nodes on it. The interconnect scheme makes a tremendous difference, but there is more flexibility in one direction than in the other. > Second, for applications which absolutely require the dense interconnect, >the processors should be made as large as possible and should NOT be simple 1- >bit machines... > For example, suppose I wish to build a complete system on a single wafer, >and I have 10,000,000,000 units of silicon. Suppose a simple processor with 1 >unit of processing power has an area of 1000, while a complex processor with 10 >units of processing power has an area of 100,000. Suppose the hypercube inter- >connect requires n*n area... > > In other words, even though each individual simple processor is 10 times >more efficient (1 processing unit per 1000 area units) than a complex processor >(10 processing units per 100,000 area units), the system built out of complex >processors is more powerful (600,000 processing units) than a system built of >simple processors (100,000 processing units). > > This result is not dependent on the numbers in the example, but is general. > > Ehud Reiter As you observe, the 2-D VLSI interconnect loses big to the 3-D brain interconnect. But wait! Why do we want VLSI interconnects? As your numbers show, the architecture that packs more gates per unit of silicon will have more power. In the wafer argument, you suppose (one) that putting everybody on one or two wafers is superior, and that (two) the interconnect size for a complex node does NOT increase in complexity over that for a simple node. (1) Having everybody on one wafer is advantageous only if the interconnect speed between processors on different chips is that much slower. However, by keeping tremendous numbers of processors on a single wafer, what speed you gain by throwing out bus drivers, backplanes and the like is clobbered by the speed you lose by not using 3-D interconnects in a hypercube architecture. In a 3-D world, if we suppose you have only 2,000,000 units of silicon available, for the simple processor model you can pack 1,000 processors (not far from realistic figures) but for the complex model, you may have fewer than 20. Where did the rest of the complex power go? Nowhere. The simple model caught up by disposing of idle 2-D interconnects. (2) In your example we now have a model of 10-bit processors with a bit-serial I/O path. Obviously, there is a bottleneck, and that bottleneck will cost you the load time for 10 bits, whereupon you end up with a very idle processor that acts like...60,000 simple ones. Or you can spend the extra complexity on silicon... -dave -- David Hsu Comm. & Signal Processing Lab, Systems Research Center /or/ "Him? No Comment." Systems Staff, Engineering Computer Facility, Dept. of EE -UMd The University of Maryland, College Park, MD 20742 ARPA: hsu@eneevax.umd.edu UUCP: [seismo,allegra]!umcp-cs!eneevax!hsu
roger@telesoft.UUCP (Roger Arnold @prodigal) (12/05/86)
> The idea behind the Connection Machine architecture, of having an expensive > hypercube interconnect with extremely simple 1-bit SIMD processors at the > nodes, has bothered me quite a bit. ... > > According to J. Ullman in COMPUTATIONAL ASPECTS OF VLSI, chapter > 6, the cost of a butterfly network in VLSI is O(n*n). ... > > ... Since the size of the interconnect grows as n squared, it will be > by far the largest component of a large system, and if we can > replace each m simple processors by 1 complex processor, we will reduce > the size of the interconnect by m*m, and thus the size of the system... > > Ehud Reiter > reiter@harvard (ARPA,UUCP) I'm no authority on the connection machine, and perhaps I'm misinterpreting your argument, but is it not precisely the high cost of the interconnect that dictates the use of 1-bit processors? You seem to be assuming that the size of the interconnect is independent of the width of the data path, whereas it is probably close to linear. It's most efficient to pump the bits serially through the network, and if that's how they go over the network, that's how you want to process them too. As long as the processor cycle is matched to the network transmission speed, buffering for a wider processor just adds overhead, and buys nothing. Roger Arnold ..sdcsvax!telesoft!roger
danny@think.COM (Danny Hillis) (12/05/86)
Your argument mixes two ideas. If by "Connection Machine" you mean the current 64K-processor implementation (the CM1) then it is not implemented on a wafer, so the arguments do not apply. Those wires are not a large part of the overall cost of a CM1. If on the other hand you mean the Connection Machine architecture, in the sense of a family of software compatible machines, then there is no reason to constrain the implementation to either single bit processors or are to any particular interconnection topology. Connection Machine programs do not depend on those details. The main point of the Connection Machine architecture is to provide a level of abstraction at which these things don't matter. Things like the optimal interconnection topology and processor size are dependent on time-varying details like the cost of connectors, the feasibilty of wafer-scale intergration, and the size of the machine. The programmer would presumably like to write code that keeps running faster as these things improve, without having to rethink everything each time someone invents a new interconnection topology. That's why the Connection Machine has a message router. I certainly agree with your arguments that hypercubes are not likely to be the best interconnection topology for wafer-scale machines and that we can use larger processors as it becomes easier to make them, but I think these are reasons why the Connection Machine architecture makes sense. -danny
reiter@endor.harvard.edu (Ehud Reiter) (12/05/86)
I recently posted a comment on the net criticizing the Connection Machine architecture of a large hypercube of 1-bit processors. Very briefly, my argument was that it was foolish to build such a system, because since the hypercube interconnect was so expensive (O(n*n) in number of processors), it would dominate the cost of the system, and thus there was little cost savings, but a large performance penalty, in using small processors at the nodes instead of large ones. I've received a number of responses, and here is my reply to these replies: 1) My original argument was based on trying to build such a system with wafer-scale integration, and a number of people criticized this assumption. Now, obviously the question of how important wafer-scale integration will be in the future has not been settled, but the same arguments also apply to conventional systems, although in a much messier fashion. Just as you can prove that a hypercube requires O(n**2) space in a 2-D system, you can prove that a hypercube requires O(n**1.5) in a 3-D system, so the space devoted to the hypercube interconnect will eventually dominate the space devoted to the processors, which grows as O(n). The problem is that the relationship between space and cost is much clearer in a chip, where everything is implemented in silicon, than it is in a conventional system, where wires, back-plane connectors, and chips all have different costs. Nevertheless, one can point out that, for example, in a hypercube the amount of wiring needed on the PC boards and the size of the backplane connectors is at least a linear function of the number of processors on the PC board. In other words, the density of a hypercube may be limited by board and connector constraints, not by chip density - and any architecture that can't take advantage of improving chip technology is in trouble. 2) A number of people have made arguments to the effect that a 32 bit processor, for example, would require a 32-bit wide interconnect, which would then increase the cost of the interconnect system by a factor of 32. I think, though, that the questions of how big the processors are and how wide the interconnect is must be addressed separately. I am not trying to argue that a 32-bit interconnect is better than a 1-bit interconnect - I am arguing that a 32-bit processor costs little more than a 1-bit processor, in this context, and, at least for some applications, would give better performance than a 1-bit processor. There may of course be applications (perhaps including the semantic nets which originally motivated the Connection Machine) where in fact a 1-bit processor is no better than a 32-bit processor, but as long as the hypercube of 32-bit processors costs the same as the hypercube of 1-bit processors, and the 32-bit system gives more performance for at least some applications, then I maintain that the 32-bit system is superior. If A and B cost the same, and B at least sometimes performs better, then I should buy B, not A. Note, by the way, that the size of the network is only linear in the network width, while it is quadratic (in VLSI) in number of processors - ie a system of N 32-bit processors with a 32-bit wide interconnect is far cheaper than a system of 32*N 1-bit processors with a 1-bit wide interconnect. 3) A number of people have also said that I'm missing the point of the Connection Machine, which is a broader concept than the current implementation. Fine - the only point I'm trying to make here is that a hypercube of tiny processors is a bad idea. 4) A number of people have mentioned Hillis's book THE CONNECTION MACHINE. I have in fact read the book, and I have a number of criticisms of it: a) The question of how costly the interconnect is is not addressed. Indeed, the book seems to imply that the cost of a Connection Machine will rise linearly with the number of processors, and I think this is simply not true, because in either 2-D or 3-D, the size of a hypercube interconnect rises much faster than linear. b) The book justifies using small processors because they give more processing power per unit of area than large processors. The whole thrust of my argument, of course, has been that when one looks at the area of the complete system, as opposed to the area devoted to processors, then it is the large processors which are more efficient. c) When comparing algorithms on a conventional system to algorithms on a Connection Machine, the book only looks at simple algorithms. For instance, claims that the Connection Machine is much faster at database lookup than uniprocessors assume that the uniprocessors are searching the database linearly, instead of using indexes; claims that the Connection Machine is much faster at convolutions assume that the uniprocessors are convolving in the straightforward way, instead of using FFT's; etc. Comparisons should be made against the best uniprocessor algorithms, not the worst. Ehud Reiter reiter@harvard (ARPA,UUCP)
faustus@ucbcad.BERKELEY.EDU (Wayne A. Christopher) (12/06/86)
Maybe I'm missing something, but why is the cost of interconnect in a hypercube O(n^2) instead of O(n log n)? I can understand that in 2D you can't just connect each processor to its log n "nearest neighbors" because they aren't so near anymore, but why should the cost be quadratic? Wayne
reiter@endor.harvard.edu (Ehud Reiter) (12/07/86)
In article <1157@ucbcad.BERKELEY.EDU> faustus@ucbcad.BERKELEY.EDU (Wayne A. Christopher) writes: >Maybe I'm missing something, but why is the cost of interconnect in a >hypercube O(n^2) instead of O(n log n)? I can understand that in 2D >you can't just connect each processor to its log n "nearest neighbors" >because they aren't so near anymore, but why should the cost be quadratic? See Ullman's COMPUTATIONAL ASPECTS OF VLSI for a real proof. In very crude terms, the idea is to consider a box containing N processors and a hypercube network. Draw a vertical line splitting the box, with half the processors on one side and half on the other. Now, the hypercube is a "global parallel" interconnect, which allows each processor on one side of the dividing line to have a circuit to a processor on the other side of the line. Thus, we must have N/2 circuits crossing the dividing line. Since there are only a small finite number of levels in which wires can run, this means the the length of the splitting vertical line is O(n) - ie the height of the box is O(n). A similar argument shows that the width of the box is O(n). Thus, the size of the box is O(n^2). This is of course just a crude sketch - again, see Ullman's book for a real proof. Ehud Reiter reiter@harvard (ARPA,UUCP)