bradley@think.ARPA (Bradley C. Kuszmaul) (04/26/85)
Of course the "triangular" extensions are totally connected, which really makes them very interesting or very boring depending on your point of view. They are interesting in that it is very easy to do routing, since everything is connected. The software for a totally connected computer is easy, and easy software (being an oxymoron) is interesting. They are boring in that you can't implement them for very large nets. In general if you have $n$ processors, you need $n^2$ wires to completely connect them and $n-1$ connections (ports) at each processor. The wire milage quickly gets out of hand as $n$ grows, and the layout becomes tricky also. In a cube containing $n=2^m$ processors (i.e. an $m$-cube), there are $m$ ports on each processor, hence $nm/2$ wires (i.e. $O(n\log n)$) which is much better than $n^2$. This is still a lot of wires to lay out, but for something like the cosmic cube (64 processors?) we are only talking about 192 wires, (probably expensive coaxial wires), which is not too much of a pain to actually construct in 3-space. I am much more interested in what happens as $n$ becomes much larger (e.g. $n=10^6$). The wiring problem becomes very hard again. The Connection Machine with 64K processors in the prototype faces an engineering challenge on this front. One nice thing about cube topologies is that they are easy to perform routing on (the relative address says which dimensions to cross, and there is a wire crossing every dimension from every node) and they are fairly redundant (there are lots of ways to get from A to B, so that if some node or wire fails, the machine might be able to keep running. (fault tolerence is important for a million processor machine)). Redundancy is also nice because it means that if lots of messages want to go from A to B, then we may be able to send more than one at a time through the net. One bad thing about cube topologies (at least for cubes bigger than about 10 dimensions) is that they do not use their bandwidth very effectively. It has been claimed (with proof - references may be available on demand) that a 14 cube sending an "average" set of messages only uses about one fourth of the bandwidth because of collisions. It has been claimed, in this discussion, that cube topologies are nice because they can simulate other topologies easily. Actually, this is not really a good argument, since most of the other seriously considered topologies are equally good at simulating each other (Handwaving proof: Suppose topology X can be easily simulated on a cube. There must be an "easy" mapping from the nodes in X to the nodes in the cube (i.e. an enumeration, since the nodes in the cube have an "easy" mapping from themselves to their addresses). This "easy" mapping should be bijective, or we will be wasting computational resources. "Easy" bijective mappings should be easy to invert, so the cube can be simulated on X "easily". Thus if X and Y can be simulated easily on a cube, then X and Y can easily simulate each other.) I suspect that the real reason that people are building N-cubes instead of skewed rings, minimum spanning trees, or some other topology, is that computer scientists have a bias towards binary representations of things, and the cube topology is nice and binary, and because routing is so easy. (routing is easy because if $a$ is the absolute address of processor $X$ and $b$ is the absolute address of processor $Y$, then $A \xor B$ is the relative address of B from A (and A from B) -Brad -- bradley@think.uucp (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley) bradley@THINK.ARPA
brooks@lll-crg.ARPA (Eugene D. Brooks III) (04/28/85)
> out, but for something like the cosmic cube (64 processors?) we are only > talking about 192 wires, (probably expensive coaxial wires), which is > not too much of a pain to actually construct in 3-space. The 64 processor was wired in a single backplane with wire wrap wire and not coaxial cable. The wiring was not even twisted pairs. Of course it probably would have been a lot less flakey with twisted pairs and appropo drivers. > I am much more interested in what happens as $n$ becomes much larger > (e.g. $n=10^6$). The wiring problem becomes very hard again. The > Connection Machine with 64K processors in the prototype faces an > engineering challenge on this front. A million processors not not a problem at all (Well perhaps I am stretching it a bit. I don't actually advocate a million processors. I thing more along the lines of a few hundred 20 MIP/MFLOP processors as being interesting.). How to construct a very large N cube can easily be worked out on the back of an envelope. You accomplish the feat by building and wiring the machine in 3 dimensions. 32k processors is a piece of cake (its only 32 processors on a side) and one million processors is within the realm of possibility. > if some node or wire fails, the machine might be able to keep running. > (fault tolerence is important for a million processor machine)). > Redundancy is also nice because it means that if lots of messages want > to go from A to B, then we may be able to send more than one at a time > through the net. Deadlock free routing requires that a specific path be used for each sender/reciver pair. > One bad thing about cube topologies (at least for cubes bigger than > about 10 dimensions) is that they do not use their bandwidth very > effectively. It has been claimed (with proof - references may be > available on demand) that a 14 cube sending an "average" set of messages > only uses about one fourth of the bandwidth because of collisions. I would like to see the reference list, please send it to me via email. By the same token you might send my your SNAIL MAIL address if you are interested in a paper describing a solution to the bandwidth problem. Request the paper titled "Shared Memory Hypercube". The switching technque gives NlogN bandwidth, tested by simulation to N=1024, and it adaptively removes conflicts. My use of the network is to create a shared memory server for a hypercube machine with CRAY style vector processors. With simultaneous vector access by all processors the packet conflict problem is as severe as it can be. The network provides FULL bandwidth to all processors simultaneously. > It has been claimed, in this discussion, that cube topologies are nice > because they can simulate other topologies easily. Actually, this is > not really a good argument, since most of the other seriously considered > topologies are equally good at simulating each other The cube topology is nice as it has other useful topologies IMBEDDED within it. We are not talking about easy simulation here, efficient execution is the target. The wires is there! > I suspect that the real reason that people are building N-cubes instead > of skewed rings, minimum spanning trees, or some other topology, is that > computer scientists have a bias towards binary representations of > things, and the cube topology is nice and binary, and because routing is > so easy. Admittedly the routing is easy, this is one of the reasons for the cubes attractiveness. The real reason for the interest in it however is it has a genuine bandwidth edge over the other topologies.
bradley@think.ARPA (Bradley C. Kuszmaul) (04/30/85)
I am going to disagree a lot with brooks@lll-crg.ARPA, but don't take it personally. It sounds like Mr. Brooks has given the topic a lot of thought, and most of our differences seem to come from different assumptions. Once we start with different assumptions, its hard to get the same conclusions. Some of my assumptions are: - Lots and lots and lots of small processors are better than fewer big processors. - Any particular connection topology will be not quite right for most applications, so we might be willing to sacrifice the best case behavior for more generality or better average case behavior (for example nearest neighbor communication in a 2 dimensional grid is very good, but not very general. A cube is well connected, but may use too many wires to achieve its bandwidth) In article <551@lll-crg.ARPA> brooks@lll-crg.ARPA (Eugene D. Brooks III) writes: >> out, but for something like the cosmic cube (64 processors?) we are only >> talking about 192 wires, (probably expensive coaxial wires), which is >> not too much of a pain to actually construct in 3-space. > >The 64 processor was wired in a single backplane with wire wrap wire and >not coaxial cable. The wiring was not even twisted pairs. Of course it >probably would have been a lot less flakey with twisted pairs and appropo >drivers. Sorry about that. I was under the impression from the discussion in this group that the cosmic cube's connections are implemented with ethernet controllers. My main point was that a 6-cube is easy to wire up, and it only gets easier by using smaller wires. >> I am much more interested in what happens as $n$ becomes much larger >> (e.g. $n=10^6$).> >A million processors not not a problem at all (Well perhaps I am stretching it >a bit. I don't actually advocate a million processors. I thing more along the >lines of a few hundred 20 MIP/MFLOP processors as being interesting.). I beg to differ. I think that a million processors is not nearly enough. I believe that our differences probably stem from different approaches to mapping applications onto the parallel processors. If I were doing something like electrical simulation of a VLSI circuit, I would like to have each processor simulate a single transistor. If I am doing finite element analysis, each processor might simulate one of the elements. While it is true that by speeding up the processors, and using fewer of them, we should be able to get the same performance on some applications, I tend to think that it is possible to get more MIPS per dollar by using smaller, cheaper processing elements. >> The wiring problem becomes very hard again. The >> Connection Machine with 64K processors in the prototype faces an >> engineering challenge on this front. >How to >construct a very large N cube can easily be worked out on the back of an >envelope. You accomplish the feat by building and wiring the machine in 3 >dimensions. 32k processors is a piece of cake (its only 32 processors on a >side) and one million processors is within the realm of possibility. I don't see how wiring a 3d cube with 32 processors on a side gives us a 15 cube. It is important to realize that if you bisect an N cube "through" one of its dimensions (i.e. pick some $0<=i<N$, put all the processors with 0 in the $i$th bit of on one side of the cut, and all the other processors on the other side) that every processor has one wire crossing the bisection. (e.g. for a 32K machine, there are 16K processors on each side, and each processor can communicate with one processor on the other side of the bisection directly, so there are 16K wires crossing the bisection.). For a million processor machine, there are half a million wires crossing the bisection. I still claim that it is hard to lay it out in three space. >> if some node or wire fails, the machine might be able to keep running. >> (fault tolerence is important for a million processor machine)). >> Redundancy is also nice because it means that if lots of messages want >> to go from A to B, then we may be able to send more than one at a time >> through the net. >Deadlock free routing requires that a specific path be used for each >sender/reciver pair. We probably are thinking of routing from different points of view. I don't believe that deadlock free routing requires a specific path be used for each pair. proof: Suppose we have a cube connected packet switched network with enough storage at each vertice to store on message from every other vertice (usenet in the year 2010?). Suppose that we further place the requirement that no pe (= vertice) is allowed to inject a message into the network unless its previously injected messages have already been delivered. (We might require that every processor which wants to send a message do it at the same time, and then wait for the network to empty out before we send any more messages). There will be no deadlock, since even if all the messages want to go to the same place, there will be enough storage for them. >> One bad thing about cube topologies (at least for cubes bigger than >> about 10 dimensions) is that they do not use their bandwidth very >> effectively. It has been claimed (with proof - references may be >> available on demand) that a 14 cube sending an "average" set of messages >> only uses about one fourth of the bandwidth because of collisions. >I would like to see the reference list, please send it to me via email. Ok, this may take a second... >By the same token you might send my your SNAIL MAIL address if you are >interested in a paper describing a solution to the bandwidth problem. >Request the paper titled "Shared Memory Hypercube". Please send the paper to Bradley C. Kuszmaul Thinking Machines Corporation 245 First Street Cambridge, MA 02142 >The switching technque gives NlogN bandwidth, tested by simulation to N=1024, >and it adaptively removes conflicts. Ah, but there are on the order of 2^N wires in the machine. Sounds like the wires are not being used very effectively. > My use of the network is to create a >shared memory server for a hypercube machine with CRAY style vector processors. >With simultaneous vector access by all processors the packet conflict problem >is as severe as it can be. The network provides FULL bandwidth to all >processors simultaneously. It sounds to me like the packet conflict problem is as nice as it could possibly be for that problem. (Presumably processor number $i$ is accessing memory $f(i)$ where $f$ is one-to-one. What happens if $f$ is not one-to-one (e.g. if $f(i)=c$ for some constant $c$.) If $f$ is always the same function, I would just use local memory and flush the topology. Thus, for your application, $f$ must be a different function on different memory accesses. How many different functions $f$ are there that you can get the bandwidth? I doubt that you can do it for every one-to-one function. Random thoughts and questions: Why do you have shared memory? Presumably this is so that processors can communicate: Why don't you just use the cube network to communicate directly? What is the start up time for your vectors (i.e. how big does a vector have to be before the vector processing part wins over the scalar processing part.) Typical vector processors are limited in their speed by lack of memory bandwidth (this is true for a single processor with high bandwidth memories (e.g. the CRAY uses a 16-way interleaved memory in order to try to increase the memory bandwidth). I don't see how putting a cube topology between the memory and the processor, and adding processors helps this situation. Is it the case that your vector processors are an order of magnitude slower than a CRAY (You mention 20MFLOPS earlier in the article) >> It has been claimed, in this discussion, that cube topologies are nice >> because they can simulate other topologies easily. Actually, this is >> not really a good argument, since most of the other seriously considered >> topologies are equally good at simulating each other >The cube topology is nice as it has other useful topologies IMBEDDED within it. >We are not talking about easy simulation here, efficient execution is the >target. The wires is there! Too many wires, and not very effectivley used. Suppose you spent the same amount of money on a different network which gave you twice the bandwidth for most of your applications. Would the N-cube still make sense? The nice thing about the TOTALLY connected topology is that EVERY other topology is embedded within it, but no one seriously considers wiring up a totally conneccted network with a million processors in it. (10^12 wires?) >> I suspect that the real reason that people are building N-cubes instead >> of skewed rings, minimum spanning trees, or some other topology, is that >> computer scientists have a bias towards binary representations of >> things, and the cube topology is nice and binary, and because routing is >> so easy. >Admittedly the routing is easy, this is one of the reasons for the cubes >attractiveness. The real reason for the interest in it however is it has >a genuine bandwidth edge over the other topologies. That is not really true. One of the topologies mentioned in this discussion (somewhere in an earlier posting) is the $k$-shuffle. This network achieves better bandwidth by decreasing the distance between nodes in the graph, by using the same degree of vertices, and using about the same number of wires as a cube. $k$-shuffles are just harder to understand: Write your address in base $k$. Now take this address and shift it left one digit (i.e. $\log k$ bits) Now add any number from $0$ to $k-1$ to the number, and your original processor is connected to that processor. If there are $n$ processors in the system, then the maximum distance between two processors is $log n / log k$. For 64K procesors in a 16-shuffle (which uses the same number of wires as a 64K 16-cube), the maximum distance between two processors is 4 jumps (compared to 16 jumps for the cube). This means that there are likey to be only a quarter as many conflicts, which increases the bandwidth. If you actually had some problem which only did nearest neighbor communication in an n-cube, we could run your program almost as quickly on the k-shuffle topology, because at worst there would be a four processor jump to perform your communication (instead of one jump). Very few programs actually use the n-cube directly (although there are a few (e.g. subcube induction to count the number of processors in a given state - [D.P. Christman, "Programming the Connection Machine", MIT masters thesis, January, 1984])) And for most programs the $k$-shuffle will be at least as good. Routing can be tricky for a $k$-shuffle. There are topologies with even better bandwidth which do have easy routing. Happy connections. -Bradley -- bradley@think.uucp (i.e. {decvax!cca,ihnp4!mit-eddie}!think!bradley) bradley@THINK.ARPA