perry@hp-dcde.UUCP (perry) (02/18/85)
> /***** hp-dcde:net.arch / oliveb!jerry / 5:49 pm Feb 21, 1985*/ > > If your thinking that 1M processors is unreasonable then think again. > Depending on the memory in each processor 1 to several processors could > be placed on a single chip. As the only IO required is for the > hyper-channel connections the number of pin-outs is minimal. As a 1M > array would, by definition, get volume pricing each chip might cost > only a dollar or so. > > If each processor had 16K bytes of memory, a 1M array would result in a > computer with 16,000 Meg (16 gigabytes) of ram. If the entire wafer of > silicon was used then the wasted area used for cutting the chips apart > could be eliminated. It would be possible to get many processors on on > wafer with only 30 or so external connections required. > > Jerry Aguirre @ Olivetti ATC > {hplabs|fortune|idi|ihnp4|tolerant|allegra|tymix}!oliveb!jerry > /* ---------- */ Although it makes sense to put entire chipsets on the same wafer, you're going to run up against several problems: 1) Heat dissipation. I looked up the dissipation for Motorola's 68000 CPU and their 256Kx1 RAM. The 68000 dissipates 1.5W, and the 256Kx1 dissipates 350mW. For the configuration you suggest, each processing element will require 1.5 + 2*.350 = 2.2W. This does not include connecting circuitry, such as hyperchannel, DMA controllers, etc. Although I don't know the maximum amount of heat that a wafer can dissipate, it would appear that even 4 CPU/RAM's would be pushing it. 2) Yield. As circuit complexity increases, yield decreases. Having all those circuits be correct simultaneously may be a statistical impossibility. 3) Cost. If (1) requires different (more expensive) cooling technology, and (2) makes yields even lower, the volume price would still be higher. CPU chips still cost big bucks all by themselves. Adding more to the complexity may result in a commercially (not to mention technically) infeasible design. Perry Scott, HP-FSD ...{allegra|ihnp4|decvax|ucbvax}!hplabs!hpfcla!perry-s
jerry@oliveb.UUCP (Jerry Aguirre) (02/22/85)
Incorporating an x,y,z bus design may simplify the resulting configuration but does place limits on the number of processors. Also a correction: the data would have to pass thru at most 2 intermediate processors to reach its destination. Along 1 axis to the plane containing the destination, then into the line containing the destination, then to the destination. One intermediate connection would be required if the processors were in a plane (x, y only). The x,y,z bus would allow for 3 equal paths for any source and destination so fault tolerance and even bus loading could be handled. When you begin to expand the number of processors the limits of the x,y,z design are obvious. The originators of the hyper-cube are talking about VERY large arrays of processors. As I understand it, it takes about 6 months of compute time on a Cray computer to create about 1 hour of visual images (like the graphics for the movie "The Last Starfighter"). And even then the resolution is not as good as it could be. This is one kind of problem for which a very large array of processors is suited. At some point the number of processors on a x,y,z bus is going to exceed its capacity. The limit on the number of processors is inherent in the design spec of the bus. With the hyper-cube the number of connections grows with the log-2 of the number of processors. So 1,000 processors require 10 connections per processor and 1,000,000(1M) require 20. A x,y,z design for 1M processors would have 100 processors sharing each bus. If your thinking that 1M processors is unreasonable then think again. Depending on the memory in each processor 1 to several processors could be placed on a single chip. As the only IO required is for the hyper-channel connections the number of pin-outs is minimal. As a 1M array would, by definition, get volume pricing each chip might cost only a dollar or so. If each processor had 16K bytes of memory, a 1M array would result in a computer with 16,000 Meg (16 gigabytes) of ram. If the entire wafer of silicon was used then the wasted area used for cutting the chips apart could be eliminated. It would be possible to get many processors on on wafer with only 30 or so external connections required. Jerry Aguirre @ Olivetti ATC {hplabs|fortune|idi|ihnp4|tolerant|allegra|tymix}!oliveb!jerry
rpw3@redwood.UUCP (Rob Warnock) (02/23/85)
Jerry Aguirre <oliveb!jerry> writes: +--------------- | At some point the number of processors on a x,y,z bus is going to | exceed its capacity. The limit on the number of processors is inherent | in the design spec of the bus. With the hyper-cube the number of | connections grows with the log-2 of the number of processors. So 1,000 | processors require 10 connections per processor and 1,000,000(1M) | require 20. A x,y,z design for 1M processors would have 100 processors | sharing each bus. +--------------- Good points, but... 1. 100 processors per bus is NOT an outrageous number, especially if (as you suggest later) several processor share a silicon substrate. Standard Ethernet allows 100 taps per cable segment, and 1000 nodes per logical cable (set by the backoff algorithm). 2. You may object to "1." on the basis of throughput: All 100 processors have to share the bus. On the other hand, each processor of a 1M hyper-cube with point-to-point connections has to handle 20 TIMES the throughput of one of the links! An x,y,z system may have to handle 3 times the bandwidth of a bus, peak, but hardware address filtering will lower this. Assume packets addressed at random, in a 1M x,y,z bus system a processor has to handle "bus * 3 / 100", or 3% of one bus. So pick whatever speed of point-to-point link you really CAN handle 20 of, and a similar bus system could simply use a bus 300 times the speed of one pt-to-pt link. This may also sound "outrageous", but start with the processing capacity and work outwards. Assuming each processor can handle a megabit/sec of communications in addition to its "real work" (a figure that is quite high with current protocols), the pt-pt links of a 1M hyper-cube might be 50 kilobits/sec, and the busses of an x,y,z bus system might be some 33 megabits/sec each. Neither figure is unreasonable, depending on various engineering tradeoffs. The bus arrangement will give lower latency, due to the queueing advantage of single-queue/multi-server (the bus) over multi-queue/multi-server (the pt-pt links). The pt-pt links can be sped up to improve latency (assuming the packet rate doesn't change), but you risk causing "data-lates" or "overruns" if many of the 20 links happen to contain a packet at once. In an x,y,z bus system, the maximum peak memory load from I/O is 3 times the bus rate. Assuming equal memory bandwidths, hyper-cube pt-pt links could only use 3/20 the bus rate (or 5 Mbit/sec in the above example), which is not enough to overcome the queueing disadvantage. Incidently, while I prefer busses for the interconnections, an x,y,z hookup of the form you mention may not be the best way to exploit the connectivity of a bus. Consider a hybrid form which groups processors into smaller "hyper-points" on a bus (say 10 to 1000 of them) and then interconnectes hyper-points into "hyper-hyper-cubes" (with "fat" corners) by using one additional bus connection on each processor of a hyper-point. Each hyper-point thus acts like one processor of a hyper-cube as far as communications goes, but each processor of the hyper-point only talks to two busses: the intra-point bus and one of the inter-point (edge) busses. (Giving each processor a third bus connection would allow hyper-hyper-hyper-cubes.) The path length would be one hop longer than for a hyper-cube (due to the intra-point bus), but far fewer I/O ports would be required per processor, thus lessening the strain on the memory (and processing) bandwidth. Rob Warnock Systems Architecture Consultant UUCP: {ihnp4,ucbvax!dual}!fortune!redwood!rpw3 DDD: (415)572-2607 USPS: 510 Trinidad Lane, Foster City, CA 94404
cdshaw@watrose.UUCP (Chris Shaw) (02/24/85)
The whole point of point-to-point communication channels is to eliminate all forms of bus contention that may occur between processors. Hence CALTECH's use of a large, complicated backplane setup; and Intel's use of point-to- point ethernet channels in its cube products. The point-to-point setup allows data to be passed from one processor to the next in a ring, like a systolic loop, or through any number of other patterns that may be available given an n-dimensional hypercube. Ultimately, perhaps, one may want point-to-point between all processors. This would require N*(N-1)/2 channels/ethernets/whatever, given N processors. This would allow any processor to send data direct to any other processor in the machine. For 64 processors, this would require 2016 inter-processor wires, with 63 I/O devices per processor board. This is excessive. The hypercube allows n* 2 **(n-1) wires, where there are 2**n processors. For 64 nodes, this amounts to 192 wires, with 6 I/O devices per board. This arrangement can be built fairly easily (once the bugs are ironed out), but you pay for it with access to only 6 of the 64 processors. However, access to the remainder of the processors is a maximum of 6 communication links away. The average communication length is 3, with no bus contention along the way. The contention is really hidden in the point-to-point waiting for receive and transmit. The xyz bus doesn't deal with more than 8 processors very nicely, since you still have bus contention possibilities on each of the busses, AND a wait involved with changing from the x to y busses to get to a processor not on the x bus. (x & y by way of example). The x.y.z (or u.v.w.x.y.z) bus schema don't buy you anything. You get buckets of bus squabbling for each "plane" AND you must route data to a different plane if the destination processor isn't connected to your plane. Point-to-point, on the other hand, buys zero bus squabbling (i.e. full bus bandwidth on each wire), at the price of data for "distant" processors having to be routed through some intermediaries. Hope this answers the "why point-to-point" questions .... Chris Shaw
zben@umd5.UUCP (02/25/85)
In article <268@oliveb.UUCP> jerry@oliveb.UUCP (Jerry Aguirre) writes: > ... If the entire wafer of >silicon was used then the wasted area used for cutting the chips apart >could be eliminated. ... Well, until they start growing REALLY perfect silicon crystals in zero-gee (if in fact that is the real problem with growing perfect crystals) you would still have to deal with yield here. I suspect 60-75% yield on as complex a circuit as a microprocessor would be pretty good for the current state of the art (any flames?). So you might have to burn a map of the good processors into the chip after testing, like those Intel mag bubble chips... Network routing algorithms, anyone? ;-) -- Ben Cranston ...seismo!umcp-cs!cvl!umd5!zben zben@umd2.ARPA
zben@umd5.UUCP (02/25/85)
In article <342@umd5.UUCP> zben@umd5.UUCP (Ben Cranston) writes: >So you might have to burn a map of the good processors into the chip after >testing, like those Intel mag bubble chips... Er, um, lets be charitable and say ZBEN wasn't really thinking when he said "Intel". He must have been thinking of the "T.I." bubble chips... And its even worse than that. A bad processor area could short one of the power supply rails, or (in the case of the Ethernet scheme) could flood the Enternet with bogus messages. The chip would probably have to have areas where a bad processor (identified during testing) could be electrically isolated from the power supply rails and the network connection, using a directed-energy beam to vaporize strategic areas of metallization. -- Ben Cranston ...seismo!umcp-cs!cvl!umd5!zben zben@umd2.ARPA
rpw3@redwood.UUCP (Rob Warnock) (02/27/85)
+--------------- | The whole point of point-to-point communication channels is to eliminate all | forms of bus contention that may occur between processors... | ...Point-to-point, on the other hand, buys zero bus squabbling (i.e. full bus | bandwidth on each wire), at the price of data for "distant" processors | having to be routed through some intermediaries. | Hope this answers the "why point-to-point" questions .... | Chris Shaw +--------------- Yes, this sounds nice, but unfortunately, you have just pushed the "bus squabbling" into each processor node! For "N" greater than 2 or 3, "N" times the "full bus bandwidth" is not available WITHIN each processor to SERVICE such point-to-point channels! (...unless each channel is terribly slow.) Take a look sometime at the memory-bus characteristics the "full-function DMA" Ethernet chips, such as the Intel 82586 and the AMD/Mostek LANCE. (To avoid confusion, I will use "M-bus" to mean a processor's internal memory bus, and "E-bus" to mean the external Ethernet or similar bit-serial bus.) Because of the time the chip spends holding the M-bus, you can't run more than about two simultaneous controllers on the same M-bus at 10 Mbits/sec. In order to get 6 or 8 or more point-to-point channels on one M-bus, you have to slow each channel down so much that you would be better off with the "bus contention" on a full-speed E-bus! Yes, I know that by supplying a whole mess of external data and address buffers (and control logic for them), you can cut the M-bus-occupancy time of these chips, but that only raises "N" to 3 or 4 before you run out of memory bandwidth. (Even the "x,y,z" configuration is going to need some help to support just 3 controllers.) And before I would try to interface either of the above to a 32-bit (or wider) memory bus, I would "drop back and punt" and use a simple serializer such as the Seeq or Fujitsu Ethernet chips and a state machine to do the DMA. But by the time you have built 8 fast channels and have widened the memory enough to support them, you could afford perhaps DOUBLE the number of the cheaper "x,y,z" E-bus processors! Remember, with fast source-path routing and smart DMA controllers such as the Intel or AMD/Mostek parts, sending a packet from E-bus "x" to E-bus "y" to E-bus "z" at 10 Mbits/sec can easily be FASTER than sending it point-to-point over a 1 Mbit/sec channel. [Note: Slight subject change coming -- away from "point-to-point vs. bus" ] I happen to favor a hybrid approach, as I mentioned in an earlier posting, which uses an E-bus to group a number (possibly small, say 8 perhaps? ;-} ) of processors together to make a "fat point" (I believe I used the word "hyper-point" earlier). Each processor needs but two (2) E-bus controllers: one used for the internal or intra-point communication, and one used for inter-point or "edge" connections. (This is easily achieved at 10 Mbits/sec with existing controller chips and 16-bit data paths.) If each "edge" E-bus contains but two processors, you can exactly model the hyper-cube style, but with each "point" being 6-10 processors with 2 controllers each rather than one processor with 6-10 controllers. Yes, each transmission from one "edge" E-bus to another requires an intermediate hop through the "point" E-bus, but this is compensated by the higher speed of the "edge" links, which run at a full 10 Mbits/sec. But higher-degree "edges" are possible -- one can fold edges together (by connecting edge E-busses) either to form "hyper-hyper-cubes" (whatever that might mean) or to simply save on processors. If every other point (using Hamming distance to give meaning to "every other") E-bussed all of it's edges together, such "distinguished" points need have only one processor (and no "internal" E-bus), and you end up with something I can only describe as the "half-dual" of a hyper-cube. Intermediate forms are of course possible, as well as the other extreme. I leave as an exercise the construction of an "x,y,z bus" system from hyper-points whose processors contain only two E-bus controllers per M-bus... Rob Warnock Systems Architecture Consultant UUCP: {ihnp4,ucbvax!dual}!fortune!redwood!rpw3 DDD: (415)572-2607 USPS: 510 Trinidad Lane, Foster City, CA 94404
wall@fortune.UUCP (Jim Wall) (02/27/85)
>The x.y.z (or u.v.w.x.y.z) bus schema don't buy you anything. You get >buckets of bus squabbling for each "plane" AND you must route data to a >different plane if the destination processor isn't connected to your plane. >Point-to-point, on the other hand, buys zero bus squabbling (i.e. full bus >bandwidth on each wire), at the price of data for "distant" processors >having to be routed through some intermediaries. > >Chris Shaw Ah, surely you jest. Given that each node on the psuedo cube has its own associated memory, the vast majority of that processors time will be spent without touching the 'bus'. And in most cases, the only times the bus is used is for infrequent data movement (lets say for mmu misses) and for interprocessor communication. As long as there are not too many processors on any one bus or e ethernet link, the number of times where you would have to wait for the bus would be minimal. The trade off as to how many would be allowed is part of the architects job, to analyze the usage of the machine, the tasks it must do, the performance requirements and the cost. Outside of accademia and government, cost is much more important than unnecessary performance. These ideas are, of course, not new. Rob Warnocks (redwood!rpw3) message on 'fat corners' imply the same thing. For some cases, a node may have to communicate through a gateway node to get to the processor it wishes to talk to. This approach is equivilent to more processors on one plane, as far as the delay aspects are concerned. -Jim Wall ...amd!fortune!wall
rej@cornell.UUCP (Ralph Johnson) (03/01/85)
In article <5056@fortune.UUCP> wall@fortune.UUCP (Jim wall) writes: > > Ah, surely you jest. Given that each node on the psuedo cube has >its own associated memory, the vast majority of that processors time >will be spent without touching the 'bus'. And in most cases, the only >times the bus is used is for infrequent data movement (lets say for >mmu misses) and for interprocessor communication. I assume that you haven't built any large multiprocessors. Forgive me if I am wrong. I haven't either, but I have talked to people who have, and I read a lot of papers. Most experts agree that lack of bandwidth is death to a multiprocessor. Go down the list of big multiprocessors that have actually been built (Illiac 4, cm*, etc) and you will find that the biggest problem with each has been that communication was too expensive. Most of the problems that require parallelism need a lot of communication between processors. Also, asynchronous buses with multiple masters are slower than simple point to point communication links. A shared bus may work with a handful of processors, but not the hundreds or thousands that are being discussed for cubes. Ralph Johnson
cdshaw@watrose.UUCP (Chris Shaw) (03/02/85)
> Ah, surely you jest. Given that each node on the psuedo cube has > its own associated memory, the vast majority of that processors time > will be spent without touching the 'bus'. And in most cases, the only > times the bus is used is for infrequent data movement (lets say for > mmu misses) and for interprocessor communication. No I don't jest.. and here's why. The purpose of the cube is NOT to have a machine which dozens of people can log on to and have their own 286-based micro. The motivation for the cube is to get a machine in which all of the processors work together on the same VERY large problem. In other words, the cube is a PARALLEL processing machine, not just a machine with lots of processors. As a previous reply to your posting indicated, you can't have too much parallelism in a machine of this ilk. > As long as there are not too many processors on any one bus or >ethernet link, the number of times where you would have to wait for the >bus would be minimal. The trade off as to how many would be allowed >is part of the architects job, to analyze the usage of the machine, the >tasks it must do, the performance requirements and the cost. There are two thing I see wrong with this suggestion : 1) It seems to imply that there would be several versions of a machine, say a linear algebra engine, a database engine, etc. This sounds kind of hokey to me. 2) Your estimation of communications load I think is far too small. Caltech has a cube running in which they solve the 7-body problem. This problem requires that you calculate 21 pairs of interactions (I don't know for sure). The structure of solution was to have 7 body processes and one i/o processes on a 4-node square. (2 processes per node). Each body process sent its position to 3 other processes, and received similar data from the other 3. With info for each body, calculations were done, and the results passed on to the remaining processes. (See Jan '85 CACM for real description). The point is, the solution to this problem is highly communication based. Many applications where matrix-bashing is needed are also likely to be dependent on getting partial results sent to them from other processes within the cube. Basically, as much time could conceivably be spent on sending and receiving data is spent on doing actual calculations. >-Jim Wall >...amd!fortune!wall Chris Shaw University of Waterloo