[net.arch] Cube designs vs. x,y,z bus

perry@hp-dcde.UUCP (perry) (02/18/85)

> /***** hp-dcde:net.arch / oliveb!jerry /  5:49 pm  Feb 21, 1985*/
> 
> If your thinking that 1M processors is unreasonable then think again.
> Depending on the memory in each processor 1 to several processors could
> be placed on a single chip.  As the only IO required is for the
> hyper-channel connections the number of pin-outs is minimal.  As a 1M
> array would, by definition, get volume pricing each chip might cost
> only a dollar or so.
> 
> If each processor had 16K bytes of memory, a 1M array would result in a
> computer with 16,000 Meg (16 gigabytes) of ram.  If the entire wafer of
> silicon was used then the wasted area used for cutting the chips apart
> could be eliminated.  It would be possible to get many processors on on
> wafer with only 30 or so external connections required.
> 
> 				Jerry Aguirre @ Olivetti ATC
> {hplabs|fortune|idi|ihnp4|tolerant|allegra|tymix}!oliveb!jerry
> /* ---------- */

Although it makes sense to put entire chipsets on the same wafer, you're
going to run up against several problems:

1)  Heat dissipation.  I looked up the dissipation for Motorola's 68000 CPU
    and their 256Kx1 RAM.  The 68000 dissipates 1.5W, and the 256Kx1 dissipates
    350mW.  For the configuration you suggest, each processing element will
    require 1.5 + 2*.350 = 2.2W.  This does not include connecting circuitry,
    such as hyperchannel, DMA controllers, etc.  Although I don't know the
    maximum amount of heat that a wafer can dissipate, it would appear that
    even 4 CPU/RAM's would be pushing it.
    
2)  Yield.  As circuit complexity increases, yield decreases.  Having all those
    circuits be correct simultaneously may be a statistical impossibility.

3)  Cost.  If (1) requires different (more expensive) cooling technology, and
    (2) makes yields even lower, the volume price would still be higher.
    CPU chips still cost big bucks all by themselves.  Adding more to the
    complexity may result in a commercially (not to mention technically)
    infeasible design.

Perry Scott, HP-FSD
...{allegra|ihnp4|decvax|ucbvax}!hplabs!hpfcla!perry-s

jerry@oliveb.UUCP (Jerry Aguirre) (02/22/85)

Incorporating an x,y,z bus design may simplify the resulting
configuration but does place limits on the number of processors.  Also
a correction:  the data would have to pass thru at most 2 intermediate
processors to reach its destination.  Along 1 axis to the plane
containing the destination, then into the line containing the
destination, then to the destination.  One intermediate connection
would be required if the processors were in a plane (x, y only).  The
x,y,z bus would allow for 3 equal paths for any source and destination
so fault tolerance and even bus loading could be handled.

When you begin to expand the number of processors the limits of the
x,y,z design are obvious.  The originators of the hyper-cube are
talking about VERY large arrays of processors.  As I understand it, it
takes about 6 months of compute time on a Cray computer to create about
1 hour of visual images (like the graphics for the movie "The Last
Starfighter").  And even then the resolution is not as good as it could
be.  This is one kind of problem for which a very large array of
processors is suited.

At some point the number of processors on a x,y,z bus is going to
exceed its capacity.  The limit on the number of processors is inherent
in the design spec of the bus.  With the hyper-cube the number of
connections grows with the log-2 of the number of processors.  So 1,000
processors require 10 connections per processor and 1,000,000(1M)
require 20.  A x,y,z design for 1M processors would have 100 processors
sharing each bus.

If your thinking that 1M processors is unreasonable then think again.
Depending on the memory in each processor 1 to several processors could
be placed on a single chip.  As the only IO required is for the
hyper-channel connections the number of pin-outs is minimal.  As a 1M
array would, by definition, get volume pricing each chip might cost
only a dollar or so.

If each processor had 16K bytes of memory, a 1M array would result in a
computer with 16,000 Meg (16 gigabytes) of ram.  If the entire wafer of
silicon was used then the wasted area used for cutting the chips apart
could be eliminated.  It would be possible to get many processors on on
wafer with only 30 or so external connections required.

				Jerry Aguirre @ Olivetti ATC
{hplabs|fortune|idi|ihnp4|tolerant|allegra|tymix}!oliveb!jerry

rpw3@redwood.UUCP (Rob Warnock) (02/23/85)

Jerry Aguirre <oliveb!jerry> writes:
+---------------
| At some point the number of processors on a x,y,z bus is going to
| exceed its capacity.  The limit on the number of processors is inherent
| in the design spec of the bus.  With the hyper-cube the number of
| connections grows with the log-2 of the number of processors.  So 1,000
| processors require 10 connections per processor and 1,000,000(1M)
| require 20.  A x,y,z design for 1M processors would have 100 processors
| sharing each bus.
+---------------

Good points, but...

1. 100 processors per bus is NOT an outrageous number, especially if
   (as you suggest later) several processor share a silicon substrate.
   Standard Ethernet allows 100 taps per cable segment, and 1000 nodes
   per logical cable (set by the backoff algorithm).

2. You may object to "1." on the basis of throughput: All 100 processors
   have to share the bus. On the other hand, each processor of a 1M hyper-cube
   with point-to-point connections has to handle 20 TIMES the throughput of one
   of the links!  An x,y,z system may have to handle 3 times the bandwidth of
   a bus, peak, but hardware address filtering will lower this.  Assume packets
   addressed at random, in a 1M x,y,z bus system a processor has to handle
   "bus * 3 / 100", or 3% of one bus.

So pick whatever speed of point-to-point link you really CAN handle 20 of,
and a similar bus system could simply use a bus 300 times the speed of one
pt-to-pt link. This may also sound "outrageous", but start with the processing
capacity and work outwards. Assuming each processor can handle a megabit/sec
of communications in addition to its "real work" (a figure that is quite
high with current protocols), the pt-pt links of a 1M hyper-cube might
be 50 kilobits/sec, and the busses of an x,y,z bus system might be some
33 megabits/sec each. Neither figure is unreasonable, depending on various
engineering tradeoffs.

The bus arrangement will give lower latency, due to the queueing advantage
of single-queue/multi-server (the bus) over multi-queue/multi-server (the
pt-pt links). The pt-pt links can be sped up to improve latency (assuming
the packet rate doesn't change), but you risk causing "data-lates" or
"overruns" if many of the 20 links happen to contain a packet at once. In
an x,y,z bus system, the maximum peak memory load from I/O is 3 times the
bus rate. Assuming equal memory bandwidths, hyper-cube pt-pt links could
only use 3/20 the bus rate (or 5 Mbit/sec in the above example), which is
not enough to overcome the queueing disadvantage.

Incidently, while I prefer busses for the interconnections, an x,y,z hookup of
the form you mention may not be the best way to exploit the connectivity of a
bus.  Consider a hybrid form which groups processors into smaller "hyper-points"
on a bus (say 10 to 1000 of them) and then interconnectes hyper-points into
"hyper-hyper-cubes" (with "fat" corners) by using one additional bus connection
on each processor of a hyper-point. Each hyper-point thus acts like one
processor of a hyper-cube as far as communications goes, but each processor
of the hyper-point only talks to two busses: the intra-point bus and one of
the inter-point (edge) busses. (Giving each processor a third bus connection
would allow hyper-hyper-hyper-cubes.) The path length would be one hop longer
than for a hyper-cube (due to the intra-point bus), but far fewer I/O ports
would be required per processor, thus lessening the strain on the memory
(and processing) bandwidth.


Rob Warnock
Systems Architecture Consultant

UUCP:	{ihnp4,ucbvax!dual}!fortune!redwood!rpw3
DDD:	(415)572-2607
USPS:	510 Trinidad Lane, Foster City, CA  94404

cdshaw@watrose.UUCP (Chris Shaw) (02/24/85)

The whole point of point-to-point communication channels is to eliminate all
forms of bus contention that may occur between processors. Hence CALTECH's
use of a large, complicated backplane setup; and Intel's use of point-to-
point ethernet channels in its cube products.

The point-to-point setup allows data to be passed from one processor to the
next in a ring, like a systolic loop, or through any number of other patterns
that may be available given an n-dimensional hypercube. Ultimately, perhaps,
one may want point-to-point between all processors. This would require
N*(N-1)/2  channels/ethernets/whatever, given N processors.
This would allow any processor to send data direct to any other processor in the
machine. For 64 processors, this would require 2016 inter-processor wires,
with 63 I/O devices per processor board. This is excessive.

The hypercube allows n* 2 **(n-1)  wires, where there are 2**n processors.
For 64 nodes, this amounts to 192 wires, with 6 I/O devices per board.
This arrangement can be built fairly easily (once the bugs are ironed out),
but you pay for it with access to only 6 of the 64 processors. However, access
to the remainder of the processors is a maximum of 6 communication links away.
The average communication length is 3, with no bus contention along the way.
The contention is really hidden in the point-to-point waiting for receive and
transmit.

The xyz bus doesn't deal with more than 8 processors very nicely, since you
still have bus contention possibilities on each of the busses, AND a wait
involved with changing from the x to y busses to get to a processor not
on the x bus. (x & y by way of example).

The x.y.z (or u.v.w.x.y.z) bus schema don't buy you anything. You get
buckets of bus squabbling for each "plane" AND you must route data to a 
different plane if the destination processor isn't connected to your plane.
Point-to-point, on the other hand, buys zero bus squabbling (i.e. full bus
bandwidth on each wire), at the price of data for "distant" processors
having to be routed through some intermediaries.

Hope this answers the "why point-to-point" questions ....

Chris Shaw

zben@umd5.UUCP (02/25/85)

In article <268@oliveb.UUCP> jerry@oliveb.UUCP (Jerry Aguirre) writes:
> ...  If the entire wafer of
>silicon was used then the wasted area used for cutting the chips apart
>could be eliminated.  ...   

Well, until they start growing REALLY perfect silicon crystals in zero-gee
(if in fact that is the real problem with growing perfect crystals)
you would still have to deal with yield here.  I suspect 60-75% yield on
as complex a circuit as a microprocessor would be pretty good for the
current state of the art (any flames?).

So you might have to burn a map of the good processors into the chip after
testing, like those Intel mag bubble chips...  

Network routing algorithms, anyone?   ;-)

-- 
Ben Cranston        ...seismo!umcp-cs!cvl!umd5!zben    zben@umd2.ARPA

zben@umd5.UUCP (02/25/85)

In article <342@umd5.UUCP> zben@umd5.UUCP (Ben Cranston) writes:
>So you might have to burn a map of the good processors into the chip after
>testing, like those Intel mag bubble chips...  

Er, um, lets be charitable and say ZBEN wasn't really thinking when he 
said "Intel".  He must have been thinking of the "T.I." bubble chips...

And its even worse than that.  A bad processor area could short one of the
power supply rails, or (in the case of the Ethernet scheme) could flood the
Enternet with bogus messages.  The chip would probably have to have areas
where a bad processor (identified during testing) could be electrically
isolated from the power supply rails and the network connection, using a
directed-energy beam to vaporize strategic areas of metallization.

-- 
Ben Cranston        ...seismo!umcp-cs!cvl!umd5!zben    zben@umd2.ARPA

rpw3@redwood.UUCP (Rob Warnock) (02/27/85)

+---------------
| The whole point of point-to-point communication channels is to eliminate all
| forms of bus contention that may occur between processors...
| ...Point-to-point, on the other hand, buys zero bus squabbling (i.e. full bus
| bandwidth on each wire), at the price of data for "distant" processors
| having to be routed through some intermediaries.
| Hope this answers the "why point-to-point" questions ....
| Chris Shaw
+---------------

Yes, this sounds nice, but unfortunately, you have just pushed the
"bus squabbling" into each processor node! For "N" greater than 2 or
3, "N" times the "full bus bandwidth" is not available WITHIN each
processor to SERVICE such point-to-point channels! (...unless each
channel is terribly slow.) Take a look sometime at the memory-bus
characteristics the "full-function DMA" Ethernet chips, such as the
Intel 82586 and the AMD/Mostek LANCE.  (To avoid confusion, I will
use "M-bus" to mean a processor's internal memory bus, and "E-bus" to
mean the external Ethernet or similar bit-serial bus.)

Because of the time the chip spends holding the M-bus, you can't run
more than about two simultaneous controllers on the same M-bus at 10
Mbits/sec. In order to get 6 or 8 or more point-to-point channels on
one M-bus, you have to slow each channel down so much that you would
be better off with the "bus contention" on a full-speed E-bus!

Yes, I know that by supplying a whole mess of external data and address
buffers (and control logic for them), you can cut the M-bus-occupancy
time of these chips, but that only raises "N" to 3 or 4 before you run
out of memory bandwidth. (Even the "x,y,z" configuration is going to
need some help to support just 3 controllers.) And before I would try
to interface either of the above to a 32-bit (or wider) memory bus, I
would "drop back and punt" and use a simple serializer such as the Seeq
or Fujitsu Ethernet chips and a state machine to do the DMA. But by the
time you have built 8 fast channels and have widened the memory enough
to support them, you could afford perhaps DOUBLE the number of the
cheaper "x,y,z" E-bus processors!

Remember, with fast source-path routing and smart DMA controllers such
as the Intel or AMD/Mostek parts, sending a packet from E-bus "x" to
E-bus "y" to E-bus "z" at 10 Mbits/sec can easily be FASTER than sending
it point-to-point over a 1 Mbit/sec channel.

[Note: Slight subject change coming -- away from "point-to-point vs. bus" ]

I happen to favor a hybrid approach, as I mentioned in an earlier posting,
which uses an E-bus to group a number (possibly small, say 8 perhaps? ;-} )
of processors together to make a "fat point" (I believe I used the word
"hyper-point" earlier). Each processor needs but two (2) E-bus controllers:
one used for the internal or intra-point communication, and one used for
inter-point or "edge" connections. (This is easily achieved at 10 Mbits/sec
with existing controller chips and 16-bit data paths.)

If each "edge" E-bus contains but two processors, you can exactly model
the hyper-cube style, but with each "point" being 6-10 processors with
2 controllers each rather than one processor with 6-10 controllers.
Yes, each transmission from one "edge" E-bus to another requires an
intermediate hop through the "point" E-bus, but this is compensated by
the higher speed of the "edge" links, which run at a full 10 Mbits/sec.

But higher-degree "edges" are possible -- one can fold edges together (by
connecting edge E-busses) either to form "hyper-hyper-cubes" (whatever that
might mean) or to simply save on processors. If every other point (using
Hamming distance to give meaning to "every other") E-bussed all of it's
edges together, such "distinguished" points need have only one processor
(and no "internal" E-bus), and you end up with something I can only describe
as the "half-dual" of a hyper-cube.

Intermediate forms are of course possible, as well as the other extreme.
I leave as an exercise the construction of an "x,y,z bus" system from
hyper-points whose processors contain only two E-bus controllers per M-bus...


Rob Warnock
Systems Architecture Consultant

UUCP:	{ihnp4,ucbvax!dual}!fortune!redwood!rpw3
DDD:	(415)572-2607
USPS:	510 Trinidad Lane, Foster City, CA  94404

wall@fortune.UUCP (Jim Wall) (02/27/85)

>The x.y.z (or u.v.w.x.y.z) bus schema don't buy you anything. You get
>buckets of bus squabbling for each "plane" AND you must route data to a 
>different plane if the destination processor isn't connected to your plane.
>Point-to-point, on the other hand, buys zero bus squabbling (i.e. full bus
>bandwidth on each wire), at the price of data for "distant" processors
>having to be routed through some intermediaries.
>
>Chris Shaw

   Ah, surely you jest.  Given that each node on the psuedo cube has 
its own associated memory, the vast majority of that processors time
will be spent without touching the 'bus'.  And in most cases, the only
times the bus is used is for infrequent data movement (lets say for 
mmu misses) and for interprocessor communication.

   As long as there are not too many processors on any one bus or e
ethernet link, the number of times where you would have to wait for the
bus would be minimal. The trade off as to how many would be allowed
is part of the architects job, to analyze the usage of the machine, the
tasks it must do, the performance requirements and the cost. Outside of
accademia and government, cost is much more important than unnecessary
performance.

   These ideas are, of course, not new. Rob Warnocks (redwood!rpw3)
message on 'fat corners' imply the same thing. For some cases, a node
may have to communicate through a gateway node to get to the processor
it wishes to talk to. This approach is equivilent to more processors
on one plane, as far as the delay aspects are concerned.

						-Jim Wall
					     ...amd!fortune!wall

rej@cornell.UUCP (Ralph Johnson) (03/01/85)

In article <5056@fortune.UUCP> wall@fortune.UUCP (Jim wall) writes:
>
>   Ah, surely you jest.  Given that each node on the psuedo cube has 
>its own associated memory, the vast majority of that processors time
>will be spent without touching the 'bus'.  And in most cases, the only
>times the bus is used is for infrequent data movement (lets say for 
>mmu misses) and for interprocessor communication.

I assume that you haven't built any large multiprocessors.  Forgive me
if I am wrong.  I haven't either, but I have talked to people who have,
and I read a lot of papers.  Most experts agree that lack of bandwidth
is death to a multiprocessor.  Go down the list of big multiprocessors
that have actually been built (Illiac 4, cm*, etc) and you will find that
the biggest problem with each has been that communication was too
expensive.  Most of the problems that require parallelism need a lot of
communication between processors.  Also, asynchronous buses with
multiple masters are slower than simple point to point communication
links.  A shared bus may work with a handful of processors, but not the
hundreds or thousands that are being discussed for cubes.

Ralph Johnson

cdshaw@watrose.UUCP (Chris Shaw) (03/02/85)

>   Ah, surely you jest.  Given that each node on the psuedo cube has 
> its own associated memory, the vast majority of that processors time
> will be spent without touching the 'bus'.  And in most cases, the only
> times the bus is used is for infrequent data movement (lets say for 
> mmu misses) and for interprocessor communication.

No I don't jest.. and here's why.
The purpose of the cube is NOT to have a machine which dozens of people
can log on to and have their own 286-based micro. The motivation for the
cube is to get a machine in which all of the processors work together
on the same VERY large problem. In other words, the cube is a PARALLEL
processing machine, not just a machine with lots of processors. As a 
previous reply to your posting indicated, you can't have
too much parallelism in a machine of this ilk.


>   As long as there are not too many processors on any one bus or 
>ethernet link, the number of times where you would have to wait for the
>bus would be minimal. The trade off as to how many would be allowed
>is part of the architects job, to analyze the usage of the machine, the
>tasks it must do, the performance requirements and the cost.

There are two thing I see wrong with this suggestion :

1) It seems to imply that there would be several versions of a machine,
say a linear algebra engine, a database engine, etc. This sounds kind of
hokey to me.
2) Your estimation of communications load I think is far too small. Caltech
has a cube running in which they solve the 7-body problem. This problem
requires that you calculate 21 pairs of interactions (I don't know for sure).
The structure of solution was to have 7 body processes and one i/o processes
on a 4-node square. (2 processes per node). Each body process sent its position
to 3 other processes, and received similar data from the other 3. With 
info for each body, calculations were done, and the results passed on to the
remaining processes. (See Jan '85 CACM for real description). The point is,
the solution to this problem is highly communication based. Many applications
where matrix-bashing is needed are also likely to be dependent on getting
partial results sent to them from other processes within the cube.
Basically, as much time could conceivably be spent on sending and receiving
data is spent on doing actual calculations.


>-Jim Wall
>...amd!fortune!wall

Chris Shaw
University of Waterloo