[comp.parallel] IPSC Communications

coop@cerc.wvu.wvnet.edu (11/18/89)

A little while ago I saw a video lecture by a designer from Intel that
predicted that the next generation of their hypercube IPSC systems is going
to use a two-dimensional grid communications scheme with a more sophisticated
router instead of their currently n-cube communication scheme. Wouldn't this
make the architecture less flexible and expandable, besides making it necessary
to restructure the instruction set ? Does anyone have any details on this ?

P.S. Since IPSC/1 was based on 80286, and the IPSC/2 on the 80386, is an
     i860-based IPSC in the future ? It could make a cray look really
     bad for a fraction of the price ...

Boris Pelakh				"Software - a spell one casts on a
pelakh@cerc.wvu.wvnet.edu		 computer to transform input into
coop@cerc.wvu.wvnet.edu			 errors."        -- Me

Disclaimer :    If my employer knew what I did with the time I get paid for,
		I would be out of a job. Let's keep this between us, OK ?

brooks@maddog.llnl.gov (Eugene Brooks) (11/20/89)

In article <7110@hubcap.clemson.edu> coop@cerc.wvu.wvnet.edu writes:
>A little while ago I saw a video lecture by a designer from Intel that
>predicted that the next generation of their hypercube IPSC systems is going
>to use a two-dimensional grid communications scheme with a more sophisticated
>router instead of their currently n-cube communication scheme. Wouldn't this
>make the architecture less flexible and expandable, besides making it necessary
>to restructure the instruction set ? Does anyone have any details on this ?
A 2-D grid is more expandable than a hypercube, in that the node only
requires 4 ports independent of machine size.  Bandwidth across a bisection
of the machine is less, but this does not in any way change the
instruction set.  If the network wormhole routes you just send a message
to the destination processor and thats it.  There are just a lot more ways
for the message to run into a conflict on the 2-D grid, but it will eventually
get to its destination.  It is interesting to note that the orginal message
passing machine at Caltech was proposed as a grid, not a hypercube.  The
Computer scientists there thought that hypercubes were prettier at the time
and forced the physicists to learn Grey codes to get their physics problem
mappings done.

With regards to the possible use of i860,  I am sure that Intel would only
use their best in a machine for serious supercomputing.

          No one will survive the attack of the Killer Micros
brooks@maddog.llnl.gov, brooks@maddog.uucp

gld@CUNIXD.CC.COLUMBIA.EDU (Gary L Dare) (11/20/89)

In article <7110@hubcap.clemson.edu> Boris Pelakh wrote:

>A little while ago I saw a video lecture by a designer from Intel that
>predicted that the next generation of their hypercube IPSC systems is
>going to use a two-dimensional grid communications scheme with a more
>sophisticated router instead of their currently n-cube communication

Gee, this sounds like they're going to try the imbedded 2-D grid a la
Connection Machine.  The CM has two modes of communication, hypercube
or grid, but they're both the same system.  Certain efficiencies can
arise from using the grid as opposed to the hypercube, and vice-versa.
-- 
~~~~~~~~~~~~~~~~~~~~~~~~ Je me souviens ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Gary L. Dare				In memory of Victor Davis:
> gld@cunixd.cc.columbia.EDU 		 Canada's Greatest Swimmer
> gld@cunixc.BITNET			 d. 89/11/13, Montreal, Quebec

pase@orville.nas.nasa.gov (Douglas M. Pase) (11/21/89)

In article <7110@hubcap.clemson.edu> coop@cerc.wvu.wvnet.edu writes:
>Wouldn't [Intel's two-dimensional grid communications scheme] make the
>architecture less flexible and expandable [than the older n-cube topology],
>besides making it necessary to restructure the instruction set?

A lot of Intel's ideas are based (at least initially) on William Dally's PhD.
thesis.  Grossly simplified, the idea is that one can trade the wire layout
complexity of an n-cube arrangement for higher bandwidth connections (more +
shorter wires) in a grid/torus.  Most important, he shows that such trades
favor the 2d arrangements.  With a simple argument it is easily shown that
a grid/torus constructed in that way has lower latency and contention than
an n-cube, *even for problems which prefer an n-cube*.  The question I still
have is whether Intel is allocating the wire space as Dally recommends, or if
they're using low wire-count connections to connect the grid PEs as they do
for the current n-cube machines.  The performance improvements DEPEND on the
higher wire-count connections.

No programming changes are necessary.  Routing has always been handled by the
system.

>P.S. is an i860-based iPSC in the future?
Well, not so much the future as the present...
Dr. Douglas M. Pase				Computer Sciences Corporation
95 Sierra Vista Ave				NAS MS 258-6
Mountain View, CA  94043			NASA Ames Research Center
(415) 940-1197					Moffett Field, CA  94035
pase@orville.nas.nasa.gov			(415) 694-6394

workman@decwrl.dec.com (Will Workman) (11/21/89)

The MasPar MP-1 implements a true grid communications scheme
based on the NASA funded "Blitzen" eight way nearest neighor
plus pipelined extensions for regular communication patterns
plus a totally independent global router using custom VLSI
based on a hierarchical three stage cross bar switch.  
Peformance of nearest neighbors is 45 times faster than the
CM-2 when passing 32 bit values, and 2.5 times faster on 
global messages of both constant offset and random patterns.
The CM-2 really suffers a performance penalty by not having
a physical grid interconnect system on applications like
image and signal processing which are heavily dependent on
fast nearest neighbor and reduction techniques.

Will Workman, Dir of Fed Mkt, MasPar Computer

parker@vienna3.tmc.edu (Bruce Parker) (11/22/89)

In article <7142@hubcap.clemson.edu> pase@orville.nas.nasa.gov (Douglas M. Pase) writes:
>
>A lot of Intel's ideas are based (at least initially) on William Dally's PhD.
>thesis.  Grossly simplified, the idea is that one can trade the wire layout
>complexity of an n-cube arrangement for higher bandwidth connections (more +
>shorter wires) in a grid/torus.  Most important, he shows that such trades
>favor the 2d arrangements.  With a simple argument it is easily shown that
>a grid/torus constructed in that way has lower latency and contention than
>an n-cube, *even for problems which prefer an n-cube*.

How is it possible for a sqrt(n) by sqrt(n) mesh with
O(sqrt(n)) diameter and bisection width to have lower
latency and contention than an n-node hypercube with O(lg n)
diameter and O(n) bisection?  Is the analysis specialized
for certain problems as opposed to examining a worst- or
even average-case?

Bruce Parker
305 Weston Hall					(201) 596-3369
Computer and Information Science Department 	parker@mars.njit.edu
New Jersey Institute of Technology

parker@vienna3.tmc.edu (Bruce Parker) (11/22/89)

In article <7143@hubcap.clemson.edu> argosy!workman@decwrl.dec.com (Will Workman) writes:
>The CM-2 really suffers a performance penalty by not having
>a physical grid interconnect system on applications like
>image and signal processing which are heavily dependent on
>fast nearest neighbor and reduction techniques.
>
>Will Workman, Dir of Fed Mkt, MasPar Computer

Those of you who are less biased towards selling a
particular architecture might consider that these guys may
not have the best mapping of mesh nodes to hypercube nodes.
There does exist a "real-time" (O(1) delay) emulation of a
mesh on a butterfly network ("Work-preserving emulations of
fixed-connection networks", by Koch, Leighton, Maggs, Rao,
and Rosenberg, 1989 STOC, pp. 227-240).  Hypercube emulation
of a butterfly is trivial (simulate each column by a single
node), although the load is now O(lg n).  I haven't read
Koch, et al's construction yet, but it may be possible to
sidestep the butterfly-hypercube emulation and implement the
emulation directly on the hypercube.  All this suggests that
mesh-style communication may be comparable on a hypercube,
albeit with a slight slowdown (perhaps a small constant
times slower).

Bruce Parker
305 Weston Hall					(201) 596-3369
Computer and Information Science Department 	parker@mars.njit.edu
New Jersey Institute of Technology

stramm%beowulf@ucsd.edu (Bernd Stramm) (11/27/89)

In article <7159@hubcap.clemson.edu> parker@vienna3.tmc.edu (Bruce Parker) writes:
>In article <7142@hubcap.clemson.edu> pase@orville.nas.nasa.gov (Douglas M. Pase) writes:
>>
>>A lot of Intel's ideas are based (at least initially) on William Dally's PhD.
>>thesis.  
>How is it possible for a sqrt(n) by sqrt(n) mesh with
>O(sqrt(n)) diameter and bisection width to have lower
>latency and contention than an n-node hypercube with O(lg n)
>diameter and O(n) bisection?  Is the analysis specialized
>for certain problems as opposed to examining a worst- or
>even average-case?

It is possible if you consider wire length, communication word size,
and bandwidth in the real world, i.e. in 2 or 3 dimensions. 
Since you have to map the hypercube to 3 dimensions, you get
many long wires. Worse, you get many wires across a limited
cross-section, so you have to make the communication word size
smaller than if you use a 2 or 3 dimensional cube (or torus)
with more elements in each dimension. He considers k-ary d-cubes (and
tori), i.e. cubes of d dimensions with k elements along each dimension.
For the number of PEs which are currently feasible, things work
out best in 2 or 3 dimensions. 

For people seriously interested in parallel architecture, the 
thesis is well worth reading. It also investigates worm-hole 
routing, and presents the design and implementation of a self-timed 
routing chip (i.e. one that doesn't need a clock).

>Bruce Parker

Bernd Stramm
UC San Diego
stramm%cs@ucsd.edu  or  stramm@cs.ucsd.edu  or bstramm@ucsd.bitnet

wen-king@csvax.caltech.edu (Wen-King Su) (11/27/89)

In article <7159@hubcap.clemson.edu> you write:
>In article <7142@hubcap.clemson.edu> pase@orville.nas.nasa.gov (Douglas M. Pase) writes:
>>
<>A lot of Intel's ideas are based (at least initially) on William Dally's PhD.
>>thesis.  Grossly simplified, the idea is that one can trade the wire layout
<>complexity of an n-cube arrangement for higher bandwidth connections (more +
>>shorter wires) in a grid/torus.  Most important, he shows that such trades
<>favor the 2d arrangements.  With a simple argument it is easily shown that
>>a grid/torus constructed in that way has lower latency and contention than
<>an n-cube, *even for problems which prefer an n-cube*.
>
<How is it possible for a sqrt(n) by sqrt(n) mesh with
>O(sqrt(n)) diameter and bisection width to have lower
<latency and contention than an n-node hypercube with O(lg n)
>diameter and O(n) bisection?

To make the comparison fair, a channel for a 2d grid has O(sqrt(n))
times more wires than a channel for an n-cube.   Latency goes down
because the tail end of a message reaches the destination O(sqrt(n))
times earlier for the grid than for the n-cube.  Throughput for random
message traffic is at least as good as n-cube's because, for that case,
the throughput is bounded by the number of wires crossing the bisection.

When Bill Dally first proposed the idea of low dimension torus, he
was met with a lot of skeptism among members of the group because we
are all used to one way of thinking about networks.  One major
difficulty at the time was the potential for message-routing deadlock in
a torus network, but he came up with a channel-sharing mechanism to avoid
deadlock.  My first connection to this issue is that I ran several
hundred simulations on Bill Dally's torus networks and on n-cube networks
under various conditions.  The results from the simulation changed the
way the rest of us feel about torus.

The idea of grids did not arise until a little later.  We missed it
because we were only looking for periodic networks (all nodes are
identical).  My second contribution was to point out that grids would
do better than torus because its channels are twice as wide as a torus's,
and because a simpler deadlock-free routing method exists for the grid.
There are other advantages, but this was the spark and the idea of torus 
was quickly dropped in favor of grids.  From that point on, the ideas
of grids began to diffuse into the industry.  Symult was first, than
comes Intel.

bjornl@tds.kth.se (Bj|rn Lisper) (11/27/89)

In article <7161@hubcap.clemson.edu> parker@vienna3.tmc.edu (Bruce Parker)
writes:
%In article <7143@hubcap.clemson.edu> argosy!workman@decwrl.dec.com (Will
Workman) writes:
%>The CM-2 really suffers a performance penalty by not having
%>a physical grid interconnect system on applications like
%>image and signal processing which are heavily dependent on
%>fast nearest neighbor and reduction techniques.

%Those of you who are less biased towards selling a
%particular architecture might consider that these guys may
%not have the best mapping of mesh nodes to hypercube nodes.
%There does exist a "real-time" (O(1) delay) emulation of a
%mesh on a butterfly network ....  Hypercube emulation
%of a butterfly is trivial (simulate each column by a single
%node), although the load is now O(lg n). .... but it may be possible to
%sidestep the butterfly-hypercube emulation and implement the
%emulation directly on the hypercube.

I was under the impression that the CM-2 uses binary-reflected Gray encoding
of the grid coordinates. That gives a direct embedding of the grid into the
hypercube topology. 

Bjorn Lisper

grunwald@ncar.UCAR.EDU (Dirk Grunwald) (11/28/89)

>>>>> On 21 Nov 89 19:19:10 GMT, parker@vienna3.tmc.edu (Bruce Parker) said:
Bruce> In article <7142@hubcap.clemson.edu> pase@orville.nas.nasa.gov (Douglas M. Pase) writes:
>
>A lot of Intel's ideas are based (at least initially) on William Dally's PhD.
>thesis.  Grossly simplified, the idea is that one can trade the wire layout
>complexity of an n-cube arrangement for higher bandwidth connections (more +
>shorter wires) in a grid/torus.  Most important, he shows that such trades
>favor the 2d arrangements.  With a simple argument it is easily shown that
>a grid/torus constructed in that way has lower latency and contention than
>an n-cube, *even for problems which prefer an n-cube*.

Bruce> How is it possible for a sqrt(n) by sqrt(n) mesh with
Bruce> O(sqrt(n)) diameter and bisection width to have lower
Bruce> latency and contention than an n-node hypercube with O(lg n)
Bruce> diameter and O(n) bisection?  Is the analysis specialized
Bruce> for certain problems as opposed to examining a worst- or
Bruce> even average-case?

---

In a single phrase, ``fat wires''.  The argument being that 8 wires
transmit the info. of 1 at 8x the speed.  Thus, the latency of the
*last byte* (not first) decreases because of increased bandwidth. The
software setup is the same for 1-wire hypercube or n-wire mesh.

The cut-through time per hop is going to be about the same - fairly small.
You'll get more hops with the torus, but not that many more, on average.
Actually, dally is now advocating a 3-D mesh, not a simple 2-D torus.
A mesh also gets you wire locality. This is important from physical
perspectives of short interconnects, lower cap's & higher bandwidths.

Of course, once you go to optic interconnects with electronic
computers, it doesn't matter & you'll probably see the hypercube win
out again. The entire argument is rooted in extant technology, not
theoretical niceness.

Dirk Grunwald -- Univ. of Colorado at Boulder	(grunwald@foobar.colorado.edu)
						(grunwald@boulder.colorado.edu)

ran@cs.utah.edu (Ran Ginosar) (11/30/89)

In article <7210@hubcap.clemson.edu> boulder!foobar!grunwald@ncar.UCAR.EDU (Dirk Grunwald) writes:

   In a single phrase, ``fat wires''.  The argument being that 8 wires
   transmit the info. of 1 at 8x the speed.  Thus, the latency of the
   *last byte* (not first) decreases because of increased bandwidth. ...

It turns out that, at least in edge-cutting technologies, this is not
exactly true. The limiting factor in driving N wires OFF CHIP is the
power-speed product. It takes N times more power to drive N wires off
the I/O pins of the node-chip than a single one, assuming you're
driving them as fast as the technology lets you. Given a constant
amount of power to disipate, it takes N times longer (more slowly!) to
drive N wires than one. While Dally's constancy factor, bisection
width, was key to the ongoing discussion, the constant power-speed
factor was overlooked.

Incidentally, the point was made (I believe) by Peter Denyer in his
1982 paper on bit-serial VLSI architectures, and by Richard Fujimoto
and Carlo Sequin (also 1982?) in their paper on X-trees and
Y-components. 
--
	Dr. Ran Ginosar			ran@cs.utah.edu
	Computer Science Department, University of Utah
	Salt Lake City, UT 84112.   Phone: 801-581-7705, fax 801-581-5843.

wen-king@csvax.caltech.edu (Wen-King Su) (12/01/89)

>From: ran@cs.utah.edu (Ran Ginosar)
<In article <7210@hubcap.clemson.edu> boulder!foobar!grunwald@ncar.UCAR.EDU (Dirk Grunwald) writes:
>
<   In a single phrase, ``fat wires''.  The argument being that 8 wires
>   transmit the info. of 1 at 8x the speed.  Thus, the latency of the
<   *last byte* (not first) decreases because of increased bandwidth. ...
>
<It turns out that, at least in edge-cutting technologies, this is not
>exactly true. The limiting factor in driving N wires OFF CHIP is the
<power-speed product. It takes N times more power to drive N wires off
>the I/O pins of the node-chip than a single one, assuming you're
<driving them as fast as the technology lets you. Given a constant
>amount of power to disipate, it takes N times longer (more slowly!) to
<drive N wires than one. While Dally's constancy factor, bisection
>width, was key to the ongoing discussion, the constant power-speed
<factor was overlooked.

The factor is not N.  If each channel of the cube is 1 bit wide, a
cube router will have a total of 2*log2(n) channel-wires (n is the
number of nodes).  A 2-D grid router, on the other hand will have
8*N wires where N = sqrt(n)/2  ---> a total of 4*sqrt(n) wires.
The wire-count ratio of 2-D grid router to cube router is therefore
2*sqrt(n)/log2(n).  The ratio becomes less than N (and keeps getting
comparatively less) when n is greater than 16.

We also have to take into account that many wires in a cube are long.
We can overcome that difficulty by driving long wires as transmission
lines (pump more than bits into a single wire before the first bit emerge
from the other end of the wire, ie wave length of the data pulses is
shorter than the wire length), but the cost and effort is probably
more than the gain.

/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/