[comp.hypercube] Question on hypercube routing

prism::warner.DECNET@hubcap.UUCP (07/14/87)

The answer to David Pase's question on hypercube routing is that quite a
bit of research has been done on optimal communication algorithms.  Probably
the most important paper is "Spanning Graphs for Optimum Broadcasting and
Persnalized Communication in Hypercubes" by S. Lennart Johnsson and Chieng-
Tien Ho, Department of Computer Science Tech Report #500, Yale University,
November 1986.  The bottom line is that while there are a number of very
simple routing algorithms, optimal algorithms are quite subtle.

Wen-King@hubcap.UUCP (07/14/87)

In article <278@hubcap.UUCP> pase%oregon-grad.csnet@RELAY.CS.NET (Douglas M. Pase) writes:
>	000 -> 001
<	000 -> 010
>	000 -> 001 -> 011
	... etc
<The channel between nodes 0 and 1 is used in four different cases, between 0
>and 2 in two cases, and 0 and 4 in only one case.  This exponential piling
<of messages into the lower numbered channels (which is compounded if one
>considers similar usage of channels by other nodes as well) means that in
<heavy message traffic there will be a lot of contention for the lower numbered
>channels where the channels corresponding to the higher bits go unused.

    000 -(0)-> 001
    000 -(1)-> 010
    000 -(0)-> 001 -(1)-> 011
    000 -(2)-> 100
    000 -(0)-> 001 -(2)-> 101
    000 -(1)-> 010 -(2)-> 110
    000 -(0)-> 001 -(1)-> 011 -(2)-> 111

On the countrary, I counted that channels 0, 1, and 2 are each used
exactly 4 times, though they are not all channels coming out of node
0.  The composite usage, assuming all nodes in the cube are doing
similar things, will be the same for all of the individual channels.

+--------------------------------------------------------------------------+
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
+--------------------------------------------------------------------------+

Wen-King@hubcap.UUCP (07/14/87)

In article <285@hubcap.UUCP> ravi@CS.UCLA.EDU writes:
<>The obvious solution to this problem is to randomly choose the direction
 <in which the message should be routed, i.e. randomly choose one of the
 >bits that differ instead of scanning from right to left or left to right. 
 <Another possibility is that one could start the left to right scanning for 
 >each message at a randomly chosen bit for that message. 

If you do this, you may have problems with deadlock due to lack of queue
space along the path of the message.  One reason for using the routing
algorithm described by pase in his article is that the algorithm is proven
to be deadlock free provided that the destination node will eventually
consumes all messages it receives.  This particular routing algorithm is
call the e-cube routing algorithm.
+--------------------------------------------------------------------------+
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
+--------------------------------------------------------------------------+

ravi@hubcap.UUCP (07/14/87)

In article <291@hubcap.UUCP> wen-king@VLSI.CALTECH.EDU (Wen-King Su) writes:
>
>In article <285@hubcap.UUCP> ravi@CS.UCLA.EDU writes:
><>The obvious solution to this problem is to randomly choose the direction
> <in which the message should be routed, i.e. randomly choose one of the
> >bits that differ instead of scanning from right to left or left to right. 
> <Another possibility is that one could start the left to right scanning for 
> >each message at a randomly chosen bit for that message. 
>
>If you do this, you may have problems with deadlock due to lack of queue
>space along the path of the message.  One reason for using the routing
>algorithm described by pase in his article is that the algorithm is proven
>to be deadlock free provided that the destination node will eventually
>consumes all messages it receives.  This particular routing algorithm is
>call the e-cube routing algorithm.

I can see the possibility that a message will never reach its
destination when it is routed in an arbitrary direction (no longer
shortest path routing) instead of a direction towards the destination. 
A case can be made for doing this in a congested situation when buffers 
in the chosen direction are full. 

However when one chooses a direction toward the destination I 
see no reason why it should deadlock. I would appreciate if you
can give us the argument as to why it deadlocks or any references pointing
to the deadlock proof. 

ravi

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
T. M. Ravi 
UCLA Computer Science Department, 3680 Boelter Hall, UCLA
Los Angeles, CA 90024	          Phone: (213) 825-2266
ARPA : ravi@CS.UCLA.EDU
UUCP : {...sdcrdcf, ihnp4, trwspp, ucbvax}!ucla-cs!ravi
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

grunwald@M.CS.UIUC.EDU (07/16/87)

If two nodes are K hops apart, there are K bits on in their routing mask
(the bit-mask generated by the from ^ to computation).

By always looking at the L.S.B of the routing mask, you're choosing a
single fixed path for communication between each pair of nodes.

If you choose one of the on bits at random, or if you consider every 'on bit'
and send the message out the channel with the smallest message queue, you'll
have a better link utilization. From simulation data for a network switch 
which allows all links to be utilized all the time, this packet routing method
will buy you some improvement (about 10%) when queues form. 

However, the Intel & Ametek cubes allow a single (or in the case of Intel,
maybe two) concurrent transmissions.
	In this case, it doesn't matter which routing method you use. Link
utilization is not the question: it's memory traffic within a node.

I'm not certain how many channels Ncube will fire up at once.

bradley@M.CS.UIUC.EDU (07/16/87)

According to "A High Performance Operating System for the NCUBE", pp. 90 of
_Hypercube Multiprocessors 1987_", "The broadcast instruction [in the NCUBE
instruction set] allows a message to be sent on any subset of the output 
channels of a node in a single DMA action."  In other words, only a 
single message can be sent at a time, but it can be sent out over several
channels at once.

David Bradley

bradley@a.cs.uiuc.edu
Department of Computer Science
University of Illinois at Urbana-Champaign

"Picasso: Official hypercube operating system of the 1988 Olympic Games"

agn@UNH.CS.CMU.EDU (Andreas Nowatzyk) (07/17/87)

Contrary to David Bradley's interpretation of the broadcast instruction
on the Ncube-machine (BPTR and BCNT), it is possible to use each of the
22 DMA-channels (10 pairs of receiver/transmitter for communication inside
the cube and 1 pair for I/O) independently:

To start a transfer, you should check the ready-flag to see that the channel
is idle (you may use the interrupt facility to do so instead). The LPTR
instruction loads the address of the data to be transfered into the address
register of an specific channel and the LCNT instruction loads the byte-count
and initiates the transfer - no sweat. This is much easier than programming
the ethernet-chip in the iPSC.

There is however a bandwidth limitation: 22 active DMA channels would cause
about 28 Mbyte/sec peak memory traffic in addition to the CPU. They all share
the same data path to the off-chip memory. This would exceed the memory I/O
bandwidth by a factor of 3. So a node cannot have more than 8 active,
independent channels at a time.

Anyway, this is only of interest to bare-metal system programmer: The
typical user sees generalized send- and receive-calls that hide the topology,
routing issues and the queueing of unsolicited messages.

-- 
   --  Andreas Nowatzyk  (DC5ZV)

   Carnegie-Mellon University	     Arpa-net:   agn@unh.cs.cmu.edu
   Computer Science Department         Usenet:   ...!seismo!unh.cs.cmu.edu!agn

Irving@hubcap.UUCP (07/18/87)

In article <278@hubcap.UUCP> pase%oregon-grad.csnet@RELAY.CS.NET (Douglas M. Pase) writes:
>Message routing, it would seem, is a simple problem on a hypercube [...]

>			*** The Question ***
>
>Does anyone know of a routing algorithm which utilizes more evenly the channels
>available for *all* combinations of node sources and destinations?
>--
>Doug Pase   --   ...ucbvax!tektronix!ogcvax!pase  or  pase@Oregon-Grad.csnet

Well, it just so happens that my supervisor and I have a paper submitted for
publication on just this problem...

First, to apologise for the long delay; turnaround time out here is pretty
bad.  I have seen most of the other responses (they all arrived at the same
time).

Anyhow, what we have done is to compare fixed-order (the one you outlined),
random order routing (at each router, a direction is chosen at random from
the directions in which the message must move), and two "adaptive" schemes
where the dimensions chosen depend on the all the messages passing through
the node at the same time.

The locally optimal assignment of messages to outgoing wires (that is,
moving the most messages forward at once) is exactly the Bipartite Matching
problem, for which the best know solution is a very tricky O(n^2.5)
algorithm that would be almost impossible to implement effectively in
silicon.  We used this as one of our adaptive schemes; the other was a
simple "first-fit" method.

We simulated all of these under random message loads at a variety of load
levels, and compared queue lengths, total message transit time (in cube
steps) and a few other things.  Basically, the "fixed order" methods perform
dismally compared to the "adaptive" methods.

The second part of the paper investigates what happens with limited queues
at each corner of the cube.  Messages are assigned moves away from their
destination if necessary to prevent queue overflows.

What we found was that very little queue space is needed (ie, 2 or 4 total
(not per dimension) spaces for an 8-cube) to get the best throughput/delay
under heavy loads.  Under light loads it doesn't make any difference, since
messages don't have to wait in the queues much.  Also, our first-fit method
comes very close to Bipartite Matching without the complex algorithm.

The paper is available as a tech report from the University of Saskatchewan;
e-mail me (reid@sask.UUCP) your postal address if you would like a copy.

I have also done a fairly complete design of a router chip for an 8-cube
using the first-fit algorithm; it maps quite well into VLSI.  The design
can be extended to attach an arbitrary number (limited by pin count and chip
size) of PEs to each corner of the cube to make better use of the high load
capacity of the first-fit routers.

-- 
 - irving -   (reid@sask.uucp or {alberta, ihnp4, utcsri}!sask!reid)

Whose idea was this, anyway?

bradley@seri.UUCP (07/20/87)

According to "A High Performance Operating System for the NCUBE", pp. 90 of
_Hypercube Multiprocessors 1987_", "The broadcast instruction [in the NCUBE
instruction set] allows a message to be sent on any subset of the output 
channels of a node in a single DMA action."  In other words, only a single message cal be sent at a time, but it can be sent out over several
channels at once.

David @radley

bradley@a.cs.uiuc.edu
Department of Computer Science
University of Illinois at Urbana-Champaign

"Picasso: Official hypercube operating system of the 1988 Olympic Games"