[comp.parallel] Wormhole Routing

vidya@ENEEVAX.UMD.EDU (Vidya Rajagopalan) (10/27/88)

Recently somebody on this newsgroup mentioned that the IPSC/2
uses wormhole routing. What exactly is wormhole routing and how
does it differ from schemes such as cut-through ?

frazier@CS.UCLA.EDU (10/31/88)

Somebody asked what wormhole routing (Chuck Seitz) was, and what
makes it different from virtual cut through (Kleinrock).  Well,
there is no difference.  Virtual cut through is a scheme
whereby one begines to forward a packet as soon as it is routed
and the necessary output port is free, i.e. before the entire 
packet has arrived.  This is basically an optimization to
current packet switching practice, and has been implemented on
the Post Office (Stevens, Robinson, Davis), along with several
other current communications chips.  Wormhole routing, proposed
by Seitz and Dally on the Torus Chip, is simply virtual cut
through where there isn't room on the node for the entire packet
to be stored, so that one is forced to do the cut through.  The
advantage of not supporting "regular" packet switching is that
one needs only minimal buffer space at each node.

Greg
--------------------------------------------------------------

Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

wen-king@cit-vax.Caltech.Edu (King Su) (11/02/88)

In article <3388@hubcap.UUCP> grunwald@m.cs.uiuc.edu writes:

>In a hypercube architecture, circuit switching messages using the standard
<``ecube'' routing method leads to skewed resource utilization. E.g., you'll
>find 80% utilization of channel #0 but only ~10% utilization of channel #6
<because of the routing algorithm.

All channels are used equally if message pattern is random.  If the k-th bit
of the source and destination node ID differs, the message must go through
the k-th channel of some node somewhere.  If probability is the same for
any bit in the addresses to differ, the probability that the message will
go through a k-th channel somewhere in the cube, is the same for any k.
Since the number of k-th channels in the cube is the same for any k, the
channel ultilization must also be the same for any k.
-- 
/*------------------------------------------------------------------------*\
| Wen-King Su  wen-king@vlsi.caltech.edu  Caltech Corp of Cosmic Engineers |
\*------------------------------------------------------------------------*/

joel@uunet.UU.NET (Joel Clark) (11/02/88)

			Wormhole Routing on the iPSC/2

				extracts from

	     "The iPSC/2 Direct-Connect Communications Technology"
			      by Steven F. Nugent

				as reprinted in

  	    "A Technical summary of the iPSC/2 Concurrent Supercomputer"
	

( This publication is available from Intel Scientific Computers Sales
  literature   15201 S.W. Greenbrier Parkway,  Beaverton, Or   97006
  (503) 629-7600.  Steven's article and others also appear in the 
"Proceedings of the Third Hypercube Conference", copyright ACM 1988)


The Direct-Connect Module router is a hardware controlled message passing
system.  The DCM routers form a circuit-switch network that dynamically 
create a synchronous path, from a source to a destination node, which
remains open for the duration of a message.  The DCM router supports 
connections for eight full duplex, bit-serial channels that connect nearest 
neighbor nodes in a hypercube, and can be interconnected to form networks of
up to seven dimensions containing 128 nodes.  Each of the eight channels is 
routed independently allowing up to eight messages to be routed simultaneously.

Routing is based on the e-cube routing algorithym (references 1,2) which
guarantees a deadlock free network.  Paths are dynamically constructed for 
each message prior to its transmission. A complete path is built in a 
step-by-step process involving arbitrations for additional path segments 
at each router.  The channels that constitute a path are held for the 
duration of the message.  When a destination node is ready to accept a 
message, transmission begins.  A channel is released when the tail of a 
message passes between the routers connected by that channel.

Direct-Connect routing is a variation of Wormhole routing (references 3,4)
with the primary difference being that with the Direct-Connect the message 
is transmitted after the route has been built.  This eliminates the need for 
flow-control buffering in the intermediate routers.

A routing operation can be broken into four phases: establishing a path, 
acknowledgement, message transmission and releasing connections.  To initiate
the routing of a message, the source node must transfer at least one 32 bit
word to its DCM.  The low order 8 bits of this word are a channel routing mask 
calculated by taking the XOR of the source and destination node's binary 
addresses.  

The DCM serializer establishes the first segment of the path via the channel
corresponding the the lowest order bit set in the channel routing mask.
In order to avoid deadlock, messages in the hypercube are routed in increasingly
higher numbered channels until the destination is reached.  The local DCM 
arbitrates requests for the same channel and grants them one at a time in a 
"round robin" fashion.

When the channel is granted, the channel routing mask is forwarded via that
channel, to the next DCM.  The new DCM then scans the channel routing mask from
the arriving channel's bit upward, for the next channel to request.  When 
that channel is available the routing request (channel routing mask) is 
forwarded, and so on, until it reaches its destination node, establishing
the "message path" from source to destination node.. 

If the destination DCM can accept a message it will acknowledge with the
RDY signal establishing a "status path", for carrying flow-control information,
matching, in the opposite direction, the message path.  The status path, like
the message path, maintains its connection for the duration of the message.
Once the RDY signal is received at the source node, message transmission
begins.  The source DCM can transmit data continously into the network
until the End of Message is reached or a not ready indication is received 
over the status path.  When the RDY is again detected on the status path 
message transmission resumes.  When a DCM sees an End of Message it frees 
the channel for other requests.   There is no buffering of messages or need 
of CPU cycles on the intermediate nodes.  On source and destination nodes,
a DMA transfers the message in 32 bit words directly from the DCM to memory.


References:

(1)	C. R. Lang jr.,  "The Extension of Object-Oriented Languages to a
	Homogeneous, Concurrent Architecture", Dept. of Computer Science,
	Calif. Inst. of Technology, Tech. Reprot 5014, May 1982

(2)	H. Sullivan  & T. R. Bashkow,  "A Large Scale Homogeneous Machine",
	Proc. 4th Annual Symposium on Computer Architecture, 1977, pp 105-124. 

(3) 	C. L. Seitz, et al.,  "The Hybercube Communications Chip", Dept of 
	Comp. Sci., Calif. Inst. of Technology, Display File 5182 March 1985

(4)	W. J. Dally, "A VLSI Architectur for Concurrent Data Structures", 
	Ph.D. Thesis, Dept. of Comp. Sci., Calif. Inst. of Technology, 
	Technical Report 5209, March 1986



Joel Clark				joel@intelisc.intel.COM
Intel Scientific Computers		{tektronix}!ogcvax!intelisc!joel
Beaverton, OR				(503) 629-7732

grunwald@m.cs.uiuc.edu (11/02/88)

uh..make that be ``the wormhole defn. by *Dally* and Seitz.''

grunwald@m.cs.uiuc.edu (11/08/88)

Actually, the iPSC/2 doesn't use worm-hole routing, it uses blocking ecube
circuit switching.

Wormhole routing, in the defn. by Athas & Seitz, consists of sending small
packets (flits) through a network. When your message acquires a link, it
owns it until the last flit passes through. If you can't acquire a link,
you sit there until you can.

Routing and packet forwarding is all done by autonomous routing units. There
is no impact on the computation processor. The Ametek 2010, Mosaic and
Jellybean machine use this routing/switching method.

Drawbacks are: fixed path routing (needed for deadlock free communication),
	although this is being fixed by an adaptive variant.

Basically, the iPSC/2 use a similar method, but it establishs an end-to-end
circuit for the the message, sends the message and then tears down the circuit.
Analysis of this method with the former would show you that it differs
only in the circuit tear-down time, assuming that your message is sufficiently
long that you can't hold the entire message in the network routing units.

Drawbacks: again, fixed path routing. Also, wormhole would work better for
	   short messages or networks that have more buffering space at
	   each routing unit.

In a hypercube architecture, circuit switching messages using the standard
``ecube'' routing method leads to skewed resource utilization. E.g., you'll
find 80% utilization of channel #0 but only ~10% utilization of channel #6
because of the routing algorithm.

grunwald@m.cs.uiuc.edu (11/08/88)

Consider the following:

	2 <- C -> 3
	^	  ^
	|	  |
	D	  B
	|	  |
	v	  v
	0 <- A -> 1

have every node send messags to every other node. The following is the
``worse case'' analysis:

Assume that it takes a single time unit to establish connection over a
single link.

Consider the following messages (same case applies to all other nodes by
symmetry)
	Msgs	   Route				Holds
----------------------------------------------------------------------	
	0 -> 1	uses link 0 at 0			#0 for 1 unit
	0 -> 2  uses link 1				#1 for 1
	0 -> 3  uses link 0 at 0, link 1 at 1		#0 for 2, #1 for 1

where ``holds'' means how many time steps the link is held when establishing
a circuit (i.e. during routing).

in general, with uniform routing, ~1/2 the msgs start on link #0,
~1/4 on link #1, etc.

the only other nodes using link zero are adjacent nodes (assuming we're
using a left-to-right ordering for resolving bits in the ecube routing),
and that's for traffic directed to node 0.

However, two messages are queued at node 0 to use link 0. The utilization
of link 0 is going to be higher than for the other links. Admittedly, this
is for the case where we're only routing messages, and not sending data.
As you send data that takes X time units to send, the routing time is
minimized and the difference in link utilization becomes meaningless.

In a worm-hole network, you have a similar problem if the route probe
blocks along the way. You're holding all those resources for several
units of time. If you model the probability of blocking at a link as
the expected time to wait for that link acquisition, you'll find that
link #0 will again be used more often than higher numbered links.

Of course, when you begin to send any reasonable amount of data, or if
you restrict yourself to small networks, this isn't a problem.
However, it's silly for real hardware to use straight Ecube anyway
unless you need to have deadlock free routing. You can do an
exhaustive search (which, remarkably, is very effective) and get much
better performance (balanced link utilization + avoid hot spots).