vidya@ENEEVAX.UMD.EDU (Vidya Rajagopalan) (10/27/88)
Recently somebody on this newsgroup mentioned that the IPSC/2 uses wormhole routing. What exactly is wormhole routing and how does it differ from schemes such as cut-through ?
frazier@CS.UCLA.EDU (10/31/88)
Somebody asked what wormhole routing (Chuck Seitz) was, and what makes it different from virtual cut through (Kleinrock). Well, there is no difference. Virtual cut through is a scheme whereby one begines to forward a packet as soon as it is routed and the necessary output port is free, i.e. before the entire packet has arrived. This is basically an optimization to current packet switching practice, and has been implemented on the Post Office (Stevens, Robinson, Davis), along with several other current communications chips. Wormhole routing, proposed by Seitz and Dally on the Torus Chip, is simply virtual cut through where there isn't room on the node for the entire packet to be stored, so that one is forced to do the cut through. The advantage of not supporting "regular" packet switching is that one needs only minimal buffer space at each node. Greg -------------------------------------------------------------- Greg Frazier o Internet: frazier@CS.UCLA.EDU CS dept., UCLA /\ UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier ----^/---- /
wen-king@cit-vax.Caltech.Edu (King Su) (11/02/88)
In article <3388@hubcap.UUCP> grunwald@m.cs.uiuc.edu writes: >In a hypercube architecture, circuit switching messages using the standard <``ecube'' routing method leads to skewed resource utilization. E.g., you'll >find 80% utilization of channel #0 but only ~10% utilization of channel #6 <because of the routing algorithm. All channels are used equally if message pattern is random. If the k-th bit of the source and destination node ID differs, the message must go through the k-th channel of some node somewhere. If probability is the same for any bit in the addresses to differ, the probability that the message will go through a k-th channel somewhere in the cube, is the same for any k. Since the number of k-th channels in the cube is the same for any k, the channel ultilization must also be the same for any k. -- /*------------------------------------------------------------------------*\ | Wen-King Su wen-king@vlsi.caltech.edu Caltech Corp of Cosmic Engineers | \*------------------------------------------------------------------------*/
joel@uunet.UU.NET (Joel Clark) (11/02/88)
Wormhole Routing on the iPSC/2 extracts from "The iPSC/2 Direct-Connect Communications Technology" by Steven F. Nugent as reprinted in "A Technical summary of the iPSC/2 Concurrent Supercomputer" ( This publication is available from Intel Scientific Computers Sales literature 15201 S.W. Greenbrier Parkway, Beaverton, Or 97006 (503) 629-7600. Steven's article and others also appear in the "Proceedings of the Third Hypercube Conference", copyright ACM 1988) The Direct-Connect Module router is a hardware controlled message passing system. The DCM routers form a circuit-switch network that dynamically create a synchronous path, from a source to a destination node, which remains open for the duration of a message. The DCM router supports connections for eight full duplex, bit-serial channels that connect nearest neighbor nodes in a hypercube, and can be interconnected to form networks of up to seven dimensions containing 128 nodes. Each of the eight channels is routed independently allowing up to eight messages to be routed simultaneously. Routing is based on the e-cube routing algorithym (references 1,2) which guarantees a deadlock free network. Paths are dynamically constructed for each message prior to its transmission. A complete path is built in a step-by-step process involving arbitrations for additional path segments at each router. The channels that constitute a path are held for the duration of the message. When a destination node is ready to accept a message, transmission begins. A channel is released when the tail of a message passes between the routers connected by that channel. Direct-Connect routing is a variation of Wormhole routing (references 3,4) with the primary difference being that with the Direct-Connect the message is transmitted after the route has been built. This eliminates the need for flow-control buffering in the intermediate routers. A routing operation can be broken into four phases: establishing a path, acknowledgement, message transmission and releasing connections. To initiate the routing of a message, the source node must transfer at least one 32 bit word to its DCM. The low order 8 bits of this word are a channel routing mask calculated by taking the XOR of the source and destination node's binary addresses. The DCM serializer establishes the first segment of the path via the channel corresponding the the lowest order bit set in the channel routing mask. In order to avoid deadlock, messages in the hypercube are routed in increasingly higher numbered channels until the destination is reached. The local DCM arbitrates requests for the same channel and grants them one at a time in a "round robin" fashion. When the channel is granted, the channel routing mask is forwarded via that channel, to the next DCM. The new DCM then scans the channel routing mask from the arriving channel's bit upward, for the next channel to request. When that channel is available the routing request (channel routing mask) is forwarded, and so on, until it reaches its destination node, establishing the "message path" from source to destination node.. If the destination DCM can accept a message it will acknowledge with the RDY signal establishing a "status path", for carrying flow-control information, matching, in the opposite direction, the message path. The status path, like the message path, maintains its connection for the duration of the message. Once the RDY signal is received at the source node, message transmission begins. The source DCM can transmit data continously into the network until the End of Message is reached or a not ready indication is received over the status path. When the RDY is again detected on the status path message transmission resumes. When a DCM sees an End of Message it frees the channel for other requests. There is no buffering of messages or need of CPU cycles on the intermediate nodes. On source and destination nodes, a DMA transfers the message in 32 bit words directly from the DCM to memory. References: (1) C. R. Lang jr., "The Extension of Object-Oriented Languages to a Homogeneous, Concurrent Architecture", Dept. of Computer Science, Calif. Inst. of Technology, Tech. Reprot 5014, May 1982 (2) H. Sullivan & T. R. Bashkow, "A Large Scale Homogeneous Machine", Proc. 4th Annual Symposium on Computer Architecture, 1977, pp 105-124. (3) C. L. Seitz, et al., "The Hybercube Communications Chip", Dept of Comp. Sci., Calif. Inst. of Technology, Display File 5182 March 1985 (4) W. J. Dally, "A VLSI Architectur for Concurrent Data Structures", Ph.D. Thesis, Dept. of Comp. Sci., Calif. Inst. of Technology, Technical Report 5209, March 1986 Joel Clark joel@intelisc.intel.COM Intel Scientific Computers {tektronix}!ogcvax!intelisc!joel Beaverton, OR (503) 629-7732
grunwald@m.cs.uiuc.edu (11/02/88)
uh..make that be ``the wormhole defn. by *Dally* and Seitz.''
grunwald@m.cs.uiuc.edu (11/08/88)
Actually, the iPSC/2 doesn't use worm-hole routing, it uses blocking ecube circuit switching. Wormhole routing, in the defn. by Athas & Seitz, consists of sending small packets (flits) through a network. When your message acquires a link, it owns it until the last flit passes through. If you can't acquire a link, you sit there until you can. Routing and packet forwarding is all done by autonomous routing units. There is no impact on the computation processor. The Ametek 2010, Mosaic and Jellybean machine use this routing/switching method. Drawbacks are: fixed path routing (needed for deadlock free communication), although this is being fixed by an adaptive variant. Basically, the iPSC/2 use a similar method, but it establishs an end-to-end circuit for the the message, sends the message and then tears down the circuit. Analysis of this method with the former would show you that it differs only in the circuit tear-down time, assuming that your message is sufficiently long that you can't hold the entire message in the network routing units. Drawbacks: again, fixed path routing. Also, wormhole would work better for short messages or networks that have more buffering space at each routing unit. In a hypercube architecture, circuit switching messages using the standard ``ecube'' routing method leads to skewed resource utilization. E.g., you'll find 80% utilization of channel #0 but only ~10% utilization of channel #6 because of the routing algorithm.
grunwald@m.cs.uiuc.edu (11/08/88)
Consider the following: 2 <- C -> 3 ^ ^ | | D B | | v v 0 <- A -> 1 have every node send messags to every other node. The following is the ``worse case'' analysis: Assume that it takes a single time unit to establish connection over a single link. Consider the following messages (same case applies to all other nodes by symmetry) Msgs Route Holds ---------------------------------------------------------------------- 0 -> 1 uses link 0 at 0 #0 for 1 unit 0 -> 2 uses link 1 #1 for 1 0 -> 3 uses link 0 at 0, link 1 at 1 #0 for 2, #1 for 1 where ``holds'' means how many time steps the link is held when establishing a circuit (i.e. during routing). in general, with uniform routing, ~1/2 the msgs start on link #0, ~1/4 on link #1, etc. the only other nodes using link zero are adjacent nodes (assuming we're using a left-to-right ordering for resolving bits in the ecube routing), and that's for traffic directed to node 0. However, two messages are queued at node 0 to use link 0. The utilization of link 0 is going to be higher than for the other links. Admittedly, this is for the case where we're only routing messages, and not sending data. As you send data that takes X time units to send, the routing time is minimized and the difference in link utilization becomes meaningless. In a worm-hole network, you have a similar problem if the route probe blocks along the way. You're holding all those resources for several units of time. If you model the probability of blocking at a link as the expected time to wait for that link acquisition, you'll find that link #0 will again be used more often than higher numbered links. Of course, when you begin to send any reasonable amount of data, or if you restrict yourself to small networks, this isn't a problem. However, it's silly for real hardware to use straight Ecube anyway unless you need to have deadlock free routing. You can do an exhaustive search (which, remarkably, is very effective) and get much better performance (balanced link utilization + avoid hot spots).