wangjw@cs.purdue.edu (Jingwen Wang) (04/23/91)
The ipsc/860 hypercube is noted by its very fast processor and circuit switching communications. The processor speed is around 9-10 times faster than the Ncube/2 as we recently measured. However, the communication speed is by far not raised by a comparable rate. This makes people dissapointed as they moved their ipsc/2 or Ncube code to this machine because the speedup would drop dramatically. We were at first amazed to see that broadcast and multicast as well as many other global communication calls are provided on ipsc/860. But after experiments, we found most global communication calls are not as efficient as we had expected. The broadcast function is effected by the csend or isend call with the destination parameter specified as -1. We compared the speed of this call with the most simple method -- using a loop sending a separate message to every other node. The everage elapsed time from sending to receiving for broadcast call is around 1/2 of the looping method for message length of 500. This is quite good since it really speedups the communication. It is also possible to broadcast to a subcube of nodes. The multicast function of the i860 machine has exactly the same speed as the simplest looping method (send a separate message to each destination). The only benifit is the simplification of expression (but you have to prepare a destination list which is at least as complicated as a loop to send the message separately). This is difficult to improve because the destinations are supposed to be arbitrary rather than regular, as is opposed to broadcast. Also there are global collection routines to collect a contribution of data from each node and after the operation each node gets a copy of the collection. Such an operation results in substantial reduction of communication time on store and forward networks (it takes about the same time as a single broadcasting). But in ipsc/860, it is even slightly slower than if each node sends a broadcast message, which effects a fully-exchange. It seems that the only benifit of the circuit-switched networks in global communication is its broadcast, which saves time by sending messages simultaneously via several channels. Of course, for point to point communications, they are certainly better than store-and-forward message passing. The above comments are only some negative points on circuit-switching networks. Some experts are obviously over optimistic on the performance of such networks. They need to be improved, too. Jingwen Wang Dept. CS. Purdue University wangjw@cs.purdue.edu -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
tve@sprite.berkeley.edu (Thorsten von Eicken) (04/23/91)
In article <1991Apr23.123808.10313@hubcap.clemson.edu> wangjw@cs.purdue.edu () writes: > The ipsc/860 hypercube is noted by its very fast processor and circuit >switching communications. The processor speed is around 9-10 times faster >than the Ncube/2 as we recently measured. However, the communication >speed is by far not raised by a comparable rate. This makes people >dissapointed as they moved their ipsc/2 or Ncube code to this machine >because the speedup would drop dramatically. That's what I've been suspecting for a while. I would actually claim that the ipsc/860 communication is worse than that of the Ncube/2. If you do it right, it takes 23 instructions to send and 26 to receive a message on the ncube. This includes both the user and kernel code. This turns into 10us for a send and about 15us for a receive. As you probably know, the standard OS takes 8x longer... What's the peak bandwidth for the ipsc/860? The ncube has 2.2Mb/s/link and unless you send large (>512bytes) messages, you can keep between 1 and 2 channels busy (large messages can keep more busy). Thorsten von Eicken -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
jet@karazm.math.uh.edu ("J. Eric Townsend") (04/24/91)
In article <1991Apr23.123808.10313@hubcap.clemson.edu> wangjw@cs.purdue.edu () writes: >every other node. The everage elapsed time from sending to receiving for >broadcast call is around 1/2 of the looping method for message length of >500. This is quite good since it really speedups the communication. It The iPSC message passing software has a magic size of 100bytes. Anything <=100 bytes only takes one message send. Anything >100bytes takes *three* message sends, to insure that there is enought space in the receiving buffer. I would suggest that you do timings of both messages <100bytes and messages of >100bytes. Also, look closely at using FORCETYPE in your message type. If you *know* you have room to receive a message, the sender can use FORCETYPE and the "is there room?" message exchange does not happen. -- J. Eric Townsend - jet@uh.edu - bitnet: jet@UHOU - vox: (713) 749-2120 Skate UNIX or bleed, boyo... (UNIX is a trademark of Unix Systems Laboratories). -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
jet@karazm.math.uh.edu ("J. Eric Townsend") (04/25/91)
In article <1991Apr23.165213.10093@agate.berkeley.edu> tve@sprite.berkeley.edu (Thorsten von Eicken) writes: >peak bandwidth for the ipsc/860? The ncube has 2.2Mb/s/link and 2.8Mb/s. -- J. Eric Townsend - jet@uh.edu - bitnet: jet@UHOU - vox: (713) 749-2120 Skate UNIX or bleed, boyo... (UNIX is a trademark of Unix Systems Laboratories). -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
grunwald@foobar.colorado.edu (Dirk Grunwald) (05/02/91)
>>>>> On 22 Apr 91 20:14:05 GMT, wangjw@cs.purdue.edu (Jingwen Wang) said: ... JW> The broadcast function is effected by the csend or isend call with the JW> destination parameter specified as -1. We compared the speed of this call JW> with the most simple method -- using a loop sending a separate message to JW> every other node. The everage elapsed time from sending to receiving for JW> broadcast call is around 1/2 of the looping method for message length of JW> 500. This is quite good since it really speedups the communication. It JW> is also possible to broadcast to a subcube of nodes. -- The 'simplest method' is not to send to every node in the system; it's to use a broadcast tree, making broadcast be a O(Lg N) proces, not O(N). That's what the Intel O/S does. Your 1/2 speed of the looping method would then hold for only for N=4. For N=8, it's 2.6, for N=16, it's 4 times faster to use a tree, etc. >From your value of '2' I assume you used a 4-node system, or possibly 8 nodes with imprecision in your results. And of course, circuit switch networks have *no* benefit for broadcast trees, because bcast trees use single-hop communication. The only way to improve bcast is to add a communication processor to offload the store & replicate work. JW> The multicast function of the i860 machine has exactly the same speed JW> as the simplest looping method (send a separate message to each JW> destination). The only benifit is the simplification of expression (but JW> you have to prepare a destination list which is at least as complicated JW> as a loop to send the message separately). This is difficult to improve JW> because the destinations are supposed to be arbitrary rather than regular, JW> as is opposed to broadcast. -- Again, if you're sending to > lgN nodes, you're better off (modulo the fact that you interrupt everyone) to simply broadcast to the entire network & have un-interested nodes drop the message. You can obviously improve on this with a variety of multicast tree algorithms that exist. Of course, worm hole networks 'suffer' from these same problems, because you're not using the feature these methods attempt to make efficient - point-to-point communication. -- =========================== MODERATOR ============================== Steve Stevenson {steve,fpst}@hubcap.clemson.edu Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell