[comp.parallel] iPSC/860 Communication Performance

wangjw@cs.purdue.edu (Jingwen Wang) (04/23/91)

  The ipsc/860 hypercube is noted by its very fast processor and circuit
switching communications. The processor speed is around 9-10 times faster
than the Ncube/2 as we recently measured. However, the communication
speed is by far not raised by a comparable rate. This makes people
dissapointed as they moved their ipsc/2 or Ncube code to this machine
because the speedup would drop dramatically.

  We were at first amazed to see that broadcast and multicast as well as
many other global communication calls are provided on ipsc/860. But after
experiments, we found most global communication calls are not as efficient
as we had expected.

  The broadcast function is effected by the csend or isend call with the
destination parameter specified as -1. We compared the speed of this call
with the most simple method -- using a loop sending a separate message to 
every other node. The everage elapsed time from sending to receiving for 
broadcast call is around 1/2 of the looping method for message length of 
500. This is quite good since it really speedups the communication. It
is also possible to broadcast to a subcube of nodes.

  The multicast function of the i860 machine has exactly the same speed
as the simplest looping method (send a separate message to each 
destination). The only benifit is the simplification of expression (but
you have to prepare a destination list which is at least as complicated
as a loop to send the message separately). This is difficult to improve
because the destinations are supposed to be arbitrary rather than regular, 
as is opposed to broadcast.

  Also there are global collection routines to collect a contribution 
of data from each node and after the operation each node gets a copy
of the collection. Such an operation results in substantial reduction of
communication time on store and forward networks (it takes about the same
time as a single broadcasting). But in ipsc/860, it is even slightly slower
than if each node sends a broadcast message, which effects a fully-exchange.  
  It seems that the only benifit of the circuit-switched networks in
global communication is its broadcast, which saves time by sending
messages simultaneously via several channels. Of course, for point to
point communications, they are certainly better than store-and-forward
message passing.

  The above comments are only some negative points on circuit-switching
networks. Some experts are obviously over optimistic on the performance
of such networks. They need to be improved, too.

  Jingwen Wang
  Dept. CS. 
  Purdue University

  wangjw@cs.purdue.edu

  

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

tve@sprite.berkeley.edu (Thorsten von Eicken) (04/23/91)

In article <1991Apr23.123808.10313@hubcap.clemson.edu> wangjw@cs.purdue.edu () writes:
>  The ipsc/860 hypercube is noted by its very fast processor and circuit
>switching communications. The processor speed is around 9-10 times faster
>than the Ncube/2 as we recently measured. However, the communication
>speed is by far not raised by a comparable rate. This makes people
>dissapointed as they moved their ipsc/2 or Ncube code to this machine
>because the speedup would drop dramatically.

That's what I've been suspecting for a while. I would actually claim
that the ipsc/860 communication is worse than that of the Ncube/2. If you
do it right, it takes 23 instructions to send and 26 to receive a
message on the ncube. This includes both the user and kernel code.
This turns into 10us for a send and about 15us for a receive. As you
probably know, the standard OS takes 8x longer... What's the
peak bandwidth for the ipsc/860? The ncube has 2.2Mb/s/link and
unless you send large (>512bytes) messages, you can keep between
1 and 2 channels busy (large messages can keep more busy).
	Thorsten von Eicken

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

jet@karazm.math.uh.edu ("J. Eric Townsend") (04/24/91)

In article <1991Apr23.123808.10313@hubcap.clemson.edu> wangjw@cs.purdue.edu () writes:
>every other node. The everage elapsed time from sending to receiving for 
>broadcast call is around 1/2 of the looping method for message length of 
>500. This is quite good since it really speedups the communication. It

The iPSC message passing software has a magic size of 100bytes.  Anything
<=100 bytes only takes one message send.  Anything >100bytes takes *three*
message sends, to insure that there is enought space in the receiving buffer.

I would suggest that you do timings of both messages <100bytes and messages of
 >100bytes.

Also, look closely at using FORCETYPE in your message type.  If you
*know*  you have room to receive a message, the sender can use
FORCETYPE and the "is there room?" message exchange does not happen.


--
J. Eric Townsend - jet@uh.edu - bitnet: jet@UHOU - vox: (713) 749-2120
Skate UNIX or bleed, boyo...
(UNIX is a trademark of Unix Systems Laboratories).

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

jet@karazm.math.uh.edu ("J. Eric Townsend") (04/25/91)

In article <1991Apr23.165213.10093@agate.berkeley.edu> tve@sprite.berkeley.edu (Thorsten von Eicken) writes:
>peak bandwidth for the ipsc/860? The ncube has 2.2Mb/s/link and

2.8Mb/s.

--
J. Eric Townsend - jet@uh.edu - bitnet: jet@UHOU - vox: (713) 749-2120
Skate UNIX or bleed, boyo...
(UNIX is a trademark of Unix Systems Laboratories).

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

grunwald@foobar.colorado.edu (Dirk Grunwald) (05/02/91)

>>>>> On 22 Apr 91 20:14:05 GMT, wangjw@cs.purdue.edu (Jingwen Wang) said:
	...
JW>   The broadcast function is effected by the csend or isend call with the
JW> destination parameter specified as -1. We compared the speed of this call
JW> with the most simple method -- using a loop sending a separate message to 
JW> every other node. The everage elapsed time from sending to receiving for 
JW> broadcast call is around 1/2 of the looping method for message length of 
JW> 500. This is quite good since it really speedups the communication. It
JW> is also possible to broadcast to a subcube of nodes.
--

The 'simplest method' is not to send to every node in the system; it's
to use a broadcast tree, making broadcast be a O(Lg N) proces, not
O(N). That's what the Intel O/S does. Your 1/2 speed of the looping
method would then hold for only for N=4. For N=8, it's 2.6, for N=16,
it's 4 times faster to use a tree, etc.

>From your value of '2' I assume you used a 4-node system, or possibly
8 nodes with imprecision in your results.

And of course, circuit switch networks have *no* benefit for broadcast
trees, because bcast trees use single-hop communication. The only way
to improve bcast is to add a communication processor to offload the
store & replicate work.

JW>   The multicast function of the i860 machine has exactly the same speed
JW> as the simplest looping method (send a separate message to each 
JW> destination). The only benifit is the simplification of expression (but
JW> you have to prepare a destination list which is at least as complicated
JW> as a loop to send the message separately). This is difficult to improve
JW> because the destinations are supposed to be arbitrary rather than regular, 
JW> as is opposed to broadcast.
--

Again, if you're sending to > lgN nodes, you're better off (modulo the
fact that you interrupt everyone) to simply broadcast to the entire
network & have un-interested nodes drop the message. You can obviously
improve on this with a variety of multicast tree algorithms that
exist.

Of course, worm hole networks 'suffer' from these same problems,
because you're not using the feature these methods attempt to make
efficient - point-to-point communication.

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell