[comp.parallel] SIMD

rjc@CS.UCLA.EDU (Robert Collins) (07/03/89)

First of all, I don't work for TMC, so this is totally unofficial.  I
simply have had *lots* of experience hacking on the CM2.  I don't
use *lisp, so I can't give details on *lisp function names...

I am posting rather than responding via e-mail so the CM2 doesn't
get a bum rap (and a bad reputation) from inexperienced users.

In article <5886@hubcap.clemson.edu> jps@cat.cmu.edu (James Salsman) writes:
>I really hate having to deal with indirect addressing on
>most SIMD machines.  I wish someone would build a SIMD array
>using PE's with address buffers.  Just one tiny address
>buffer per processor is all I want... nothing fancy.  As
>long as *ALL* the memory addresses have to come over the
>global instruction stream and thus are the *SAME* for each
>element, a lot of potential processing power is going to
>waste!
>
>For example, on the Connection Machine in *Lisp, indirect
>aref!!'s take FOR EVER.  This is SERIOUSLY slowing down the
>Production System that I wrote in *Lisp (regardless, it's
>faster than CMU/Soar's Production System Machine or any
>other implementation of a production system that I've heard
>about.)  TMC has added somthing called "sideways arrays" to
>help indirect addressing, but the *Lisp manual is totally
>obscure (so what else is new) and from what I can tell, it
>looks like "sideways" means "spread out over several
>physical processors."  Ack/Pft!
> [ ... ]

I'd tell you to RTFM, but I realize TFM can be hard to R (and
understand)! :-)

Yes, your basic CM array reference takes a long time.  Everyone gets
bitten by this one :-).  It runs in time O(n), where n is the
**number of elements in your array**.  This is *bad* for big arrays.
the aref!! function does not do any actual indirect addressing.  What
it does is loop through all indicies in your array.  In each iteration,
it turns on every processor that has an index equal to the loop index,
does a move operation, and goes onto the next element of the array.

With the release of version 5.0 of the CM system software, we now have
`fast' indirect addressing.  It is performed by the chip that does
serial to parallel conversion (for feeding the FPA chips).  Since
this hardware works on 32-bit chuncks (and there is one for every
32 processing elements), you are stuck with fast arrays with
32 bit elements (or multiples of 32, I guess).  This is a small price
to pay for the speed.  Using the fast (sideways) arrays, an array
reference takes on the order of 2-3 times as long as a 32-bit in-processor
move.  They are sometimes called `sideways' arrays, because each
element in the per-processor arrays is spread across the memory of
32 processors (so they can be accessed in parallel).
If you have a lookup table that is the same for all processors (or at
least in groups of 32), you can minimize your memory requirements
by physically sharing the array.

James, I am sure that TMC's customer support group will be able to
give you details about how to access both fast arrays and fast
shared arrays from *lisp.  Their email address is csg@think.com.

I completely agree that indirect addressing is an extremely important
aspect of a SIMD architecture.  In fact, I have yet to write a
program for the CM that didn't beg for indirect addressing.

Although TMC perhaps could have handled indirect addressing in a more
general way, I feel that their efforts in this direction have been
quite adequate.

I have done some pretty bizarre things with the CM2, and it has been
a great piece of hardware.  My first project was to study the 
performance issues involved in the emulation of MIMD-style computation
on the CM (a SIMD machine).  This type of emulation is critically
dependent on indirect addressing.  By the way, I realize that MIMD on
the CM is a really brain damaged thing to do, but it works *very* well.
I still haven't figured out many MIMD applications that can use a million
or more processors :-).  I am convinced that the reverse (emulation of a
SIMD machine on a MIMD machine) would be much more difficult and orders
of magnitude slower.  My latest project has been the simulation of
populations of artificial animals (yes, Artificial Life).  This is also
quite dependent on indirect addressing.  There is no way that an existing
MIMD computer can match the CM for this type of application!  When I first
started working on the CM, I had trouble thinking of ways to use all those
processors.  Now, I have trouble thinking of ways to make my problems
small enough to fit on the CM!  Oh well....

I hope this has helped to clear up the confusion about indirect
addressing on the CM.

Rob
-------------------------------------------------------------------------------
rjc@cs.ucla.edu	            C++/Paris on the CM2:  Object Oriented SIMD madness
-------------------------------------------------------------------------------

aklietz@ncsa.uiuc.edu (Alan Klietz) (07/05/89)

In article <5902@hubcap.clemson.edu> rjc@CS.UCLA.EDU (Robert Collins) writes:
>Yes, your basic CM array reference takes a long time.  Everyone gets
>bitten by this one :-).  It runs in time O(n), where n is the
>**number of elements in your array**.  This is *bad* for big arrays.

If "indirect addressing" means what I think it does -- each PE
grabs a value from an arbitrary address in an arbitrary PE -- then
there exists an algorithm in time O(k log n) to do the equivalent 
operation by repeatedly using the send operator.

Algorithm:

Given a set of PEs { P(i) | 0 <= i < n }, we want each PE to
fetch a word from the PE at f(P(i)).   The inverse of f is undefined.

        Input:  f(i), 0 <= i < n, indirection function
                x(i), set of values to dereference
                T(i), temporary index array

        Output: { y | y(i) = f(x(i)) }

Without loss of generality, we assume i = P(i) and that there
exists a value called nil not in the domain or range of any function.

        for all i
                y(i) := nil, T(i) := nil
        repeat
                more := false
                for all f(i) | f(i) is not nil, do
                        T(f(i)) := i     // send-with-overwrite
                for all T(i) | T(i) is not nil, do
                        more := true
                        y(T(i)) := x(i)  // send
                        T(i) := nil
                for all y(i) | y(i) is not nil, do
                        f(i) := nil
        until more = false

Note that f(x(i)) never appears on the RHS of any assignment.

Assuming the forall statements are O(log n), the total time required
is O(k log n) where

        k = max( ||{{i,j} | f(i) = f(j)}|| ) + 1
            i,j

The best case is when f is invertible, where it is O(log n).
The worst case is when f is constant, where it is O (n log n).

--
Alan E. Klietz
University of Illinois at Urbana-Champaign
National Center for Supercomputing Applications
152 Computing Applications Building
605 E. Springfield Avenue
Champaign, IL  61820
Ph: +1 217 244 8024            ARPA:  aklietz@ncsa.uiuc.edu

rjc@CS.UCLA.EDU (Robert Collins) (07/06/89)

In article <5918@hubcap.clemson.edu> aklietz@ncsa.uiuc.edu (Alan Klietz) writes:
>In article <5902@hubcap.clemson.edu> rjc@CS.UCLA.EDU (Robert Collins) writes:
>>Yes, your basic CM array reference takes a long time.  Everyone gets
>>bitten by this one :-).  It runs in time O(n), where n is the
>>**number of elements in your array**.  This is *bad* for big arrays.
>
>If "indirect addressing" means what I think it does -- each PE
>grabs a value from an arbitrary address in an arbitrary PE -- then
>there exists an algorithm in time O(k log n) to do the equivalent 
>operation by repeatedly using the send operator.
>
>Algorithm: [ deleted ]

On the CM2, you can do it a lot faster than Alan's algorithm, if you use
the "fast" arrays described in my article (from which the above quote is
extracted--the quote is about "slow" arrays).

The indirect addressing that I spoke of was an arbitrary index in an
array within the processor.  There are also send/get operations that
write to/read from an arbitrary index of such arrays on an arbitary
processor.  These operations will allow the sort of indirect addressing
that is described above (well, actually only to a close approximation--but
probably good enough for all practical purposes).

I do not know the algorithm used by the router microcode, but it will be
faster than the (deleted) algorithms (one would hope).  If you have
really nasty collision patterns, things can slow down considerably.  It
seems like I usually end up with nasty collision patterns :-).

Rob

-------------------------------------------------------------------------------
rjc@cs.ucla.edu	            C++/Paris on the CM2:  Object Oriented SIMD madness
-------------------------------------------------------------------------------