[comp.parallel] Help! Connection Machine reference

raja@frith.egr.msu.edu (Narayan S Raja) (06/24/89)

I would appreciate reference(s) you may
be aware of about ANY IMPLEMENTATION on the
Connection Machine (CM-1 or 2).  I am particularly
interested in papers which discuss the pros
and cons and software/architectural problems
and issues encountered during the implementation,
and actual speedups achieved over a sequential
version.  In particular I remember a paper I
saw some months ago about an implementation
on CM (I think it was logic simulation) of
"relaxation methods" with actual timings.
Unfortunately I don't have the authors
or Proceedings to track it down.  Could anyone
(possibly Eugene Miya) help?

Narayan Sriranga Raja.

landman@Sun.COM (Howard A. Landman) (07/06/89)

In article <5839@hubcap.clemson.edu> raja@frith.egr.msu.edu (Narayan S Raja) writes:
>I would appreciate reference(s) you may
>be aware of about ANY IMPLEMENTATION on the
>Connection Machine (CM-1 or 2).  I am particularly
>interested in papers which discuss the pros
>and cons and software/architectural problems
>and issues encountered during the implementation,
>and actual speedups achieved over a sequential
>version.

I can give you a quick summary of what I've learned so far from porting
my cellular-automaton Go program to the CM-2.

The original program was written in C, so I chose C* for the CM language.
There are some interesting consequences of that.  First, large chunks of
the code needed no, or minor, rewriting in order to merely function correctly.
Unfortunately, the single CM processors did not have enough memory to
implement my core algorithm on a single processor, so I decided to change
the CA core to use one processor per cell.  This is really the "natural"
embedding of a 2D grid CA into the CM.

When I did that, the nicest thing that happened was that some operations
which I previously implemented with procedures became simple logical
operations on (poly) variables.  But some not so nice things happened as
well.  For one thing, even though the CM is bit-addressible, C* assumes
that the machine is byte-addressible, in the sense that it is not possible
to declare a 1-bit or a 19-bit variable.  You only get the standard C types
like char, short, int & long.  So I use 8 bits (a char) where 1 bit would do.
Also, some features of C* do not work - a simple switch statement caused the
"cs" compiler to segmentation fault.  In a sense I think C* is a second-class
citizen on the CM.  There are several types of CM operations which are
impossible to generate from pure C* code.  I don't think this is true with
StarLisp.  In Hillis' book "The Connection Machine" he defines the CM as
a machine designed to run CM-Lisp; but a machine designed to run C* would
look somewhat different (e.g. it would be byte addressible).  C* is a kind
of compromise between compatibility with C and the CM architecture.  The
language design is very clean and the compatibility with standard C is great,
but some of the power of the CM is not expressed in the language.

Also, although I was doing largely N-E-W-S communication, for various
reasons I chose not to use the built-in grid communication package.  For
one thing it is not entirely natural to C*, but is embodied in a function
library.  Another reason had to do with the exact size and layout of my CAs.
In retrospect this was probably a mistake.  The general communication
mechanism is very flexible, but also very slow compared to the grid-based one.
However, not all the communication overhead is due to the CA operations;
For example, even a seemingly innocuous if-then-else can apparently trigger
global-OR computation.

I'm going to give a performance measure now, but I first want to qualify it
in several ways.  First, I'm comparing just-barely-working and completely
unoptimized and not-using-the-grid and using-8-bits-for-1 C* code on a CM to
*highly* optimized and word-parallel and loop-unrolled C code on a Sun 4.
Second, I don't think this has any implications for the CM using CM-Lisp
or StarLisp or Fortran 8X, since they behave very differently.  Third,
taking performance measurements on a CM is tricky and I'm not extremely
confident of the numbers.  Fourth, one data point is not enough to draw
any meaningful conclusions.  Given the above caveats: I am currently seeing
about 3 times the performance of a 10 MIP Sun 4 when I use 16K CM processors
configured as 128K virtual processors.  This represents only about 2% of the
theoretical power of those processors.  I believe I will be able to improve
this substantially as I understand the performance issues better and change
my coding style.  Almost all the time seems to be spent in the communications.

One last comment - the "cs" C* compiler is pretty slow, about an order of
magnitude slower than "cc" on the intermediate C code.  "Make" and a
measure of patience are necessary to use it well.  It helps if you keep your
source files small, both for compile speed, and because (on a VAX, but not
a Sun, host) I have sometimes overrun compiler limits with my largest file.

I would like to conclude by thanking both Thinking Machines Corp. and the
Argonne National Labs Advanced Computing Research Facility for generously
allowing me to use their facilities for these continuing investigations.
I hope to have more interesting things to report soon, like how strong a
massively parallel Go program can be!

	Howard A. Landman
	landman@sun.com