raja@frith.egr.msu.edu (Narayan S Raja) (06/24/89)
I would appreciate reference(s) you may be aware of about ANY IMPLEMENTATION on the Connection Machine (CM-1 or 2). I am particularly interested in papers which discuss the pros and cons and software/architectural problems and issues encountered during the implementation, and actual speedups achieved over a sequential version. In particular I remember a paper I saw some months ago about an implementation on CM (I think it was logic simulation) of "relaxation methods" with actual timings. Unfortunately I don't have the authors or Proceedings to track it down. Could anyone (possibly Eugene Miya) help? Narayan Sriranga Raja.
landman@Sun.COM (Howard A. Landman) (07/06/89)
In article <5839@hubcap.clemson.edu> raja@frith.egr.msu.edu (Narayan S Raja) writes: >I would appreciate reference(s) you may >be aware of about ANY IMPLEMENTATION on the >Connection Machine (CM-1 or 2). I am particularly >interested in papers which discuss the pros >and cons and software/architectural problems >and issues encountered during the implementation, >and actual speedups achieved over a sequential >version. I can give you a quick summary of what I've learned so far from porting my cellular-automaton Go program to the CM-2. The original program was written in C, so I chose C* for the CM language. There are some interesting consequences of that. First, large chunks of the code needed no, or minor, rewriting in order to merely function correctly. Unfortunately, the single CM processors did not have enough memory to implement my core algorithm on a single processor, so I decided to change the CA core to use one processor per cell. This is really the "natural" embedding of a 2D grid CA into the CM. When I did that, the nicest thing that happened was that some operations which I previously implemented with procedures became simple logical operations on (poly) variables. But some not so nice things happened as well. For one thing, even though the CM is bit-addressible, C* assumes that the machine is byte-addressible, in the sense that it is not possible to declare a 1-bit or a 19-bit variable. You only get the standard C types like char, short, int & long. So I use 8 bits (a char) where 1 bit would do. Also, some features of C* do not work - a simple switch statement caused the "cs" compiler to segmentation fault. In a sense I think C* is a second-class citizen on the CM. There are several types of CM operations which are impossible to generate from pure C* code. I don't think this is true with StarLisp. In Hillis' book "The Connection Machine" he defines the CM as a machine designed to run CM-Lisp; but a machine designed to run C* would look somewhat different (e.g. it would be byte addressible). C* is a kind of compromise between compatibility with C and the CM architecture. The language design is very clean and the compatibility with standard C is great, but some of the power of the CM is not expressed in the language. Also, although I was doing largely N-E-W-S communication, for various reasons I chose not to use the built-in grid communication package. For one thing it is not entirely natural to C*, but is embodied in a function library. Another reason had to do with the exact size and layout of my CAs. In retrospect this was probably a mistake. The general communication mechanism is very flexible, but also very slow compared to the grid-based one. However, not all the communication overhead is due to the CA operations; For example, even a seemingly innocuous if-then-else can apparently trigger global-OR computation. I'm going to give a performance measure now, but I first want to qualify it in several ways. First, I'm comparing just-barely-working and completely unoptimized and not-using-the-grid and using-8-bits-for-1 C* code on a CM to *highly* optimized and word-parallel and loop-unrolled C code on a Sun 4. Second, I don't think this has any implications for the CM using CM-Lisp or StarLisp or Fortran 8X, since they behave very differently. Third, taking performance measurements on a CM is tricky and I'm not extremely confident of the numbers. Fourth, one data point is not enough to draw any meaningful conclusions. Given the above caveats: I am currently seeing about 3 times the performance of a 10 MIP Sun 4 when I use 16K CM processors configured as 128K virtual processors. This represents only about 2% of the theoretical power of those processors. I believe I will be able to improve this substantially as I understand the performance issues better and change my coding style. Almost all the time seems to be spent in the communications. One last comment - the "cs" C* compiler is pretty slow, about an order of magnitude slower than "cc" on the intermediate C code. "Make" and a measure of patience are necessary to use it well. It helps if you keep your source files small, both for compile speed, and because (on a VAX, but not a Sun, host) I have sometimes overrun compiler limits with my largest file. I would like to conclude by thanking both Thinking Machines Corp. and the Argonne National Labs Advanced Computing Research Facility for generously allowing me to use their facilities for these continuing investigations. I hope to have more interesting things to report soon, like how strong a massively parallel Go program can be! Howard A. Landman landman@sun.com