[comp.parallel] Connection Machine Fortran

reinhard@rice.edu (Reinhard von Hanxleden) (05/30/91)

Dear netters,

I would be grateful for comments/references on two issues regarding the 
Connection Machine.

1. IRREGULAR PROBLEMS  (PIC codes, sparse matrices, _your favorite_, ...)
  - Which problems have been tackled? How did the implementation differ 
    from equivalent MIMD implementations?
  - Have "classic" (MIMD) load balancing paradigms (e.g. recursive bisection)
    been applicable?
  - Any experiences in using indirection vectors to distribute a problem 
    evenly?

2. GENERATING EFFICIENT CM-FORTRAN (CMF) FROM F77 / F90
  - Who has experiences in tackling this with standard vectorizers?
    (One report I am aware of is by J. Myczkowski and R. Shapiro, trying
    the KAP vectorizer on 4 small samples; J. Mellor-Crummey has used PFC
    on a larger weather code.)
  - Have any vectorizers been modified for this task yet?
  - Any attempts to generate Layouts / Alignments, Foralls, or other CMF-
    specialities automatically from either F77 or F90? Have any major
    problems been identified yet? (Now, THIS should cause an avalanche,
    right?)

Any piece of information will be appreciated, collected, and - if
sufficient interest arises - be posted. Thanks in advance,

Reinhard v. Hanxleden			

reinhard@rice.edu 
Department of Computer Science, Rice University
P.O. Box 1892, Houston, TX 77251, (713) 527-4834

-- 
=========================== MODERATOR ==============================
Steve Stevenson                            {steve,fpst}@hubcap.clemson.edu
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

reinhard@rice.edu (Reinhard von Hanxleden) (06/07/91)

Here is a summary of the responses to my posting last week. Thanks to
everybody who answered!

Reinhard v. Hanxleden			

reinhard@rice.edu 
Department of Computer Science, Rice University
P.O. Box 1892, Houston, TX 77251, (713) 527-4834

###########################################################################
From: ajclear@cs.sandia.gov (Andrew J. Cleary)

Re your posting to comp.parallel on CM Fortran: Steve Kratzer
at Supercomputing Research Center (kratzer@super.org) implemented
a sparse QR factorization algorithm a year ago in Paris that he
is almost done porting to CM Fortran.  Indeed, the algorithm uses
a very different methodology than MIMD sparse factorization algorithms:
it totally foregoes the usual coarse grained parallelism used in MIMD
in favor of a fine-grained parallelism, with the benefit that more
"regularity" is introduced.  Gilbert and Schreiber publish a tech
report on sparse Cholesky that utilizes a more MIMD type approach,
but get crappy results. Kratzer and I are currently in the process
of designing and coding (yes, in CM Fortran) a sparse Cholesky code
based on his QR code but with some major differences. While it is
still pretty different than MIMD codes, with the new slicewise
compiler we are using the CM in a more coarse-grained fashion, so
that the new algorithm/code is looking more like MIMD codes than
Steve's original code.  It remains to be seen whether any of these
approaches will utilized the CM at anywhere near acceptable speeds
(so far, none has, with Steve's QR code coming in at about 25 Mflops).

There are various quirks associated with the slice-wise compiler.
Some are bugs that get fixed occasionally.  Some are a little more
basic to the compiler (especially those involving performance and
axes layed out serially), though rumor has it that the next version
of the compiler is due out in months and will be more clever with
serial axes.

I know nothing about attempt to automatic generation of CM Fortran
code from old code.

###########################################################################
From: Harry Berryman <berryman-harry@cs.yale.edu>

You should ask Seema Hiranindani (now, Mirchindani) about the work by
Saltz, Berryman,etc.  I'm assuming that you're with Ken K.'s group.
What kind of stuff do you intend on doing? We've done a fair amount
of irregular stuff on both the 'cube and the CM. 

###########################################################################
From: @duncan.cs.utk.edu:clay@cs.utk.edu (Clay P. Breshears)

Although I have had no direct experince on using of the methods thazt you
mention, I do know that Thinking Machines, Corp. has a package that maps
irregular problems to the Connection Machine quite efficiently.  It is an
unofficial product from TMC called the Routing Compiler writtn by Denny Dahl. 
It takes a communication graph of your problem and generates a mapping of the
problem to the CM architecture that will minimize the distance communications
must travel.

As of mid-January this year, the Routing Compiler was out of beta test and went
out with 6.0 release of the system.  A CM Fortran interface is available.

If you wnat more information, I would suggest contacting TMC directly since the
above is just about everything that I have on it.

From: David B. Serafini <serafini@amelia.nas.nasa.gov>

Denny Dahl at Thinking Machines has been doing some work on how to allocate
points in a computational grid to virtual-processors based on analysis of
communication pattterns.  This may be applicable to sparse matrices too.
It produces significant reductions in communication time by balancing the
communication bandwidth available and required.  

###########################################################################
From: Steve Hammond <hammonds@riacs.edu>

I have written two papers covering different aspects of working
with irregular or unstructured problems on the CM.

S. W. Hammond and T. J. Barth, "An Efficient Massively Parallel
Euler Solver for 2D Unstructured Grid Problems", to appear in
the AIAA Journal, also, AIAA paper 91-0441, AIAA Aerospace Sciences
Meeting, Reno Nevada, Jan, 1991

and

S. W. Hammond and Robert Schreiber, "Mapping Unstructured Grid
Problems to the Connection Machine", RIACS Technical Report 90.22,
October, 1990.

The problem is that very little can be done by the system at
"compile" time since the irregularity is not known until 
run time.  Check the proceedings of the 1990 hypercube conference
for a paper by Denny Dahl of Thinking Machines.  He has done work
on a communication compiler that stores the communication pattern
at runtime and then uses it over and over again for irregular
communications.

If you want either of the two papers listed above, send me some
mail with your address.

>

Have you heard of the "PARTY" system being worked on
by Joel Saltz and some others at ICASE?
I don't know too much about it because it is on the fringe
of my interests but it is supposed to be a runtime system
to manage unstructured communications.

###########################################################################
From: hugo@griggs.dartmouth.edu (Peter Su)

I'm not too familiar with these kinds of problems, but I have been
doing a lot of CM work lately.  I think that much of this is doable.
In particular, load balancing and stuff is pretty easy.  There is a
paper in the upcoming SPAA conference that talks about an
implementation of a sorting algorithm that does a kind of distributed
load balancing.  There is also come code floating around that
implements quad trees, I believe, which is basically rescursive
bisection, right?

I don't know of any systems that do vectorization for the CM.  In
principle, it shouldn't be all that much different from vectorizing on
a Cray (say) because the architectures are actually pretty similar.
Just think of the CM as being a big, huge, wide vector register.
Rather than one or two piped floating point units, the CM has up to
2048!  This is the model that CM fortran uses to generate code...

There are a couple of papers from Compass around that talk about
automatically generating alignments form F90.  See the 1990 frontiers
conference and 


%A Kathleen Knobe
%A Joan Lucas
%A Guy Steele
%T Data Optimization: Allocation of Arrays to Reduce Communication on
SIMD Machines
%D 1990
%J Journal Of Parallel And Distributed Computing 
%V 8
%P 102-118
%K steele90


I'm not sure if the current CMF does any of this automatically, the
docs don't mention it.  This is really the heart of the whole problem
of doing good compilers for machines where access to memory isn't
uniform. This includes machines like the CM, the Cray (uniform access,
like stride-1, is much faster), multiprocessors with cache, and NUMA
MIMD machines.  I think compilers can concentrate on alignment, but
layout is a tougher problem.

Let me know what you find out. 

>

I think the code I have in mind was part of a ray tracer.

Irregular disrtibutions aren't as bad as you might think, because the
CM has facilities for per processor indirect addressing.  Thus, there
is a bit of flexibility with regard to how you access per processor
data.  Now, if the computations that you have to do are wildly
irregular (i.e. not SPMD) then you have trouble.

###########################################################################
From: nr1886@euclid.math.colostate.edu (Nenad Rijavec)

Regarding your posting <1991May30.163634.17582@rice.edu>, my experience
might be somewhat relevant, even though I am using C* as opposed to CM
FORTRAN.

I am implementing a multidimensional assignment problem solver on a CM. The
problems are extremely sparse and the algorithms used are from combinatorial
and nonsmooth optimization fields. 

My conclusion is that the CM is not particularly well suited for solving
such problems, mainly because of the fact that array indices pretty much *have*
to be serial. Sure, CM does allow parallel indices, but they are actually
emulated in software and thus (at least in my case) usually prohibitevly
expensive.

What I do depends heavily on load balancing. My inquieries did not unearth
anything useful, so I've had to work on a dynamical load balancing scheme
myself. I don't know how good it really is (I am a mathematician and not a
computer scientist), but I'll be glad to tell you about it if you are
interested.


-- 
MY SIGFILE
HERE