[comp.parallel] "Molecule: Language Construct for Development Parallel Programs"

cline@sunshine.ece.clarkson.edu (Marshall Cline) (05/25/89)

In article <5583@hubcap.clemson.edu> budd@mist.cs.orst.edu (Tim Budd) writes:
>...This allows us to generate
>fine grain SIMD (connection-machine style) code by introducing parallelism
>one way, and coarse grain, MIMD (sequent-style) code by introducing
>parallelism in another way.... not
>something the programmer needs to deal with....
>--tim budd, budd@cs.orst.edu

Did ya'all see the article in May's IEEE Transactions Software Engineering?
     "Molecule: A Language Construct for Development of Parallel Programs",
      by Z. Xu & K. Hwang, pages 587-599.

Personally, I'm interested in the concept of allowing _one_ parallel
program to compile on _many_ hardware platforms regardless of the
hardware's underlying mechanisms.  I'd like to discuss this issue if
others are also interested.

__________MOTIVATION:_HARDWARE_INDEPENDENCE__________________________
Up to now, we've had to re-write code to port it from one vendor's
parallelization mechanisms to another's.  Issues such as:
        * SIMD vs MIMD
        * fine- vs coarse-grained
        * loosely- vs tightly-coupled
        * mechanisms for mutual exclusion
        * mechanisms for IPC
        etc, etc, etc
are directly visible in most parallel programs.  Although we typically
write in a "high level" language like C, we usually have to directly
control the hardware [I hate to say it, but it's analogous to writing
in assembly language].

"Wouldn't it be nice" if we could write code in a way that vendor
specific issues could be hidden from us?  The "molecule" technique is
a step in this direction.

__________WHAT_IS_A_"MOLECULE"?_______________________________________
OVERVIEW OF THE PAPER: A molecule is roughly like a procedure or function.
However, along with encapsulating "algorithms", "data types", and
"parameters", it also encapsulates "parallel constructs" such as the
above distinctions.  Apparently the authors felt that the (non-strict)
dataflow paradigm is the most flexible (able to be translated into any
of the other paradigms), so the highest level "moleculized" program is
specified using this form.  By "non-strict dataflow", they mean the
single assignment rule for _atomic_ objects; ie: if "i" is an integer,
it can be assigned a value only once, but if "a" is an array of integers,
each element can _individually_ be assigned a value, but only once.  No
lazy evaluation specified.

__________THE_"MODE"_LAYER__________________________________________
They specify a number of other "molecule" paradigms, such as "SIMD",
"parallel", "task", and a bunch of others, all of which are roughly like
_pseudo_code_ for specifying parallel algorithms (ie: the language
doesn't have any mechanisms for allocating processors, that being
taken care of by the next pre-processing step).  The "dataflow" programs
are translated into this "mode" layer by a first pre-processing step.
Programs at this level _do_ have individual processes specified (ie:
decomposing the program into processes is accomplished by the "dataflow
--> mode" preprocessor).  A generic form of IPC is also visible at
this "mode" layer, which was hidden in the top-most layer.

__________THE_"VENDOR_SPECIFIC"_LAYER__________________________________
The next preprocessor ("mode --> vendor-specific") allocates processors,
provides mechanisms for mutual exclusion, and generally looks messy.
They give an example of what code looks like at this layer for inverting
a matrix (coded in the "C" extensions provided by Intel for use on their
iPSC hypercube).  The iPSC "extended-C" version took 250 lines, the
"mode" layer required 44 lines, the top layer required 26 lines.

__________MODIFIABILITY_AND_CODE_MAINTENANCE___________________________
Software Engineering concerns such as modifiability were addressed next.
Changing the granularity of the parallelism required a _complete_rewrite_
at the iPSC hypercube level, resulting in over 1000 lines of code being
changed (progam went from 250 --> 1000 lines).  The "mode" layer required
changing two subprograms.  The topmost (dataflow) layer required changing
only two lines.

__________FEASIBILTY?__________________________________________________
QUESTION #1: The author's didn't say whether or not these pre-processors
have any hope of a feasible implementation.  That appears to be an open
question which might require _lots_ and _lots_ of research before an
answer can be found.  Clearly this is a _very_ hard problem, given that
reasoning about a Turing Machine's Language (ie: reasoning about what a
piece of code _does_) is undecidable.

I'm interested in _informed_ opinions (ie: _READ_THE_ARTICLE_FIRST_).
What do we think out there?  We've all heard the "know-it-alls" who say
"Can't be done" before the problem is even well studied.  I hope we're
all mature enough to realize that this article is a _first_ step.  Let's
not bash them for what they _didn't_ do; let's discuss what they _did_ do.

__________SOME_OTHER_DISCUSSION_QUESTIONS______________________________
They show a diagram (Fig 2, page 594) demonstrating how the "dataflow"
specification can be translated into {Multicomputers, Multiprocessors,
and Array processors}.  These include:
  * Multicomputers:   {iPSC hypercube, NCUBE, FPS T-series, Transputer/Occam}
  * Multiprocessors:  {Cray X-MP/FORTRAN, Alliant FX/8, Cyber/FORTRAN}
  * Array processors: {MPP/Pascal, Connection-Machine/C*}

QUESTION #2: Is this _broad_ range feasible?  Please discuss only after
understanding the "mode" layer pre-compilation phase -- ie: if it's not
feasible, why not?  Is the "dataflow --> mode" preprocessing step
unfeasible?  Is the "task-type --> Cray-X-MP" preprocessing step unfeasible?

QUESTION #3: Another discussion question involves the relative _efficiency_
of the generated code.  This is going to open up Pandora's Box, I'm sure,
but it needs to be (politely) discussed.  Code that's generated by a
machine is (probably) never going to be as "efficient" as code generated
by a (skillful) human.  This is the old high-level vs. assembly language
question, but applied to parallelization.  QUESTION: In 20 years, are
"they" going to look back at "us" and say what we say about the "machine
language hackers" of yesteryear?  Ie: Are they going to say "Why in the
world did they worry so much about wasting a few clock cycles?  Why did
they worry about squeezing the last possible ounce of parallelism out of
a piece of hardware?".  Or are they going to say the opposite: "Why did
they ever even _want_ a language specification which is independent of
the parallelization paradigm?"

QUESTION #4: Are parallel languages headed towards levels that are higher
than what we now call "high level" languages?  Ie: Given that we are/will-be
running our stuff on supercomputers, should we _use_ the resorces of the
"super"-computer to generate code from "super"-high-level-language specs?
Granted: this is a far sighted question, but what do you think?

QUESTION #5: Anyone know of any _other_ work being done in this area (ie:
any other methods for "parallelization-schema-independence"?).

QUESTION #6: Is "modified-dataflow" the best choice for the highest-level?
Is it flexible enough to be translated into the other forms?  Is "your 
favorite form" better in terms of translation into Multicomputers,
Multiprocessors and Array processor architectures?

OTHER QUESTIONS ANYONE?

___________OPINIONS_OPINIONS_OPINIONS_________________________________
OPINION #1: I think it's going to take a _LOT_ of work before the question
of feasibility is answerable.  I'd say "no it's not feasible _today_,
but _may_be_ in the future..."

OPINION #2: Both "preprocessing" steps seem _horribly_ difficult.  For
example, the "mode --> vendor-specific" preprocessing step involves
allocation of processors.  The authors justify their push toward higher
level language specifications based on the notion that "it is extremely
difficult to figure out a good allocation" (page 595, left column, 2nd
to bottom para).  Again, they say that process allocation is a "very
difficult task" which is "handled by a precompiler." (page 595, right
column, 2nd para).

Thus the discussion is inherently self-contradictory: those who say that
"the problem is too hard for a machine to do" are admitting that humans
have to forever be saddled with this problem -- and they're not willing
to look into relieving us of the burden.  On the other hand, those who
say "it's easy to do by machine" must then answer the question: "if it's
so easy, why bother? why not just let humans do it?"

OPINION #3: I think that in 20 years "they" will have machines do _LOTS_
more to help generate the code.  Assembly language will be _unheard_of_
in practice.  Writing code which will only run on a single parallel machine
will be considered anathema.  As always, this will present efficiency
problems.  Perhaps these will be solved in the usual "find and streamline
the bottlenecks" method.  Perhaps "streamlining" will involve hand-coding,
or perhaps it will involve "super-parallelizing" the code.  Perhaps we'll
("they'll") need profiling tools to "find" the bottlenecks.  Or perhaps the
machine will generate the code, run it, automatically look for bottlenecks,
automatically re-parallelize the offending section, run it again, etc.

OPINION #4: I don't know.  A "natural language" form (ie: accept English
commands and "execute" them) would certainly qualify as a "super"-high-
level-language.  Where would that leave programmers?  Would we all be out
of a job?  I don't think so.  _Someone_ is going to have to write the code
which translates English (or whatever) into something the machine can
handle!  Can you imagine the Software Maintenance problem that would
create?  Shivver!

Let the discussion begin!
Marshall

--
	________________________________________________________________
	Marshall P. Cline	ARPA:	cline@sun.soe.clarkson.edu
	ECE Department		UseNet:	uunet!sun.soe.clarkson.edu!cline
	Clarkson University	BitNet:	BH0W@CLUTX
	Potsdam, NY  13676	AT&T:	(315) 268-6591

bjornl@tds.kth.se (Bj|rn Lisper) (05/25/89)

In article <5593@hubcap.clemson.edu> cline@sunshine.ece.clarkson.edu
(Marshall Cline) writes:
....
>Did ya'all see the article in May's IEEE Transactions Software Engineering?
>	"Molecule: A Language Construct for Development of Parallel Programs",
>	 by Z. Xu & K. Hwang, pages 587-599.
....
>__________WHAT_IS_A_"MOLECULE"?_______________________________________
>			.....  By "non-strict dataflow", they mean the
>single assignment rule for _atomic_ objects; ie: if "i" is an integer,
>it can be assigned a value only once, but if "a" is an array of integers,
>each element can _individually_ be assigned a value, but only once.  No
>lazy evaluation specified.

Very interesting. I've always considered the approach to treat whole arrays
as atomic entities in single-assignment languages as a failure, since it
lends itself to potentially very inefficient implementations (like copying
whole arrays, or using costly I-structures). The problem is, however, to
find that a given program with assignments of single array elements really
adheres to the single-assignment rule. In quite a few cases this can be done
by a compile-time analysis; in others it's not possible, like in the
following example:

a(f(i)) = x
a(f(j)) = y

where f is an array whose entries are unknown at compile time, so that
possibly f(i) = f(j). The "whole array" approach takes care of this by
demanding the two assignments to assign different arrays. In the "array-
element-as-atomic-entity" approach one has to provide some run-time
mechanism to check for single-assignment consistency.

I think, however, that this is the way to go.

Bjorn Lisper

John.Baugh@LINDENTHAL.CAE.RI.CMU.EDU (05/26/89)

In article <5593@hubcap.clemson.edu> cline@sunshine.ece.clarkson.edu 
(Marshall Cline) writes: [stuff deleted]
>taken care of by the next pre-processing step).  The "dataflow" programs
>are translated into this "mode" layer by a first pre-processing step.
>Programs at this level _do_ have individual processes specified (ie:
>decomposing the program into processes is accomplished by the "dataflow
>--> mode" preprocessor).  A generic form of IPC is also visible at
>this "mode" layer, which was hidden in the top-most layer.

On a related note, Robert Babb has apparently used FORTRAN + data
dependencies in a similar way to drive execution of programs (see
Computer, July '84).  He argues, I believe, that the resulting
programs can be mapped easily onto a variety of hardware 
architectures.

[more stuff deleted]

>QUESTION #4: Are parallel languages headed towards levels that are higher
>than what we now call "high level" languages?  Ie: Given that we are/will-be

The Arvind camp certainly promotes the non-strict dataflow approach
(e.g., I-structures), though at a fine-grain of parallelism, because
of its inherent determinacy and parallelism.

A problem that must be dealt with by such languages (i.e., functional,
dataflow, ...) is the incremental update of arrays.  Otherwise, many
common numerical algorithms become excessively space inefficient.
Take a look at Arvind's "Future Scientific Programming of Parallel
Machines," MITLCS Computation Structures Group Memo 272, Feb. '88.
The algorithm presented for Gauss elimination takes advantage of
I-structures, but creates a new array at each elimination step, the
size of which begins at NxN and is successively reduced to 1x1.  An
alternative to this, which I describe in my thesis, is the combination
of array comprehensions and algorithms like Crout reduction and
Choleski decomposition, which do not successively rewrite the values
of the resulting triangularized arrays.  This approach requires no
copying.

Of course there are other approaches to the incremental update
problem, and some are these are discussed in papers by Wadler and
Hudak in Proceedings of the Workshop of Graph Reduction, LNCS 279,
'86.  They basically recommend a variety of monolithic array
constructors.  Other approaches include reference counting to
determine whether an array can be destructively modified, and Wise
presents a quadtree implementation of matrices in IPL, May '85.

The bottom line seems to be that one needs non-strict arrays, various
(monolithic) array constructors, array comprehensions, etc., to be
able to describe numerical programs in a reasonable(?) way (in a
dataflow language).  I'm not sure how elegant the resulting programs
would be.  In addition, non-strict arrays, for example, also destroy
referential transparency (which I won't argue for or against).

[more deleted]

>	________________________________________________________________
>	Marshall P. Cline	ARPA:	cline@sun.soe.clarkson.edu
>	ECE Department		UseNet:	uunet!sun.soe.clarkson.edu!cline
>	Clarkson University	BitNet:	BH0W@CLUTX
>	Potsdam, NY  13676	AT&T:	(315) 268-6591

John Baugh

dinucci@ogccse.ogc.edu (David C. DiNucci) (05/26/89)

In article <5606@hubcap.clemson.edu> John.Baugh@LINDENTHAL.CAE.RI.CMU.EDU writes:
>On a related note, Robert Babb has apparently used FORTRAN + data
>dependencies in a similar way to drive execution of programs (see
>Computer, July '84).  He argues, I believe, that the resulting
>programs can be mapped easily onto a variety of hardware 
>architectures.

Wow!  Somebody has heard of Large-Grain Data Flow (LGDF)!

Actually, back then, LGDF could only be ported to shared-memory
multiprocessors.  Since then, however, a few things have changed.
My first goals when I came on board were to determine (1) what a
REAL (i.e. formal) description of LGDF would look like, and (2) what
features kept LGDF programs from being efficiently implemented on
distributed-memory machines.  Those goals led to an even deeper search:
could both the shared memory model and the message passing model be
considered special cases of one "more basic" model?

The result has been LGDF2, and I believe the search has been fruitful.
It comes down to this:  both models consist of "processes" sharing
information.  Upon "accessing" some information, a process dictates
the set of processes that are allowed to access it next.  If each
process knows what type of access it will make (reading, writing,
neither, both) to the information if it runs, parallelism can be
increased - for example, by allowing multiple readers to access
information concurrently.

In LGDF2, the programmer explicitly specifies the granularity of
processes - we simply ensure that they share data in an efficient
an uniform way regardless of the (MIMD) architecture they are executed
on.  In addition, we allow the programmer to specify non-deterministic
computations.

The most recent stuff on LGDF2 has been published in
  1988 Proc. 21st Hawaii Int'l Conf. on System Sciences (DiNucci, Babb)
  Parallel Processing for Scientific Computing (Babb, DiNucci)
    (Actually 1987 Proc. 3rd SIAM Conf. on Par. Proc. for Sci. Computing)
  Digest of Papers, CompCon Spring 89 "Intellectual Leverage", (DiNucci,
						Babb)

The last paper gives some hints on what we're working on now (including
work on arrays which was sadly lacking in the above papers).

Incidentally, the "Molecule" article was written in 87, and some other
models have come to light since then.  Chandy and Misra's UNITY comes
to mind (see "Parallel Program Design - A Foundation", 1988, Addison-Wesley).

>[more stuff deleted]
>
>In (???) Marshall Cline writes:
>>QUESTION #4: Are parallel languages headed towards levels that are higher
>>than what we now call "high level" languages?  Ie: Given that we are/will-be

Well, ours is.  LGDF2 is now becoming a full-fledged language for building
parallel programs (networks) from processes, leaving the role of constructing
processes from statements to more standard language processors (though
these processes need not need be sequential).  We are finding that
many "object-oriented" principles find a comfortable home at the higher
LGDF2 level (resulting in screams of outrage at the idea of object-
oriented Fortran programs).

In fact, isn't the restricted, controlled sharing of information among
computional agents one of the central tenets of software engineering,
object-oriented or otherwise?

>
>[more deleted]
>
>>	________________________________________________________________
>>	Marshall P. Cline	ARPA:	cline@sun.soe.clarkson.edu
>>	ECE Department		UseNet:	uunet!sun.soe.clarkson.edu!cline
>>	Clarkson University	BitNet:	BH0W@CLUTX
>>	Potsdam, NY  13676	AT&T:	(315) 268-6591
>
>John Baugh

-- 
David C. DiNucci                UUCP:  ..ucbvax!tektronix!ogccse!dinucci
Oregon Graduate Center          CSNET: dinucci@cse.ogc.edu
Beaverton, Oregon

len@uunet.UU.NET (Leonard Vanek) (05/29/89)

In article <5593@hubcap.clemson.edu> cline@sunshine.ece.clarkson.edu (Marshall Cline) writes:
>
>Did ya'all see the article in May's IEEE Transactions Software Engineering?
>     "Molecule: A Language Construct for Development of Parallel Programs",
>      by Z. Xu & K. Hwang, pages 587-599.
>
>Personally, I'm interested in the concept of allowing _one_ parallel
>program to compile on _many_ hardware platforms regardless of the
>hardware's underlying mechanisms.  I'd like to discuss this issue if
>others are also interested.
...
>
>QUESTION #5: Anyone know of any _other_ work being done in this area (ie:
>any other methods for "parallelization-schema-independence"?).

Check out "The Parallation Model: Architecture-Independent Parallel
Programming" by Gary Sabot. This book, published by the MIT Press and
based on the author's thesis, describes a model and implementation of
a simulator for this model. The prototyping work was all done in LISP
so LISP is used for most of the examples in the book. This tends to
support the use of dataflow notation in Molecule. However, Sabot
claims that the model is language-independent and to illustrate that
point does include some examples in C.

Len
--------------------------------------------------------------------
Leonard Vanek                  UUCP: ... uunet!attcan!lsuc!array!len
Array Systems Computing Inc.    or ... utzoo!dciem!array!len
5000 Dufferin St. Suite 200     or lsuc!array!len@ai.toronto.edu
Downsview, Ont. M3H 5T5        Phone: (416) 736-0900
Canada                         FAX:   (416) 736-4715

sc@vlsi.cs.cmu.edu (Siddhartha Chatterjee) (05/31/89)

In article <5633@hubcap.clemson.edu>, dciem!array!len@uunet.UU.NET (Leonard Vanek) writes:
> Check out "The Parallation Model: Architecture-Independent Parallel
> Programming" by Gary Sabot. This book, published by the MIT Press and
> based on the author's thesis, describes a model and implementation of
> a simulator for this model. The prototyping work was all done in LISP
> so LISP is used for most of the examples in the book. This tends to
> support the use of dataflow notation in Molecule. However, Sabot
> claims that the model is language-independent and to illustrate that
> point does include some examples in C.

The Paralation stuff is fundamentally different from Molecule in that it is
not process-oriented, but rather data-oriented.  It has strict
synchronization guarantees: barrier synchronization semantics are guaranteed
at the end of all elwise loops.  There are two things I don't like about the
language: the same-paralation restriction in elwise loops, and the
communication mechanism (match and move are a pain).  The nice thing about
the language is that it allows nested elwise's.  We used it here last
semester in an undergraduate course, and it went down quite well with the
students.

There is a compiler for a subset of the language for the Connection Machine,
built by Guy Blelloch (see his thesis for details), but it seems very
unlikely there will ever be a compiler for the full language.  The
interaction between the paralation constructs and the type mechanisms in
Common Lisp is sufficiently complex, for one.  I've seen the Paralation C
example in the book, but I don't think anything's coming out on that front.
It's not that you couldn't port the three constructs to C, just that I think
no one's doing it.
-- 
----
ARPA:	Siddhartha.Chatterjee@CS.CMU.EDU
UUCP:	{seismo,decvax,allegra}!rochester!cmu-cs-pt!vlsi!sc
----