[comp.parallel] Parallel Numerical Algorithms.

fpst@hubcap.UUCP (Steve Stevenson) (05/17/88)

[ I pulled this off sci.math.num-analysis.  Thought it more
  than relevant.
  Steve
]

I'm looking for sources (papers, books, etc) for parallel numerical
algorithms, i.e., algorithms in numerical analysis that work in
parallel.  I'm not only interested in the algorithms, but in
their analysis; when do they converge, what is the order of
convergence, etc.

Although I am mostly interested in true parallelism, I wouldn't mind
receiving algorithms based on vectorization, and such like.

Please e-mail me responses, our sites subscribtion to this newsgroup
is quite eratic.  I wouls also like to hear from anyone else who is
interested in this subject.

                        Thanks in advance,



|----------------------------------------------------------------------------|
|Dennis Grinberg, Math & CS Department, Bar-Ilan University, Ramat-Gan ISRAEL|
|----------------------------------------------------------------------------|
|BITNET:   grinberg@bimacs.bitnet                                            |
|CSNET:    grinberg%bimacs.bitnet%cunyvm.cuny.edu@csnet-relay                |
|ARPA:     grinberg%bimacs.bitnet@cunyvm.cuny.edu                            |
|UUCP:     uunet!mcvax!humus!bimacs!grinberg                                 |
|SNAILNET: Dennis Grinberg, 13 Hava Lutzki St., Rehovot, ISRAEL              |
|----------------------------------------------------------------------------|

-- 
Steve Stevenson                            fpst@hubcap.clemson.edu
(aka D. E. Stevenson),                     fpst@clemson.csnet
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

eugene@pioneer.arc.nasa.gov (Eugene N. Miya) (05/23/88)

for comp.parallel:
>Although I am mostly interested in true parallelism, I wouldn't mind
>receiving algorithms based on vectorization, and such like.

The English language is marvelously vague.  I wish I knew what 'true'
parallelism is or was.  Ortega now has a new book and there are several
others (see my bibliography).

I really dislike the word "parallelism."  The problem is Whorfian and an
example hit me in March.  I was attending this meeting in Oregon and one
speaker (James Hack, a atmospheric physicist, a meterologist at NCAR) was
giving an overview of his community's computing problems [I used to work with
weather models myself, so the terrain is familiar].

We were humbled by this man's problems (O(n^3.5)) and he admitted he was not
a computer scientist [he was humbled by our problems].  Then he placed a view
graph of a 3-D plot of Amdahl's law, one axis was "percent vectorization" and
the other domain axis was "percent parallelism" done using the well known
NCAR graphics package displaying the spike at 100% vectorization and
parallelization (blah, what a word).  This graph disturbed me, there was
something wrong, and later (days unfortunately) my linear algebra background
came to the elegant conclusion that he was assuming these two things were
orthogonal.  That was his fallacy.  Now until we refine our terminology and
get an appreciation for the problems created by side-effects, shared memory,
etc., we won't solve many fundamental problems.

If we are going to make a distinction between vector and parallel, it had
better have better basis for making that distinction.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

fpst@hubcap.clemson.edu (Steve Stevenson-Moderator) (05/23/88)

[ Dennis is not receiving the group directly and is communicating
  directly through e-mail.  I'll try to edit and preserve meaning.
  This note is in reference to the request for papers on parallel
  numerical algorithms.

   Steve
]

Avi Lin of Temple University has a number of papers on various subjects
such as _A_New_Parallel_Algorithm_for_LU_Decomposition_ (written with
He Zhang, a graduate student), _A_New_Parallel_Algorithm_For_Linear_
Triangler_Systems_, etc.  He visted Bar-Ilan University in Israel
a few weeks ago, and I picked up a few papers.  If there is any interest,
please let me know.

|----------------------------------------------------------------------------|
|Dennis Grinberg, Math & CS Department, Bar-Ilan University, Ramat-Gan ISRAEL|
|----------------------------------------------------------------------------|
|BITNET:   grinberg@bimacs.bitnet                                            |
|CSNET:    grinberg%bimacs.bitnet%cunyvm.cuny.edu@csnet-relay                |
|ARPA:     grinberg%bimacs.bitnet@cunyvm.cuny.edu                            |
|UUCP:     uunet!mcvax!humus!bimacs!grinberg                                 |
|SNAILNET: Dennis Grinberg, 13 Hava Lutzki St., Rehovot, ISRAEL              |
|----------------------------------------------------------------------------|

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/24/88)

In article <1684@hubcap.UUCP> eugene@pioneer.arc.nasa.gov (Eugene N. Miya) writes:
>a computer scientist [he was humbled by our problems].  Then he placed a view
>graph of a 3-D plot of Amdahl's law, one axis was "percent vectorization" and
>the other domain axis was "percent parallelism" done using the well known
>NCAR graphics package displaying the spike at 100% vectorization and
>parallelization (blah, what a word).  This graph disturbed me, there was
>...

His point was that on a parallel/vector architecture such as the Cray X/MP
or Cray 2 series you have to do well on both the MIMD parallel and Vector
fronts in order to approach the peak performance for a machine.  Although
the space of applications may not be evenly distributed on the x-y surface
of his plot, the notion that it presented that a rather small number of
applications would do well on both the MIMD and Vector parallel fronts
seems to be quite valid and is not disturbing at all.  A good example of
an application that fits the distinction of (MIMD,Vector) parallel is
Gauss elimination.  One vectorizes the SAXPY operations to mask memory and
functional unit latency, one parallizes across the rows of the matrix to
use more than one vector CPU in a MIMD mode.  Of course, there is more than
one way to skin this cat, but the notion of simultaneously exploiting MIMD
parallelism and Vector (SIMD) parallelism in a given architecture or algorithm
is quite useful.

fpst@hubcap.clemson.edu (Steve Stevenson-Moderator) (05/24/88)

From article <1684@hubcap.UUCP>, by Eugene N. Miya <eugene@pioneer.arc.nasa.gov>:
> 
> If we are going to make a distinction between vector and parallel, it had
> better have better basis for making that distinction.
> 

I agree.  It seems to me that a tentative, first cut distinction
is SIMD and MIMD but one has to be very careful.  Example: a possible
counter is that NCUBE does a parallel algorithm (in MIMD sense) 'cause
it does not have vector coprocessors.  The FPS T, on the other hand,
could do both.

It seems to me that the essentials have to do with some reformulation
of linear algebra and some agreement that there is an equivalence.

Anyone care to try for a strict definition?

eugene@pioneer.arc.nasa.gov (Eugene N. Miya) (05/25/88)

In article <1691@hubcap.UUCP> you write:
>Anyone care to try for a strict definition?

I am unable to offer a better definition, but I hope I can offer some
enlightenment based on further discussions in private others.  There is
a confusion between subtle differences in parallel/vector/Flynn's
SIMD/AND/OR parallelism what have you.  DO NOT CONFUSE PARALLEL with
SYNCHRONOUS/SYNCHRONY.  The problem which kills you are side-effects.
The early CMU people noted the major problems were: consistency,
deadlock, starvation, and exception handling.

To understand some of the problem, I offer a variation of a technique
used by the early physicists for understanding the nature of light.
They jokingly (and half-heartedly) said light behaves like particle on
MWF and like wave on TTS (Sunday, they rest).  So, DO work on parallel
algorithms or programming or machines for one week without using the
word PARALLEL.  You start to fall back on other words like CONCURRENT,
SIMULTANEOUS, MULTIPROCESS, and in time throw each of these words out.

Several different groups, I am aware of Marty Fouts and others
(independently) using words like COOPERATIVE versus COMPETITIVE
computing (why not, we have GREEDY algorithms, right? ;-)

The problem of sustaining parallel computation is sort of like
sustaining controlled thermonuclear fusion (great for short periods,
but the start up gets you).  I wish I can offer more, but I'm not smart
enough.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

lisper-bjorn@YALE.ARPA (Bjorn Lisper) (05/26/88)

In article <407@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>In article <29851@yale-celray.yale.UUCP> I wrote, on "true parallelism" vs.
"degree of vectorization":
....
>=In my world it is very simple. parallelism is when you do several things at
>=once, at different places. Vector operations are usually executed in a
>=pipeline. When the pipeline is filled every stage in it will execute in
>=parallel. Thus vector operations provide a special form of parallelism.
....
>"Vector ops" are too high a level of abstraction to be of use in
>distinguishing between vectorization and parallelism.  I think we
>have to break down the vector ops into their constituent parts.

The point I wanted to make is that vectorization is a very special form of
parallelism. The limits for "true parallelism" can, if I've got that concept
right, be expressed as data dependencies between "events" where every event
is an application of a "high-level" operator, say "+" in the vector example
above. If we have a multiprocessor system where some processors can perform
"+" and if we have some connections between the processors then a compiler
will have to find a schedule of the events that satisfies three criteria; 1.
communication requirements given by data dependencies must be met by
available connections, 2. "+" events are scheduled only on processors who
can perform "+", 3. no two events are scheduled on the same processor at the
same time. Furthermore the schedule should be "good" which usually means
short completion time.

Vectorization can be described in exactly the same way; in order to do this
"+" must be expressed in terms of suboperations matching the stages of the
pipeline. Every event a(i)+b(i) is now a set of smaller events according to
this, let's call them (i,k) for each vector index i and each stage k. Since
(i,k+1) uses the result from (i,k) as input there is now a data dependency
between each such pair. This new set of events can now be scheduled *in
exactly the same way as above*, where our "multiprocessor system" now
consists of the stages in the pipeline linearly connected stage k -> stage
k+1. 2. transforms to "(i,k) must be scheduled on k" and 1. transforms to
"if (i,k) is scheduled at k for time t, then (i,k+1) must be scheduled at
k+1 for time t+1" (assuming every pipeline stage takes one time step to
complete).

This is a 25 line description of what I tried to say with four lines in my
previous posting.

>Fine-grain parallelism is the principal diet of VLIWs, and we (well,
>I) define it to be one of a small set of things:  memory load or
>store, register-to-register integer or floating operation, and I/O.
>When you've broken down your program into ops at that level, you can
>then schedule them for execution in an optimal way.  Operations that
>are (at that time) unconstrained save for data dependencies exhibit
>what I think of as useful parallelism.
>
>Of course, as soon as we have a decent enough definition, somebody
>will try to apply it to a systolic or connection machine
>and we'll have to start over. ;-)

Yes, you will have to! :-) Have you heard of so-called space-time mapping
methods for synthesizing systolic arrays? These are essentially scheduling
methods that adhere to the three criteria I sketched above. The scheduling
is often concisely expressed as a function from the set of events to
space-time coordinates. So far I've seen no attempts to apply this way of
thinking to the Connection Machine but it is totally clear that it works in
theory. On such a machine the SIMD constraint will replace 2. (the same
operation in time instead of at the same processor) and the communication
requirement 1. will have to be a little more refined and take into account
such things as congestion etc.

Bjorn Lisper

lisper-bjorn@YALE-BULLDOG.ARPA (Bjorn Lisper) (05/27/88)

In article <1684@hubcap.UUCP> eugene@pioneer.arc.nasa.gov (Eugene N. Miya)
writes:
>for comp.parallel:
>>Although I am mostly interested in true parallelism, I wouldn't mind
>>receiving algorithms based on vectorization, and such like.
>
>The English language is marvelously vague.  I wish I knew what 'true'
>parallelism is or was.
....
>If we are going to make a distinction between vector and parallel, it had
>better have better basis for making that distinction.

In my world it is very simple. parallelism is when you do several things at
once, at different places. Vector operations are usually executed in a
pipeline. When the pipeline is filled every stage in it will execute in
parallel. Thus vector operations provide a special form of parallelism.

Bjorn Lisper

colwell@mfci (Robert Colwell) (05/27/88)

In article <29851@yale-celray.yale.UUCP> lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
=In article <1684@hubcap.UUCP= eugene@pioneer.arc.nasa.gov (Eugene N. Miya)
=writes:
==for comp.parallel:
===Although I am mostly interested in true parallelism, I wouldn't mind
===receiving algorithms based on vectorization, and such like.
==
==The English language is marvelously vague.  I wish I knew what 'true'
==parallelism is or was.
=....
==If we are going to make a distinction between vector and parallel, it had
==better have better basis for making that distinction.
=
=In my world it is very simple. parallelism is when you do several things at
=once, at different places. Vector operations are usually executed in a
=pipeline. When the pipeline is filled every stage in it will execute in
=parallel. Thus vector operations provide a special form of parallelism.
=
=Bjorn Lisper

I think what Eugene's pointing out is that we don't really know what
the coin of the realm is here.  If your hammer says "vector machine"
on it, then you'll quantize your parallelism in units of "percent
vectorization".  But the code or application doesn't know you're
doing that, and it may possess fine-grain parallelism far beyond the
stretches that a vector machine can take advantage of.

"Vector ops" are too high a level of abstraction to be of use in
distinguishing between vectorization and parallelism.  I think we
have to break down the vector ops into their constituent parts.
Fine-grain parallelism is the principal diet of VLIWs, and we (well,
I) define it to be one of a small set of things:  memory load or
store, register-to-register integer or floating operation, and I/O.
When you've broken down your program into ops at that level, you can
then schedule them for execution in an optimal way.  Operations that
are (at that time) unconstrained save for data dependencies exhibit
what I think of as useful parallelism.

Of course, as soon as we have a decent enough definition, somebody
will try to apply it to a systolic or connection machine
and we'll have to start over. ;-)

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

eugene@pioneer.arc.nasa.gov (Eugene N. Miya) (05/28/88)

Remember what Alan Perlis said: A LISP Programmer knows the value of
everything, but the cost of nothing. ;-)

>In my world it is very simple.

Certainly, because you have probably simple problems.  This I assume.

>parallelism is when you do several things at
>once, at different places. Vector operations are usually executed in a
>pipeline. When the pipeline is filled every stage in it will execute in
>parallel. Thus vector operations provide a special form of parallelism.

I will give you an analogy about parallelism.  Imagine n cars being at a stop
light waiting for it to change.  The light turns green. One form of
parallelism is when you have a multilane road (too easy case) [the
undergrad concept of parallelism].
	= = = = = >  #
	= = = = = >  #
	= = = = = >  #
	= = = = = >  #
We assume no one decides to change langes.  Why we run run freeways at
99% efficiency at 65 MPH.
But it is also a form of parallelism when you have all the serially lined
drivers hit the gas SIMULTANEOUSLY.  You only trust that the car ahead
	= > = > = > = >  #
of you moves at the same rate.  What we have in REAL traffic
intersections is PIPELINING (some people call this temporal
	- > - > - > - >  #
parallelism, but it really isn't parallelism is some senses, you even call
it and vectorization "special cases").  The part of the problem comes
from these dependences (the car ahead of you).  If you are willing to
run into him (or thru him, to sacrifice consistency) fine, we have
chaotic relaxation algorithms.  Ah, but if life could be so easy.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "Mailers?! HA!", "If my mail does not reach you, please accept my apology."
  {uunet,hplabs,ncar,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene
  "Send mail, avoid follow-ups.  If enough, I'll summarize."

nelan@HANDIES.UCAR.EDU (George Nelan) (05/30/88)

In article <1696@hubcap.UUCP>, eugene@pioneer.arc.nasa.gov (Eugene N. Miya)
writes:
> In article <1691@hubcap.UUCP> you write:
+ +Anyone care to try for a strict definition?
+
  [etc...]
+ SYNCHRONOUS/SYNCHRONY.  The problem which kills you are side-effects.
+ The early CMU people noted the major problems were: consistency,
+ deadlock, starvation, and exception handling.
  [etc...]

Perhaps, just perhaps, maybe someday, somewhere, someplace, and sometime,
someone will invent something like this:

An infinite, w.r.t. the universe of discourse defined by the data dependencies
of a particular program, MIMD ultra-fine grained side-effect free PARALLEL
machine for IMPLICITLY parallel programs.  I guess no side-effects => purely
functional programs, huh?  Also, it looks like deadlock & synchronization
(consistency) constraints => normal order (lazy) evaluation must be the
computation model of choice [I have some references why this is so --
sufficient interest => I'll post & discuss]; for computational power,
be sure to allow for higher-order functions too.

Oh, BTW, if ya want true scalability, limited (of course) by the definition of
infinity given above, forget about distributed heaps & GC [lambda/combinator
reduction? P'tuey!].  We don't need no steenkin' heaps...

Eductively yurs,
George Nelan, ERC 207, Arizona State University, Tempe, Arizona, USA, 85287
BELL : (602)-965-2791
UUCP : ...{{{ames,rutgers}!hao},{allegra,decvax,ihnp4}}!noao!asuvax!nelan
CSNET: nelan@asu

gould@onion.cs.reading.ac.uk (Adrian J. Gould) (05/30/88)

Hiya,
	I have a project proposal here at Reading for my final year.
	It has been accepted, and so I would like to gleen a bit of
	info from your KNOWLEDGABLE brains...


	I am going to compare parallel computing algorithms for a
	computational problem (the good old mandelbrot set). I already
	have some information (BYTE) and a couple of books, but what
	can you suggest?

	I will be working on simulations, but if any companies have
	any 'free' time that I could use on a computer (parallel),
	especially transputer time... let me know by mail.

	Ok that's the begging done,
		Thanks.

Adrian Gould


-- 
	"And God said, 'Let there be light.' And there was light."                       And so God gave his only Son to be the Light of the world.

Janet: gould@sage.cs.reading.ac.uk  -or-  gould@onion.cs.reading.ac.uk

gerald@umb.umb.edu (Gerald Ostheimer) (06/01/88)

In article <1772@hubcap.UUCP> noao!mcdsun!asuvax!asuvax!nelan@HANDIES.UCAR.EDU
(George Nelan) writes:

>Perhaps, just perhaps, maybe someday, somewhere, someplace, and sometime,
>someone will invent something like this:
>
>An infinite, w.r.t. the universe of discourse defined by the data dependencies
>of a particular program, MIMD ultra-fine grained side-effect free PARALLEL
>machine for IMPLICITLY parallel programs.  I guess no side-effects => purely
>functional programs, huh?  Also, it looks like deadlock & synchronization
>(consistency) constraints => normal order (lazy) evaluation must be the
>computation model of choice [I have some references why this is so --
>sufficient interest => I'll post & discuss]; for computational power,
>be sure to allow for higher-order functions too.

Someday may be closer than you think.
You should take a look at the work on the tagged-token dataflow machine of
MIT's Computation Structures Group under Arvind. (Their work, for some reason
unbeknownst to me, did not yet receive any attention in this newsgroup.)
Their combined language/architecure approach meets the following of your
criteria:
 o  implicit parallelism defined purely by data dependencies
 o  MIMD: full-powered non-von CPU's; instruction scheduling is nifty: no
    sequential program storage, but rather the result of one instruction (a
    token) 'enables' another instruction, possibly across CPU's; enabled
    instructions are maintained (in haphazard order) in a pipeline inside
    each CPU
 o  side-effect free: to a large degree--there is actually a limited form of
    side-effects, which does, however, not affect determinacy: their language
    Id (speak 'Idd') offers something called I-structures. I-structures are,
    intuitively, arrays whose fields can be initialized exactly once, but
    possibly by side-effect. Deals quite nicely with some serious problems of
    pure functional languages (no exorbitant storage requirements, no extensive
    copying). Also makes translation of those 'dusty' Fortran programs easier.
 o  normal order evaluation
 o  higher order functions

 One of the strengths of this architecture is that it can deal very gracefully
 with network latency and memory latency. As long as the processor pipelines
 are full, all CPU's keep working concurrently. (This is probably why 
 network topology seems not to be an overly important topic in their
 literature.)

 A (possibly) surprising problem that turns up is that there can be too
 much parallelism in a program, which can overflow the pipelines and choke the
 machine.

 There are of course more problems.
 I for one never quite understood how the program is distributed over the
 CPU's. This must happen dynamically (when calling functions or entering loops,
 for example), if parallelism is to be exploited.

 Unfortunately I can't give any references to widely available comprehensive
 publications on their work, probably because sofar their work was confined to
 simulations on a network of TI Explorers, and the development of a
 1000-processor machine is just beginning.

 Of the papers I have available, two seem give a reasonable overview:

 "Future Scientific Programming on Parallel Machines" by Arvind and Ekanadham
 (the latter of IBM, Yorktown Heights), Computation Structures Group Memo 272,
 and also to appear (appeared?) in Proc. of the Int. Conf. on Supercomputing,
 Athens, 1987.

 "Executing a Program on the MIT Tagged-Token Dataflow Architecture" by Arvind
 and Nikhil, CSG Memo 271, and also to appear (appeared?) Proc. Parallel
 Architectures and Languages, Europe, Eindhoven 1987

 Not being a member of their team, I take no responsibility for the accuracy
 of the above and gladly welcome any corrections/clarifications.

-- 
Gerald				"When I use a word, it means
<gerald@grn.umb.edu>		 just what I choose it to mean -
				 neither more nor less"
				 -Humpty Dumpty, Through the Looking Glass

lgy@pupthy2.princeton.edu (Larry Yaffe) (06/01/88)

In article <1776@hubcap.UUCP> gerald@umb.umb.edu (Gerald Ostheimer) writes:
-
-You should take a look at the work on the tagged-token dataflow machine of
-MIT's Computation Structures Group under Arvind. (Their work, for some reason
-unbeknownst to me, did not yet receive any attention in this newsgroup.)
-Their combined language/architecure approach meets the following of your
-criteria:
- o  side-effect free: to a large degree--there is actually a limited form of
-    side-effects, which does, however, not affect determinacy: their language
-    Id (speak 'Idd') offers something called I-structures. I-structures are,
-    intuitively, arrays whose fields can be initialized exactly once, but
-    possibly by side-effect. Deals quite nicely with some serious problems of
-    pure functional languages (no exorbitant storage requirements, no extensive
-    copying). Also makes translation of those 'dusty' Fortran programs easier.

    I'd like to hear more about how this language avoids excessive copying
& wasteful memory usage.  How do standard tasks like matrix multiplication, 
binary tree insertions, hash table management, etc. work?

-Gerald				"When I use a word, it means
-<gerald@grn.umb.edu>		 just what I choose it to mean -
-				 neither more nor less"
-				 -Humpty Dumpty, Through the Looking Glass

------------------------------------------------------------------------
Laurence G. Yaffe			Internet: lgy@pupthy.princeton.edu
Department of Physics			Bitnet:   lgy@pucc
Princeton University			UUCP:     ...!princeton!pupthy!lgy
PO Box 708, Princeton NJ 08544

lisper-bjorn@YALE.ARPA (Bjorn Lisper) (06/02/88)

In article <1784@hubcap.UUCP> Larry Yaffe writes:
>(Gerald Ostheimer writes about the MIT tagged-token dataflow machine and
>the language Id...)

>    I'd like to hear more about how this language avoids excessive copying
>& wasteful memory usage.  How do standard tasks like matrix multiplication, 
>binary tree insertions, hash table management, etc. work?

Id is a definitional language. Thus is doesn't specify memory management
explicitly, its variables stand for values rather than memory locations and
the statements stand for events, defining new values from previously defined
ones. To the contrary of an imperative language the order between statements
does not imply a strict execution order, the only order required is the
partial one defined by the data dependencies between statements. It is thus
very much up to the compiler to find a schedule of the events such that
memory is not unduly wasted. (Remember that memory is merely a means of
communicating values between events.)

Definitional languages have been proposed as programming languages for
dataflow machines but they are perfectly general. Nothing prevents them from
being used on other architectures as well (including the traditional von
Neumann one).

An imperative language lets (or demands that!) the programmer specify how
the memory management is done by explicitly telling which variable (= memory
location) is to store a value resulting from an assignment. So for a certain
architecture the programmer can be smart and write assignment statements in
a way that uses memory efficiently. For a definitional language the compiler
has to be as smart as the programmer to do an equally good job. My guess is
that these languages will not be widely successful until this is the case,
amd I'm not convinced that this point has been reached yet.  But when the
compiler techniques for them have been developed to this point (and I do
think it can be done) then they will offer additional benefits such as
portability between different architectures (both serial and parallel) and
ease of program verification.

So the answer to your question is: it's up to the compiler.

Bjorn Lisper

fpst@hubcap.UUCP (Steve Stevenson) (06/02/88)

In article <1772@hubcap.UUCP> George Nelan writes:
....
>Perhaps, just perhaps, maybe someday, somewhere, someplace, and sometime,
>someone will invent something like this:
>
>An infinite, w.r.t. the universe of discourse defined by the data dependencies
>of a particular program, MIMD ultra-fine grained side-effect free PARALLEL
>machine for IMPLICITLY parallel programs.  I guess no side-effects => purely
>functional programs, huh?  Also, it looks like deadlock & synchronization
>(consistency) constraints => normal order (lazy) evaluation must be the
>computation model of choice [I have some references why this is so --
>sufficient interest => I'll post & discuss]; for computational power,
>be sure to allow for higher-order functions too.

What you describe sounds pretty much like a dataflow machine to me.
Experimental such have been built (Manchester, UK for instance) but they
don't seem to be a success. For some reason there is always a lot of talk
about them and their principles of operation before they are built but when
the hardware is there you never get to hear anything about the resulting
performance. Maybe the statistics is too embarrassing, who knows?  Another
indication of the practical problems with dataflow machines is that despite
the fact that the concept has been around for a long time (since the
seventies) there are still no commercial manufacturers out there who have
tried to build one and sell for profit.

The main problem with dataflow machines is that they use an extremely
general scheme to handle fine-grain parallelism. This will cause a
gargantuan overhead. Much of the fine-grain parallelism in a program
typically has a data-independent structure which means that the dependencies
can be analyzed *at compile-time* (as opposed to dataflow mechanisms that
handle this at run-time) and the instructions can be scheduled in advance in
a way that executes as efficient as possible on the hardware at hand. Unless
dataflow architects find ways of incorporating this I don't think dataflow
architectures ever will be cost-effective.

IEEE Computer, February 1982, is a special issue on dataflow. It is a good
introduction for the one previously unfamiliar in the field. It is
especially valuable since it not only contains articles of authors positive
to dataflow but also an article that argues that dataflow is no good:
Gajski, Padua, Kuck, "A Second Opinion on Data Flow Machines and Languages".
I think their critique of dataflow architectures still holds water (although
I don't share their opinion on data flow *languages*, but that's another
issue).

Bjorn Lisper
-- 
Steve Stevenson                            fpst@hubcap.clemson.edu
(aka D. E. Stevenson),                     fpst@clemson.csnet
Department of Computer Science,            comp.parallel
Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell

cdshaw@UCBVAX.BERKELEY.EDU (Chris Shaw) (06/02/88)

>In article <1691@hubcap.UUCP> you write:
>>Anyone care to try for a strict definition?

I think that the best way to think of pipelined parallelism under the Flynn
taxonomy is to call it MISD. In other words, the "unused" classification
among SISD, MISD, SIMD, MIMD does actually contain examples of real machines.

I suppose the real difficulty is with definitions: What is an instruction?
What is data? What is a CPU?

If "instruction" in the Flynn taxonomy means "operate on a data item by passing
it from one register through some combinatorial circuit to another register",
then the above classification of pipelines as MISD machines makes sense.

Of course, assuming that the four Flynn classes are orthogonal is probably a 
mistake. They can probably best be thought as subsets, where SISD is a subset
of MISD, which is a subset of SIMD, which is a subset of MIMD. At the highest
level, a Cray X-MP may be MIMD, but each cpu is MISD, while the scalar part of 
the CPU is SISD.

But all this relies on well-defined notions of data, instruction, etc.
-- 
Chris Shaw    cdshaw@alberta.UUCP (via watmath, ihnp4 or ubc-vision)
University of Alberta
CatchPhrase: Bogus as HELL !

hjm@uunet.UU.NET (hjm) (06/02/88)

In article <1772@hubcap.UUCP> noao!mcdsun!asuvax!asuvax!nelan@HANDIES.UCAR.EDU (George Nelan) writes:
>
>I guess no side-effects => purely functional programs, huh?
>

There is one major problem with this: no side-effects => no I/O, which is a
fairly useful side-effect of most computation.  This is not strictly true, but
for the sequential type of I/O that we're all accustomed to doing there is
state transmitted between statements in terms of the I/O being performed.  For
example, a function such as read_next_input_value(stdin) will return different
things on different calls, and so is strictly not a function.  OK, a bodge can
be performed so that a state variable is added so that referential transparency
is preserved, but just try passing the entire filing system of your machine as
an argument to every function that does anything to any file!  Do that and we'llall need a massively parallel machine just for word processing.

Having said all that, I like functional languages, especially since they can be
implemented in parallel cleanly.  It's just a pity that the machine has to talk
to the outside world ...

	Hubert Matthews

ian@esl.UUCP (Ian Kaplan) (06/06/88)

In article <1776@hubcap.UUCP> gerald@umb.umb.edu (Gerald Ostheimer)
writes (in response to a note from George Nelan):
>
>You should take a look at the work on the tagged-token data flow machine of
>MIT's Computation Structures Group under Arvind. (Their work, for some reason
>unbeknownst to me, did not yet receive any attention in this newsgroup.)
[ much deleted ]
>
> A (possibly) surprising problem that turns up is that there can be too
> much parallelism in a program, which can overflow the pipelines and choke the
> machine.
[ text deleted ]
>
> There are of course more problems.
> I for one never quite understood how the program is distributed over the
> CPU's. This must happen dynamically (when calling functions or
> entering loops, for example), if parallelism is to be exploited.

  To make up for the deficit of data flow discussion in this news
group, I will submit this overly long note discussing some of Arvind's
work.  Data flow is complex and I cannot provide an introduction in
the space of a note that I have time to write and you are likely to
read.  As a result, I will assume some familiarity on the part of the
reader with data flow.  If there is enough interest, I might be
persuaded to put together brief bibliography.

  All programs must have a method of handling global state, which is
usually stored in global data structures.  As Mr. Ostheimer points out,
Arvind's I-structures handle this well.  Arvind's group has also
looked at reducing the overhead normally associated with handling
these data structures.  Some of these methods are algorithmic and
others are architectural.  For example, Steve Heller in Arvind's lab
did some work on a hardware implementation of an I-structure store,
which is a sort of "smart memory".

  Arvind's model of data flow is referred to as dynamic data flow (this
is in contrast to Prof. Dennis' model of static data flow).  In
dynamic data flow several instances of a loop can be active at the
same time.  As each loop instance is instantiated a loop field in the
data flow tag is incremented.  This allows separate data flow matching
on the tokens of each loop instance.  While this technique generates a
lot of parallelism for loops, as Mr. Ostheimer points out, it can also
generate so much parallelism that the data flow matching store
overflows.  One way to solve this problem is to allow the loops to
only expand so far.  For example, a given loop would only be allowed
to have five instantiations.  This would be decided in advance by the
compiler.  How the compiler decides this, I have not seen explained.

  Arvind's implementation of dynamic data flow assumes that as
constructs like loops parallelize at run time, new loop instances run
on additional processors.  This means that the code for the loop must
be resident on the processor.  One easy way to handle this was used by
the Cal. Tech. Cosmic Cube group: a complete copy of the program is
loaded onto every processor.  Of course this is a very inefficient way
to use memory, especially in a small grain parallel computer.  Like
Mr. Ostheimer, I have never seen an explanation of how code is allocated
to processors in a dynamic data flow system.  On Arvind's data flow
machine I don't thing that assignment is dynamic (it is on the
Manchester data flow machine).

  The data flow model is an asynchronous model.  A data producer sends
data to a data consumer when ever it wants and the consumer uses
the data when it is ready.  There are no rendezvous in data flow.
This has the advantage of allowing pipelining, which is an important
source of parallelism.  It has the disadvantage of consuming large,
potentially unbounded, amounts of buffer space.  If a data producer
(perhaps a data flow sub-graph) produces data faster than a consumer
can use it, the consumer's buffers will overflow.  Arvind wrote two
papers on "demand driven data flow", which allows the consumer to tell
the producer that it is ready for more data.  I do not know if the
current Id compiler that is being used by Arvind's group uses demand
driven data flow or not.  Without something like demand driven
data flow (or reduction) an asynchronous data flow machine has the
potential of consuming all of its buffer memory and crashing.
      
  Arvind and Dennis propose data flow supercomputers.  Supercomputers
are being used primarily for numeric computation (as opposed to symbol
computation).  This means that numeric programs must run well on the
system being proposed.  As many people have noted, numeric programs
spend most of their time in loops.  While existing programs might be
better restated for a data flow machine (in Id for example), they will
still have a heavy loop content.  Arvind's data flow model takes
advantage of the parallelism in loops and produces a great deal of
parallelism while the loop is executing.  Many loops must synchronize,
either during the loop execution (if the loop is pipelining) or at the
end of the loop.  If one looks at a performance graph for a numeric
program executing on a data flow machine, there will be large peaks,
with a lot of parallelism, and troughs, where synchronization must
take place.  The synchronization is talking place on a relatively
small number of processors.  At this point, the entire computation is
limited by the speed of the processors.  If the computation is highly
parallel, these synchronization points will come to dominate the
computation time.  The data flow model assumes that many (i.e., 10^4
or 10^6) relatively inexpensive, relatively low performance,
processors are used.  This is fine for the "peaks" of parallelism, but
these slow processors become the limiting factor for synchronization.
This is one of the reasons that Prof. Kuck at the Univ. of Ill.
proposes a few very powerful processors for his Cedar parallel
processor.  Since these processors are fast, they will execute the
synchronization sections rapidly.

Conclusion

  Arvind's data flow model has a reasonable solution for handling
global state (global data structures).  The dynamic data flow model
has a potential producing so much loop parallelism that the data flow
matching stores overflow.  This can be managed by choking loop
parallelism.  Arvind also has a solution for handling the
"producer-consumer" problem, but it is very theoretical and needs work
before it can be implemented.  Using a combination of reduction and
data flow may provide an elegant solution, but I have not seen Arvind
propose this (although there are people in his lab who know reduction
well).  Also, it is not clear, at least to me, that small grain data
flow is a good model for numeric computation.  Symbolic computation
might be a better application, since less synchronization may be
required.  Finally, I have heard a lot of grumbling at conferences
along the lines of "Why is there no real hardware implementation of a
data flow machine."  One person even suggested that no machine had
been built "because Arvind knew that it would not work" (a view that I
don't subscribe to).  Although some of the criticism is unfair, it
still remains true that there is no "iron".  Compared to a couple of
years ago, data flow work has lost momentum if the ACM Computer
Architecture Conference proceedings and the International Conference
on Parallel Processing proceedings are any indication.  Of the data
flow papers I saw presented at the last ICPP, none of them addressed
the hard problems in data flow.  All in all, data flow in America does
not appear as lively as it once was.  However the Japanese just
finished a fairly large data flow parallel processor named Sigma, so
there is still work going on.  I have not seen any information on the
latest Sigma work.

  Well, I hope that this generates some light, in addition to heat.
Sorry it was so long.  Any views expressed here are personal, and do
not necessarily reflect those of ESL or my department.

           Ian L. Kaplan
           ESL Inc.
           Advanced Technology Systems
           ian@esl.COM
           ames!esl!ian

stekas@att.att.com (06/06/88)

Bjorn -

   AT&T makes a signal processor for the military, called the EMSP, and it
is a data flow machine which can deliver up to 600 MFLOPS.  The data flow
architecture does cost something but the benefits are ease of programming
(the machine takes care of resource assignments) and the ability to add
more processors without having to modify the software.

   So there is at least 1 real world data flow machine out there.

Jim

vandys@hpindda.HP.COM (Andy Valencia) (06/06/88)

/ hpindda:comp.parallel / gerald@umb.umb.edu (Gerald Ostheimer) / 10:26 am  May 31, 1988 /
>Someday may be closer than you think.
>You should take a look at the work on the tagged-token dataflow machine of
>MIT's Computation Structures Group under Arvind.

    Is part of the Mind Meld work I heard about?  Where can we get more
information--Stanford had no tech. reports on this.

			Thanks...
			Andy Valencia
			vandys%hpindda.UUCP@hplabs.hp.com