fpst@hubcap.UUCP (Steve Stevenson) (05/17/88)
[ I pulled this off sci.math.num-analysis. Thought it more than relevant. Steve ] I'm looking for sources (papers, books, etc) for parallel numerical algorithms, i.e., algorithms in numerical analysis that work in parallel. I'm not only interested in the algorithms, but in their analysis; when do they converge, what is the order of convergence, etc. Although I am mostly interested in true parallelism, I wouldn't mind receiving algorithms based on vectorization, and such like. Please e-mail me responses, our sites subscribtion to this newsgroup is quite eratic. I wouls also like to hear from anyone else who is interested in this subject. Thanks in advance, |----------------------------------------------------------------------------| |Dennis Grinberg, Math & CS Department, Bar-Ilan University, Ramat-Gan ISRAEL| |----------------------------------------------------------------------------| |BITNET: grinberg@bimacs.bitnet | |CSNET: grinberg%bimacs.bitnet%cunyvm.cuny.edu@csnet-relay | |ARPA: grinberg%bimacs.bitnet@cunyvm.cuny.edu | |UUCP: uunet!mcvax!humus!bimacs!grinberg | |SNAILNET: Dennis Grinberg, 13 Hava Lutzki St., Rehovot, ISRAEL | |----------------------------------------------------------------------------| -- Steve Stevenson fpst@hubcap.clemson.edu (aka D. E. Stevenson), fpst@clemson.csnet Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
eugene@pioneer.arc.nasa.gov (Eugene N. Miya) (05/23/88)
for comp.parallel: >Although I am mostly interested in true parallelism, I wouldn't mind >receiving algorithms based on vectorization, and such like. The English language is marvelously vague. I wish I knew what 'true' parallelism is or was. Ortega now has a new book and there are several others (see my bibliography). I really dislike the word "parallelism." The problem is Whorfian and an example hit me in March. I was attending this meeting in Oregon and one speaker (James Hack, a atmospheric physicist, a meterologist at NCAR) was giving an overview of his community's computing problems [I used to work with weather models myself, so the terrain is familiar]. We were humbled by this man's problems (O(n^3.5)) and he admitted he was not a computer scientist [he was humbled by our problems]. Then he placed a view graph of a 3-D plot of Amdahl's law, one axis was "percent vectorization" and the other domain axis was "percent parallelism" done using the well known NCAR graphics package displaying the spike at 100% vectorization and parallelization (blah, what a word). This graph disturbed me, there was something wrong, and later (days unfortunately) my linear algebra background came to the elegant conclusion that he was assuming these two things were orthogonal. That was his fallacy. Now until we refine our terminology and get an appreciation for the problems created by side-effects, shared memory, etc., we won't solve many fundamental problems. If we are going to make a distinction between vector and parallel, it had better have better basis for making that distinction. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,hao,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
fpst@hubcap.clemson.edu (Steve Stevenson-Moderator) (05/23/88)
[ Dennis is not receiving the group directly and is communicating directly through e-mail. I'll try to edit and preserve meaning. This note is in reference to the request for papers on parallel numerical algorithms. Steve ] Avi Lin of Temple University has a number of papers on various subjects such as _A_New_Parallel_Algorithm_for_LU_Decomposition_ (written with He Zhang, a graduate student), _A_New_Parallel_Algorithm_For_Linear_ Triangler_Systems_, etc. He visted Bar-Ilan University in Israel a few weeks ago, and I picked up a few papers. If there is any interest, please let me know. |----------------------------------------------------------------------------| |Dennis Grinberg, Math & CS Department, Bar-Ilan University, Ramat-Gan ISRAEL| |----------------------------------------------------------------------------| |BITNET: grinberg@bimacs.bitnet | |CSNET: grinberg%bimacs.bitnet%cunyvm.cuny.edu@csnet-relay | |ARPA: grinberg%bimacs.bitnet@cunyvm.cuny.edu | |UUCP: uunet!mcvax!humus!bimacs!grinberg | |SNAILNET: Dennis Grinberg, 13 Hava Lutzki St., Rehovot, ISRAEL | |----------------------------------------------------------------------------|
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/24/88)
In article <1684@hubcap.UUCP> eugene@pioneer.arc.nasa.gov (Eugene N. Miya) writes: >a computer scientist [he was humbled by our problems]. Then he placed a view >graph of a 3-D plot of Amdahl's law, one axis was "percent vectorization" and >the other domain axis was "percent parallelism" done using the well known >NCAR graphics package displaying the spike at 100% vectorization and >parallelization (blah, what a word). This graph disturbed me, there was >... His point was that on a parallel/vector architecture such as the Cray X/MP or Cray 2 series you have to do well on both the MIMD parallel and Vector fronts in order to approach the peak performance for a machine. Although the space of applications may not be evenly distributed on the x-y surface of his plot, the notion that it presented that a rather small number of applications would do well on both the MIMD and Vector parallel fronts seems to be quite valid and is not disturbing at all. A good example of an application that fits the distinction of (MIMD,Vector) parallel is Gauss elimination. One vectorizes the SAXPY operations to mask memory and functional unit latency, one parallizes across the rows of the matrix to use more than one vector CPU in a MIMD mode. Of course, there is more than one way to skin this cat, but the notion of simultaneously exploiting MIMD parallelism and Vector (SIMD) parallelism in a given architecture or algorithm is quite useful.
fpst@hubcap.clemson.edu (Steve Stevenson-Moderator) (05/24/88)
From article <1684@hubcap.UUCP>, by Eugene N. Miya <eugene@pioneer.arc.nasa.gov>: > > If we are going to make a distinction between vector and parallel, it had > better have better basis for making that distinction. > I agree. It seems to me that a tentative, first cut distinction is SIMD and MIMD but one has to be very careful. Example: a possible counter is that NCUBE does a parallel algorithm (in MIMD sense) 'cause it does not have vector coprocessors. The FPS T, on the other hand, could do both. It seems to me that the essentials have to do with some reformulation of linear algebra and some agreement that there is an equivalence. Anyone care to try for a strict definition?
eugene@pioneer.arc.nasa.gov (Eugene N. Miya) (05/25/88)
In article <1691@hubcap.UUCP> you write: >Anyone care to try for a strict definition? I am unable to offer a better definition, but I hope I can offer some enlightenment based on further discussions in private others. There is a confusion between subtle differences in parallel/vector/Flynn's SIMD/AND/OR parallelism what have you. DO NOT CONFUSE PARALLEL with SYNCHRONOUS/SYNCHRONY. The problem which kills you are side-effects. The early CMU people noted the major problems were: consistency, deadlock, starvation, and exception handling. To understand some of the problem, I offer a variation of a technique used by the early physicists for understanding the nature of light. They jokingly (and half-heartedly) said light behaves like particle on MWF and like wave on TTS (Sunday, they rest). So, DO work on parallel algorithms or programming or machines for one week without using the word PARALLEL. You start to fall back on other words like CONCURRENT, SIMULTANEOUS, MULTIPROCESS, and in time throw each of these words out. Several different groups, I am aware of Marty Fouts and others (independently) using words like COOPERATIVE versus COMPETITIVE computing (why not, we have GREEDY algorithms, right? ;-) The problem of sustaining parallel computation is sort of like sustaining controlled thermonuclear fusion (great for short periods, but the start up gets you). I wish I can offer more, but I'm not smart enough. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
lisper-bjorn@YALE.ARPA (Bjorn Lisper) (05/26/88)
In article <407@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >In article <29851@yale-celray.yale.UUCP> I wrote, on "true parallelism" vs. "degree of vectorization": .... >=In my world it is very simple. parallelism is when you do several things at >=once, at different places. Vector operations are usually executed in a >=pipeline. When the pipeline is filled every stage in it will execute in >=parallel. Thus vector operations provide a special form of parallelism. .... >"Vector ops" are too high a level of abstraction to be of use in >distinguishing between vectorization and parallelism. I think we >have to break down the vector ops into their constituent parts. The point I wanted to make is that vectorization is a very special form of parallelism. The limits for "true parallelism" can, if I've got that concept right, be expressed as data dependencies between "events" where every event is an application of a "high-level" operator, say "+" in the vector example above. If we have a multiprocessor system where some processors can perform "+" and if we have some connections between the processors then a compiler will have to find a schedule of the events that satisfies three criteria; 1. communication requirements given by data dependencies must be met by available connections, 2. "+" events are scheduled only on processors who can perform "+", 3. no two events are scheduled on the same processor at the same time. Furthermore the schedule should be "good" which usually means short completion time. Vectorization can be described in exactly the same way; in order to do this "+" must be expressed in terms of suboperations matching the stages of the pipeline. Every event a(i)+b(i) is now a set of smaller events according to this, let's call them (i,k) for each vector index i and each stage k. Since (i,k+1) uses the result from (i,k) as input there is now a data dependency between each such pair. This new set of events can now be scheduled *in exactly the same way as above*, where our "multiprocessor system" now consists of the stages in the pipeline linearly connected stage k -> stage k+1. 2. transforms to "(i,k) must be scheduled on k" and 1. transforms to "if (i,k) is scheduled at k for time t, then (i,k+1) must be scheduled at k+1 for time t+1" (assuming every pipeline stage takes one time step to complete). This is a 25 line description of what I tried to say with four lines in my previous posting. >Fine-grain parallelism is the principal diet of VLIWs, and we (well, >I) define it to be one of a small set of things: memory load or >store, register-to-register integer or floating operation, and I/O. >When you've broken down your program into ops at that level, you can >then schedule them for execution in an optimal way. Operations that >are (at that time) unconstrained save for data dependencies exhibit >what I think of as useful parallelism. > >Of course, as soon as we have a decent enough definition, somebody >will try to apply it to a systolic or connection machine >and we'll have to start over. ;-) Yes, you will have to! :-) Have you heard of so-called space-time mapping methods for synthesizing systolic arrays? These are essentially scheduling methods that adhere to the three criteria I sketched above. The scheduling is often concisely expressed as a function from the set of events to space-time coordinates. So far I've seen no attempts to apply this way of thinking to the Connection Machine but it is totally clear that it works in theory. On such a machine the SIMD constraint will replace 2. (the same operation in time instead of at the same processor) and the communication requirement 1. will have to be a little more refined and take into account such things as congestion etc. Bjorn Lisper
lisper-bjorn@YALE-BULLDOG.ARPA (Bjorn Lisper) (05/27/88)
In article <1684@hubcap.UUCP> eugene@pioneer.arc.nasa.gov (Eugene N. Miya) writes: >for comp.parallel: >>Although I am mostly interested in true parallelism, I wouldn't mind >>receiving algorithms based on vectorization, and such like. > >The English language is marvelously vague. I wish I knew what 'true' >parallelism is or was. .... >If we are going to make a distinction between vector and parallel, it had >better have better basis for making that distinction. In my world it is very simple. parallelism is when you do several things at once, at different places. Vector operations are usually executed in a pipeline. When the pipeline is filled every stage in it will execute in parallel. Thus vector operations provide a special form of parallelism. Bjorn Lisper
colwell@mfci (Robert Colwell) (05/27/88)
In article <29851@yale-celray.yale.UUCP> lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes: =In article <1684@hubcap.UUCP= eugene@pioneer.arc.nasa.gov (Eugene N. Miya) =writes: ==for comp.parallel: ===Although I am mostly interested in true parallelism, I wouldn't mind ===receiving algorithms based on vectorization, and such like. == ==The English language is marvelously vague. I wish I knew what 'true' ==parallelism is or was. =.... ==If we are going to make a distinction between vector and parallel, it had ==better have better basis for making that distinction. = =In my world it is very simple. parallelism is when you do several things at =once, at different places. Vector operations are usually executed in a =pipeline. When the pipeline is filled every stage in it will execute in =parallel. Thus vector operations provide a special form of parallelism. = =Bjorn Lisper I think what Eugene's pointing out is that we don't really know what the coin of the realm is here. If your hammer says "vector machine" on it, then you'll quantize your parallelism in units of "percent vectorization". But the code or application doesn't know you're doing that, and it may possess fine-grain parallelism far beyond the stretches that a vector machine can take advantage of. "Vector ops" are too high a level of abstraction to be of use in distinguishing between vectorization and parallelism. I think we have to break down the vector ops into their constituent parts. Fine-grain parallelism is the principal diet of VLIWs, and we (well, I) define it to be one of a small set of things: memory load or store, register-to-register integer or floating operation, and I/O. When you've broken down your program into ops at that level, you can then schedule them for execution in an optimal way. Operations that are (at that time) unconstrained save for data dependencies exhibit what I think of as useful parallelism. Of course, as soon as we have a decent enough definition, somebody will try to apply it to a systolic or connection machine and we'll have to start over. ;-) Bob Colwell mfci!colwell@uunet.uucp Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090
eugene@pioneer.arc.nasa.gov (Eugene N. Miya) (05/28/88)
Remember what Alan Perlis said: A LISP Programmer knows the value of everything, but the cost of nothing. ;-) >In my world it is very simple. Certainly, because you have probably simple problems. This I assume. >parallelism is when you do several things at >once, at different places. Vector operations are usually executed in a >pipeline. When the pipeline is filled every stage in it will execute in >parallel. Thus vector operations provide a special form of parallelism. I will give you an analogy about parallelism. Imagine n cars being at a stop light waiting for it to change. The light turns green. One form of parallelism is when you have a multilane road (too easy case) [the undergrad concept of parallelism]. = = = = = > # = = = = = > # = = = = = > # = = = = = > # We assume no one decides to change langes. Why we run run freeways at 99% efficiency at 65 MPH. But it is also a form of parallelism when you have all the serially lined drivers hit the gas SIMULTANEOUSLY. You only trust that the car ahead = > = > = > = > # of you moves at the same rate. What we have in REAL traffic intersections is PIPELINING (some people call this temporal - > - > - > - > # parallelism, but it really isn't parallelism is some senses, you even call it and vectorization "special cases"). The part of the problem comes from these dependences (the car ahead of you). If you are willing to run into him (or thru him, to sacrifice consistency) fine, we have chaotic relaxation algorithms. Ah, but if life could be so easy. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "Mailers?! HA!", "If my mail does not reach you, please accept my apology." {uunet,hplabs,ncar,ihnp4,decwrl,allegra,tektronix}!ames!aurora!eugene "Send mail, avoid follow-ups. If enough, I'll summarize."
nelan@HANDIES.UCAR.EDU (George Nelan) (05/30/88)
In article <1696@hubcap.UUCP>, eugene@pioneer.arc.nasa.gov (Eugene N. Miya) writes: > In article <1691@hubcap.UUCP> you write: + +Anyone care to try for a strict definition? + [etc...] + SYNCHRONOUS/SYNCHRONY. The problem which kills you are side-effects. + The early CMU people noted the major problems were: consistency, + deadlock, starvation, and exception handling. [etc...] Perhaps, just perhaps, maybe someday, somewhere, someplace, and sometime, someone will invent something like this: An infinite, w.r.t. the universe of discourse defined by the data dependencies of a particular program, MIMD ultra-fine grained side-effect free PARALLEL machine for IMPLICITLY parallel programs. I guess no side-effects => purely functional programs, huh? Also, it looks like deadlock & synchronization (consistency) constraints => normal order (lazy) evaluation must be the computation model of choice [I have some references why this is so -- sufficient interest => I'll post & discuss]; for computational power, be sure to allow for higher-order functions too. Oh, BTW, if ya want true scalability, limited (of course) by the definition of infinity given above, forget about distributed heaps & GC [lambda/combinator reduction? P'tuey!]. We don't need no steenkin' heaps... Eductively yurs, George Nelan, ERC 207, Arizona State University, Tempe, Arizona, USA, 85287 BELL : (602)-965-2791 UUCP : ...{{{ames,rutgers}!hao},{allegra,decvax,ihnp4}}!noao!asuvax!nelan CSNET: nelan@asu
gould@onion.cs.reading.ac.uk (Adrian J. Gould) (05/30/88)
Hiya, I have a project proposal here at Reading for my final year. It has been accepted, and so I would like to gleen a bit of info from your KNOWLEDGABLE brains... I am going to compare parallel computing algorithms for a computational problem (the good old mandelbrot set). I already have some information (BYTE) and a couple of books, but what can you suggest? I will be working on simulations, but if any companies have any 'free' time that I could use on a computer (parallel), especially transputer time... let me know by mail. Ok that's the begging done, Thanks. Adrian Gould -- "And God said, 'Let there be light.' And there was light." And so God gave his only Son to be the Light of the world. Janet: gould@sage.cs.reading.ac.uk -or- gould@onion.cs.reading.ac.uk
gerald@umb.umb.edu (Gerald Ostheimer) (06/01/88)
In article <1772@hubcap.UUCP> noao!mcdsun!asuvax!asuvax!nelan@HANDIES.UCAR.EDU (George Nelan) writes: >Perhaps, just perhaps, maybe someday, somewhere, someplace, and sometime, >someone will invent something like this: > >An infinite, w.r.t. the universe of discourse defined by the data dependencies >of a particular program, MIMD ultra-fine grained side-effect free PARALLEL >machine for IMPLICITLY parallel programs. I guess no side-effects => purely >functional programs, huh? Also, it looks like deadlock & synchronization >(consistency) constraints => normal order (lazy) evaluation must be the >computation model of choice [I have some references why this is so -- >sufficient interest => I'll post & discuss]; for computational power, >be sure to allow for higher-order functions too. Someday may be closer than you think. You should take a look at the work on the tagged-token dataflow machine of MIT's Computation Structures Group under Arvind. (Their work, for some reason unbeknownst to me, did not yet receive any attention in this newsgroup.) Their combined language/architecure approach meets the following of your criteria: o implicit parallelism defined purely by data dependencies o MIMD: full-powered non-von CPU's; instruction scheduling is nifty: no sequential program storage, but rather the result of one instruction (a token) 'enables' another instruction, possibly across CPU's; enabled instructions are maintained (in haphazard order) in a pipeline inside each CPU o side-effect free: to a large degree--there is actually a limited form of side-effects, which does, however, not affect determinacy: their language Id (speak 'Idd') offers something called I-structures. I-structures are, intuitively, arrays whose fields can be initialized exactly once, but possibly by side-effect. Deals quite nicely with some serious problems of pure functional languages (no exorbitant storage requirements, no extensive copying). Also makes translation of those 'dusty' Fortran programs easier. o normal order evaluation o higher order functions One of the strengths of this architecture is that it can deal very gracefully with network latency and memory latency. As long as the processor pipelines are full, all CPU's keep working concurrently. (This is probably why network topology seems not to be an overly important topic in their literature.) A (possibly) surprising problem that turns up is that there can be too much parallelism in a program, which can overflow the pipelines and choke the machine. There are of course more problems. I for one never quite understood how the program is distributed over the CPU's. This must happen dynamically (when calling functions or entering loops, for example), if parallelism is to be exploited. Unfortunately I can't give any references to widely available comprehensive publications on their work, probably because sofar their work was confined to simulations on a network of TI Explorers, and the development of a 1000-processor machine is just beginning. Of the papers I have available, two seem give a reasonable overview: "Future Scientific Programming on Parallel Machines" by Arvind and Ekanadham (the latter of IBM, Yorktown Heights), Computation Structures Group Memo 272, and also to appear (appeared?) in Proc. of the Int. Conf. on Supercomputing, Athens, 1987. "Executing a Program on the MIT Tagged-Token Dataflow Architecture" by Arvind and Nikhil, CSG Memo 271, and also to appear (appeared?) Proc. Parallel Architectures and Languages, Europe, Eindhoven 1987 Not being a member of their team, I take no responsibility for the accuracy of the above and gladly welcome any corrections/clarifications. -- Gerald "When I use a word, it means <gerald@grn.umb.edu> just what I choose it to mean - neither more nor less" -Humpty Dumpty, Through the Looking Glass
lgy@pupthy2.princeton.edu (Larry Yaffe) (06/01/88)
In article <1776@hubcap.UUCP> gerald@umb.umb.edu (Gerald Ostheimer) writes:
-
-You should take a look at the work on the tagged-token dataflow machine of
-MIT's Computation Structures Group under Arvind. (Their work, for some reason
-unbeknownst to me, did not yet receive any attention in this newsgroup.)
-Their combined language/architecure approach meets the following of your
-criteria:
- o side-effect free: to a large degree--there is actually a limited form of
- side-effects, which does, however, not affect determinacy: their language
- Id (speak 'Idd') offers something called I-structures. I-structures are,
- intuitively, arrays whose fields can be initialized exactly once, but
- possibly by side-effect. Deals quite nicely with some serious problems of
- pure functional languages (no exorbitant storage requirements, no extensive
- copying). Also makes translation of those 'dusty' Fortran programs easier.
I'd like to hear more about how this language avoids excessive copying
& wasteful memory usage. How do standard tasks like matrix multiplication,
binary tree insertions, hash table management, etc. work?
-Gerald "When I use a word, it means
-<gerald@grn.umb.edu> just what I choose it to mean -
- neither more nor less"
- -Humpty Dumpty, Through the Looking Glass
------------------------------------------------------------------------
Laurence G. Yaffe Internet: lgy@pupthy.princeton.edu
Department of Physics Bitnet: lgy@pucc
Princeton University UUCP: ...!princeton!pupthy!lgy
PO Box 708, Princeton NJ 08544
lisper-bjorn@YALE.ARPA (Bjorn Lisper) (06/02/88)
In article <1784@hubcap.UUCP> Larry Yaffe writes: >(Gerald Ostheimer writes about the MIT tagged-token dataflow machine and >the language Id...) > I'd like to hear more about how this language avoids excessive copying >& wasteful memory usage. How do standard tasks like matrix multiplication, >binary tree insertions, hash table management, etc. work? Id is a definitional language. Thus is doesn't specify memory management explicitly, its variables stand for values rather than memory locations and the statements stand for events, defining new values from previously defined ones. To the contrary of an imperative language the order between statements does not imply a strict execution order, the only order required is the partial one defined by the data dependencies between statements. It is thus very much up to the compiler to find a schedule of the events such that memory is not unduly wasted. (Remember that memory is merely a means of communicating values between events.) Definitional languages have been proposed as programming languages for dataflow machines but they are perfectly general. Nothing prevents them from being used on other architectures as well (including the traditional von Neumann one). An imperative language lets (or demands that!) the programmer specify how the memory management is done by explicitly telling which variable (= memory location) is to store a value resulting from an assignment. So for a certain architecture the programmer can be smart and write assignment statements in a way that uses memory efficiently. For a definitional language the compiler has to be as smart as the programmer to do an equally good job. My guess is that these languages will not be widely successful until this is the case, amd I'm not convinced that this point has been reached yet. But when the compiler techniques for them have been developed to this point (and I do think it can be done) then they will offer additional benefits such as portability between different architectures (both serial and parallel) and ease of program verification. So the answer to your question is: it's up to the compiler. Bjorn Lisper
fpst@hubcap.UUCP (Steve Stevenson) (06/02/88)
In article <1772@hubcap.UUCP> George Nelan writes: .... >Perhaps, just perhaps, maybe someday, somewhere, someplace, and sometime, >someone will invent something like this: > >An infinite, w.r.t. the universe of discourse defined by the data dependencies >of a particular program, MIMD ultra-fine grained side-effect free PARALLEL >machine for IMPLICITLY parallel programs. I guess no side-effects => purely >functional programs, huh? Also, it looks like deadlock & synchronization >(consistency) constraints => normal order (lazy) evaluation must be the >computation model of choice [I have some references why this is so -- >sufficient interest => I'll post & discuss]; for computational power, >be sure to allow for higher-order functions too. What you describe sounds pretty much like a dataflow machine to me. Experimental such have been built (Manchester, UK for instance) but they don't seem to be a success. For some reason there is always a lot of talk about them and their principles of operation before they are built but when the hardware is there you never get to hear anything about the resulting performance. Maybe the statistics is too embarrassing, who knows? Another indication of the practical problems with dataflow machines is that despite the fact that the concept has been around for a long time (since the seventies) there are still no commercial manufacturers out there who have tried to build one and sell for profit. The main problem with dataflow machines is that they use an extremely general scheme to handle fine-grain parallelism. This will cause a gargantuan overhead. Much of the fine-grain parallelism in a program typically has a data-independent structure which means that the dependencies can be analyzed *at compile-time* (as opposed to dataflow mechanisms that handle this at run-time) and the instructions can be scheduled in advance in a way that executes as efficient as possible on the hardware at hand. Unless dataflow architects find ways of incorporating this I don't think dataflow architectures ever will be cost-effective. IEEE Computer, February 1982, is a special issue on dataflow. It is a good introduction for the one previously unfamiliar in the field. It is especially valuable since it not only contains articles of authors positive to dataflow but also an article that argues that dataflow is no good: Gajski, Padua, Kuck, "A Second Opinion on Data Flow Machines and Languages". I think their critique of dataflow architectures still holds water (although I don't share their opinion on data flow *languages*, but that's another issue). Bjorn Lisper -- Steve Stevenson fpst@hubcap.clemson.edu (aka D. E. Stevenson), fpst@clemson.csnet Department of Computer Science, comp.parallel Clemson University, Clemson, SC 29634-1906 (803)656-5880.mabell
cdshaw@UCBVAX.BERKELEY.EDU (Chris Shaw) (06/02/88)
>In article <1691@hubcap.UUCP> you write: >>Anyone care to try for a strict definition? I think that the best way to think of pipelined parallelism under the Flynn taxonomy is to call it MISD. In other words, the "unused" classification among SISD, MISD, SIMD, MIMD does actually contain examples of real machines. I suppose the real difficulty is with definitions: What is an instruction? What is data? What is a CPU? If "instruction" in the Flynn taxonomy means "operate on a data item by passing it from one register through some combinatorial circuit to another register", then the above classification of pipelines as MISD machines makes sense. Of course, assuming that the four Flynn classes are orthogonal is probably a mistake. They can probably best be thought as subsets, where SISD is a subset of MISD, which is a subset of SIMD, which is a subset of MIMD. At the highest level, a Cray X-MP may be MIMD, but each cpu is MISD, while the scalar part of the CPU is SISD. But all this relies on well-defined notions of data, instruction, etc. -- Chris Shaw cdshaw@alberta.UUCP (via watmath, ihnp4 or ubc-vision) University of Alberta CatchPhrase: Bogus as HELL !
hjm@uunet.UU.NET (hjm) (06/02/88)
In article <1772@hubcap.UUCP> noao!mcdsun!asuvax!asuvax!nelan@HANDIES.UCAR.EDU (George Nelan) writes: > >I guess no side-effects => purely functional programs, huh? > There is one major problem with this: no side-effects => no I/O, which is a fairly useful side-effect of most computation. This is not strictly true, but for the sequential type of I/O that we're all accustomed to doing there is state transmitted between statements in terms of the I/O being performed. For example, a function such as read_next_input_value(stdin) will return different things on different calls, and so is strictly not a function. OK, a bodge can be performed so that a state variable is added so that referential transparency is preserved, but just try passing the entire filing system of your machine as an argument to every function that does anything to any file! Do that and we'llall need a massively parallel machine just for word processing. Having said all that, I like functional languages, especially since they can be implemented in parallel cleanly. It's just a pity that the machine has to talk to the outside world ... Hubert Matthews
ian@esl.UUCP (Ian Kaplan) (06/06/88)
In article <1776@hubcap.UUCP> gerald@umb.umb.edu (Gerald Ostheimer) writes (in response to a note from George Nelan): > >You should take a look at the work on the tagged-token data flow machine of >MIT's Computation Structures Group under Arvind. (Their work, for some reason >unbeknownst to me, did not yet receive any attention in this newsgroup.) [ much deleted ] > > A (possibly) surprising problem that turns up is that there can be too > much parallelism in a program, which can overflow the pipelines and choke the > machine. [ text deleted ] > > There are of course more problems. > I for one never quite understood how the program is distributed over the > CPU's. This must happen dynamically (when calling functions or > entering loops, for example), if parallelism is to be exploited. To make up for the deficit of data flow discussion in this news group, I will submit this overly long note discussing some of Arvind's work. Data flow is complex and I cannot provide an introduction in the space of a note that I have time to write and you are likely to read. As a result, I will assume some familiarity on the part of the reader with data flow. If there is enough interest, I might be persuaded to put together brief bibliography. All programs must have a method of handling global state, which is usually stored in global data structures. As Mr. Ostheimer points out, Arvind's I-structures handle this well. Arvind's group has also looked at reducing the overhead normally associated with handling these data structures. Some of these methods are algorithmic and others are architectural. For example, Steve Heller in Arvind's lab did some work on a hardware implementation of an I-structure store, which is a sort of "smart memory". Arvind's model of data flow is referred to as dynamic data flow (this is in contrast to Prof. Dennis' model of static data flow). In dynamic data flow several instances of a loop can be active at the same time. As each loop instance is instantiated a loop field in the data flow tag is incremented. This allows separate data flow matching on the tokens of each loop instance. While this technique generates a lot of parallelism for loops, as Mr. Ostheimer points out, it can also generate so much parallelism that the data flow matching store overflows. One way to solve this problem is to allow the loops to only expand so far. For example, a given loop would only be allowed to have five instantiations. This would be decided in advance by the compiler. How the compiler decides this, I have not seen explained. Arvind's implementation of dynamic data flow assumes that as constructs like loops parallelize at run time, new loop instances run on additional processors. This means that the code for the loop must be resident on the processor. One easy way to handle this was used by the Cal. Tech. Cosmic Cube group: a complete copy of the program is loaded onto every processor. Of course this is a very inefficient way to use memory, especially in a small grain parallel computer. Like Mr. Ostheimer, I have never seen an explanation of how code is allocated to processors in a dynamic data flow system. On Arvind's data flow machine I don't thing that assignment is dynamic (it is on the Manchester data flow machine). The data flow model is an asynchronous model. A data producer sends data to a data consumer when ever it wants and the consumer uses the data when it is ready. There are no rendezvous in data flow. This has the advantage of allowing pipelining, which is an important source of parallelism. It has the disadvantage of consuming large, potentially unbounded, amounts of buffer space. If a data producer (perhaps a data flow sub-graph) produces data faster than a consumer can use it, the consumer's buffers will overflow. Arvind wrote two papers on "demand driven data flow", which allows the consumer to tell the producer that it is ready for more data. I do not know if the current Id compiler that is being used by Arvind's group uses demand driven data flow or not. Without something like demand driven data flow (or reduction) an asynchronous data flow machine has the potential of consuming all of its buffer memory and crashing. Arvind and Dennis propose data flow supercomputers. Supercomputers are being used primarily for numeric computation (as opposed to symbol computation). This means that numeric programs must run well on the system being proposed. As many people have noted, numeric programs spend most of their time in loops. While existing programs might be better restated for a data flow machine (in Id for example), they will still have a heavy loop content. Arvind's data flow model takes advantage of the parallelism in loops and produces a great deal of parallelism while the loop is executing. Many loops must synchronize, either during the loop execution (if the loop is pipelining) or at the end of the loop. If one looks at a performance graph for a numeric program executing on a data flow machine, there will be large peaks, with a lot of parallelism, and troughs, where synchronization must take place. The synchronization is talking place on a relatively small number of processors. At this point, the entire computation is limited by the speed of the processors. If the computation is highly parallel, these synchronization points will come to dominate the computation time. The data flow model assumes that many (i.e., 10^4 or 10^6) relatively inexpensive, relatively low performance, processors are used. This is fine for the "peaks" of parallelism, but these slow processors become the limiting factor for synchronization. This is one of the reasons that Prof. Kuck at the Univ. of Ill. proposes a few very powerful processors for his Cedar parallel processor. Since these processors are fast, they will execute the synchronization sections rapidly. Conclusion Arvind's data flow model has a reasonable solution for handling global state (global data structures). The dynamic data flow model has a potential producing so much loop parallelism that the data flow matching stores overflow. This can be managed by choking loop parallelism. Arvind also has a solution for handling the "producer-consumer" problem, but it is very theoretical and needs work before it can be implemented. Using a combination of reduction and data flow may provide an elegant solution, but I have not seen Arvind propose this (although there are people in his lab who know reduction well). Also, it is not clear, at least to me, that small grain data flow is a good model for numeric computation. Symbolic computation might be a better application, since less synchronization may be required. Finally, I have heard a lot of grumbling at conferences along the lines of "Why is there no real hardware implementation of a data flow machine." One person even suggested that no machine had been built "because Arvind knew that it would not work" (a view that I don't subscribe to). Although some of the criticism is unfair, it still remains true that there is no "iron". Compared to a couple of years ago, data flow work has lost momentum if the ACM Computer Architecture Conference proceedings and the International Conference on Parallel Processing proceedings are any indication. Of the data flow papers I saw presented at the last ICPP, none of them addressed the hard problems in data flow. All in all, data flow in America does not appear as lively as it once was. However the Japanese just finished a fairly large data flow parallel processor named Sigma, so there is still work going on. I have not seen any information on the latest Sigma work. Well, I hope that this generates some light, in addition to heat. Sorry it was so long. Any views expressed here are personal, and do not necessarily reflect those of ESL or my department. Ian L. Kaplan ESL Inc. Advanced Technology Systems ian@esl.COM ames!esl!ian
stekas@att.att.com (06/06/88)
Bjorn - AT&T makes a signal processor for the military, called the EMSP, and it is a data flow machine which can deliver up to 600 MFLOPS. The data flow architecture does cost something but the benefits are ease of programming (the machine takes care of resource assignments) and the ability to add more processors without having to modify the software. So there is at least 1 real world data flow machine out there. Jim
vandys@hpindda.HP.COM (Andy Valencia) (06/06/88)
/ hpindda:comp.parallel / gerald@umb.umb.edu (Gerald Ostheimer) / 10:26 am May 31, 1988 / >Someday may be closer than you think. >You should take a look at the work on the tagged-token dataflow machine of >MIT's Computation Structures Group under Arvind. Is part of the Mind Meld work I heard about? Where can we get more information--Stanford had no tech. reports on this. Thanks... Andy Valencia vandys%hpindda.UUCP@hplabs.hp.com