jashley@silver.ucs.indiana.edu (J. Michael Ashley) (07/17/90)
A little while ago I asked why array comprehensions are not parallel. In particular, why does an array comprehension like this: { array (1, n) | [i] = f i || i <- 1 to n } not evaluate the (f i)'s concurrently? Well, it turns out that the i's get generated sequentially as a list (a list in the data structure sense, mind you). But there's no reason why we have to confine ourselves to using this "1 to n" construct to generate the indices. In fact, we can use any function we like to generate the indices. For example, we can rewrite the above expression as { array (1, n) | [i] = f i || i <- gen 1 n } where gen is some generator of our own creation. But what are we going to create? We'd like to write something better than the O(n) generator that the "1 to n" represents. Unfortunately, I haven't been able to uncover a O(1) or even O(log n) algorithm. The difficulty is that the append function is O(m), where m is the length of the first argument, so the best I've been able to do is O(m). This is no good. Thorsten von Eiken suggested the following trick. Instead of using one generator to create the indices, use two! I'm ignoring the bounds problem, but this is the idea: { array (1, n) | [i] = f i || j <- 1 to n by 8 & i <- j to j+7 } The best you can hope for with this method is O(sqrt n) where n is the length of the array. You can use more generators to get O(cubert n), etc... This is a nice solution, and I don't think you can do much better without resorting to non-functional constructs (the "make_1d_i_array" function in Id). Though this is a nice solution, it is still a trick, and I'm not very fond of tricks. It seems that the root of the problem is that the generator produces a *list* of indices. I say this is the problem because building a list is an O(n) job and referencing the list is an O(n) job. A list is the wrong data structure. Can anybody else suggest a better definition for array comprehensions? What about a "set" as a new data structure? By the way, array comprehensions are a feature of Haskell, which is suppossed to be the new standard. So this question is very relevent! mike ashley jashley@lanl.gov
wg@tubopal.UUCP (Wolfgang Grieskamp) (07/17/90)
In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes: >A little while ago I asked why array comprehensions are not parallel. In >particular, why does an array comprehension like this: > > { array (1, n) | [i] = f i || i <- 1 to n } > >not evaluate the (f i)'s concurrently? Well, it turns out that the i's get >generated sequentially as a list (a list in the data structure sense, mind >you). This index list is intermediate and an optimizing compiler can eliminate it using loop fusion. Assum you have FUN .. : nat ** nat -> seq[nat] DEF i..n == IF i>n THEN <> IF i<=n THEN i::((i+1)..n) ELSE <> FI and FUN init: array[alpha] ** (nat -> alpha) ** seq[nat] -> array[alpha] DEF init(A,f,S) == IF <>?(S) THEN A IF ::?(S) THEN init(update(A,hd(S),f(hd(S))),f,tl(S)) FI and somewhere the expression ... init(array(1,n),f,1..n) ... as a low level representation of the array comprehension. Now the compiler may derive the new function: FUN init1: array[alpha] ** (nat -> alpha) ** nat ** nat -> array[alpha] DEF init1(A,f,i,n) == IF <>?(i..n) THEN A IF ::?(i..n) THEN init(update(A,hd(i..n),f(hd(i..n))), tl(i..n)) FI and the rule: LAW init(A,f,i..n) == init1(A,f,i,n) Unfolding .. once in init1 and simplification yields: DEF init1(A,f,i,n) == IF i>n THEN A IF i<=n THEN init(update(A,hd(i::((i+1)..n)), f(hd(i::((i+1)..n)))), tl(i::((i+1)..n))) FI ... IF i<=n THEN init(update(A,i,f(i)),(i+1)..n) Using the above rule you get finaly: DEF init1(A,f,i,n) == IF i>n THEN A IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI (A compiler will probably not use this fold/unfold method, because it depends to much on "eureka" steps. Furthermore, the view is based on the assumption that update is O(1). Such an update can be established by a sharing analysis for the given example). Regarding to concurrent initialization assume simplisticaly that init1 is not strict in its first argument. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Wolfgang Grieskamp wg@opal.cs.tu-berlin.de tub!tubopal!wg wg%opal@DB0TUI11.BITNET
jashley@silver.ucs.indiana.edu (J. Michael Ashley) (07/17/90)
Wolfgand Grieskamp writes: >In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes: >>A little while ago I asked why array comprehensions are not parallel. In >>particular, why does an array comprehension like this: >> >> { array (1, n) | [i] = f i || i <- 1 to n } >> >not evaluate the (f i)'s concurrently? Well, it turns out that the i's get >generated sequentially as a list (a list in the data structure sense, mind >you). > >This index list is intermediate and an optimizing compiler can eliminate it >using loop fusion. [ omitting the pleasantly detailed derivation of optimization ] >Using the above rule you get finaly: > > DEF init1(A,f,i,n) == IF i>n THEN A > IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI > >(A compiler will probably not use this fold/unfold method, >because it depends to much on "eureka" steps. Furthermore, the view >is based on the assumption that update is O(1). Such an update >can be established by a sharing analysis for the given example). > >Regarding to concurrent initialization assume simplisticaly that init1 >is not strict in its first argument. Actually, this is precisely the optimization that the current Id compiler does right now; if it sees the construct "1 to n", then the compiler will transform it into a loop. Since we are using I-structures, the update is O(1) (without sharing analysis), and init1 is not strict in it's first argument. This doesn't solve the problem, however. Since we are using a dataflow architecture, the function f will undergo partial evaluation to the point where it must have it's argument. What this means is that if we have lots and lots of processors, 90% of the work of creating the array is done by the first iteration or two of init1. The execution is now bottlnecked by this sequential generation of the indices. A slightly better compiler will generate the indices in O(log n) time without any trouble. But this doesn't solve the problem completely. In the case of accumulator comprehensions, the programmer will probably not be using this easy-to-optimize "1 to n" construct. Instead, there will be some arbitrary function (probably complicated) that *must return a list*. Even the most trivial numerical codes will be more than a match for any compiler optimization. So we are again stuck with lots of parallelism when generating the elements of the list but bounded by an O(n) append to put this results back together to form the final list. Thomas Breuel suggested a couple of ways ("difference lists and/or pointer jumping") to overcome the O(n) append problem. This is kind of a hack, and Thomas admits that. I still sit on the position that lists are not the right data structure for the indices. mike ashley jashley@lanl.gov
wg@tubopal.UUCP (Wolfgang Grieskamp) (07/17/90)
In article <1661@opal.tubopal.UUCP> wg@tubopal.UUCP (Wolfgang Grieskamp) writes: > ... >Regarding to concurrent initialization assume simplisticaly that init1 >is not strict in its first argument. ^^^^ NONSENS. I wanted to say "lazy". >- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - >Wolfgang Grieskamp >wg@opal.cs.tu-berlin.de tub!tubopal!wg wg%opal@DB0TUI11.BITNET
wg@tubopal.UUCP (Wolfgang Grieskamp) (07/18/90)
In article <51173@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes: >Wolfgang Grieskamp writes: > >>Using the above rule you get finaly: >> >> DEF init1(A,f,i,n) == IF i>n THEN A >> IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI >> >> [...] >>Regarding to concurrent initialization assume simplisticaly that init1 >>is [[not strict]] lazy in its first argument. > > >This doesn't solve the problem, however. Since we are using a dataflow >architecture, the function f will undergo partial evaluation to the point >where it must have it's argument. If I get it right, this a specific problem of dataflow architectures. init1 fires only, when *all* arguments are available, so the next recursion level may only be reached when update(..) is available (must be something like this, I'am not so firm with dataflow architectures). Anyway, array initialization is not only in FORTRAN inherently parallel, but also in functional languages, and that should be pointed out. You might imagine the evaluation of init1 causes the creation of (n-i)+1 update "processes". The result of update is not required to bound the recursion, so init1 may terminate before any f(i) has been evaluated. The bottelneck appears, if some f(i) is ready, and update requires the updated array of the i+1 recursion level (this is a inherent problem (also in FORTRAN ...) you can only solve with some complicated analysis which ensures the commutativity of all performed updates). >In the case of >accumulator comprehensions, the programmer will probably not be using this >easy-to-optimize "1 to n" construct. Instead, there will be some arbitrary >function (probably complicated) that *must return a list*. Fusion in the case of the discussed array comprehensions works for *any* list producer with an arbitrary number of recursion and termination cases, not only for "1 to n". However, the (conceptual) fold step can be only applied to cases which have an outermost list constructor (so if you construct the index list with a concatenate of two sublists, it will fail for this case). Another problem is the compilers runtime ... so we actually need parallel machines to write nice compilers! - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Wolfgang Grieskamp wg@opal.cs.tu-berlin.de tub!tubopal!wg wg%opal@DB0TUI11.BITNET
tom@stl.stc.co.uk (Tom Thomson) (07/19/90)
In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes: >Can anybody else suggest a better definition for array comprehensions? What >about a "set" as a new data structure? > what's "new" about "set"? Sets have been in Hope+ for a long time. Perhaps the most interesting set of "set" facilities is in the Hope+Unification variant of Hope+. Tom Thomson
jashley@silver.ucs.indiana.edu (J. Michael Ashley) (07/19/90)
Wolfgang writes: >>>Using the above rule you get finaly: >>> >>> DEF init1(A,f,i,n) == IF i>n THEN A >>> IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI >>>Regarding to concurrent initialization assume simplisticaly that init1 >>>is [[not strict]] lazy in its first argument. >If I get it right, this a specific problem of dataflow architectures. init1 >fires only, when *all* arguments are available, so the next recursion level >may only be reached when update(..) is available [...] >You might imagine the evaluation of init1 causes the creation of (n-i)+1 update >"processes". The result of update is not required to bound the recursion, >so init1 may terminate before any f(i) has been evaluated. The bottelneck >appears, if some f(i) is ready, and update requires the updated array of >the i+1 recursion level (this is a inherent problem (also in FORTRAN ...) >you can only solve with some complicated analysis >which ensures the commutativity of all performed updates). Put aside dataflow for a minute. Just assume we have a lazy language. Then the question now is not whether init1 is strict in its first argument, but whether or not update is strict in its first argument. I think it is, so this simple version of init1 is sequential. using dataflow, it is certainly sequential! it's an easy optimization to make init1 run O(log n) using a divide and conquer algorithm. this can be done since i-structures are being used to represent arrays. no analysis is necessary. >However, the (conceptual) fold step can be only applied >to cases which have an outermost list constructor (so if you construct >the index list with a concatenate of two sublists, it will fail for this >case). constructing the index list by concatenating sublists is usually what happens, so the analysis won't work. at any rate, even if it could theoretically work, i bet it would take forever to do the analysis on real codes. mike ashley jashley@lanl.gov
jgk@osc.COM (Joe Keane) (07/19/90)
I don't think array initialization such as the one described is necessarily linear. If we think of an array as a monolithic object, then yes the updates must happen sequentially. However if the array is somehow divided into pieces, it is clear that we know which updates could affect a given piece, so the pieces can be constructed independently, although it takes time to put them together. There are simple algorithms to do insertion and deletion in arrays, but they become slow with large arrays. To avoid linear time in such operations you have to use more sophisticated data structures, using a divide-and-conquer approach. The array is hierarchically broken down into smaller pieces which are better handled with simple algorithms. This sort of indexing is done by B-trees, 2-3 trees, splay trees, etc. It is interesting that this same change also solves the parallelism problem. The initialization of the array will follow the structure of the data structure, because the different parts at each node in the hierarchy can be done independently. This does it in logarithmic time, which is the best we can do assuming that in a given amount of time a computation can only fork off a limited number of parallel sub-computations.
fs@rex.cs.tulane.edu (Frank Silbermann) (07/19/90)
J. Michael Ashley: >> Can anybody else suggest a better definition >> for array comprehensions? What about a >> "set" as a new data structure? Tom Thomson: > what's "new" about "set"? > Sets have been in Hope+ for a long time. It was my understanding that Hope's "set" abstractions violate the axioms of set theory. The elements of Hope's "sets" have an implied ordering accessible to the programmer, and are therefore more accurately described as _list_ comprehensions, i.e. list processing with implicit backtracking. Frank Silbermann fs@rex.cs.tulane.edu Tulane University New Orleans, LA 70118
wg@opal.cs.tu-berlin.de (Wolfgang Grieskamp) (07/21/90)
jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes: >Put aside dataflow for a minute. Just assume we have a lazy language. Then >the question now is not whether init1 is strict in its first argument, but >whether or not update is strict in its first argument. I think it is, so >this simple version of init1 is sequential. Yes, thats right. init1 is strict in its first argument, if update is (seufz). But if we assume, that the first argument of init1 is evaluated lazy (what I wanted to say), then there IS no reason why init1 should block through the calculation of update or one of its Arguments. In general sub-calculations may be parallized over recursion levels, if the induction variable(s) do not depend on them. >using dataflow, it is certainly sequential! Maybe. You probably need a hybrid. The pure dataflow model ist just (nice) theory. >it's an easy optimization to make init1 run O(log n) using a divide and >conquer algorithm. this can be done since i-structures are being used to >represent arrays. no analysis is necessary. O(log n) isnt really good. Its mandatory to exploit (at least partial available) linear memory. The key of effience in imperative languages are arrays with access O(1) und update O(1), and effience (BTW, imperative languages too) makes the world go around. You can have O(1) access in functional languages, providing you have builtin or handcoded array structures, and O1(1) update in the lucky exclusive case, if you additionally have sharing analysis or reference counting (I heard, some people have observed the lucky case appears >60% with standard problems). The main problem is not wheter O(xyz) or inherent parallelism, but the underlying machine structure. Having ($$$) MIMDs with shared memory, then it would be a good idea to parallize the initialization maximal (unless you know, the initializer is trivial). Having loosely coupled MIMDs, you must carefully estimate the comm. effort against the initializer effort. I guess, on l.c. MIMDs parallel array initialization is an exotic case, and pipeling of streams is the case of interest. >constructing the index list by concatenating sublists is usually what happens, >so the analysis won't work. at any rate, even if it could theoretically work, >i bet it would take forever to do the analysis on real codes. Dont be so pesimistic, Michael! Its not the question of forever, but, say, 10 user minutes against 0.5 for a medium module. For example, full unification is much more expansive. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Wolfgang Grieskamp -fullOpt => *compiler* thinks: "lazy or not lazy, this is here the question!" wg@opal.cs.tu-berlin.de tub!tubopal!wg wg%opal@DB0TUI11.BITNET