[comp.lang.functional] Id array comprehensions revisited

jashley@silver.ucs.indiana.edu (J. Michael Ashley) (07/17/90)

A little while ago I asked why array comprehensions are not parallel.  In
particular, why does an array comprehension like this:

	{ array (1, n) | [i] = f i || i <- 1 to n }

not evaluate the (f i)'s concurrently?  Well, it turns out that the i's get
generated sequentially as a list (a list in the data structure sense, mind
you).

But there's no reason why we have to confine ourselves to using this "1 to n"
construct to generate the indices.  In fact, we can use any function we like
to generate the indices.  For example, we can rewrite the above expression as

	{ array (1, n) | [i] = f i || i <- gen 1 n }

where gen is some generator of our own creation.  But what are we going to
create?  We'd like to write something better than the O(n) generator that the
"1 to n" represents.  Unfortunately, I haven't been able to uncover a
O(1) or even O(log n) algorithm.  The difficulty is that the append
function is O(m), where m is the length of the first argument, so the best
I've been able to do is O(m).  This is no good.

Thorsten von Eiken suggested the following trick.  Instead of using one
generator to create the indices, use two!  I'm ignoring the bounds problem,
but this is the idea:

	{ array (1, n) | [i] = f i || j <- 1 to n by 8 &
                                      i <- j to j+7      }

The best you can hope for with this method is O(sqrt n) where n is the
length of the array.  You can use more generators to get O(cubert n), etc...
This is a nice solution, and I don't think you can do much better without
resorting to non-functional constructs (the "make_1d_i_array" function in Id).

Though this is a nice solution, it is still a trick, and I'm not very fond
of tricks.  It seems that the root of the problem is that the generator
produces a *list* of indices.  I say this is the problem because building
a list is an O(n) job and referencing the list is an O(n) job.  A list
is the wrong data structure.

Can anybody else suggest a better definition for array comprehensions?  What
about a "set" as a new data structure?

By the way, array comprehensions are a feature of Haskell, which is
suppossed to be the new standard.  So this question is very relevent!

mike ashley 
jashley@lanl.gov

wg@tubopal.UUCP (Wolfgang Grieskamp) (07/17/90)

In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes:
>A little while ago I asked why array comprehensions are not parallel.  In
>particular, why does an array comprehension like this:
>
>	{ array (1, n) | [i] = f i || i <- 1 to n }
>
>not evaluate the (f i)'s concurrently?  Well, it turns out that the i's get
>generated sequentially as a list (a list in the data structure sense, mind
>you).

This index list is intermediate and an optimizing compiler can eliminate it 
using loop fusion.

Assum you have 

  FUN .. : nat ** nat -> seq[nat] 
  DEF i..n == IF i>n  THEN <>
	      IF i<=n THEN i::((i+1)..n) ELSE <> FI

and

  FUN init: array[alpha] ** (nat -> alpha) ** seq[nat] -> array[alpha] 
  DEF init(A,f,S) == IF <>?(S) THEN A 
		     IF ::?(S) THEN init(update(A,hd(S),f(hd(S))),f,tl(S)) FI

and somewhere the expression

  ...  init(array(1,n),f,1..n) ...

as a low level representation of the array comprehension. 

Now the compiler may derive the new function:

  FUN init1: array[alpha] ** (nat -> alpha) ** nat ** nat -> array[alpha]

  DEF init1(A,f,i,n) == IF <>?(i..n) THEN A
			IF ::?(i..n) THEN init(update(A,hd(i..n),f(hd(i..n))),
					       tl(i..n)) FI

and the rule:

  LAW init(A,f,i..n) == init1(A,f,i,n)


Unfolding .. once in init1 and simplification yields:

  DEF init1(A,f,i,n) == IF i>n  THEN A
			IF i<=n THEN init(update(A,hd(i::((i+1)..n)),
						   f(hd(i::((i+1)..n)))),
					  tl(i::((i+1)..n))) FI

			...
			IF i<=n THEN init(update(A,i,f(i)),(i+1)..n) 

Using the above rule you get finaly:

  DEF init1(A,f,i,n) == IF i>n  THEN A
			IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI

(A compiler will probably not use this fold/unfold method,
because it depends to much on "eureka" steps. Furthermore, the view 
is based on the assumption that update is O(1). Such an update
can be established by a sharing analysis for the given example).

Regarding to concurrent initialization assume simplisticaly that init1
is not strict in its first argument.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Wolfgang Grieskamp       
wg@opal.cs.tu-berlin.de        tub!tubopal!wg         wg%opal@DB0TUI11.BITNET

jashley@silver.ucs.indiana.edu (J. Michael Ashley) (07/17/90)

Wolfgand Grieskamp writes:
>In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes:
>>A little while ago I asked why array comprehensions are not parallel.  In
>>particular, why does an array comprehension like this:
>>
>>	{ array (1, n) | [i] = f i || i <- 1 to n }
>>
>not evaluate the (f i)'s concurrently?  Well, it turns out that the i's get
>generated sequentially as a list (a list in the data structure sense, mind
>you).
>
>This index list is intermediate and an optimizing compiler can eliminate it 
>using loop fusion.

[ omitting the pleasantly detailed derivation of optimization ]

>Using the above rule you get finaly:
>
>  DEF init1(A,f,i,n) == IF i>n  THEN A
>			IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI
>
>(A compiler will probably not use this fold/unfold method,
>because it depends to much on "eureka" steps. Furthermore, the view 
>is based on the assumption that update is O(1). Such an update
>can be established by a sharing analysis for the given example).
>
>Regarding to concurrent initialization assume simplisticaly that init1
>is not strict in its first argument.

Actually, this is precisely the optimization that the current Id compiler
does right now; if it sees the construct "1 to n", then the compiler will
transform it into a loop.

Since we are using I-structures, the update is O(1) (without sharing analysis),
and init1 is not strict in it's first argument.

This doesn't solve the problem, however.  Since we are using a dataflow
architecture, the function f will undergo partial evaluation to the point
where it must have it's argument.  What this means is that if we have lots
and lots of processors, 90% of the work of creating the array is done by
the first iteration or two of init1.  The execution is now bottlnecked by
this sequential generation of the indices.

A slightly better compiler will generate the indices in O(log n) time
without any trouble.

But this doesn't solve the problem completely.  In the case of 
accumulator comprehensions, the programmer will probably not be using this
easy-to-optimize "1 to n" construct.  Instead, there will be some arbitrary
function (probably complicated) that *must return a list*.  Even the most
trivial numerical codes will be more than a match for any compiler 
optimization.  So we are again stuck with lots of parallelism when 
generating the elements of the list but bounded by an O(n) append to put
this results back together to form the final list.

Thomas Breuel suggested a couple of ways ("difference lists and/or pointer
jumping") to overcome the O(n) append problem.  This is kind of a hack, and
Thomas admits that.  I still sit on the position that lists are not the
right data structure for the indices.

mike ashley
jashley@lanl.gov

wg@tubopal.UUCP (Wolfgang Grieskamp) (07/17/90)

In article <1661@opal.tubopal.UUCP> wg@tubopal.UUCP (Wolfgang Grieskamp) writes:
> ...
>Regarding to concurrent initialization assume simplisticaly that init1
>is not strict in its first argument.
	 ^^^^
	 NONSENS. I wanted to say "lazy".


>- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
>Wolfgang Grieskamp       
>wg@opal.cs.tu-berlin.de        tub!tubopal!wg         wg%opal@DB0TUI11.BITNET

wg@tubopal.UUCP (Wolfgang Grieskamp) (07/18/90)

In article <51173@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes:
>Wolfgang Grieskamp writes:
>
>>Using the above rule you get finaly:
>>
>>  DEF init1(A,f,i,n) == IF i>n  THEN A
>>			  IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI
>>
>> [...]

>>Regarding to concurrent initialization assume simplisticaly that init1
>>is [[not strict]] lazy in its first argument.
>
>
>This doesn't solve the problem, however.  Since we are using a dataflow
>architecture, the function f will undergo partial evaluation to the point
>where it must have it's argument.  

If I get it right, this a specific problem of dataflow architectures. init1
fires only, when *all* arguments are available, so the next recursion level
may only be reached when update(..) is available (must be something like 
this, I'am not so firm with dataflow architectures).

Anyway, array initialization is not only in FORTRAN inherently 
parallel, but also in functional languages, and that should be pointed out.
You might imagine the evaluation of init1 causes the creation of (n-i)+1 update
"processes". The result of update is not required to bound the recursion,
so init1 may terminate before any f(i) has been evaluated. The bottelneck
appears, if some f(i) is ready, and update requires the updated array of
the i+1 recursion level (this is a inherent problem (also in FORTRAN ...) 
you can only solve with some complicated analysis 
which ensures the commutativity of all performed updates).

>In the case of 
>accumulator comprehensions, the programmer will probably not be using this
>easy-to-optimize "1 to n" construct.  Instead, there will be some arbitrary
>function (probably complicated) that *must return a list*.  

Fusion in the case of the discussed array comprehensions works for *any* 
list producer with an arbitrary number of recursion and termination cases, not
only for "1 to n". However, the (conceptual) fold step can be only applied
to cases which have an outermost list constructor (so if you construct
the index list with a concatenate of two sublists, it will fail for this
case). Another problem is the compilers runtime ... so we actually need
parallel machines to write nice compilers!

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
			     Wolfgang Grieskamp 
wg@opal.cs.tu-berlin.de        tub!tubopal!wg         wg%opal@DB0TUI11.BITNET

tom@stl.stc.co.uk (Tom Thomson) (07/19/90)

In article <51123@iuvax.cs.indiana.edu> jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes:
>Can anybody else suggest a better definition for array comprehensions?  What
>about a "set" as a new data structure?
>
what's "new" about "set"?
Sets have been in Hope+ for a long time.
Perhaps the most interesting set of "set" facilities is in the
Hope+Unification variant of Hope+.

Tom Thomson

jashley@silver.ucs.indiana.edu (J. Michael Ashley) (07/19/90)

Wolfgang writes:
>>>Using the above rule you get finaly:
>>>
>>>  DEF init1(A,f,i,n) == IF i>n  THEN A
>>>			  IF i<=n THEN init1(update(A,i,f(i)),i+1,n) FI
>>>Regarding to concurrent initialization assume simplisticaly that init1
>>>is [[not strict]] lazy in its first argument.

>If I get it right, this a specific problem of dataflow architectures. init1
>fires only, when *all* arguments are available, so the next recursion level
>may only be reached when update(..) is available
[...]
>You might imagine the evaluation of init1 causes the creation of (n-i)+1 update
>"processes". The result of update is not required to bound the recursion,
>so init1 may terminate before any f(i) has been evaluated. The bottelneck
>appears, if some f(i) is ready, and update requires the updated array of
>the i+1 recursion level (this is a inherent problem (also in FORTRAN ...) 
>you can only solve with some complicated analysis 
>which ensures the commutativity of all performed updates).

Put aside dataflow for a minute.  Just assume we have a lazy language.  Then
the question now is not whether init1 is strict in its first argument, but
whether or not update is strict in its first argument.  I think it is, so
this simple version of init1 is sequential.

using dataflow, it is certainly sequential!

it's an easy optimization to make init1 run O(log n) using a divide and
conquer algorithm.  this can be done since i-structures are being used to
represent arrays.  no analysis is necessary.

>However, the (conceptual) fold step can be only applied
>to cases which have an outermost list constructor (so if you construct
>the index list with a concatenate of two sublists, it will fail for this
>case).

constructing the index list by concatenating sublists is usually what happens,
so the analysis won't work.  at any rate, even if it could theoretically work,
i bet it would take forever to do the analysis on real codes.

mike ashley
jashley@lanl.gov

jgk@osc.COM (Joe Keane) (07/19/90)

I don't think array initialization such as the one described is necessarily
linear.  If we think of an array as a monolithic object, then yes the updates
must happen sequentially.  However if the array is somehow divided into
pieces, it is clear that we know which updates could affect a given piece, so
the pieces can be constructed independently, although it takes time to put
them together.

There are simple algorithms to do insertion and deletion in arrays, but they
become slow with large arrays.  To avoid linear time in such operations you
have to use more sophisticated data structures, using a divide-and-conquer
approach.  The array is hierarchically broken down into smaller pieces which
are better handled with simple algorithms.  This sort of indexing is done by
B-trees, 2-3 trees, splay trees, etc.

It is interesting that this same change also solves the parallelism problem.
The initialization of the array will follow the structure of the data
structure, because the different parts at each node in the hierarchy can be
done independently.

This does it in logarithmic time, which is the best we can do assuming that in
a given amount of time a computation can only fork off a limited number of
parallel sub-computations.

fs@rex.cs.tulane.edu (Frank Silbermann) (07/19/90)

J. Michael Ashley:
>>		Can anybody else suggest a better definition
>>		for array comprehensions?  What about a
>>	"set" as a new data structure?

Tom Thomson:
>	what's "new" about "set"?
>	Sets have been in Hope+ for a long time.

It was my understanding that Hope's "set" abstractions
violate the axioms of set theory.
The elements of Hope's "sets" have an implied ordering
accessible to the programmer, and are therefore
more accurately described as _list_ comprehensions,
i.e. list processing with implicit backtracking.

	Frank Silbermann	fs@rex.cs.tulane.edu
	Tulane University	New Orleans, LA  70118

wg@opal.cs.tu-berlin.de (Wolfgang Grieskamp) (07/21/90)

jashley@silver.ucs.indiana.edu (J. Michael Ashley) writes:

>Put aside dataflow for a minute.  Just assume we have a lazy language.  Then
>the question now is not whether init1 is strict in its first argument, but
>whether or not update is strict in its first argument.  I think it is, so
>this simple version of init1 is sequential.

Yes, thats right. init1 is strict in its first argument, if update
is (seufz). But if we assume, that the first argument of init1 is
evaluated lazy (what I wanted to say), then there IS no 
reason why init1 should block through the calculation of update or
one of its Arguments. In general sub-calculations may be parallized
over recursion levels, if the induction variable(s) do not depend on
them.

>using dataflow, it is certainly sequential!

Maybe. You probably need a hybrid. The pure dataflow model ist just
(nice) theory.

>it's an easy optimization to make init1 run O(log n) using a divide and
>conquer algorithm.  this can be done since i-structures are being used to
>represent arrays.  no analysis is necessary.

O(log n) isnt really good. Its mandatory to exploit (at least partial
available) linear memory. The key of effience in imperative languages are
arrays with access O(1) und update O(1), and effience (BTW, imperative
languages too) makes the world go around. You can have O(1) access in functional
languages, providing you have builtin or handcoded array structures,
and O1(1) update in the lucky exclusive case, if you additionally have sharing 
analysis or reference counting (I heard, some people have observed the 
lucky case appears >60% with standard problems).

The main problem is not wheter O(xyz) or inherent parallelism, but the
underlying machine structure. Having ($$$) MIMDs with shared memory, 
then it would be a good idea to parallize the initialization maximal 
(unless you know, the initializer is trivial). Having loosely coupled
MIMDs, you must carefully estimate the comm. effort against the
initializer effort. I guess, on l.c. MIMDs parallel array 
initialization is an exotic case, and pipeling of streams is the
case of interest.

>constructing the index list by concatenating sublists is usually what happens,
>so the analysis won't work.  at any rate, even if it could theoretically work,
>i bet it would take forever to do the analysis on real codes.

Dont be so pesimistic, Michael! Its not the question of forever, but, say,
10 user minutes against 0.5 for a medium module. For example, full 
unification is much more expansive.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Wolfgang Grieskamp 		-fullOpt    => *compiler* thinks:
				"lazy or not lazy, this is here the question!"
wg@opal.cs.tu-berlin.de      tub!tubopal!wg            wg%opal@DB0TUI11.BITNET