pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/11/91)
On 10 Jan 91 16:23:01 GMT, pcg@cs.aber.ac.uk (myself) said: [ ... usual verbose treatise with possibly useful & interesting but not at all new things on value data flow patterns and implications for architectures and their implementations ... ] Rereading this article I have realized that there is a problem in the regular rerunning of the 'register count' and 'aggressive optimization' debates, that I know something that, as far I know up to a couple years ago, nobody else knows, and this gives me, I suppose, some insight that permeates the tone of my arguments without ever being made explicit. I have an unpublished[1] paper on how to implement multiple parallel assignment in the general case, which is something that apparently is still not generally known or published. The multiple assignment is, I have come to see, a very important thing, with far reaching consequences in at least two fields; I will discuss those for architecture[2]. I will first present the problem, then a sketch of the solution, and then the implications for system architecture. Multiple parallel assignment can be written as something like L1, L2, L3 := R1, R2, R3 Where the L's are lvalue producing expressions, and the R's are value producing expressions. The previous values of all the L's are replaced simultaneously and in parallel (at least conceptually) with the correcponding R's. Its formalization is the trivial extension of the single assingment one. Examples: x, y := w, z; CO x := w; y := z; CO a, b, c := b, c, d; CO slide a row of values CO a, b, := b, a; CO swap the values of a and b CO a, b, c := b, c, a; CO rotate a row of values CO a[i], i := i, a[i]; CO quite a bit more difficult CO As you can see there are four levels of difficulty; no dataflow dependecies, L-to-R dataflow dependencies, L-to-L dataflow dependencies, both. I have devised an obvious but complicated algorithm for expanding an _arbitrary_ multiple assignment into an equivalent sequence (or multipel sequences) of single assignments, with the use of the *minimum* number of temporaries[3] necessary, or any other quantity desired, depending on the desired degree of parallelism. So that nobody asks me for the draft of the paper, here is an outline of the algorithm: build two boolean matrixes, LR for L-to-R(-to-L) dependencies, and LL for L-to-L dependencies, an element is true if there is a dependency[4] betwen the two expressions. In other words, if N is the arity of the multiple assignment: for i,j in 1..N,1..N LL[i,j] := (i=j | 0 | Li <influences the value of> Lj) LR[i,j] := (i=j | 0 | Li <influences the value of> Rj) You now have got the characteristic matrixes of two directed graphs. Apply Dijkstra's cycle finding algorithm[5] to the matrixes, repeatedly, by doing the row sums and emitting one of the assignments that have zero row sum. If there aren't any, there is a cycle. If there are several row sums with the same value, select one at random (they are all part of cycles of the same degree), introduce a temporary, emit the relative assignment, and restart. Whenever an assignment to Li is emitted, delete the i'th row and column of LL and LR. That's all, in outline[6]. It is easy to prove that this generates the assignments in the right order, and introduces the minim number of temporaries. Note that the order in which single assignments are emitted if more than one row has zero row sum, or in which cycles are resolved if they have the same row sum, is a matter for scheduling policy. Also, some single assignments that have higher row sums than others can be emitted before those if temporaries are introduced to reduce those row sums[7]. The temporaries can be seen as staging places that uncouple two strands of microparallelism that would otherwise get entangled. Now the implications for system architecture, that is microparallelism. I can claim that a very large percentage of existing codes can be recoded with multiple assignments with large benefit in clarity. Anybody can think of examples[7]. I also claim that in all these important cases programmers are forced to hand expand the underlying myltiple assignment into a series of single assignments, and that a large, or even all, of the dataflow analysis of optimizers is about reversing the process and rebuilding the underlying multiple assignment, and possibly reexpanding it in a different form. If you look at the world like this, you find that instruction scheduling, the maxmimum degree of microparallelism, caching in registers across statements, common subexpression eleimination, and many other issues are just a consequence of expanding multiple parallel assignment into signle assignment, to be executed either serially or in parallel. It is obvious, at least to me, that a lot (if not all!) of so called aggressive optimization and clever register using, as opposed to competent code generation, is the regrettable result of multiple assignment not being in the language and being haphazardly reinvented every time by aggressive optimizers, and other funnily misapplied things like graph coloring. With the double hazard that the programmer is writing a very simple and easily understood multiple assignment as its *implementation* as a sequence of single assignments, and the compiler is trying to reverse engineer that. It is also obvious to me, as I have seen the trivially easy but complicated "proof" that the above described algorithm works, that this implies that in many many cases both the programmer and the optimizer must "prove" that the sequence of single assignments is indeed equivalent to the underlying multiple assignment, which is tedious and complicated, when it could be done in a reliable way as part of code generation if languages were designed with multiple assignment built in[9] and with this algorithm in a "tunable" version as to the scheduling of the single assignments, and to the number of temporaries to use and when. Especially on multiple stack superscalar machines[10], whether the stacks are uselessly implemented as portions of a flat register file or not. I had to get this off my chest, sooner or later. Most probably others have founbd the same algorithm, but if so I have never seen any indication that they attributed the same importance to it as I did when I considered all the implications[11]. If you think that they are less important than I think, reconsider; for example, read Suzuky's paper, or try to rewrite your favourite algorithms and inner loops using multiple assignments, and wonder how the algorithm could expand them on different architectures. ================ [1] Submitted for publication almost five years ago to the CJ, accepted with revisions, revisions never done :-(. I have not been in the right environment for doing or even polishing research. [2] The other one is program formalization and correctness, as in stateless or stateful languages. [3] The general form of multiple assignment can always be implemented by calculating each or the L's and each of the R's into a temporary, and then performing the single assignments between the temporaries in any order. This is what binding based, functional languages do. [4] The usual wise guys will remark that this is undecidable. In practice any weaker predicate will do, and will generate, for the simple expressions likely to be found in practice, good results. [5] The referee for my paper pointed out that I was really using topological sorting, thus missing the point entirely. Topological sorting and Dijkstra's cycle finding algorithm are equivalent (they both essentially deal with the transitive closure of a partial order operator), modulo the representations of the graph, but one has as goal finding the cycles, the other fails when there are any. Very different flavour. [6] There are additional obvious complications in selecting the optimal cycle reduction order if there are cycles in both the LR and LL matrixes, and other scheduling details, and handling cases like a, a := b, b CO well defined meaning CO a, a := b, c CO nondeterministic or illegal? CO I did a sketchy, minimal and dumb, implementation for Johnson's PCC, many years ago, in about one day. I think that adding it to any compiler would take about as much time. I have lost PCC's patches, and have not had time to add it to GNU cc or other similar technology. Sorry. [7] For example, consider "a, b := b, c". The LL matrix is all 0, and the LR matrix is 0 0 1 0 We can immediately write "L0 := R0", and then the matrix collapses, and we can write "L1 := R1", so we get the sequence a := b; b := c; If we do not like this scheduling, we can extend the matrix with the row and column for a temporary (note that a temporary never influences any L or R expression), and thus we can generate instead either of the sequences T1 := b; a := T1; b := c; CO quite silly really CO T1 := b; b := c; a := T1; CO scheduling inversion CO [8] Two particular forms of multiple assignment, sliding and rotation, are illustrated in an article in CACM, middle seventies, by Suzuki. [9] Please, not as in functional languages or languages with futures and other things that imply that you always use the maximum number of temporaries every time, as in function bindings passed by value. [10] If you think of the most applications rewritten with multiple assignments, and their dependency matrixes, it is easy to see that these multiple assignments would capture essentially all interesting microparallelism, and that they would not still be terribly complicated; rarely they would be more than four way and more than four deep each way. Even when grouped together in multiple assignments, the average expressions would still be pretty simple. Except for things like operations on whole vector/matrix (or other simple and regular structure), but then SIMD/MIMD is the right answer, not microparallelism. Massive, massive parallelism, which has its own, quite different, set of problems. [11] I stumbled upon it; I just wanted to add it to a language I was doing an implementation for, and was irked that all the references I could find at the time (for examples Gries, still in his recent books) said that the general problem, with arrays and both LL and LR dependencies, was intractable. On examining the algorithm when it was done I started to realize its implications for program writing and code generation. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/12/91)
On 10 Jan 91 19:44:01 GMT, pcg@cs.aber.ac.uk (myself) said: [ ... how to reduce a multiple assignment to one or more sequences of signle assignments ... ] pcg> So that nobody asks me for the draft of the paper, here is an pcg> outline of the algorithm: build two boolean matrixes, LR for pcg> L-to-R(-to-L) dependencies, and LL for L-to-L dependencies, an pcg> element is true if there is a dependency[4] betwen the two pcg> expressions. pcg> [4] The usual wise guys will remark that this is undecidable. In pcg> practice any weaker predicate will do, and will generate, for the pcg> simple expressions likely to be found in practice, good results. Some of the usual wise guys have written to me to remark that aliasing may pose a problem. I had clearly hinted that this need not so, but it seems that discussing the alising issue throws up more interesting language and code generation issues, so here is a followup posting. Followups to this are redirected to comp.lang.misc, because the system architecture content is not that great here. There are *two* types of aliasing we might be interested in. Let's see a few examples: a, a := 1, 1; CO semantics well defined CO a, a := 2, 3; CO nondeterministic or illegal? CO a, a := b, c; CO either of the above CO This illustrates *value* aliasing. The first example is noncontroversial; the second is, because we have a choice of either faulting it as ambiguous, or of accepting a non deterministic result (my preference, but the compiler should issue a warning); the third is equivalent to either, depending on whether "b = c". a[i], a[j] := a[j], a[i]; a[i], a[j] := a[m], a[n]; This illustrates *lvalue* or reference aliasing. But the semantics of the first line are obviously well defined, for when "i = j" it is an identity assignment, and otherwise it is a swap. The second line is harder, as it involves *both* value, as maybe "i = j" but "a[m] /= a[n]", and lvalue (if "j = m" or "i = n", ...) aliasing. Three "solutions" are possible: 1) Using a suitably weak definition of the dependency predicate guarantees that all these cases are handled correctly, at the price of generating some temporaries that are unnecessary when aliasing does not occur dynamically. For example the examples above can be implemented as a, a := 1, 1; a := 1; a, a := 2, 3; a := 3; a, a := b, c; a := b; a[i], a[j] := a[j], a[i]; R0 := a[j]; a[j] := a[i]; a[i] := R0; a[i], a[j] := a[m], a[n]; R0 := a[m]; R1 := a[n]; a[i] := R0; a[j] := R1; 2) If the programmer knows something that cannot be possibly deduced by the code that implements the dependency predicate, such as aliasing information in languages like C, there is always the option of manually expanding the multiple assignment as single assignments. For example the programmer may _know_ that "j /= m" and thus rewrite: a[i], a[j] := a[m], a[n]; a[j] := a[n]; a[i] := a[m]; If the programmer knows that "i /= n" the other rewriting becomes possible: a[i], a[j] := a[m], a[n]; a[i] := a[m]; a[j] := a[n]; 3) A far more interesting solution is to add some cleverness to the compiler. I mean, real cleverness, not like current "optimizers". The idea is that compilers are symbolic reduction engines in the domain of code sequences -- let's exploit that and have the dependency predicate return not truth values but formulas with free variables. The dependency matrixes would then imply a family of actual dependency matrixes, and code could be generated for each. This we could call "Schroedinger's cat code generation". For example, the (pessimistic) dependency matrixes for a[i], a[j] := a[m], a[n]; would be LL 0 0 LR 0 1 0 0 1 0 while the symbolic/feline ones would be LL 0 0 LR 0 (i = n) 0 0 (j = m) 0 By exploding the symbolic LR matrix for the possible values of the formulas in it, we could have the code generator generate code for: a[i], a[j] := a[m], a[n]; as in: IF i = n THEN CO LR matrix collapses to: ((0,1),(j=m,0)) CO IF j = m THEN CO equivalent to: a[i], a[j] := a[j], a[i]; LR matrix is: ((0,1),(1,0)) CO R0 := a[j]; a[j] := a[i]; a[i] := R0 ELSE CO equivalent to: a[i], a[j] := a[m], a[i]; LR matrix is: ((0,1),(0,0)) CO a[j] := a[i]; a[i] := a[m]; FI ELSE C as above, mutatis mutandis C FI Note that by going really far, we could even turn "a := b" into something like "(a /= b | a := b)", which may even (improbably) make sense for some implementations of some architectures. Now some comments. Alternative 1) above is IMNHO perfectly adequate. I do expect that the vast majority of cases will involve no potential aliasing, so no possibly excessive temporaries will be used. And even when they are used, they will probably make little difference to runtime. Alternative 2) above will be used when speed is absolutely critical and the programmer wants to invest the mental effort of second guessing the code generator. Hardly ever necessary, I hope. Note that the frequent advice to C programmers to offer greater opportunities to the optimizers, in the (fortunate) absence of C ways to inform the optimizer of aliasing hazards, by using explicit temporaries, is equivalent to this alternative. Alternative 3) is intriguing, because it implies dynamic conditional compilation. For example the code sequences relative to aliasing, if it is expected to be rare, could be offlined to some remote code section, to improve locality. More than a hint of trace scheduling here, of course. Interesting avenues of research, only a few already explored (trace scheduling, lazy or speculative execution, on the fly code generation for BITBLT, the concept of a supercompiler, ...), would open considering interpretation and compiling as an experimental science, and execution and compilation as performing or setting up experiments to collapse the state vector. Deep thinking, man. ================================================================ Now a disclaimer that I feel required to make, after some comments on the "Summary:" line for the previous posting: All the research underpinning this and the previous postings has no relationship whatsoever to the research of the University College of Wales. The research that I discuss here is *entirely* the result of my own work, using my own funds, on my own time; its contents is my responsibility alone, and should not reflect on any assessment of the research or reputation of the University College of Wales. In particular the investigation of multiple assignment was done years before I had even heard of the University College of Wales existence. If the above disclaimer were not true, I would have had to seek written permission from my Head of Department before posting the multiple asignment technology, and it could have been (probably IMNHO) refused, as it would have been my contractual duty to point out that it was patentable subject matter susceptible of commercial exploitation. I wish to thank the University College of Wales for their generous permission to use without payment their News connection to discuss in this forum my own personal research (if any :->). -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
preston@ariel.rice.edu (Preston Briggs) (01/14/91)
pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: > a[i], a[j] := a[m], a[n]; > it involves *both* value, as maybe "i = j" but "a[m] /= >a[n]", and lvalue (if "j = m" or "i = n", ...) aliasing. > >Three "solutions" are possible: >1) Conservative implementation. >2) Non-deterministic implementation; force the programmer to be specific. I'd say this is similar to C. >3) A far more interesting solution is to add some cleverness to the >compiler. I mean, real cleverness, not like current "optimizers". and shows a solution that involves generating multiple versions of the assignment, with tests of i, j, m, and n to distinguish between the various versions. I don't like this for a several reasons. a) Where's the profit? We save a register at best over the conservative solution in (1). The cost though, is 1 or 2 tests at run-time. A branch or 2 per register saved is no deal. b) Tools already exist to help disambiguate array references, e.g., dependence analysis. While dependence analysis is not always precise, it is conservative (safe) and the technology is improving all the time. Further, if you've taken the time to build a dependence analyzer, you can use it to support more extensive and more profitable transformations. and finally c) I don't like so much special-purpose machinery (compiler code) devoted to one programming language construct. I'm much prefer general-purpose optimizations that help over the entire language. Preston Briggs
pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/14/91)
On 13 Jan 91 20:20:53 GMT, preston@ariel.rice.edu (Preston Briggs) said:
preston> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
pcg> a[i], a[j] := a[m], a[n];
pcg> it involves *both* value, as maybe "i = j" but "a[m] /= a[n]", and
pcg> lvalue (if "j = m" or "i = n", ...) aliasing.
pcg> Three "solutions" are possible:
pcg> 1) [ ... assume aliasing will occur and use temporaries ... ]
pcg> 2) [ ... like 1), but encourage the programmer to use specific
pcg> knowledge, if available ... ]
pcg> 3) A far more interesting solution is to add some cleverness to the
pcg> compiler. I mean, real cleverness, not like current "optimizers".
preston> and shows a solution that involves generating multiple versions
preston> of the assignment, with tests of i, j, m, and n to distinguish
preston> between the various versions.
preston> I don't like this for a several reasons.
I don't like it either :-). I prefer 1) or 2), as I had written fairly
clearly. Yet even 3) is defensible (and I will defend it ex officio),
and I added it for the sake of completeness. It is the more defensible
if the compiler is already implemented as symbolic reduction engine.
Some of your objections that follow look like objections to multiple
assignments in itself, not to compilers as symbolic reduction. The two
issues are totally independent; I was just illustrating that aliasing
resolution by generating multiple code sequences, a fairly fashionable
thing, can be done simply if the multiple assignment expander is
embedded ina symbolic reduction compiler.
preston> a) Where's the profit? We save a register at best over the
preston> conservative solution in (1). The cost though, is 1 or 2 tests
preston> at run-time. A branch or 2 per register saved is no deal.
In this contrived example the benefit is next to nil and the cost,
especially in code space, is large.
But I read that with (much rarer!) more complex assignments and
expressions the ability to generate different code sequences and have
aliasing detected at run time is interesting, especially as the code
sequences compiled for the case where there is no aliasing can be
dramatically more efficient on heavvily pipelined machines (non general
purpose processors), and usually aliasing does not occur.
But this is an argument that applies to any flavour of runtime aliasing
resolution, whether produced by multiple assignment, by symbolic
reduction, or by both.
Moreover I am not persuaded that aliasing is really as bad as it is
described, for general purpose machines, but 3) offers a clean, clear
framework for dealing with it, where it matters (assignments), if you
think it is important.
preston> b) Tools already exist to help disambiguate array references,
preston> e.g., dependence analysis. While dependence analysis is not
preston> always precise, it is conservative (safe)
Here indeed we get into the meat of the argument. The technology in 3)
is vastly simpler and more reliable as a consequence than any of the
analyzers that you mention, because it is strictly local (aliasing
detection is done at run time), and relies on the very clean and simple
semantics of multiple assignment.
The dependency analyzers that you mention not only have to detect
aliasing at compile time with a global view of the program, an often
impossible task, they have to do double work; they have, in essence, to
take a sequence of single assignments, reconstruct the multiple
assignment that is equivalent to the sequences, do the global analysis,
and then reexpand everything again. Compare:
A) the programmer writes a multiple assignment without caring about
order of execution or aliasing or data dependencies; the code generator
expands the multiple assignment once and for all deferring alias
detection at run time and uses a small algorithm that can be proven to
work for all possible cases to generate the various alternatives.
B) the programmer writes a sequence of single assignments that implement
the multiple assignments, making sure that the sequence is one of those
equivalent to the underlying multiple assignment, and taking care of
rsolving manually dependency and aliasing issues for that particular
sequence. The "optimizer" undoes all this, reconstructs the original
multiple assignment and its characteristic graphs, analyzes the whole
program hoping to resolve aliasing statically, and then reexpands the
multiple assignment using ad-hoc technology that needs constant
debugging.
preston> and the technology is improving all the time.
Why bother improving an hacked up technology when you have a ready made,
simple, dependable alternative? Dusty decks, that's the only possible
answer.
preston> Further, if you've taken the time to build a dependence
preston> analyzer,
I'd rather spend one tenth of the time to implement a fast and furious
implementation of a simple technology that has important theoretical and
practical underpinnings than to hack up an "optimizer" and dependency
analyzer and spend the next few years debugging it :-).
preston> you can use it to support more extensive and more profitable
preston> transformations.
No, because here we are comparing (if I read you correctly) a global
analyzer that tries to resolve aliasing statically using global
analysis, and generates pessimistic code if it cannot, with one that
does only a local analysis and leaves alias resolution to run time and
generates both optimistic and pessimistic code.
Please note however that I think that local static analysis and
pessimistic code generation are usually better than either, because they
are simpler and nearly as good in a vast majority of cases.
preston> and finally
preston> c) I don't like so much special-purpose machinery (compiler code)
preston> devoted to one programming language construct.
Hey, this is one of my lines; I hate big compilers.
As to the "one programming language construct", almost all time is spent
in assignments and expressions, and it is my contention that multiple
assignment captures essentially all the interesting cases. I cannot give
you much "proof" for this contention, yet; I appeal to your experience.
preston> I'm much prefer general-purpose optimizations that help over
preston> the entire language.
As in the above lines, there are reasons to prefer local dynamic
aliasing resolution to global static analysis, in simplicity and
efficiency.
However I am myself skeptical that symbolic, multiway expansion of
multiple assignment is a win for general purpose computers, but I
concede that there are good arguments that it can be important in some
cases (I am thinking of inlining and similar things, and improving
locality and pipelinability by offlining complex and rarely executed
cases); I am just poiting out that it can be done elegantly and cheaply
by embedding the multiple assignment expander in a symbolic reduction
compiler engine.
So we agree in that I would not choose 3) over 1) or 2) except in
special circumstances; we disagree in that I don't think that ad-hoc,
global, static data dependency and "optimization" is a cost effective
alternative, except for "dusty decks".
If one designs a language appropriately, a lot of "optimization" is
simply not needed. The one example I have advanced is designed to show
that with multiple assignment one can do a lot of scheduling and
aliasing resolution without requiring complex analysis by the programmer
and by an "optimizer" that second guesses the programmer. An "optimizer"
that second guesses the programmer is also not needed for optimal alias
resolution, if one goes the dynamic route, which has its attractions.
--
Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk