[comp.lang.misc] Register Count

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/11/91)

On 10 Jan 91 16:23:01 GMT, pcg@cs.aber.ac.uk (myself) said:

    [ ... usual verbose treatise with possibly useful & interesting
    but not at all new things on value data flow patterns and
    implications for architectures and their implementations ... ]

Rereading this article I have realized that there is a problem in the
regular rerunning of the 'register count' and 'aggressive optimization'
debates, that I know something that, as far I know up to a couple years ago,
nobody else knows, and this gives me, I suppose, some insight that permeates
the tone of my arguments without ever being made explicit.

I have an unpublished[1] paper on how to implement multiple parallel
assignment in the general case, which is something that apparently is still
not generally known or published. The multiple assignment is, I have come to
see, a very important thing, with far reaching consequences in at least two
fields; I will discuss those for architecture[2]. I will first present the
problem, then a sketch of the solution, and then the implications for system
architecture.

Multiple parallel assignment can be written as something like

	L1, L2, L3 := R1, R2, R3

Where the L's are lvalue producing expressions, and the R's are value
producing expressions. The previous values of all the L's are replaced
simultaneously and in parallel (at least conceptually) with the
correcponding R's. Its formalization is the trivial extension of the
single assingment one.

Examples:

	x, y := w, z;		CO x := w; y := z; 		CO
	a, b, c := b, c, d;	CO slide a row of values	CO
	a, b, := b, a;		CO swap the values of a and b 	CO
	a, b, c := b, c, a;	CO rotate a row of values	CO
	a[i], i := i, a[i];	CO quite a bit more difficult 	CO

As you can see there are four levels of difficulty; no dataflow dependecies,
L-to-R dataflow dependencies, L-to-L dataflow dependencies, both.

I have devised an obvious but complicated algorithm for expanding an
_arbitrary_ multiple assignment into an equivalent sequence (or multipel
sequences) of single assignments, with the use of the *minimum* number of
temporaries[3] necessary, or any other quantity desired, depending on the
desired degree of parallelism.

So that nobody asks me for the draft of the paper, here is an outline of the
algorithm: build two boolean matrixes, LR for L-to-R(-to-L) dependencies,
and LL for L-to-L dependencies, an element is true if there is a
dependency[4] betwen the two expressions. In other words, if N is the arity
of the multiple assignment:

    for i,j in 1..N,1..N
	LL[i,j] := (i=j | 0 | Li <influences the value of> Lj)
	LR[i,j] := (i=j | 0 | Li <influences the value of> Rj)

You now have got the characteristic matrixes of two directed graphs.
Apply Dijkstra's cycle finding algorithm[5] to the matrixes, repeatedly,
by doing the row sums and emitting one of the assignments that have zero
row sum. If there aren't any, there is a cycle. If there are several row
sums with the same value, select one at random (they are all part of
cycles of the same degree), introduce a temporary, emit the relative
assignment, and restart.  Whenever an assignment to Li is emitted,
delete the i'th row and column of LL and LR. That's all, in outline[6].

It is easy to prove that this generates the assignments in the right
order, and introduces the minim number of temporaries.

Note that the order in which single assignments are emitted if more than
one row has zero row sum, or in which cycles are resolved if they have
the same row sum, is a matter for scheduling policy. Also, some single
assignments that have higher row sums than others can be emitted before
those if temporaries are introduced to reduce those row sums[7].  The
temporaries can be seen as staging places that uncouple two strands of
microparallelism that would otherwise get entangled.

Now the implications for system architecture, that is microparallelism.

I can claim that a very large percentage of existing codes can be
recoded with multiple assignments with large benefit in clarity. Anybody
can think of examples[7]. I also claim that in all these important cases
programmers are forced to hand expand the underlying myltiple assignment
into a series of single assignments, and that a large, or even all, of
the dataflow analysis of optimizers is about reversing the process and
rebuilding the underlying multiple assignment, and possibly reexpanding
it in a different form.

If you look at the world like this, you find that instruction scheduling,
the maxmimum degree of microparallelism, caching in registers across
statements, common subexpression eleimination, and many other issues are
just a consequence of expanding multiple parallel assignment into signle
assignment, to be executed either serially or in parallel.

It is obvious, at least to me, that a lot (if not all!) of so called
aggressive optimization and clever register using, as opposed to competent
code generation, is the regrettable result of multiple assignment not being
in the language and being haphazardly reinvented every time by aggressive
optimizers, and other funnily misapplied things like graph coloring. With
the double hazard that the programmer is writing a very simple and easily
understood multiple assignment as its *implementation* as a sequence of
single assignments, and the compiler is trying to reverse engineer that.

It is also obvious to me, as I have seen the trivially easy but complicated
"proof" that the above described algorithm works, that this implies that in
many many cases both the programmer and the optimizer must "prove" that the
sequence of single assignments is indeed equivalent to the underlying
multiple assignment, which is tedious and complicated, when it could be done
in a reliable way as part of code generation if languages were designed with
multiple assignment built in[9] and with this algorithm in a "tunable"
version as to the scheduling of the single assignments, and to the number of
temporaries to use and when.

Especially on multiple stack superscalar machines[10], whether the stacks are
uselessly implemented as portions of a flat register file or not.

I had to get this off my chest, sooner or later. Most probably others have
founbd the same algorithm, but if so I have never seen any indication that
they attributed the same importance to it as I did when I considered all the
implications[11]. If you think that they are less important than I think,
reconsider; for example, read Suzuky's paper, or try to rewrite your
favourite algorithms and inner loops using multiple assignments, and wonder
how the algorithm could expand them on different architectures.

================

[1] Submitted for publication almost five years ago to the CJ, accepted with
revisions, revisions never done :-(. I have not been in the right environment
for doing or even polishing research.

[2] The other one is program formalization and correctness, as in stateless
or stateful languages.

[3] The general form of multiple assignment can always be implemented by
calculating each or the L's and each of the R's into a temporary, and
then performing the single assignments between the temporaries in any
order. This is what binding based, functional languages do.

[4] The usual wise guys will remark that this is undecidable. In
practice any weaker predicate will do, and will generate, for the simple
expressions likely to be found in practice, good results.

[5] The referee for my paper pointed out that I was really using topological
sorting, thus missing the point entirely. Topological sorting and Dijkstra's
cycle finding algorithm are equivalent (they both essentially deal with the
transitive closure of a partial order operator), modulo the representations
of the graph, but one has as goal finding the cycles, the other fails when
there are any. Very different flavour.

[6] There are additional obvious complications in selecting the optimal
cycle reduction order if there are cycles in both the LR and LL
matrixes, and other scheduling details, and handling cases like

	a, a := b, b		CO well defined meaning		CO
	a, a := b, c		CO nondeterministic or illegal?	CO

I did a sketchy, minimal and dumb, implementation for Johnson's PCC,
many years ago, in about one day. I think that adding it to any compiler
would take about as much time. I have lost PCC's patches, and have not
had time to add it to GNU cc or other similar technology. Sorry.

[7] For example, consider "a, b := b, c". The LL matrix is all 0, and
the LR matrix is

	0 0
	1 0

We can immediately write "L0 := R0", and then the matrix
collapses, and we can write "L1 := R1", so we get the sequence

	a := b; b := c;

If we do not like this scheduling, we can extend the matrix with the row and
column for a temporary (note that a temporary never influences any L or R
expression), and thus we can generate instead either of the sequences

	T1 := b; a := T1; b := c;	CO quite silly really	CO
	T1 := b; b := c; a := T1;	CO scheduling inversion	CO

[8] Two particular forms of multiple assignment, sliding and rotation,
are illustrated in an article in CACM, middle seventies, by Suzuki.

[9] Please, not as in functional languages or languages with futures and
other things that imply that you always use the maximum number of
temporaries every time, as in function bindings passed by value.

[10] If you think of the most applications rewritten with multiple
assignments, and their dependency matrixes, it is easy to see that these
multiple assignments would capture essentially all interesting
microparallelism, and that they would not still be terribly complicated;
rarely they would be more than four way and more than four deep each way.
Even when grouped together in multiple assignments, the average expressions
would still be pretty simple. Except for things like operations on whole
vector/matrix (or other simple and regular structure), but then SIMD/MIMD is
the right answer, not microparallelism. Massive, massive parallelism, which
has its own, quite different, set of problems.

[11] I stumbled upon it; I just wanted to add it to a language I was doing
an implementation for, and was irked that all the references I could find at
the time (for examples Gries, still in his recent books) said that the
general problem, with arrays and both LL and LR dependencies, was
intractable. On examining the algorithm when it was done I started to
realize its implications for program writing and code generation.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/12/91)

On 10 Jan 91 19:44:01 GMT, pcg@cs.aber.ac.uk (myself) said:

   [ ... how to reduce a multiple assignment to one or more sequences
   of signle assignments ... ]

pcg> So that nobody asks me for the draft of the paper, here is an
pcg> outline of the algorithm: build two boolean matrixes, LR for
pcg> L-to-R(-to-L) dependencies, and LL for L-to-L dependencies, an
pcg> element is true if there is a dependency[4] betwen the two
pcg> expressions.

pcg> [4] The usual wise guys will remark that this is undecidable. In
pcg> practice any weaker predicate will do, and will generate, for the
pcg> simple expressions likely to be found in practice, good results.

Some of the usual wise guys have written to me to remark that aliasing
may pose a problem. I had clearly hinted that this need not so, but it
seems that discussing the alising issue throws up more interesting
language and code generation issues, so here is a followup posting.
Followups to this are redirected to comp.lang.misc, because the system
architecture content is not that great here.

There are *two* types of aliasing we might be interested in. Let's see a
few examples:

    a, a := 1, 1;		CO semantics well defined	CO
    a, a := 2, 3;		CO nondeterministic or illegal?	CO
    a, a := b, c;		CO either of the above		CO

This illustrates *value* aliasing. The first example is
noncontroversial; the second is, because we have a choice of either
faulting it as ambiguous, or of accepting a non deterministic result (my
preference, but the compiler should issue a warning); the third is
equivalent to either, depending on whether "b = c".

    a[i], a[j] := a[j], a[i];
    a[i], a[j] := a[m], a[n];

This illustrates *lvalue* or reference aliasing.  But the semantics of
the first line are obviously well defined, for when "i = j" it is an
identity assignment, and otherwise it is a swap. The second line is
harder, as it involves *both* value, as maybe "i = j" but "a[m] /=
a[n]", and lvalue (if "j = m" or "i = n", ...) aliasing.

Three "solutions" are possible:


1) Using a suitably weak definition of the dependency predicate
guarantees that all these cases are handled correctly, at the price of
generating some temporaries that are unnecessary when aliasing does not
occur dynamically. For example the examples above can be implemented as

    a, a := 1, 1;		a := 1;
    a, a := 2, 3;		a := 3;		
    a, a := b, c;		a := b;
    a[i], a[j] := a[j], a[i];	R0 := a[j]; a[j] := a[i]; a[i] := R0;
    a[i], a[j] := a[m], a[n];	R0 := a[m]; R1 := a[n];
				a[i] := R0; a[j] := R1;
	

2) If the programmer knows something that cannot be possibly deduced by
the code that implements the dependency predicate, such as aliasing
information in languages like C, there is always the option of manually
expanding the multiple assignment as single assignments. For example the
programmer may _know_ that "j /= m" and thus rewrite:

    a[i], a[j] := a[m], a[n];	a[j] := a[n]; a[i] := a[m];

If the programmer knows that "i /= n" the other rewriting becomes
possible:

    a[i], a[j] := a[m], a[n];	a[i] := a[m]; a[j] := a[n];


3) A far more interesting solution is to add some cleverness to the
compiler. I mean, real cleverness, not like current "optimizers". The
idea is that compilers are symbolic reduction engines in the domain of
code sequences -- let's exploit that and have the dependency predicate
return not truth values but formulas with free variables. The dependency
matrixes would then imply a family of actual dependency matrixes, and
code could be generated for each. This we could call "Schroedinger's cat
code generation". For example, the (pessimistic) dependency matrixes for

    a[i], a[j] := a[m], a[n];

would be

    LL	0 0	LR	0 1
	0 0		1 0

while the symbolic/feline ones would be

    LL	0 0	LR	0		(i = n)
	0 0		(j = m)		0

By exploding the symbolic LR matrix for the possible values of the
formulas in it, we could have the code generator generate code for:

    a[i], a[j] := a[m], a[n];

as in:

    IF i = n
    THEN
	CO LR matrix collapses to: ((0,1),(j=m,0)) CO

	IF j = m
	THEN
	    CO
		equivalent to:	a[i], a[j] := a[j], a[i];
		LR matrix is:	((0,1),(1,0))
	    CO
	    R0 := a[j]; a[j] := a[i]; a[i] := R0
	ELSE
	    CO
		equivalent to:	a[i], a[j] := a[m], a[i];
		LR matrix is:	((0,1),(0,0)) 
	    CO
	    a[j] := a[i]; a[i] := a[m];
	FI
    ELSE
	C as above, mutatis mutandis C
    FI

Note that by going really far, we could even turn "a := b" into
something like "(a /= b | a := b)", which may even (improbably) make
sense for some implementations of some architectures.


Now some comments.

Alternative 1) above is IMNHO perfectly adequate. I do expect that the
vast majority of cases will involve no potential aliasing, so no
possibly excessive temporaries will be used. And even when they are
used, they will probably make little difference to runtime.

Alternative 2) above will be used when speed is absolutely critical and
the programmer wants to invest the mental effort of second guessing the
code generator. Hardly ever necessary, I hope. Note that the frequent
advice to C programmers to offer greater opportunities to the
optimizers, in the (fortunate) absence of C ways to inform the optimizer
of aliasing hazards, by using explicit temporaries, is equivalent to
this alternative.

Alternative 3) is intriguing, because it implies dynamic conditional
compilation. For example the code sequences relative to aliasing, if it
is expected to be rare, could be offlined to some remote code section,
to improve locality. More than a hint of trace scheduling here, of
course. Interesting avenues of research, only a few already explored
(trace scheduling, lazy or speculative execution, on the fly code
generation for BITBLT, the concept of a supercompiler, ...), would open
considering interpretation and compiling as an experimental science, and
execution and compilation as performing or setting up experiments to
collapse the state vector. Deep thinking, man.

================================================================

Now a disclaimer that I feel required to make, after some comments on
the "Summary:" line for the previous posting:

  All the research underpinning this and the previous postings has no
  relationship whatsoever to the research of the University College of
  Wales.

  The research that I discuss here is *entirely* the result of my own
  work, using my own funds, on my own time; its contents is my
  responsibility alone, and should not reflect on any assessment of the
  research or reputation of the University College of Wales.

  In particular the investigation of multiple assignment was done years
  before I had even heard of the University College of Wales existence.

If the above disclaimer were not true, I would have had to seek written
permission from my Head of Department before posting the multiple
asignment technology, and it could have been (probably IMNHO) refused,
as it would have been my contractual duty to point out that it was
patentable subject matter susceptible of commercial exploitation.

I wish to thank the University College of Wales for their generous
permission to use without payment their News connection to discuss in
this forum my own personal research (if any :->).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@ariel.rice.edu (Preston Briggs) (01/14/91)

pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>    a[i], a[j] := a[m], a[n];

> it involves *both* value, as maybe "i = j" but "a[m] /=
>a[n]", and lvalue (if "j = m" or "i = n", ...) aliasing.
>
>Three "solutions" are possible:

>1)

Conservative implementation.

>2)

Non-deterministic implementation; force the programmer to be specific.
I'd say this is similar to C.

>3) A far more interesting solution is to add some cleverness to the
>compiler. I mean, real cleverness, not like current "optimizers".

and shows a solution that involves generating multiple versions
of the assignment, with tests of i, j, m, and n to distinguish
between the various versions.

I don't like this for a several reasons.

a) Where's the profit?  We save a register at best over the conservative
   solution in (1).  The cost though, is 1 or 2 tests at run-time.
   A branch or 2 per register saved is no deal.

b) Tools already exist to help disambiguate array references,
   e.g., dependence analysis.  While dependence analysis is not always
   precise, it is conservative (safe) and the technology is improving
   all the time.  Further, if you've taken the time to build a dependence
   analyzer, you can use it to support more extensive and more profitable
   transformations.

and finally

c) I don't like so much special-purpose machinery (compiler code)
   devoted to one programming language construct.  I'm much prefer
   general-purpose optimizations that help over the entire language.

Preston Briggs

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/14/91)

On 13 Jan 91 20:20:53 GMT, preston@ariel.rice.edu (Preston Briggs) said:

preston> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> a[i], a[j] := a[m], a[n];

pcg> it involves *both* value, as maybe "i = j" but "a[m] /= a[n]", and
pcg> lvalue (if "j = m" or "i = n", ...) aliasing.

pcg> Three "solutions" are possible:

pcg> 1) [ ... assume aliasing will occur and use temporaries ... ]

pcg> 2) [ ... like 1), but encourage the programmer to use specific
pcg>    knowledge, if available ... ]

pcg> 3) A far more interesting solution is to add some cleverness to the
pcg> compiler. I mean, real cleverness, not like current "optimizers".

preston> and shows a solution that involves generating multiple versions
preston> of the assignment, with tests of i, j, m, and n to distinguish
preston> between the various versions.

preston> I don't like this for a several reasons.

I don't like it either :-). I prefer 1) or 2), as I had written fairly
clearly. Yet even 3) is defensible (and I will defend it ex officio),
and I added it for the sake of completeness. It is the more defensible
if the compiler is already implemented as symbolic reduction engine.

Some of your objections that follow look like objections to multiple
assignments in itself, not to compilers as symbolic reduction. The two
issues are totally independent; I was just illustrating that aliasing
resolution by generating multiple code sequences, a fairly fashionable
thing, can be done simply if the multiple assignment expander is
embedded ina symbolic reduction compiler.

preston> a) Where's the profit?  We save a register at best over the
preston> conservative solution in (1).  The cost though, is 1 or 2 tests
preston> at run-time.  A branch or 2 per register saved is no deal.

In this contrived example the benefit is next to nil and the cost,
especially in code space, is large.

But I read that with (much rarer!) more complex assignments and
expressions the ability to generate different code sequences and have
aliasing detected at run time is interesting, especially as the code
sequences compiled for the case where there is no aliasing can be
dramatically more efficient on heavvily pipelined machines (non general
purpose processors), and usually aliasing does not occur.

But this is an argument that applies to any flavour of runtime aliasing
resolution, whether produced by multiple assignment, by symbolic
reduction, or by both.

Moreover I am not persuaded that aliasing is really as bad as it is
described, for general purpose machines, but 3) offers a clean, clear
framework for dealing with it, where it matters (assignments), if you
think it is important.

preston> b) Tools already exist to help disambiguate array references,
preston>    e.g., dependence analysis.  While dependence analysis is not
preston>    always precise, it is conservative (safe)

Here indeed we get into the meat of the argument. The technology in 3)
is vastly simpler and more reliable as a consequence than any of the
analyzers that you mention, because it is strictly local (aliasing
detection is done at run time), and relies on the very clean and simple
semantics of multiple assignment.

The dependency analyzers that you mention not only have to detect
aliasing at compile time with a global view of the program, an often
impossible task, they have to do double work; they have, in essence, to
take a sequence of single assignments, reconstruct the multiple
assignment that is equivalent to the sequences, do the global analysis,
and then reexpand everything again. Compare:

A) the programmer writes a multiple assignment without caring about
order of execution or aliasing or data dependencies; the code generator
expands the multiple assignment once and for all deferring alias
detection at run time and uses a small algorithm that can be proven to
work for all possible cases to generate the various alternatives.

B) the programmer writes a sequence of single assignments that implement
the multiple assignments, making sure that the sequence is one of those
equivalent to the underlying multiple assignment, and taking care of
rsolving manually dependency and aliasing issues for that particular
sequence. The "optimizer" undoes all this, reconstructs the original
multiple assignment and its characteristic graphs, analyzes the whole
program hoping to resolve aliasing statically, and then reexpands the
multiple assignment using ad-hoc technology that needs constant
debugging.

preston>     and the technology is improving all the time.

Why bother improving an hacked up technology when you have a ready made,
simple, dependable alternative? Dusty decks, that's the only possible
answer.

preston>    Further, if you've taken the time to build a dependence
preston>    analyzer,

I'd rather spend one tenth of the time to implement a fast and furious
implementation of a simple technology that has important theoretical and
practical underpinnings than to hack up an "optimizer" and dependency
analyzer and spend the next few years debugging it :-).

preston>    you can use it to support more extensive and more profitable
preston>    transformations.

No, because here we are comparing (if I read you correctly) a global
analyzer that tries to resolve aliasing statically using global
analysis, and generates pessimistic code if it cannot, with one that
does only a local analysis and leaves alias resolution to run time and
generates both optimistic and pessimistic code.

Please note however that I think that local static analysis and
pessimistic code generation are usually better than either, because they
are simpler and nearly as good in a vast majority of cases.

preston> and finally

preston> c) I don't like so much special-purpose machinery (compiler code)
preston>    devoted to one programming language construct.

Hey, this is one of my lines; I hate big compilers.

As to the "one programming language construct", almost all time is spent
in assignments and expressions, and it is my contention that multiple
assignment captures essentially all the interesting cases. I cannot give
you much "proof" for this contention, yet; I appeal to your experience.

preston>    I'm much prefer general-purpose optimizations that help over
preston>    the entire language.

As in the above lines, there are reasons to prefer local dynamic
aliasing resolution to global static analysis, in simplicity and
efficiency.

However I am myself skeptical that symbolic, multiway expansion of
multiple assignment is a win for general purpose computers, but I
concede that there are good arguments that it can be important in some
cases (I am thinking of inlining and similar things, and improving
locality and pipelinability by offlining complex and rarely executed
cases); I am just poiting out that it can be done elegantly and cheaply
by embedding the multiple assignment expander in a symbolic reduction
compiler engine.

So we agree in that I would not choose 3) over 1) or 2) except in
special circumstances; we disagree in that I don't think that ad-hoc,
global, static data dependency and "optimization" is a cost effective
alternative, except for "dusty decks".

If one designs a language appropriately, a lot of "optimization" is
simply not needed. The one example I have advanced is designed to show
that with multiple assignment one can do a lot of scheduling and
aliasing resolution without requiring complex analysis by the programmer
and by an "optimizer" that second guesses the programmer. An "optimizer"
that second guesses the programmer is also not needed for optimal alias
resolution, if one goes the dynamic route, which has its attractions.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk