[comp.arch] Register Count

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (01/09/91)

In article <PCG.91Jan8175401@odin.cs.aber.ac.uk> 
	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>Statistics exist that show that almost any
>expression and a majority of code sequences can, on a single threaded
>CPU (one with one ALU), be compiled without spills with FOUR spare
>registers[1].  To add abundant inter-expression caching we need another
>four.

This directly contradicts recent experience with optimizing
compilers. You appear to have counted floating point values (and
assumed no unrolling or software pipelining) and taken that as the
contents of the register bank.

Aside from the generated-in address computations, there are uses such
as scope uplinks, subroutine parameter lists, loop control, and the
like. And those are just the conventional uses, not the agressive
ones.

>My favourite solution would be to have multiple stacks. How many? FOUR
>is the answer, because it is exceedingly rare that an expression
>contains as many independent threads of computation. A superscalar
>machine may attach independent ALUs to each stack, or even specialize
>them[4].
>[4] Indeed in practice superscalar implementations find it exceedingly
>hard to find degrees of microparallelism greater than two or three,

That's two or three things _per-clock_, on a _pipelined_ machine. You
are planning to operate more than one stack per clock? In a pipelined
fashion, so that each stack can be re-referenced while an outstanding
computation hasn't yet written back its result? How are the multiple
consequences resolved?

-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/11/91)

On 9 Jan 91 05:45:14 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay)
said:

lindsay> In article <PCG.91Jan8175401@odin.cs.aber.ac.uk>
lindsay> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Statistics exist that show that almost any expression and a
pcg> majority of code sequences can, on a single threaded CPU (one with
pcg> one ALU), be compiled without spills with FOUR spare registers[1].
pcg> To add abundant inter-expression caching we need another four.

lindsay> This directly contradicts recent experience with optimizing
lindsay> compilers.

This is directly supported by all experience with optimizing compilers.
It is an old controversy, but let me repeat here: 90% of what passes for
"optimizing" compilers are compilers that optimize for *space*, not
time, by caching into registers things that are frequently referenced in
the program text, by minimizing spills, not those used in the program
execution. Since on most machines, especially RISC machines, referring
to an operand in main memory takes much more space than if it were in a
register, these compilers stash into registers eveything in sight, and
thus they do achieve code space compaction.

Since most of these compilers make no attempt to estimate the dynamic
frequency of use of values, they achieve good un time performance only
if *all* values are put in registers; as far as time is concerned,
whether most values are in a register or a spilled to memory is
irrelevant.  Fortunately since it is extremely rare[1] that there
are more than a couple dozen values in any one section of code, that all
values are cached can usually be ensured by having a 32 strong register
bank.

When you run a real _time_ optimizing compiler[2] the knee of the curve,
for single threaded non float computations, lies at four spare registers
and four global[3] registers.

lindsay> You appear to have counted floating point values (and assumed
lindsay> no unrolling or software pipelining) and taken that as the
lindsay> contents of the register bank.

I have said *nonfloat* and *single threaded*. In my view of the world,
each functional units needs its own set of registers. Most of the
currently available implementations are nearly single threaded, or in
any case need only few registers[3].  Loop unrolling only wins if you
have some degree of multiple functional units, otherwise you only buy
reduced overhead, which is not much, especially if the implementation
has some degree of pipelining.

lindsay> Aside from the generated-in address computations, there are
lindsay> uses such as scope uplinks, subroutine parameter lists, loop
lindsay> control, and the like. And those are just the conventional
lindsay> uses, not the agressive ones.

I am getting tired repeating that that quantitative analysis of computer
architecture, which existed well before Hennessy & Patterson's book, is
about knees in the two curves, resource/benef	it and resource/cost[4].

My current understanding, based on vast amounts of reading[5] and
analysis, is that, for general purpose applications, the number of
registers is an increasing function of the number of functional units,
that the knee of the curves is at four spare registers (plus possibly
four global ones) for each one, and that the knee of the curve in
functional units is at four as well, as repeated here:

pcg> My favourite solution would be to have multiple stacks. How many? FOUR
pcg> is the answer, because it is exceedingly rare that an expression
pcg> contains as many independent threads of computation. A superscalar
pcg> machine may attach independent ALUs to each stack, or even specialize
pcg> them[4].
pcg> [4] Indeed in practice superscalar implementations find it exceedingly
pcg> hard to find degrees of microparallelism greater than two or three,

I have had comments here and in the past that demonstrate a fundamental
difference in point of view between me and some other people. I base my
numbers on the computation patterns of general purpose applications --
let's say on the data flows of values in them. Characterizing general
purpose applications is far from a static field, but by now we know
quite a lot.

How to map these onto an application is the architect's problem. Some
people hold that the machine should be organized in the most dynamic
fashion to mimic the dataflow of values in any application, and come up
with superscalar dataflow machines. Based on my armchair evidence I
believe that most general purpose applications data flows can be
modelled best with four multiple stacks (quite shallow, by the way, say
four deep), because such dataflows have the shape of a tree with a
relatively low branching factor, and relatively small number of levels
in each independent branch.

If this is true, adding more functional units is not going to be of
benefit, because interference between the dataflows will be high enouggh
to cause too much trouble, and having deeper stack windows is not going
to be terribly useful either, because the stack depth will only rarely
cause its overflow, and that will happen infrequently.

lindsay> That's two or three things _per-clock_, on a _pipelined_
lindsay> machine. You are planning to operate more than one stack per
lindsay> clock?  In a pipelined fashion, so that each stack can be
lindsay> re-referenced while an outstanding computation hasn't yet
lindsay> written back its result?

Why not? Once you have multiple stacks, if the compiler is not demented,
and the application does have a sufficient intrinsic degree of
microparallelism, dataflows on the different stacks will be typically
independent. For example "(a+b) * (c+d)" can easily be mapped to two
stacks to be run in parallel. Independent branches of the dataflow
tree/graph are prime candidates. Just look at programs and data as
though they were written in Lisp :->.

lindsay> How are the multiple consequences resolved?

These are problems that all superscalar architectures must solve.

Architectures with a flat register file model, I believe, end up
simulating multiple stacks in the flat register file, whether they are
superscalar or not, because that is how application data flows are.

Now, _if_ we accept my belief (but I am ready to be converted!) that
general purpose applications dataflows can be characterized as above (a
few, short, independent "nests" of activity), we have the following
alternatives:

	Scalar, register file, load/store

Multiple stacklets are simulated in the register file. Switching the
focus among the stacklets is instantaneous, because action can happen
between any two registers, and any register can be "stack top".
Processing of each thread, that is of each stacklet, is serial. It is
important that the register file be able to cache entire stacklets,
because referring to operands in memory is so awkward. Poor instruction
density, because it does not exploit exploit expected dataflow patterns,
and is designed for the worst case[7].

	Scalar, register file, traditional

Multiple stacklets are simulated on a memory stack, the stacletk tops
(and as many further down as will fit) are kept in CPU registers, the
rest (the "spills") are multiplexed on the memory stack. Switching the
focus among the stacklets is nearly instantaneous, because most of the
action happens anyhow near the stack tops, which are in registers.
Accessing the bottom elements each stacklet is not too expensive, in a
register/memory architecture.  Processing of each thread is serial. Fair
instruction density, because it is fairly close to the multiple nested
value dataflow patterns.

	Scalar, single stack

Multiple stacklets are nested on a single CPU stack. Switching the focus
among the stacks requires pushing/popping entire substacks because
action can happens only at the stack top. Independent threads of action
cause problems because the real stack top must be multiplexed
unpleasantly.  Good instruction density, because it exploits very well
the nested nature of most computation flows[8]. Unfortunately encoding
density is not perfect, because to switch the focus from one stacklet to
another takes a number of push and pop operations. Fortunately many non
numeric applications do not have more than one thread of computation.

	Scalar, multiple stack

There is a small number of CPU stacks, of limited depth, with
overflow/underflow spilling/refilling, either hardware or software
assisted. Each stacklet is mapped to a different hardware stack, and
carries an independent thread of the computation. Typically only two CPU
stacks will be in frequent use[9], even if four will usually be
provided. Four CPU stacks will enable mapping each stacklet for most
applications. A depth of four for each CPU stack will result in no or
rare overflows most of the time.  Switching the focus between stacklets
is instantaneous.  Encoding density is excellent, because it maps
exactly on the expected value dataflow patterns.

	Superscalar, registers, load/store

There are multiple focus points, because each stacklet contained in the
register file can be served by a different functional unit. If[10] the
register file is multiported, access to the independent stacklets is
fast, but a lot of interconnect is required, because the stacklets may
be spread all around and intermingle arbitrarily in the register file,
again as there is too much generality and worst case assumptions. If
there are more functional units than there is space in the register file
for as many stacklets, trouble ensues, unless exoterica are used like
register renaming[11].

	Superscalar, registers, traditional
	Superscalar, single stack

They can be made to work, at the price of converting on the fly the
instruction stream to that for another type of architecture[12]; the
difficulty is that the presence of multiple focuses because of multiple
functional units makes for trouble, as both have to multiplex multiple
stacklets onto a single stack. Since "registers, traditional" only uses
the memory stack for multiplexing the bottoms of the stacklets, it can
be adapted to superscalar operations better than "single stack"[13].

	Superscalar, multiple stack

There are multiple CPU stacks, and each computation stacklet is mapped
onto it.  Each CPU stack need not be multiported; it can even be
dedicated to a single functional unit, even if I wouldn't. Only the
stack tops can be transferred to/from other stacks. Computations in each
CPU stack can proceed independently, because this is how application
data flows are structured. The push and pop operations could have two
forms, a short one for addressing directly the on chip cache (which
would become a statically managed cache, something similar to a register
file, but not quite), or the cache could be a more conventional one, and
push and pop have only full address forms. More functional units that
there are hardware stacks is a bad idea[11].

	Superscalar, dataflow

In this organization the CPU is structured internally as a network of
functional units, and values (and operators) flow in this network.
Unless the functional unit mix required by the application is very
different by that provided in the network, this will dynamically adapt
to any value dataflow patterns exhibited by the application, as well as
to any number and type of functional units, without recompilation.
Unfortunately this requires fairly hefty dynamic scheduling in hardware,
and all this generality and flexibility is rarely needed[14], and this
makes dataflow machines useful only for a few very wild application.


Well, I hope you have followed me down to here. I have been trapped into
spilling the beans on what I think is a correct model of computational
activity and its implication on the structure of an architecture
(register based, either load/store or traditional, and stack based,
either single or multiple), and its implementation (scalar or
superscalar). Hope it has been useful to you as well, especially in
debunking certain myths about optimizers, registers, superscalars, ...

================

[1] But I received by e-mail a report that in some cases there are more
than 128!

[2] One that uses profiling, or even better one that believes 'register'
or uses heuristics -- I think Hank Dietz's group was into this sort of
things -- to estimate the dynamic frequency of use of values.

[3] Spare means mostly for computing expressions, global for aching
values across expressions, not even procedures. Interprocedural caching
is of little benefit except in particular cases.

[4] For example, I have received some interesting mail that shows that
in order to keep the pipeline and functional units of a MIPS R3000/R3010
CPU going full tilt, you only need six floating point registers.

[5] We should not be interested in how many registers we can make use
of, but in how many registers we can productive use of, and what this is
function of, and how much does this cost.

[6] But see David Wall's classic paper about global optimization to see
that in his extreme examples the knee of the curve is at 8 registers if
dynamic frequency is taken into account, versus three times as many if
only static frequency is used.

[7] Any value can interact with any other value, but instead nesting is
so prevalent in actual value dataflow patterns.

[8] Note that a pure stack architecture is load/store too: only push and
pop address memory. It does away with registers, because the operands
are always implicit, so all instructions except push and pop have only
the opcode (a "syllable", in Burroughs terminology, or a "word" in
Forth).

[9] one for addessing calculations, the other for nonfloat calculation;
a third CPU stack will be in use if there are float calculations.

[10] Well, we could remove the "if" and substitute it with an "as" here!

[11] But then if the application value data flow patterns show more than
four independently exploitable nests of computation, it is no longer a
general purpose application as commonly intended. Use a Cray, a DAP, a
Teradata, a Connection Machine or any other SIMD/MIMD thingie.

[12] I seem to remember that this is how modern 370 architectures are
implemented; the 370 interior decor is emulated by a more or less loose
collection of functional units connected by dataflow paths. IBM have
always seen the 370 (or other) interior decor as little more than a
declarative language, to be interpreted on the fly. In the S/38 the
interior decor is actually translated, more or less statically, to
different microcodes on different models.

[13] However for both it is probably better to go the multiple CPU,
rather than the multiple functional unit, route.  Note that,
intriguingly, for most workloads, four is also the knee of the curve for
macroparallelism. Maybe R. A. Wilson's Law of Fives should be a Law of
Fours :-).

[14] Because general purpose application dataflow patterns (a mantra by
now!)  have fairly predictable shapes, as above, and full dataflow
machines do not exploit this advance knowledge. Not that I think of it,
they could, though, probably, by special casing. Could be interesting.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/11/91)

On 10 Jan 91 16:23:01 GMT, pcg@cs.aber.ac.uk (myself) said:

    [ ... usual verbose treatise with possibly useful & interesting
    but not at all new things on value data flow patterns and
    implications for architectures and their implementations ... ]

Rereading this article I have realized that there is a problem in the
regular rerunning of the 'register count' and 'aggressive optimization'
debates, that I know something that, as far I know up to a couple years ago,
nobody else knows, and this gives me, I suppose, some insight that permeates
the tone of my arguments without ever being made explicit.

I have an unpublished[1] paper on how to implement multiple parallel
assignment in the general case, which is something that apparently is still
not generally known or published. The multiple assignment is, I have come to
see, a very important thing, with far reaching consequences in at least two
fields; I will discuss those for architecture[2]. I will first present the
problem, then a sketch of the solution, and then the implications for system
architecture.

Multiple parallel assignment can be written as something like

	L1, L2, L3 := R1, R2, R3

Where the L's are lvalue producing expressions, and the R's are value
producing expressions. The previous values of all the L's are replaced
simultaneously and in parallel (at least conceptually) with the
correcponding R's. Its formalization is the trivial extension of the
single assingment one.

Examples:

	x, y := w, z;		CO x := w; y := z; 		CO
	a, b, c := b, c, d;	CO slide a row of values	CO
	a, b, := b, a;		CO swap the values of a and b 	CO
	a, b, c := b, c, a;	CO rotate a row of values	CO
	a[i], i := i, a[i];	CO quite a bit more difficult 	CO

As you can see there are four levels of difficulty; no dataflow dependecies,
L-to-R dataflow dependencies, L-to-L dataflow dependencies, both.

I have devised an obvious but complicated algorithm for expanding an
_arbitrary_ multiple assignment into an equivalent sequence (or multipel
sequences) of single assignments, with the use of the *minimum* number of
temporaries[3] necessary, or any other quantity desired, depending on the
desired degree of parallelism.

So that nobody asks me for the draft of the paper, here is an outline of the
algorithm: build two boolean matrixes, LR for L-to-R(-to-L) dependencies,
and LL for L-to-L dependencies, an element is true if there is a
dependency[4] betwen the two expressions. In other words, if N is the arity
of the multiple assignment:

    for i,j in 1..N,1..N
	LL[i,j] := (i=j | 0 | Li <influences the value of> Lj)
	LR[i,j] := (i=j | 0 | Li <influences the value of> Rj)

You now have got the characteristic matrixes of two directed graphs.
Apply Dijkstra's cycle finding algorithm[5] to the matrixes, repeatedly,
by doing the row sums and emitting one of the assignments that have zero
row sum. If there aren't any, there is a cycle. If there are several row
sums with the same value, select one at random (they are all part of
cycles of the same degree), introduce a temporary, emit the relative
assignment, and restart.  Whenever an assignment to Li is emitted,
delete the i'th row and column of LL and LR. That's all, in outline[6].

It is easy to prove that this generates the assignments in the right
order, and introduces the minim number of temporaries.

Note that the order in which single assignments are emitted if more than
one row has zero row sum, or in which cycles are resolved if they have
the same row sum, is a matter for scheduling policy. Also, some single
assignments that have higher row sums than others can be emitted before
those if temporaries are introduced to reduce those row sums[7].  The
temporaries can be seen as staging places that uncouple two strands of
microparallelism that would otherwise get entangled.

Now the implications for system architecture, that is microparallelism.

I can claim that a very large percentage of existing codes can be
recoded with multiple assignments with large benefit in clarity. Anybody
can think of examples[7]. I also claim that in all these important cases
programmers are forced to hand expand the underlying myltiple assignment
into a series of single assignments, and that a large, or even all, of
the dataflow analysis of optimizers is about reversing the process and
rebuilding the underlying multiple assignment, and possibly reexpanding
it in a different form.

If you look at the world like this, you find that instruction scheduling,
the maxmimum degree of microparallelism, caching in registers across
statements, common subexpression eleimination, and many other issues are
just a consequence of expanding multiple parallel assignment into signle
assignment, to be executed either serially or in parallel.

It is obvious, at least to me, that a lot (if not all!) of so called
aggressive optimization and clever register using, as opposed to competent
code generation, is the regrettable result of multiple assignment not being
in the language and being haphazardly reinvented every time by aggressive
optimizers, and other funnily misapplied things like graph coloring. With
the double hazard that the programmer is writing a very simple and easily
understood multiple assignment as its *implementation* as a sequence of
single assignments, and the compiler is trying to reverse engineer that.

It is also obvious to me, as I have seen the trivially easy but complicated
"proof" that the above described algorithm works, that this implies that in
many many cases both the programmer and the optimizer must "prove" that the
sequence of single assignments is indeed equivalent to the underlying
multiple assignment, which is tedious and complicated, when it could be done
in a reliable way as part of code generation if languages were designed with
multiple assignment built in[9] and with this algorithm in a "tunable"
version as to the scheduling of the single assignments, and to the number of
temporaries to use and when.

Especially on multiple stack superscalar machines[10], whether the stacks are
uselessly implemented as portions of a flat register file or not.

I had to get this off my chest, sooner or later. Most probably others have
founbd the same algorithm, but if so I have never seen any indication that
they attributed the same importance to it as I did when I considered all the
implications[11]. If you think that they are less important than I think,
reconsider; for example, read Suzuky's paper, or try to rewrite your
favourite algorithms and inner loops using multiple assignments, and wonder
how the algorithm could expand them on different architectures.

================

[1] Submitted for publication almost five years ago to the CJ, accepted with
revisions, revisions never done :-(. I have not been in the right environment
for doing or even polishing research.

[2] The other one is program formalization and correctness, as in stateless
or stateful languages.

[3] The general form of multiple assignment can always be implemented by
calculating each or the L's and each of the R's into a temporary, and
then performing the single assignments between the temporaries in any
order. This is what binding based, functional languages do.

[4] The usual wise guys will remark that this is undecidable. In
practice any weaker predicate will do, and will generate, for the simple
expressions likely to be found in practice, good results.

[5] The referee for my paper pointed out that I was really using topological
sorting, thus missing the point entirely. Topological sorting and Dijkstra's
cycle finding algorithm are equivalent (they both essentially deal with the
transitive closure of a partial order operator), modulo the representations
of the graph, but one has as goal finding the cycles, the other fails when
there are any. Very different flavour.

[6] There are additional obvious complications in selecting the optimal
cycle reduction order if there are cycles in both the LR and LL
matrixes, and other scheduling details, and handling cases like

	a, a := b, b		CO well defined meaning		CO
	a, a := b, c		CO nondeterministic or illegal?	CO

I did a sketchy, minimal and dumb, implementation for Johnson's PCC,
many years ago, in about one day. I think that adding it to any compiler
would take about as much time. I have lost PCC's patches, and have not
had time to add it to GNU cc or other similar technology. Sorry.

[7] For example, consider "a, b := b, c". The LL matrix is all 0, and
the LR matrix is

	0 0
	1 0

We can immediately write "L0 := R0", and then the matrix
collapses, and we can write "L1 := R1", so we get the sequence

	a := b; b := c;

If we do not like this scheduling, we can extend the matrix with the row and
column for a temporary (note that a temporary never influences any L or R
expression), and thus we can generate instead either of the sequences

	T1 := b; a := T1; b := c;	CO quite silly really	CO
	T1 := b; b := c; a := T1;	CO scheduling inversion	CO

[8] Two particular forms of multiple assignment, sliding and rotation,
are illustrated in an article in CACM, middle seventies, by Suzuki.

[9] Please, not as in functional languages or languages with futures and
other things that imply that you always use the maximum number of
temporaries every time, as in function bindings passed by value.

[10] If you think of the most applications rewritten with multiple
assignments, and their dependency matrixes, it is easy to see that these
multiple assignments would capture essentially all interesting
microparallelism, and that they would not still be terribly complicated;
rarely they would be more than four way and more than four deep each way.
Even when grouped together in multiple assignments, the average expressions
would still be pretty simple. Except for things like operations on whole
vector/matrix (or other simple and regular structure), but then SIMD/MIMD is
the right answer, not microparallelism. Massive, massive parallelism, which
has its own, quite different, set of problems.

[11] I stumbled upon it; I just wanted to add it to a language I was doing
an implementation for, and was irked that all the references I could find at
the time (for examples Gries, still in his recent books) said that the
general problem, with arrays and both LL and LR dependencies, was
intractable. On examining the algorithm when it was done I started to
realize its implications for program writing and code generation.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/12/91)

On 10 Jan 91 19:44:01 GMT, pcg@cs.aber.ac.uk (myself) said:

   [ ... how to reduce a multiple assignment to one or more sequences
   of signle assignments ... ]

pcg> So that nobody asks me for the draft of the paper, here is an
pcg> outline of the algorithm: build two boolean matrixes, LR for
pcg> L-to-R(-to-L) dependencies, and LL for L-to-L dependencies, an
pcg> element is true if there is a dependency[4] betwen the two
pcg> expressions.

pcg> [4] The usual wise guys will remark that this is undecidable. In
pcg> practice any weaker predicate will do, and will generate, for the
pcg> simple expressions likely to be found in practice, good results.

Some of the usual wise guys have written to me to remark that aliasing
may pose a problem. I had clearly hinted that this need not so, but it
seems that discussing the alising issue throws up more interesting
language and code generation issues, so here is a followup posting.
Followups to this are redirected to comp.lang.misc, because the system
architecture content is not that great here.

There are *two* types of aliasing we might be interested in. Let's see a
few examples:

    a, a := 1, 1;		CO semantics well defined	CO
    a, a := 2, 3;		CO nondeterministic or illegal?	CO
    a, a := b, c;		CO either of the above		CO

This illustrates *value* aliasing. The first example is
noncontroversial; the second is, because we have a choice of either
faulting it as ambiguous, or of accepting a non deterministic result (my
preference, but the compiler should issue a warning); the third is
equivalent to either, depending on whether "b = c".

    a[i], a[j] := a[j], a[i];
    a[i], a[j] := a[m], a[n];

This illustrates *lvalue* or reference aliasing.  But the semantics of
the first line are obviously well defined, for when "i = j" it is an
identity assignment, and otherwise it is a swap. The second line is
harder, as it involves *both* value, as maybe "i = j" but "a[m] /=
a[n]", and lvalue (if "j = m" or "i = n", ...) aliasing.

Three "solutions" are possible:


1) Using a suitably weak definition of the dependency predicate
guarantees that all these cases are handled correctly, at the price of
generating some temporaries that are unnecessary when aliasing does not
occur dynamically. For example the examples above can be implemented as

    a, a := 1, 1;		a := 1;
    a, a := 2, 3;		a := 3;		
    a, a := b, c;		a := b;
    a[i], a[j] := a[j], a[i];	R0 := a[j]; a[j] := a[i]; a[i] := R0;
    a[i], a[j] := a[m], a[n];	R0 := a[m]; R1 := a[n];
				a[i] := R0; a[j] := R1;
	

2) If the programmer knows something that cannot be possibly deduced by
the code that implements the dependency predicate, such as aliasing
information in languages like C, there is always the option of manually
expanding the multiple assignment as single assignments. For example the
programmer may _know_ that "j /= m" and thus rewrite:

    a[i], a[j] := a[m], a[n];	a[j] := a[n]; a[i] := a[m];

If the programmer knows that "i /= n" the other rewriting becomes
possible:

    a[i], a[j] := a[m], a[n];	a[i] := a[m]; a[j] := a[n];


3) A far more interesting solution is to add some cleverness to the
compiler. I mean, real cleverness, not like current "optimizers". The
idea is that compilers are symbolic reduction engines in the domain of
code sequences -- let's exploit that and have the dependency predicate
return not truth values but formulas with free variables. The dependency
matrixes would then imply a family of actual dependency matrixes, and
code could be generated for each. This we could call "Schroedinger's cat
code generation". For example, the (pessimistic) dependency matrixes for

    a[i], a[j] := a[m], a[n];

would be

    LL	0 0	LR	0 1
	0 0		1 0

while the symbolic/feline ones would be

    LL	0 0	LR	0		(i = n)
	0 0		(j = m)		0

By exploding the symbolic LR matrix for the possible values of the
formulas in it, we could have the code generator generate code for:

    a[i], a[j] := a[m], a[n];

as in:

    IF i = n
    THEN
	CO LR matrix collapses to: ((0,1),(j=m,0)) CO

	IF j = m
	THEN
	    CO
		equivalent to:	a[i], a[j] := a[j], a[i];
		LR matrix is:	((0,1),(1,0))
	    CO
	    R0 := a[j]; a[j] := a[i]; a[i] := R0
	ELSE
	    CO
		equivalent to:	a[i], a[j] := a[m], a[i];
		LR matrix is:	((0,1),(0,0)) 
	    CO
	    a[j] := a[i]; a[i] := a[m];
	FI
    ELSE
	C as above, mutatis mutandis C
    FI

Note that by going really far, we could even turn "a := b" into
something like "(a /= b | a := b)", which may even (improbably) make
sense for some implementations of some architectures.


Now some comments.

Alternative 1) above is IMNHO perfectly adequate. I do expect that the
vast majority of cases will involve no potential aliasing, so no
possibly excessive temporaries will be used. And even when they are
used, they will probably make little difference to runtime.

Alternative 2) above will be used when speed is absolutely critical and
the programmer wants to invest the mental effort of second guessing the
code generator. Hardly ever necessary, I hope. Note that the frequent
advice to C programmers to offer greater opportunities to the
optimizers, in the (fortunate) absence of C ways to inform the optimizer
of aliasing hazards, by using explicit temporaries, is equivalent to
this alternative.

Alternative 3) is intriguing, because it implies dynamic conditional
compilation. For example the code sequences relative to aliasing, if it
is expected to be rare, could be offlined to some remote code section,
to improve locality. More than a hint of trace scheduling here, of
course. Interesting avenues of research, only a few already explored
(trace scheduling, lazy or speculative execution, on the fly code
generation for BITBLT, the concept of a supercompiler, ...), would open
considering interpretation and compiling as an experimental science, and
execution and compilation as performing or setting up experiments to
collapse the state vector. Deep thinking, man.

================================================================

Now a disclaimer that I feel required to make, after some comments on
the "Summary:" line for the previous posting:

  All the research underpinning this and the previous postings has no
  relationship whatsoever to the research of the University College of
  Wales.

  The research that I discuss here is *entirely* the result of my own
  work, using my own funds, on my own time; its contents is my
  responsibility alone, and should not reflect on any assessment of the
  research or reputation of the University College of Wales.

  In particular the investigation of multiple assignment was done years
  before I had even heard of the University College of Wales existence.

If the above disclaimer were not true, I would have had to seek written
permission from my Head of Department before posting the multiple
asignment technology, and it could have been (probably IMNHO) refused,
as it would have been my contractual duty to point out that it was
patentable subject matter susceptible of commercial exploitation.

I wish to thank the University College of Wales for their generous
permission to use without payment their News connection to discuss in
this forum my own personal research (if any :->).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (01/13/91)

In article <PCG.91Jan10162301@odin.cs.aber.ac.uk> 
	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>Loop unrolling only wins if you
>have some degree of multiple functional units, otherwise you only buy
>reduced overhead, which is not much, especially if the implementation
>has some degree of pipelining.

Unrolling (to reduce overhead) is the basic trick of hand-coded
memcpy routines, on such "multiple functional unit" classics as the
8080 and 68010. Pipelining makes unrolling more useful, not less.

>It is an old controversy, but let me repeat here: 90% of what passes for
>"optimizing" compilers are compilers that optimize for *space*, not
>time, ...

The optimizing compilers that I worked on, were trying for speed.  We
tuned the things by endlessly recompiling and running our (ever-
expanding) code-quality suite. We didn't bother to measure how small
the generated code was: we just timed it. 

>Based on my armchair evidence I
>believe that most general purpose applications data flows can be
>modelled best with four multiple stacks (quite shallow, by the way, say
>four deep), because such dataflows have the shape of a tree with a
>relatively low branching factor, and relatively small number of levels
>in each independent branch.

I have before me a dataflow diagram that isn't tree-like. Yes, it's a
real-world Fortran subroutine.  Further, you are referring to
floating point calculation. In many programs,

	i := i + 1

is the most common calculation: addressing and control are dominant.
If you want to convince me about your "most" and "best", you are
going to need numbers. It is just as easy to believe that the modal
stack depth would be 1 (one), hence, the multiple stacks would be
that many (expensive) registers.

I could go deeper into this: I could demand to know how your design
does (say) context switches. But, on the whole, stack architectures
are dead, and dealing with a proposed one doesn't seem like a good
use of my time.
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

sef@kithrup.COM (Sean Eric Fagan) (01/13/91)

In article <11566@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <PCG.91Jan10162301@odin.cs.aber.ac.uk> 
>	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>>It is an old controversy, but let me repeat here: 90% of what passes for
>>"optimizing" compilers are compilers that optimize for *space*, not
>>time, ...

Uhm... what makes you think they're so different?

The MSC compiler optimizes for space, not execution time.  However, the *86
architecture is such that, in most cases, optimizing for space also ends up
with more efficient code.  (In particular, CSE and other code removal
techniques *do* result in faster code, as well as smaller.)  On most RISC
chips, the smallest code is quite often the fastest code.

-- 
Sean Eric Fagan  | "I made the universe, but please don't blame me for it;
sef@kithrup.COM  |  I had a bellyache at the time."
-----------------+           -- The Turtle (Stephen King, _It_)
Any opinions expressed are my own, and generally unpopular with others.

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/14/91)

On 12 Jan 91 20:59:26 GMT, lindsay@gandalf.cs.cmu.edu (Donald Lindsay) said:

lindsay> In article <PCG.91Jan10162301@odin.cs.aber.ac.uk> 
lindsay> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

pcg> Loop unrolling only wins if you have some degree of multiple
pcg> functional units, otherwise you only buy reduced overhead, which is
pcg> not much, especially if the implementation has some degree of
pcg> pipelining.

lindsay> Unrolling (to reduce overhead) is the basic trick of hand-coded
lindsay> memcpy routines, on such "multiple functional unit" classics as
lindsay> the 8080 and 68010. Pipelining makes unrolling more useful, not
lindsay> less.

My fault -- I was not discussing 8080s and 68010s, but contemporary
machines. If there is *no* suitable pipelining, like for those,
unrolling buys you reduced overhead, as I wrote. Not a lot, and you
don't need a lot of unrolling to reduce overhead suitably, and since you
have no pipelining you can reuse the registers in each stage of the
unrolling. I remember having done some experiments with copying core on
a PDP-11/70; unrolling beyond three or four was mostly pointless; with
loops with a large body the advantage was even smaller.

If you have *some limited* degree of pipelining, as in contemporary
implementations, such as the classic three-four stage pipeline that
overlaps some computation with some control, and especially if this
pipeline is exposed with things like delayed branches, then unrolling
buys you nothing at all in time, and loses code space.

Maybe you could get up to date on the "benefits" of loop unrolling for
contemporary, shallow pipelined, implementations if you got my
CoreCopy() function from a comp.arch archive, and tried it on several
contemporary machines with different values for its several options.
Just to inspire your curiosity, I have found that for example on an
implementation with a shallow pipeline like the 68020 unrolling buys you
nothing at all. This is not untypical.

I have been sent an example (thanks!) that shows that on a MIPS R30x0,
which has a shallow implementation, one can keep the floating point
pipeline busy with the use of 6-7 registers and *no* unrolling. This is
not untypical either.

pcg> It is an old controversy, but let me repeat here: 90% of what
pcg> passes for "optimizing" compilers are compilers that optimize for
pcg> *space*, not time, ...

lindsay> The optimizing compilers that I worked on, were trying for
lindsay> speed.  We tuned the things by endlessly recompiling and
lindsay> running our (ever-expanding) code-quality suite. We didn't
lindsay> bother to measure how small the generated code was: we just
lindsay> timed it.

This seems (to me at least) a funny way of proceeding, and deomstrates a
trial-and-error mentality that may be there because of lack of insight.

Well, maybe that's the case. If you were aiming at one goal with
technologies suited to achieving another goal, too bad for you.

It can happen, even to the best; a lot of serious and important people
that publish papers on "optimizing" compilers expound the funny view
that things like graph coloring optimize speed and not space. I don't
know who's the joke on.

The joke is that graph coloring minimizes the *number* of spills, not
their *cost* in time. Most spills have negligible cost in time, because
they are not in critical paths, and are not executed frequently.

If an "optimizer" does not attempt to minimize the *dynamic* cost of
spills, by estimating the frequency of reuse of values either by
feedback profiling or by heuristics, it will only achieve optimal timing
by pure chance, or by having so many registers to play with that the
cost of spills is zero because there aren't any (virtually all the so
called "optimizers" for RISC machines).

I had hoped that more people understood what David Wall and Hank Dietz
can say on this subject (not to mention more esoteric technologies).

pcg> Based on my armchair evidence I believe that most general purpose
pcg> applications data flows can be modelled best with four multiple
pcg> stacks (quite shallow, by the way, say four deep), because such
pcg> dataflows have the shape of a tree with a relatively low branching
pcg> factor, and relatively small number of levels in each independent
pcg> branch.

lindsay> Further, you are referring to floating point calculation.

No, to nonfloat ones, explicitly. Yet I think that for float ones things
are similar, for many, but maybe not most, applications. Several
important numerical applications that (unfortunately) use floating point
have FIFO dataflow reference patterns, and of course a SIMD/MIMD machine
is far better suited to them than one based on multiple stacks, which is
designed for (mythical :->) general purpose applications.

I think that it is in order to point to the classic "The MU5 computer
system", by Ibbett & Morris, MacMillan, for a clear description of how
different must architecture be for different types of expected dataflow
reference patterns. The MU5 was designed for FIFO, numerical codes,
while it ended up running general purpose applications. Baaad vibes,
man.

lindsay> It is just as easy to believe that the modal stack depth would
lindsay> be 1 (one), hence, the multiple stacks would be that many
lindsay> (expensive) registers.

Hey, you are stealing one of my lines here! That you need very few
registers because most dynamically important expressions are terribly
simple is *my* mantra.

I introduced the idea of multiple stacks as a concession to the
important, even if not the modal, case where the dataflow patterns do
have some (limited) complexity, like multiple non nested nests of
computation. Multiple stacks is like flat register file, except that it
takes advantage of the nested nature of each of the strands to allows
for separate stacks. Maybe I should repeat the argument more clearly
(hopefully). Suppose you have determined that you need *up to* 16 spare
registers to cache optimally your dataflow reference patterns, that
exhibit typically *up to* 4 independent strands of computation, each of
them involving maybe *up to* 4 levels of spill.

How should you organize the 16 registers? If you choose a flat register
file you need 4+4(+4) bits to encode the operand numbers and you need,
for an N<=4-way superscalar implementation, an N+k multiported file. If
you use four register stacks with four levels each, you need 2 bits per
operand number and each separate stack can be single ported.

Naturally the flat register file can cope much better with data
reference patterns that are not *up to* four way independent and *up to*
four way deep each, but I surmise that these are relatively rare, for
general purpose applications.

Naturally if you think that you can cache the locality of a code with
less than 16 registers, please use less than those, or if you think that
the data reference patterns exhibit no exploitable nesting, then use a
flat register file with its generality and associated costs.

More research needs to be done on this subject; I do have anectodal
evidence (and have read some papers that seem to imply) that 4x4 is a
good trade off between the number of registers and the profile of most
codes and their exploitable parallelism. I could be persuaded easily
that *most* codes do not need those many, because that's my own mentra,
but then the same argument would apply to the size of a flat register
file.

It is sad that people like you spend lots of time in trial-and-error
attempts to get an "optimizer" going using possibly irrelevant
technology, instead of investigating important issues like the shape of
dataflow reference patterns for various classes of applications. Or
maybe too many products/careers have been built on large register banks
and "optimizers" for this.

My argument, more than being centered on the number four (which I still
believe to be a reasonable estimate), was centered on the idea that a
flat register file does not reflect the expected reference patterns of
general purpose codes as the same number of registers arranged as
multiple stacks.

For certain types of codes, neither arrangements may be best -- vector
and vector oriented superscalars (e.g. the ZS-1) have register *queues*
instead of register stacks or flat files, as the expected dataflow
patterns are FIFO, with no value reuse, but with deep pipelining.

lindsay> But, on the whole, stack architectures are dead,

Just as quantitative computer architecture is, if serious people can
delude themselves that classic graph coloring optimizes run time and not
code space, and there is a scant research on studying the actual
reference patterns of classes of code, instead of building
products/careers on questionable _assumptions_, that seem to work only
because they are in fact never tested.

lindsay> and dealing with a proposed one doesn't seem like a good use of
lindsay> my time.

Yes, probably reading the relevant literature is a better use of your
time. Reread David Wall and company a million times... Or maybe the
David Wall in my world is a pop singer in yours :-).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

preston@ariel.rice.edu (Preston Briggs) (01/14/91)

pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>    a[i], a[j] := a[m], a[n];

> it involves *both* value, as maybe "i = j" but "a[m] /=
>a[n]", and lvalue (if "j = m" or "i = n", ...) aliasing.
>
>Three "solutions" are possible:

>1)

Conservative implementation.

>2)

Non-deterministic implementation; force the programmer to be specific.
I'd say this is similar to C.

>3) A far more interesting solution is to add some cleverness to the
>compiler. I mean, real cleverness, not like current "optimizers".

and shows a solution that involves generating multiple versions
of the assignment, with tests of i, j, m, and n to distinguish
between the various versions.

I don't like this for a several reasons.

a) Where's the profit?  We save a register at best over the conservative
   solution in (1).  The cost though, is 1 or 2 tests at run-time.
   A branch or 2 per register saved is no deal.

b) Tools already exist to help disambiguate array references,
   e.g., dependence analysis.  While dependence analysis is not always
   precise, it is conservative (safe) and the technology is improving
   all the time.  Further, if you've taken the time to build a dependence
   analyzer, you can use it to support more extensive and more profitable
   transformations.

and finally

c) I don't like so much special-purpose machinery (compiler code)
   devoted to one programming language construct.  I'm much prefer
   general-purpose optimizations that help over the entire language.

   In the case of multiple-assignment, I'd go for an instruction
   schedule that attempts to minimize register usage (similar
   to the work of Steve Johnson.  That way, we'd gain the benefits
   on all sorts of expressions.

Preston Briggs

preston@ariel.rice.edu (Preston Briggs) (01/15/91)

pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

>It can happen, even to the best; a lot of serious and important people
>that publish papers on "optimizing" compilers expound the funny view
>that things like graph coloring optimize speed and not space. I don't
>know who's the joke on.
>
>The joke is that graph coloring minimizes the *number* of spills, not
>their *cost* in time. Most spills have negligible cost in time, because
>they are not in critical paths, and are not executed frequently.

The idea of managing registers/memory via graph coloring is an old one
(at least back to '61).  The basic idea is to minimize the register
requirements (as in "find a minimal coloring").  Spilling isn't a part of
graph coloring; it's a part of register allocation.

The advantage of graph coloring is that it provides a simple abstraction
from the complexities of register allocation on various interesting
machines.  The reason it works well (and it _does_) is that many
machines have enough registers to run most programs.

Lots of people complain about the difficulty of implementation and
suggest some simpler alternative.  My experience is that a graph
coloring allocator is not that hard to build, the simpler alternatives
are not trivial to build, and that a global register allocator
(like graph coloring) smokes a local register allocator.

>If an "optimizer" does not attempt to minimize the *dynamic* cost of
>spills, by estimating the frequency of reuse of values either by
>feedback profiling or by heuristics, it will only achieve optimal timing
>by pure chance, or by having so many registers to play with that the
>cost of spills is zero because there aren't any (virtually all the so
>called "optimizers" for RISC machines).

Optimality sounds like a fine reason to have lots of registers,
especially since optimal choices by the compiler are undecidable.

Further, every register allocator based on graph coloring that I've
read about (and I read aggressively in this area) uses heuristics to
estimate the dynamic cost of spills.  In addition, some use of profiling
info.  Spill choices are always based on these dynamic estimates.

>I had hoped that more people understood what David Wall and Hank Dietz
...
>Yes, probably reading the relevant literature is a better use of your
>time. Reread David Wall and company a million times... Or maybe the
>David Wall in my world is a pop singer in yours :-).

I'm not sure exactly which Wall or Dietz papers you're refering to;
they both publish extensively.  I'll talk some about Wall.

Wall has papers in Sigplan 86 and Sigplan 88 discussing global
(in this case meaning interprocedural) register allocation at link 
time (the 88 paper compares with various forms of register windows).  
The papers are interesting for many reasons; however, Grandi's
emphasis is (probably) the small number of registers required
for reasonable performance and the small benefits achieved
with a large number of registers (the 86 paper reports results
for 8, 32, and 52 registers).  It's important to be a little
sceptical though, and look for holes (reasons why his measurements
might not apply to particular situations).

For his 86 paper, I wonder about:
1) The balance of his machine (cost of add versus cost of load).
   In the 5 years since the 86 paper, spills have gotten more
   expensive.

   Consider also various forms of low-level parallelism (pipelines
   and such).  These aren't recent inventions, but they change
   the shape of the code.

2) His optimizer.  It's not mentioned beyond "The compiler generates
   good code that does not load a variable twice in the same basic
   block..."  Sounds like value numbering to me. Optimization needs
   registers.  More powerful optimization needs more registers (and
   the payoff is faster programs).

   Even if he used a good optimizer, improvements in the last 5 years
   tend to make his numbers conservative (e.g., global value numbering,
   fancier pipeline scheduling, dependence analysis and loop restructuring).

3) His benchmarks.  Do they resemble anything you're interested in?

4) He doesn't talk about allocating addressing temporaries to registers.
   Again, I assume it's because of minimal optimization.  But temporaries
   are an important consumer of registers (and their reuse is an
   important source of improvement).

5) He doesn't do subsumption (coalescing) when coloring.  This saves
   copy instructions (cycles!) but sometime requires additional
   registers.  Therefore, his results may understate the improvement
   and the registers requirements.

Reading Wall's work is valuable.  His numbers are interesting and
informative.  But I think they should be interpreted as showing trends,
not exact register requirements.

Preston Briggs

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (01/15/91)

In article <PCG.91Jan13174042@odin.cs.aber.ac.uk> 
	pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>pcg> It is an old controversy, but let me repeat here: 90% of what
>pcg> passes for "optimizing" compilers are compilers that optimize for
>pcg> *space*, not time, ...
>lindsay> The optimizing compilers that I worked on, were trying for
>lindsay> speed.  We tuned the things by endlessly recompiling and
>lindsay> running our (ever-expanding) code-quality suite. We didn't
>lindsay> bother to measure how small the generated code was: we just
>lindsay> timed it.
>
>This seems (to me at least) a funny way of proceeding, and deomstrates a
>trial-and-error mentality that may be there because of lack of insight.

>It is sad that people like you spend lots of time in trial-and-error
>attempts to get an "optimizer" going using possibly irrelevant
>technology, instead of investigating important issues ...
>Reread David Wall and company a million times... Or maybe the
>David Wall in my world is a pop singer in yours :-).

Hmmm. Some of my coworkers from that effort, have since been
coworkers of David Wall. The idea that we lacked insight is amusing.

The insight which you have missed, is that advanced optimization is
heavily influenced by the statistical properties of the test suite.
It is entirely possible to persuade one's colleagues to make some
improvement, promising them wonders:  and then discover that the code
quality suite now runs slightly SLOWER. (Or - more usual and  more
agonizing - some programs are faster, but others are slower.) If you
do not understand why this should be, then you have a non-
quantitative understanding of the subject area.

However, that was not the point that I was trying to make. I was
pointing out, unambiguously, that our compilers were _clearly_ aiming
for speed, not space.

>lindsay> But, on the whole, stack architectures are dead,
>Just as quantitative computer architecture is, if serious people can
>delude themselves ...
>I introduced the idea of multiple stacks ...
>Maybe I should repeat the argument more clearly (hopefully).

Clarity is not the issue: I am simply unconvinced that your proposal
would be competitive if built. I encourage you to obtain quantitative
supporting data.

-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

yodaiken@chelm.cs.umass.edu (victor yodaiken) (01/15/91)

In article <1991Jan14.191057.14242@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>I wrote:
>>>
>>>Optimality sounds like a fine reason to have lots of registers,
>>>especially since optimal choices by the compiler are undecidable.
>
>and yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
>>How's that? The compiler is allocating a finite set of resources. How
>>can this problem be undecidable? Dificult, yes; Intractable, maybe; but
>>I don't see where undecidability would come from. Did I miss something here?
>
>Because the path actually taken through the code can't, in general,
>be known at compile-time.
>
>Preston Briggs

Still don't see it.  The system state, i.e. contents of store and registers,
appears to determine the path taken through a piece of code. 
If we have k registers of n bits each, memory
addresss in the range of 0 - 2^l for some l, where each memory word
holds  m bits plus some other stuff: then we get  approx 
x = 2^(n*k)*2^( (2^l -1)*m) + C  possible states 
(C represents things like carry bits etc).
Thus, in principle, we can check each possible path code. 
Where is the undecidability? In practice, we won't really have to check all
paths. See, H. Massalin " Superoptimizer: A look at the smallest
program" (ASPLOS II 1987) for a nice example of this principle in practice.

foo@erfordia.rice.edu (Mark Hall) (01/15/91)

In article <25106@dime.cs.umass.edu> yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
)In article <1991Jan14.191057.14242@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
)>I wrote:
)>>>
)>>>Optimality sounds like a fine reason to have lots of registers,
)>>>especially since optimal choices by the compiler are undecidable.
)>
)>and yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
)>>How's that? The compiler is allocating a finite set of resources. How
)>>can this problem be undecidable? Dificult, yes; Intractable, maybe; but
)>>I don't see where undecidability would come from. Did I miss something here?
)>
)>Because the path actually taken through the code can't, in general,
)>be known at compile-time.
)>
)>Preston Briggs
)
)Still don't see it.  The system state, i.e. contents of store and registers,
)appears to determine the path taken through a piece of code. 

   I think preston is referring to programs which read input in the 
 phrase "in general".
 
   Can you predict the state of registers for all possible 
 (possibly infinite) inputs streams? 

 - mark 
   former office-mate of preston   (how's THAT for name-dropping?)

yodaiken@chelm.cs.umass.edu (victor yodaiken) (01/15/91)

In article <1991Jan14.233249.21161@rice.edu> foo@erfordia.rice.edu (Mark Hall) writes:
>In article <25106@dime.cs.umass.edu> yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
>)>Because the path actually taken through the code can't, in general,
>)>be known at compile-time.
>)>
>)>Preston Briggs
>)
>)Still don't see it.  The system state, i.e. contents of store and registers,
>)appears to determine the path taken through a piece of code. 
>
>   I think preston is referring to programs which read input in the 
> phrase "in general".
> 
>   Can you predict the state of registers for all possible 
> (possibly infinite) inputs streams? 
>

Not personally, but in principle, it can be done. The input streams 
comprise a regular language. Thus, we have a finite representation, and 
we have some k, so that any input stream of length greater than k repeats
state.

przemek@liszt.helios.nd.edu (Przemek Klosowski) (01/16/91)

In article <25106@dime.cs.umass.edu> yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
>In article <1991Jan14.191057.14242@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>>>>especially since optimal choices by the compiler are undecidable.
>>
>>and yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
>>>How's that? The compiler is allocating a finite set of resources. How
>
>Still don't see it.  The system state, i.e. contents of store and registers,
>appears to determine the path taken through a piece of code. 

Normally you compile the program in order to run it. If you use the exhaustive
strategy you propose, there is no need to run the program at all---the compiler
would actually run the program, and instead of producing the executable it 
might as well give the program's output...

--
			przemek klosowski (przemek@ndcvx.cc.nd.edu)
			Physics Dept
			University of Notre Dame IN 46556

preston@ariel.rice.edu (Preston Briggs) (01/16/91)

I wrote optimal spill choice being undecidable

>>)>Because the path actually taken through the code can't, in general,
>>)>be known at compile-time.

And yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:

>>)Still don't see it.  The system state, i.e. contents of store and registers,
>>)appears to determine the path taken through a piece of code. 

And Mark Hall (my former officemate) expanded

>>   I think preston is referring to programs which read input in the 
>> phrase "in general".
>> 
>>   Can you predict the state of registers for all possible 
>> (possibly infinite) inputs streams? 

and Yodaiken continued

>Not personally, but in principle, it can be done. The input streams 
>comprise a regular language. Thus, we have a finite representation, and 
>we have some k, so that any input stream of length greater than k repeats
>state.  

We seem to be edging toward a little argument over trivia.

I was speculating about compiling a single subroutine,
where we have no knowledge (or very little knowledge)
about arguments, global memory, and input.
This seems to be a pretty typical starting point
for building (or thinking about) a compiler or optimizer.

So, if we see a stupid routine that looks like

	routine dumbcode(arg)
	  A
	  if arg = 0 then
	     B
	  else
	     C
	  endif
	  D
	  return
        end

and for some reason, there's enough register pressure that we must
choose to improve one of B or C at the expense of the other.
How can we choose?  I don't see how we can (with all the assumptions).

But if I've got enough registers, the question doesn't arise.
This allows a significant simplification in optimization:

    "Always assume you've got enough registers."

Probably due to John Cocke, this assumption allows us to avoid many hard
(and somtime undecidable) choices.

Machines with enough of registers make this a valid assumption.

Graph coloring allocators (that attempt to minimize the number of required
registers) make the assumption valid on more machines and more code.

Preston Briggs

yodaiken@chelm.cs.umass.edu (victor yodaiken) (01/16/91)

In article <1991Jan15.165940.2545@news.nd.edu> przemek@liszt.helios.nd.edu (Przemek Klosowski) writes:
>In article <25106@dime.cs.umass.edu> yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
>>In article <1991Jan14.191057.14242@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>>>>>especially since optimal choices by the compiler are undecidable.
>>>
>>>and yodaiken@chelm.cs.umass.edu (victor yodaiken) writes:
>>>>How's that? The compiler is allocating a finite set of resources. How
>>
>>Still don't see it.  The system state, i.e. contents of store and registers,
>>appears to determine the path taken through a piece of code. 
>
>Normally you compile the program in order to run it. If you use the exhaustive
>strategy you propose, there is no need to run the program at all---the compiler
>would actually run the program, and instead of producing the executable it 
>might as well give the program's output...
>	
>--
>			przemek klosowski (przemek@ndcvx.cc.nd.edu)
>			Physics Dept
>			University of Notre Dame IN 46556

I did not propose the exhaustive strategy as a practical engineering
device. Instead, I wanted to argue  that Turing undecidability
has little to do with compilation.

yodaiken@chelm.cs.umass.edu (victor yodaiken) (01/16/91)

In article <1991Jan15.172335.12695@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>I was speculating about compiling a single subroutine,
>where we have no knowledge (or very little knowledge)
>about arguments, global memory, and input.
>This seems to be a pretty typical starting point
>for building (or thinking about) a compiler or optimizer.
>
>So, if we see a stupid routine that looks like
>
>	routine dumbcode(arg)
>	  A
>	  if arg = 0 then
>	     B
>	  else
>	     C
>	  endif
>	  D
>	  return
>        end
>
>and for some reason, there's enough register pressure that we must
>choose to improve one of B or C at the expense of the other.
>How can we choose?  I don't see how we can (with all the assumptions).
>

If we don't know how many registers we have, or how much program 
memory is addressable, then certainly we cannot hope to optimize
register use. But, if we are taking this code and generating machine
code that will run on  machine X with some limit of adressable memory
per process, and some known number of registers, then we can in principle
minimize register spilling. This is simply because machine X is a finite
state system, and finite state machines can be minimized effectively.
This does not mean that we can necessarily find  a practical algorithm which
will solve the register spilling problem for some class of interesting 
computers and some programming lanugage. But,  there is no theoretical
result which says that such algorithms do not exist. If we were building
a compiler for a turing machine, then we would have a problem, but
generally, we work with much more limited machines.

pcg@cs.aber.ac.uk (Piercarlo Grandi) (01/16/91)

On 14 Jan 91 16:37:39 GMT, preston@ariel.rice.edu (Preston Briggs) said:

preston> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:

	[ ... register coloring has the wrong goal; it works only
	because there are often so many registers that *any* register
	allocation policy would win, by not being used ... ]

preston> The idea of managing registers/memory via graph coloring is an
preston> old one (at least back to '61).  The basic idea is to minimize
preston> the register requirements (as in "find a minimal coloring").
preston> Spilling isn't a part of graph coloring; it's a part of
preston> register allocation.

Well, this is a bit misleading -- graph coloring and spilling may be
different modules of a compiler, but the *goal* of graph coloring is to
minimize the number of spills.

preston> The advantage of graph coloring is that it provides a simple
preston> abstraction from the complexities of register allocation on
preston> various interesting machines.  The reason it works well (and it
preston> _does_) is that many machines have enough registers to run most
preston> programs.

In other words that several modern architectures have many more
registers than they would otherwise need, just to make an inappropriate
compiler technology work, by never de facto exercising it.

Many machines as a result have ironically enough registers that running
graph coloring itself is pointless, because there is no need to minimise
register usage. The problem is that registers are not free.

It reminds me of the System V swapper/pager -- its policies are so bad
that the only way to make it work well is to have enough memory (ten
times the working set) that swapping/paging never occurs :-).

preston> Lots of people complain about the difficulty of implementation
preston> and suggest some simpler alternative.  My experience is that a
preston> graph coloring allocator is not that hard to build,

I agree, the graph coloring allocator is not difficult to build -- it
can be done in a few dozen lines of code. Complaints dismissed :-). But
the tedious and dangerous part is to do the global analysis that builds
the graph.

preston> Further, every register allocator based on graph coloring that
preston> I've read about (and I read aggressively in this area) uses
preston> heuristics to estimate the dynamic cost of spills.  In
preston> addition, some use of profiling info.  Spill choices are always
preston> based on these dynamic estimates.

Are we sure we are not thinking of the same papers, only that I read
them as in "if you have heuristics/profiling graph coloring is virtually
useless" and you read them as "graph coloring allocators do make use of
heuristics/profiling"? :-)

	[ ... on Wall's papers ... ]

preston> It's important to be a little sceptical though, and look for
preston> holes (reasons why his measurements might not apply to
preston> particular situations).

My reckoning is that his papers are essentially valid for architectures
like that of his test machine, where the internal degree of parallelism
is small, and the applications are general purpose. More research is
needed to try to see how the knee of the curve shifts along two
dimensions; type of application and internal degree of parallelism of
the machine.

Not along the degree of "optimization" dimension; either the code
generator does a good job of matching the application dataflow pattenrs
to those of the CPU, and then it is simply doing competent code
generation, or it is badly designed, or if the machine and application
dataflow patterns are not well matched, no additional "optimization"
will buy anything.

But first I think more research is needed into application and machine
characterization in themselves... The only serious amount of work in
this field has been done by the parallel computer people, for medium and
coarse level parallelism. Microparallelism has not been as widely
studied. Some work, at the language level, has been done by the
functional language people (not surprisingly, they often are the same
ones that like dataflow and graph reduction architectures), but that's
hardly mainstream.

preston> For his 86 paper, I wonder about:
preston> 1) The balance of his machine (cost of add versus cost of load).
preston>    In the 5 years since the 86 paper, spills have gotten more
preston>    expensive.

Yes. But the question is: is a bigger flat register file the best way to
provide extra buffering against increased memory latency? *Assuming* so
sounds like CISC designers assuming that ever more sophisticated
instructions are the key to greater speed.

preston>    Consider also various forms of low-level parallelism
preston>    (pipelines and such).  These aren't recent inventions, but
preston>    they change the shape of the code.

They are important issues -- and I seem to have written that the number
of useful registers is a more or less linear function of the exploitable
degree of parallelism, for spare registers, and of the frequency of
reuse of values, for cache registers.

For contemporary microcomputer implementations, with low limits on both,
this means that far less registers are needed than "optimizer" mythology
assumes. For future implementations, which are supposed to have much
greater internal degrees of parallelism, more registers will be needed.

But high degrees of parallelism imply well predictable and independent
dataflow patterns -- separate register sets seem the best bet here, as
SIMD/MIMD architectures for even higher degress of parallelism.

NOTE: I have *never* written that there is an *absolute* upper limit to
the number of useful registers for *all* possible applications or
machine architectures. I have nothing to complain about the size of the
huge vector register banks of a CRAY.

preston> 2) His optimizer.  It's not mentioned beyond "The compiler generates
preston>    good code that does not load a variable twice in the same basic
preston>    block..."  Sounds like value numbering to me. Optimization needs
preston>    registers.  More powerful optimization needs more registers (and
preston>    the payoff is faster programs).

This is an *assumption*, and a bizarre one. "Optimization" does not
create the need for more registers -- the need for registers is strictly
function of the intrinsic microparallelism of the application and
machine, as evinced from their dataflow patterns. These, for general
purpose applications on contemporary general purpose microprocessors,
imply fairly low register requirements. "Optimization" cannot get around
this.

I have detected, also in e-mail discussions, that there is some
misudenrstanding of the intention of my postings:

My postings sneer at the "more registers!" and "more optimization!"
fashions, by pointing out that:

* optimal number of registers is a functions of machine and application
dataflow patterns, not of "optimization";

* there is some evidence that leads to believe that such optimal number
is suprinsingly low for contemporary processors with limited internal
parallelism and not that much greater levels of parallelism are
exploitable, for general purpose applications;

* for non general purpose applications SIMD/MIMD machines and their
special rules are the best bet;

* much of global analysis that makes "optimization" hard and hazardous
can be obviated by a little bit more thoughtfulness at the language
level, such as using language paradigms of appropriate expressive power,
instead of reverse engineering in the "optimizer" poor choices made at
higher layers;

* just to make my criticism constructive, I hazard showing that credible
(in line of principle) specific alternatives are possible, so that my
call for more research seem more alluring;

* perversely, a lot of research effort is spent instead on "more
registers!" and "more optimization!", comparatively little (as
published) on characterization of low level dataflow patterns and on
careful language design.  Dusty decks (or ideas) rule?

The objections I see to my articles seem to be totally off the mark with
respect to the above points.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

larus@primost.cs.wisc.edu (James Larus) (01/16/91)

I believe that Preston is correct is claiming that register allocation is an undecidable problem.  It can be shown with a simple varient of Turing's halting argument (see disclaimers below):

Consider a function F that contains two parts: an arbitrary computation C and another (register-intensive) computation C'

	procedure F ()
	  C;
	  C';

If C does not halt then a register allocator should allocate all registers to C since C' will never be executed and this will `minimize' the computation time.  However, if C halts, then the registers should be divided more equitably between C and C'.  A register allocator that could make this choice for an aribtrary computation C solves the Turing halting problem.

The big problem with this argument is that the behavior of register allocators on non-terminating programs is (always?) left unspecified.  Also, as someone noted, real computers are not Turing machines with infinite tapes, so in theory the number of states that a real computer can reach is finite and all we have to do is look for a repetition in the sequence of states to detect an infinite loop.  Of course, in practice, a machine with a non-trivial amount of memory has too many states to make this argument








 sensible.

/Jim