[comp.compilers] Compilers taking advantage of architectural enhancements

aglew@crhc.uiuc.edu (Andy Glew) (10/12/90)

[Copied from comp.arch. -John]

>[gillies@m.cs.uiuc.edu]
>One of the problems in new CPU designs is that designers don't realize
>which architecture enhancements are pointless, because we don't and
>may never have the optimization technology to take advantage of them.

Ah, a potentially interesting and useful topic.  Perhaps we can start
a discussion that will lead to a list of possible hardware
architectural enhancements that a compiler can/cannot take advantage
of?  Maybe the real compiler guys (preston, David C.) will help us
out?

By the way, it is useful in such discussions to distinguish between
run-of-the-mill optimizations that can pretty much be added to any
existing compiler by a fairly good CS grad, and optimizations that
require so much background that only the "experts" in the field can
reasonably be expected to do this.  For example, many of the
parallelizing FORTRAN loop optimizations can reasonably only be
expected to be done by people from Rice or UIUC CSRD (note that I'm
UIUC, but CRHC, not CSRD), so unless you are willing to pay for these
guys (or their assorted front companies) you aren't likely to get them
onto your machine.
    Cost of compiler development can be significant.  Sometimes a
company might put a hardware feature in even though they know a
compiler approach would be better, because they can't afford the
compiler guys (or the compiler guys already have exclusive contracts
with a competitor).


Let me list a few things to start off the discussion.  I hope and
expect to be shot down on a few of them.  It's easy to list a few of
the hardware enhancements that we already know compilers can take
advantage of.


Branch Delay Slots - small number
    One or two branch delay slots can be filled quite easily.
    DIFFICULTY: a reasonably good CS grad can do it.

Branch Delay slots - large number
    Most groups report difficulty filling a large number of branch delay
    slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu)
    has been able to get good results for up to 8 delay slots, on slightly
    different hardware (forward semantic, with profiling feedback).
    DIFFICULTY: given the slightly different hardware, and exposure
    	to the necessary ideas, a reasonably good CS grad can do it.

Register file - moderate sized (up to 32 registers)
    Most compilers can use these efficiently. 
    DIFFICULTY:	a reasonably good CS grad can do it.
    	Avoiding IBM's patents is a slight problem.

Register file - large (around 128 registers, or more)
    Most compilers do not get enough benefit from these to justify
    the extra hardware, or the slowed down register access.

Multiple, hierarchical, registers sets
    Like Cray's fast S registers and slower T registers.
    It took a very long time for compilers to take advantage 	
    of these multiple register sets.
    DIFFICULTY: complex.
    

Separate floating point register file
    Most compilers can take advantage of this.
    DIFFICULTY: easy

Heterogenous register file
    Few compilers have been developed that can take advantage of a truly heterogenous
    register file, one in which, for example,  the divide unit writes to registers
    D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers
    S1..S4 -- even though such hardware conceivably has a cycle time advantage over
    homogenous registers, even on VLIW machines where data can easily be moved
    to generic registers when necessary.
    DIFFICULTY: hard. 

Instruction cache
    Many groups have reported compilers that can take advantage of small
    instruction caches by rearranging code to minimize I-cache misses.  Eg.
    HP, UIUC IMPACT.
    DIFFICULTY: moderately complex - a good CS grad can do it, but it requires
    tools and background that take a while to develop, and are not widely
    available.  One of the complicating factors is that this sort of
    optimization is most useful when it is done in cooperaton with the
    linker/loader. Which is something that your average compiler hacker
    doesn't have background in (although it isn't too hard).

Data cache - software managed consistency
    This reportedly has been done, but certainly isn't run-of-the-mill.
    DIFFICULTY: needs a skilled compiler expert.

Micro-scheduling parallelism (like CONVEX's ASAP)
    Apparently isn't too hard to develop compiler support for this.
    Hardware is rather expensive, though.
    DIFFICULTY: moderately complex

Vectorizing
    Done in lots of places, but high performance vectorizing compilers
    still seem to be a problem for a lot of companies.
    DIFFICULTY: complex.  The easy things are easy, but milking the
    last few FLOPs out of your vector unit requires compiler experts.

Parallelizing - fine grain, small numbers of processors
    (like Alliant's concurrent hardware)
    Done, eg. by Alliant and Cray
    DIFFICULTY: to be competitive with this sort of hardware, you
    	need real compiler experts on your team.  Harder than vectorizing.

Parallelizing, fine grain, large numbers of processors.
    What Tera is trying to do.
    DIFFICULTY: very complex. Leading edge.
    	Good luck, Tera.

Multiple functional units - heterogenous - VLIW or superscalar
    DIFFICULTY: complex.
    	Complexity of the packing algorithms grows rapidly as
    	you try to go beyond basic block parallelism

Multiple functional units - homogenous - VLIW or superscalar
    DIFFICULTY: moderately complex
    	Easier than the heterogenous case, and the packing algorithms
    	are considerably easier.

Special hardware instructions - scalar
    Taking advantage of simple instructions like abs(), conditional exchange, etc.
    DIFFICULTY:
    	(1) When treated not as a compiler problem, but as a problem of simply
    	    writing libraries to inline optimized machine code, EASY
    	    Requires inlining support.
    	(2) When treated as a compiler problem, eg. compiler should recognize
    	    IF a < 0 THEN a = -a
    	    and automatically convert it to a = ABS(a)
    	    MODERATELY COMPLEX.
    I expect Herman Rubin to comment on this.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

preston@titan.rice.edu (Preston Briggs) (10/12/90)

[Copied from comp.arch. -John]

In article <AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes:

>Perhaps we can start
>a discussion that will lead to a list of possible hardware
>architectural enhancements that a compiler can/cannot take advantage of?

I'll comment on a couple of features from the list

>Register file - large (around 128 registers, or more)
>    Most compilers do not get enough benefit from these to justify
>    the extra hardware, or the slowed down register access.

In the proceedings of Sigplan 90, there's a paper about how to chew
lots of registers.

	Improving Register Allocation for Subscripted Variables
	Callahan, Carr, Kennedy

I suggested the subtitle "How to use of all those FP registers"
but nobody was impressed.  Also, there's a limit to how many registers
you need, at least for scientific fortran.  It depends on the speed
of memory and cache, speed of the FPU, and the actual applications.
The idea is that once the FPU is running at full speed,
more registers are wasted.

>Heterogenous register file
>    Few compilers have been developed that can take advantage of a
>    truly heterogenous register file, one in which, for example, the
>    divide unit writes to registers D1..D2, the add unit writes to
>    registers A1..A16, the shift unit writes to registers S1..S4 -- 
>    even though such hardware conceivably has a cycle time advantage
>    over homogenous registers, even on VLIW machines where data can
>    easily be moved to generic registers when necessary.
>    DIFFICULTY: hard. 

At first glance, the problem seems susceptable to coloring.
Perhaps I'm missing something.

>Data cache - software managed consistency
>    This reportedly has been done, but certainly isn't run-of-the-mill.
>    DIFFICULTY: needs a skilled compiler expert.

At Rice (and other places), people are considering the perhaps
related problems of trying to manage cache usage for a single
processor.  I'm personally turned on by the topic because of
big performance gains possible and the possible impact on 
architecture.  Questions like: Can we get away with no D-cache?
Perhaps we don't need cache for FP only?
Can we get away with only direct mapped cache?  What does a compiler
do with set associativity?  How can we do prefetches to cache?

Porterfield did a thesis here that talks some about these questions.
Additionally, Callahan and Porterfield (both at Tera) have a paper
in Supercomputing 90 on (perhaps) similar topics.

>Multiple functional units - heterogenous - VLIW or superscalar
>    DIFFICULTY: complex.
>Multiple functional units - homogenous - VLIW or superscalar
>    DIFFICULTY: moderately complex
>    	Easier than the heterogenous case, and the packing algorithms
>    	are considerably easier.

I had never thought to distinguish the two cases and
I'm not sure why the scheduling algorithms should be much different.

>Special hardware instructions - scalar
>    Taking advantage of simple instructions like abs(), conditional 
>    exchange, etc.
>    DIFFICULTY:
>    	(1) When treated not as a compiler problem, but as a problem of simply
>    	    writing libraries to inline optimized machine code, EASY
>    	    Requires inlining support.

For intrinsics, I follow the PL.8 example.
That is, have intermediate language instructions
for ABS etc. so the optimizer can try and hoist them or perhaps strength
reduce them (e.g. SIN).  Then expand to a simple form (perhaps with branches
and so forth), and let the optimizer get at the guts of each operation.
Some like ABS might be available as basic instructions and so need not
be expanded to a lower level form.  This seems to require that the 
front-end recognize certain calls as intrinsics.  Naturally, this
works fine with Fortran, but compilers for other languages could
easily adopt the same approach.  Probably have for C.

This isn't wonderfully extensible, but people have worked on
variations that might be worth exploring.  In particular,
the Experimental Compiler System (ECS) project at IBM hoped to
achieve the same effect in a more extensible fashion.

-- 
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

aglew@crhc.uiuc.edu (Andy Glew) (10/12/90)

[Copied from comp.arch -John]

>>Perhaps we can start
>>a discussion that will lead to a list of possible hardware
>>architectural enhancements that a compiler can/cannot take advantage of?
>
>I'll comment on a couple of features from the list

Thanks. Do you have any features of your own to add?


>>Register file - large (around 128 registers, or more)
>>    Most compilers do not get enough benefit from these to justify
>>    the extra hardware, or the slowed down register access.
>
>In the proceedings of Sigplan 90, there's a paper about how to chew
>lots of registers.
>
>	Improving Register Allocation for Subscripted Variables
>	Callahan, Carr, Kennedy
>
>I suggested the subtitle "How to use of all those FP registers"
>but nobody was impressed.  Also, there's a limit to how many registers
>you need, at least for scientific fortran.  It depends on the speed
>of memory and cache, speed of the FPU, and the actual applications.
>The idea is that once the FPU is running at full speed,
>more registers are wasted.

I'm pretty sure that being able to put elements of aggregates into
registers is the next big step - subscripted variables and structure
fields.

However, if it's only just now being published, that probably means
its leading edge, so a company should not design for this unless it
has guys in its compiler group actually working on it.
    Any comment on this way of looking at it?


>>Heterogenous register file
>>    Few compilers have been developed that can take advantage of a
>>    truly heterogenous register file, one in which, for example, the
>>    divide unit writes to registers D1..D2, the add unit writes to
>>    registers A1..A16, the shift unit writes to registers S1..S4 -- 
>>    even though such hardware conceivably has a cycle time advantage
>>    over homogenous registers, even on VLIW machines where data can
>>    easily be moved to generic registers when necessary.
>>    DIFFICULTY: hard. 
>
>At first glance, the problem seems susceptable to coloring.
>Perhaps I'm missing something.

I agree with you --- I really don't understand why heterogenous
register files are so hard to handle.  But homogenous register files
are one thing that compiler people have gone into rhapsodies wrt. RISC
about.

Here's one example: the Intel 80x86 is basically a heterogenous
register file machine. Specific registers were tied to the outputs and
inputs of specific functional units in the original hardware.
Compiler people hated targetting this architecture, and there are very
few compilers that can produce machine code comparable to hand-coded
assembly on this architecture.

But heterogenous register files are much easier to make fast.  I think
that two things can be done to make them easier for the compilers to
use:
    (1) Provide more than one register at the output of any specific 
    	functional unit. This avoids the need to immediately move 
    	the result away.
    	(if my spelling is bad it's because my keyboard just went flakey)
    (2) Make the register file heterogenous for writes, but homogenous
    	for reads (it's easier to provide read multiporting than
    	write multiporting).




>>Data cache - software managed consistency
>>    This reportedly has been done, but certainly isn't run-of-the-mill.
>>    DIFFICULTY: needs a skilled compiler expert.
>
>At Rice (and other places), people are considering the perhaps
>related problems of trying to manage cache usage for a single
>processor.  I'm personally turned on by the topic because of
>big performance gains possible and the possible impact on 
>architecture.  Questions like: Can we get away with no D-cache?
>Perhaps we don't need cache for FP only?
>Can we get away with only direct mapped cache?  What does a compiler
>do with set associativity?  How can we do prefetches to cache?
>
>Porterfield did a thesis here that talks some about these questions.
>Additionally, Callahan and Porterfield (both at Tera) have a paper
>in Supercomputing 90 on (perhaps) similar topics.

Again - it's current research, so a company should not "bet the farm"
on this sort of thing, unless it has people actively working in the field.
A compnay should not assume that it can just go out and buy this sort
of compiler technology, or that it can hire a generic CS grad to 
implement this sort of optimization within a year.


>>Multiple functional units - heterogenous - VLIW or superscalar
>>    DIFFICULTY: complex.
>>Multiple functional units - homogenous - VLIW or superscalar
>>    DIFFICULTY: moderately complex
>>    	Easier than the heterogenous case, and the packing algorithms
>>    	are considerably easier.
>
>I had never thought to distinguish the two cases and
>I'm not sure why the scheduling algorithms should be much different.

Neither am I. I just put this up because it seemed to be what Gillies was 
referring to.



>>Special hardware instructions - scalar
>>    Taking advantage of simple instructions like abs(), conditional 
>>    exchange, etc.
>>    DIFFICULTY:
>>    	(1) When treated not as a compiler problem, but as a problem of simply
>>    	    writing libraries to inline optimized machine code, EASY
>>    	    Requires inlining support.
>
>For intrinsics, I follow the PL.8 example.
>That is, have intermediate language instructions
>for ABS etc. so the optimizer can try and hoist them or perhaps strength
>reduce them (e.g. SIN).  Then expand to a simple form (perhaps with branches
>and so forth), and let the optimizer get at the guts of each operation.
>Some like ABS might be available as basic instructions and so need not
>be expanded to a lower level form.  This seems to require that the 
>front-end recognize certain calls as intrinsics.  Naturally, this
>works fine with Fortran, but compilers for other languages could
>easily adopt the same approach.  Probably have for C.
>
>This isn't wonderfully extensible, but people have worked on
>variations that might be worth exploring.  In particular,
>the Experimental Compiler System (ECS) project at IBM hoped to
>achieve the same effect in a more extensible fashion.

I'm a fan of GNU EMACS style inlining.  I don't think that your
intermediate language should even attempt to represent all of the
special hardware instructions that are likely to be useful - and it is
very frustrating when you have the operation that nobody thought of.
PS. comp.compilers has been talking about this recently.

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

spot@TR4.GP.CS.CMU.EDU (Scott Draves) (10/12/90)

[Copied from comp.arch -John]

|> >>Register file - large (around 128 registers, or more)
|> >>    Most compilers do not get enough benefit from these to justify
|> >>    the extra hardware, or the slowed down register access.
|> >
|> >In the proceedings of Sigplan 90, there's a paper about how to chew
|> >lots of registers.
|> >
|> >	Improving Register Allocation for Subscripted Variables
|> >	Callahan, Carr, Kennedy
|> >
|> >I suggested the subtitle "How to use of all those FP registers"
|> >but nobody was impressed.  Also, there's a limit to how many registers
|> >you need, at least for scientific fortran.  It depends on the speed
|> >of memory and cache, speed of the FPU, and the actual applications.
|> >The idea is that once the FPU is running at full speed,
|> >more registers are wasted.
|> 

shouldn't loop unrolling burn lots of registers also?  when unrolling,
which ceiling will you hit first, the number of registers, or the size
of the I-cache?

Scott Draves
spot@cs.cmu.edu
[A friend at Multiflow told me that the large register file was crucial to
get serious unrolling speedups.  There was some rather tricky work balancing
unrolling vs. avoiding register saves by using otherwise unused registers in
leaf routines. -John]
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

golds@fjcnet.GOV (Rich Goldschmidt) (10/13/90)

[Copied from comp.arch -John]

Maybe this is naive or too futuristic, but is anyone working towards
methods for automatically generating a compiler based on the architecture
design?  It would seem that even before there is silicon, there is
enough info about a new CPU in the CAD system used for layout that the
features of special interest for generating a compiler are known.  To
the extent that generating a compiler is rule based (those tasks a
good CS grad can do?), why hasn't it been automated?  Or are there people
working on this now?  Chip designers might even take compilers into 
consideration in their designs :-).
-- 
Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

ctl8588@rigel.tamu.edu (LAUGHLIN, CHET) (10/15/90)

In article <1990Oct12.230424.930@esegue.segue.boston.ma.us>, golds@fjcnet.GOV (Rich Goldschmidt) writes...
>Maybe this is naive or too futuristic, but is anyone working towards
>methods for automatically generating a compiler based on the architecture
>design?  It would seem that even before there is silicon, there is

I would suggest "Computer Architecture a Quantitative Approach" by John L
Hennessy and David A Patterson, Morgan Kaufmann Publishers, c1990.  They
speak at several different points on how HLL compilers will affect the
system performance.  In addition, they mention some common pitfalls
designers tend to not think about till after the fact...like KISS for
instruction sets because compiler writers deal with a simple principle of
"make the common fast and the rest correct."  (Of course, they are biased
by background...but the book has many good points)

Chet Laughlin                  CTL8588@RIGEL.TAMU.EDU (128.194.4.4)
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

jourdan@minos.inria.fr (Martin Jourdan) (10/15/90)

In article <1990Oct12.230424.930@esegue.segue.boston.ma.us>,
golds@fjcnet.GOV (Rich Goldschmidt) writes:
=> Maybe this is naive or too futuristic, but is anyone working towards
=> methods for automatically generating a compiler based on the architecture
=> design?

There already has been a HUGE amount of work, and work is still going on, on
``automatic'' generation of *parts of* compilers from architectural
descriptions.  The most obvious example is the work on code generation
(instruction selection and register allocation), for which there now exist
usable systems, e.g. TWIG, BEG and Pagode (non-exhaustive list).  Pagode for
instance (I know it quite well because it is being developed by colleagues
of mine at INRIA) takes as input a grammar describing the input IR trees and
a description of the machine which can be very easily deduced from the
``programmer's manual'' provided by the CPU manufacturer.  The basic
entities in a machine description are storage bases (registers, memory,
etc.), storage classes (groups thereof), access modes, access classes and
instructions.  The semantics of the instructions is given in terms of the IR
grammar. From this Pagode generates an instruction selector and a register
allocator.

Work is also going on in other areas of compiler construction.  For
instance, Francois Bodin showed in his thesis (``Optimisation de microcode
pour une architecture horizontale et synchrone: etude et mise en oeuvre d'un
compilateur'', U. Rennes, France, June 1989) that it was possible to use a
finer description of the machine, exhibiting timing and other ``internal''
information, to automatically generate instruction schedulers.  Davidson and
Fraser are well known for their work on the automatic construction of
peephole optimizers from machines descriptions.

Automatic generation of other compiler phases is probably also possible but
you then have to take into account source-language issues in addition to
target-machine ones.

=> [...]  Chip designers might even take compilers into 
=> consideration in their designs :-).

This is obviously desirable, and I believe this was the main motivation
to start this discussion thread.

Martin Jourdan <jourdan@minos.inria.fr>, INRIA, Rocquencourt, France.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

anders@dit.lth.se (Anders Ardo) (10/16/90)

> In article <1990Oct12.230424.930@esegue.segue.boston.ma.us>,
> golds@fjcnet.GOV (Rich Goldschmidt) writes:
> => Maybe this is naive or too futuristic, but is anyone working towards
> => methods for automatically generating a compiler based on the architecture
> => design?
> 
> There already has been a HUGE amount of work, [including]
> TWIG, BEG and Pagode (non-exhaustive list).  ...

Could you please give some references to these systems.

> => [...]  Chip designers might even take compilers into
> => consideration in their designs :-).

I fully agree.
We have just embarked on a project which aims at integrating a VLSI design
environment with compiler generation tools. This will enable us to rapidly
generate and test complete system (including applications) either
simulated or real hardware, thus enabling us to efficiently evaluate 
different design tradeoffs between hardware and software.
-- 
 Anders Ardo                       Tel: int+46 46 107522
 Dept. of Computer Engineering     fax: int+46 46 104714
 University of Lund, P.O. Box 118  Internet: anders@dit.lth.se
 S-221 00  Lund, Sweden               or     anders%dit.lth.se@uunet.uu.net
           

-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

hankd@dynamo.ecn.purdue.edu (Hank Dietz) (10/16/90)

In article <1990Oct12.024252.8361@esegue.segue.boston.ma.us> aglew@crhc.uiuc.edu (Andy Glew) writes:
>>[gillies@m.cs.uiuc.edu]
>>One of the problems in new CPU designs is that designers don't realize
>>which architecture enhancements are pointless, because we don't and
>>may never have the optimization technology to take advantage of them.
>
>Ah, a potentially interesting and useful topic.  Perhaps we can start
>a discussion that will lead to a list of possible hardware
>architectural enhancements that a compiler can/cannot take advantage
>of?  Maybe the real compiler guys (preston, David C.) will help us out?
...
>For example, many of the parallelizing FORTRAN loop optimizations can
>reasonably only be expected to be done by people from Rice or UIUC CSRD (note
>that I'm UIUC, but CRHC, not CSRD), so unless you are willing to pay for
>these guys (or their assorted front companies) you aren't likely to get them
>onto your machine.

While it is true that the group of researchers in automatic parallelization
is small, it certainly isn't limited to UIUC CSRD and Rice; there are also
substantial efforts at IBM, Intel, UC Irvine, CMU, etc.  For example, Purdue
faculty in this field include Jose Fortes and me -- both of us have been
publishing in this field for more than five years (well over 50
publications) and we have implemented working parallelizers for subsets of
the C language targeted to several different architectures.  The point is
that, although compiler experts are in demand, it simply isn't true that
there are only one or two places that know how to do things.

Further, at Purdue EE, I teach a graduate course on compiler code
generation, optimization, and parallelization.  In the course, *EVERY*
student implements an optimizing, parallelizing, compiler for a small
language and targets it to a simple parallel abstract machine -- usually a
VLIW.  I'm not saying that one course makes them experts, but the students
from that course are virtually all compiler-literate to the point where at
least relatively mundane things like traditional dependence analysis and
vectorization are well within their grasp.  Students complete that course at
a rate of about 15/year.

>Cost of compiler development can be significant.  Sometimes a company might
>put a hardware feature in even though they know a compiler approach would be
>better, because they can't afford the compiler guys (or the compiler guys
>already have exclusive contracts with a competitor).

In my view, this is 99% false.  Companies tend to put the money into
hardware because it is more concrete and they also are used to putting money
into hardware.  For example, one of my former students works at Motorola as
a compiler person -- but he's one of a very few compared to *MANY*
architecture/hardware folk.  In fact, he also has an architecture background
and without it he probably wouldn't have been given the job.  Companies have
to learn that creating a compiler is comparably difficult to creating an
architecture; the tendency is to give it less weight, resulting in
overworked compiler people and delays in completing the compilers.  A
secondary issue is that designing one without deeply considering the other
just plain doesn't work, and there are few people who are experts in BOTH
compiler and architecture to act as the interface between the two groups.

In contrast, consider a company like Burton Smith's Tera.  Burton knows what
he's doing -- he has tried very hard to make his company have a balance of
compiler, architecture/hardware, and OS people.  Did he have trouble getting
these people?  Perhaps a bit -- good OS folk are particularly hard to find
in these "well, let's just port unix" times -- but generally I'd say he had
less trouble than most companies would have because it is clear that he
values these people at least as much as he values architecture/hardware
types.

>Let me list a few things to start off the discussion.  I hope and expect to
>be shot down on a few of them.  It's easy to list a few of the hardware
>enhancements that we already know compilers can take advantage of.

Wrong viewpoint or, as a certain public figure used to say, "well, there you
go again."  You're trying to give architecture/hardware people a list of
"you can feel safe doing this without consulting a compiler person" things
-- the trick is to involve compiler people throughout rather than letting
the machine be built and then calling in the compiler folk (and giving them
H*ll because the compiler hasn't been able to achieve the machine's peak
performance and wasn't delivered on time).

For some years now, I've had a research group (about 2-3 faculty and
10-20 students) called CARP:  Compiler-oriented Architecture Research
at Purdue.  A one paragraph version of our manifesto:

"Research in compiler optimization/parallelization and hardware architecture
is, and should be, tightly interwoven.  CARP, the Compiler-oriented
Architecture Research group at Purdue, centers on the innovative use of the
interaction of compiler and architecture to increase system performance.  In
general, this is accomplished by blending STATIC (compile-time,
assemble-time, or link-time) and DYNAMIC (runtime hardware, firmware, or
operating system) analysis so that each computational atom is processed in
the most efficient and reliable way.  Statically, it is possible to
understand/transform the entire program, yet only probabilistic knowledge is
available (e.g., one can know branching probabilities, but not which way the
branch goes this time).  Dynamically, understanding/transformability is
limited to a few instructions around the current program counter, but
perfect knowledge within that range is common.  Very few problems can be
solved equally well using either kind of information -- the trick is simply
to solve each problem in the right place."

>Branch Delay Slots - small number
>Branch Delay slots - large number
>Register file - moderate sized (up to 32 registers)
>Register file - large (around 128 registers, or more)
>Separate floating point register file
>Heterogenous register file
>Instruction cache
>Micro-scheduling parallelism (like CONVEX's ASAP)
>Vectorizing
>Multiple functional units - heterogenous - VLIW or superscalar
>Multiple functional units - homogenous - VLIW or superscalar

All old ideas with multiple viable approaches in the literature.  This
is not to say they are done perfectly, but that's not the issue in
making a product....  Unfortunately, a few are not easy to automate in
"generic" code generators (e.g., heterogeneous register file).

>Multiple, hierarchical, registers sets
>Data cache - software managed consistency
>Parallelizing - fine grain, small numbers of processors
>Parallelizing, fine grain, large numbers of processors.
>Special hardware instructions - scalar

These are problems with no readily available "cookbook" solutions.
That doesn't necessarily mean they'd be hard for a compiler to deal
with, just that it will take a bit of head scratching....

Of course, I still contend that the above list is headed the wrong way -- we
should be looking for new ideas being synthesized by viewing both compiler
and architecture/hardware.  For example, the Barrier MIMD work (see papers
in ICPP 90) could only have come from such a holistic view.

						-hankd@ecn.purdue.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

preston@titan.rice.edu (Preston Briggs) (10/17/90)

In article <1990Oct16.085249.13511@lth.se> anders@dit.lth.se (Anders Ardo) writes:

>> There already has been a HUGE amount of work, [including]
>> TWIG, BEG and Pagode (non-exhaustive list).  ...

>Could you please give some references to these systems.

The basis for TWIG is described in

	Efficient Tree Pattern Matching:
		an Aid to Code Generation
	A. Aho and M. Ganapathi
	POPL 12, 1985

The TWIG reference manual (I've never seen this one) is

	Twig reference maual
	S. Tjiang
	AT&T Bell Labs, internal memorandum
	January 1986

An interesting application is

	Anatomy of a Hardware Compiler
	K. Keutzer and W. Wolf
	Sigplan 88 Conference on Programming Language
	Design and Implementation
	in Sigplan Notices, July 1988

BEG is described in

	BEG - A generator for efficient back ends
	H. Emmelmann, F. Schr\:oer, and R. Landwehr
	Sigplan 89 Conference on Programming Language Design &c.
	Sigplan Notices, July 1989

-- 
Preston Briggs
preston@titan.rice.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

anders@dit.lth.se (Anders Ardo) (10/17/90)

Have anyone put together a bibliography of papers on code generation
issues and methods like those mentioned here, eg:

>Branch Delay Slots ...
>Register file ...
>Multiple functional units ...
>[etc.]

and/or papers on the automation of these tasks in generated code generators?

           Anders


-- 
 Anders Ardo                       Tel: int+46 46 107522
 Dept. of Computer Engineering     fax: int+46 46 104714
 University of Lund, P.O. Box 118  Internet: anders@dit.lth.se
 S-221 00  Lund, Sweden               or     anders%dit.lth.se@uunet.uu.net
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

larus@primost.cs.wisc.edu (James Larus) (10/17/90)

Having implemented a few `state of the art' compiler algorithms (and a few
beyond the state of the art), I can say that most competent graduate
students could code them, but most companies wouldn't implement them.  The
key problem is that most papers present enough details, but don't present
enough evidence that a technique is effective.  A rational project manager
would not commit the resources to try out these algorithms.  There are
exceptions, of course, but compiler papers frequently use towers of hanoi
or puzzle-sized benchmarks.

/Jim
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

hankd@ecn.purdue.edu (Hank Dietz) (10/18/90)

In article <1990Oct12.231125.1275@esegue.segue.boston.ma.us> aglew@crhc.uiuc.edu (Andy Glew) writes:
>>In the proceedings of Sigplan 90, there's a paper about how to chew
>>lots of registers.
>>
>>	Improving Register Allocation for Subscripted Variables
>>	Callahan, Carr, Kennedy
>>
>>I suggested the subtitle "How to use of all those FP registers" but nobody
>>was impressed.  Also, there's a limit to how many registers you need, at
>>least for scientific fortran.  It depends on the speed of memory and cache,
>>speed of the FPU, and the actual applications.  The idea is that once the
>>FPU is running at full speed, more registers are wasted.
>
>I'm pretty sure that being able to put elements of aggregates into
>registers is the next big step - subscripted variables and structure fields.

It isn't hard to use lots of registers...  the trick is to benefit
from doing so.  Fetching from outside the processor is harmless
provided that it never slows the system down, and that point often can
be reached with a surprisingly small number of registers.

Unrolling and better alias analysis are two good approaches to making
more registers useful -- there is also the hardware fix called CRegs
(see the paper in Supercomputing 1988), which allows aliased values to
be kept in registers (really CRegs).

...
>>>Heterogenous register file
...
>>At first glance, the problem seems susceptable to coloring.
>>Perhaps I'm missing something.
>
>I agree with you --- I really don't understand why heterogenous
>register files are so hard to handle.  But homogenous register files
>are one thing that compiler people have gone into rhapsodies wrt. RISC
>about.
>
>Here's one example: the Intel 80x86 is basically a heterogenous
>register file machine.

Get a copy of the Compiler Writer's Guide.  I have one for the 286.
It presents a very reasonable scheme for using coloring with multiple
classes of registers (as in a 286).  In fact, I think it is one of the
most understandable descriptions of the coloring technique in general.

						-hankd@ecn.purdue.edu
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

jourdan@minos.inria.fr (10/19/90)

In addition to the references on TWIG given by Preston Briggs, let me add:
	Alfred V. Aho, Mahadevan Ganapathi and Steven W. K. Tjiang
	Code Generation Using Tree Matching and Dynamic Programming
	ACM Trans. on Progr. Lang. and Systems (TOPLAS)
	Vol. 11, No. 4 (Oct. 1989), pp. 491-516

The reference on BEG given by Preston is, as far as I know (and I know
them quite well), the only published one.

References on Pagode:
	Annie Despland, Monique Mazaud and Raymond Rakotozafy
	Using Rewriting Techniques to Produce Code Generators
		and Proving them Correct
	to appear quite soon in Science of Computer Programming
	(report RR-1046, INRIA, Rocquencourt, June 1989)

	Annie Despland, Monique Mazaud and Raymond Rakotozafy
	PAGODE: A back end generator using attribute abstract syntaxes 
		and term rewritings
	to appear in the proceedings of Compiler Compilers '90,
	to be held next week in Schwerin, Germany

	Annie Despland, Monique Mazaud and Raymond Rakotozafy
	Code generator generation based on template-driven target 
		term rewriting
	Proceedings of Rewriting Techniques and Applications,
	Bordeaux, France, May 1987, LNCS 256, pp 105-120.

	Annie Despland, Monique Mazaud and Raymond Rakotozafy
	An implementation of retargetable code generators in Prolog
	Proceedings of the International Workshop on Programming Language
		Implementation and Logic Programming (PLILP)
	Orleans, France, May 1988, LNCS 348, pp 81-104.

Martin Jourdan <jourdan@minos.inria.fr>, INRIA, Rocquencourt, France.
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.

worley@compass.com (Dale Worley) (10/24/90)

   Ah, a potentially interesting and useful topic.  Perhaps we can start
   a discussion that will lead to a list of possible hardware
   architectural enhancements that a compiler can/cannot take advantage
   of?

Part of the trouble with this idea is that "what hardware can be taken
advantage of" changes over time.  For instance, I believe that the Cray-1 was
out for a number of years before people had developed good vectorizing
compilers -- since vector hardware hadn't existed, there was no incentive to
figure out how to build a compiler that took advantage of it!  This leads to
a chicken-and-egg problem -- the compiler doesn't exist because no hardware
needs it, and vice-versa.  The correct solution was mentioned by somebody --
develop both in parallel and tune the combination of the two for best price &
performance.  Of course, it means you have to fund both a hardware effort and
a software effort!

Dale Worley		Compass, Inc.			worley@compass.com
-- 
Send compilers articles to compilers@esegue.segue.boston.ma.us
{ima | spdcc | world}!esegue.  Meta-mail to compilers-request@esegue.