[comp.arch] Instruction Scheduling

bron@olympus.SGI.COM (Bron C. Nelson) (03/11/88)

In article <477@imagine.PAWL.RPI.EDU>, jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes:
> I think these issues have been beaten to death.  Anyone want a new subject,
> like reorganizing algorithms? (my specialty :-)  Is there anyone else out
> there that does linking before reorganization?  Or given thought to doing
> optimal (instead of near-optimal) reorganization in small code blocks?
> 

I think this is a fine idea.  Instruction scheduling/reorganization
is a neat field that still has lots of room for good ideas.  Also,
it is destined to become an increasingly "hot" topic as memory
latencies increase and asynchronous funtional units become more
and more common even in the micro world (the MIPS R2000 for example
has autonomous integer multiply and divide hardware that can
run overlapped with other "standard" ALU ops, and of course the
floating point co-processor runs autonomously as well).

This isn't really an architectural issue, so probably doesn't
really belong in comp.arch.  However, it does have interesting
architectural impact.  (For example, a machine similar to the
Multi-FLow VLIW (tm I think) machine could probably have been
BUILT years ago, but without something like a trace-scheduler,
the machine would be worthless.)  Also, there doesn't really
seem to be a good place to discuss this except perhaps comp.misc
(or comp.compilers, but I'd rather avoid moderation :-) at least
to start).  Ergo, I will post my first few articles on
this topic to comp.arch, cross-posted to comp.misc.  If enough
comp.arch people complain, we'll use comp.misc exclusively.

Bron Nelson  bron@sgi.com
Silicon Graphics Inc.
Don't blame my employers for my opinions

bron@olympus.SGI.COM (Bron C. Nelson) (03/11/88)

Recently jesup suggested we start talking about something else
and proposed reorganization algorithms as a topic.  I think this
is a fine idea.  Instruction scheduling/reorganization is a neat
field that still has lots of room for good ideas.  Also, it
is destined to become an increasingly "hot" topic as memory
latencies increase and asynchronous funtional units become more
and more common even in the micro world (the MIPS R2000 for example
has autonomous integer multiply and divide hardware that can
run overlapped with other "standard" ALU ops, and of course the
floating point co-processor runs autonomously as well).

This isn't really an architectural issue, so probably doesn't
really belong in comp.arch.  However, it does have interesting
architectural impact.  (For example, a machine similar to the
Multiflow VLIW (tm I think) machine could probably have been
BUILT years ago, but without something like a trace-scheduler,
the machine would be worthless.)  Also, there doesn't really
seem to be a good place to discuss this except perhaps comp.misc
(or comp.compilers, but I'd rather avoid moderation :-) at least
to start).  Ergo, I will post my first few articles on
this topic to comp.misc, cross-posted to comp.arch.  If enough
comp.arch people complain, I'll cease cross-posting.

Bron Nelson  bron@sgi.com
Silicon Graphics Inc.
Don't blame my employers for my opinions

bron@olympus.SGI.COM (Bron C. Nelson) (03/11/88)

(Hummm... I see I somehow managed to stupidly post BOTH versions
of my introduction article.)


In an attempt to deflect the endless bickering over the RPM40
from comp.arch, I am taking a cue from Randell Jesup and am
trying to start a discussion of Instruction Scheduling.  Please
contribute your own experiences.

Let's start things off with the best of my scanty bibliography.
I strongly encourage people to suggest other relevant papers
(e-mail to me and I'll summarize, or post directly if you're so
inclined).



Thomas Gross
"Code Optimization of Pipeline Constraints"
Stanford University
Computer Systems Laboratory  Technical Report No. 83-255
December 1983

Based on the author's PhD dissertation.  Describes an instruction
scheduler that (I believe) formed the basis of the one that is
used by MIPS Co.  Scheduling is done after register selection
(post-pass scheduling).



Phillip Gibbons and Steven Muchnick
"Efficient Instruction Scheduling for a Pipelined Architecture"
Proceedings of the 1986 SIGPLAN Compiler Construction Conference
pp. 11 - 16

Instruction scheduler for the HP Precision Architecture.  Seems
to be strongly based on Gross's work.



Wei-Chung Hsu
"Register Allocation and Code Scheduling for Load/Store Architectures"
1987 Ph.D thesis
University of Wisconsin, Madison

Discusses pros and cons of scheduling before or after register
allocation (pre-pass vs post-pass methods).  Gives examples and
good algorithms for both.  Talks about the desirability of doing
integrated scheduling and register allocation and gives simple
methods of keeping the one in mind while doing the other to help
improve both.



John Ellis
"Bulldog: A Compiler for VLIW Architectures"
1985 Ph.D thesis
Yale University
Also published in the 
ACM Doctoral Dissertation Award Series by MIT press.

Trace scheduling explained.  A lot of reading, but well worth it.
The idea that makes VLIW machines work.



Rajiv Gupta and Mary Soffa
"Region Scheduling"
Proceedings of the 2nd International Conference on Supercomputing
International Supercomputing Institute, Inc. 1987
Vol III, pp 141 - 148

Describes work-in-progress without any hard numbers, but this looks
like a very clever idea.  The authors claim superiority over trace
scheduling on several grounds.  The ideas seem more solidly based
than trace scheduling (which always struck me as a good ad-hoc idea
carried to extremes) as well as easier to understand.  If this lives
up to its promises, I would rate it as something to follow closely.
BTW - anyone know where I can contact the authors/get more info?



Well, that's it!  (except for old and/or scummy papers that aren't
worth reading).  Surely some of you Out There can suggest more.



Bron Nelson  bron@sgi.com
Don't blame my employers for my opinions

bron@olympus.SGI.COM (Bron C. Nelson) (03/12/88)

What you need to know in order to do decent instruction scheduling
is the idea of "dependence;" you need to know which instructions
MUST preceed which other instructions, and which can be done
independently.  For ALU ops, the dependencies can be seen in the
register usage.  The real problem is loads/stores.  Without
some kind of analysis, you cannot move a load past a store, and
you cannot move a store past a load OR a store.  This can severely
inhibit the amount of re-ordering that can be done, particularly
if you only schedule a basic block at a time.

The question I want the answer to is: how much and what kind of
analysis is actually done by the various compilers Out There?

To do a really good job, you need the kind of data dependence
analysis that a vectorizing compiler does.  Cray Research, Multiflow,
and Ardent Computer are the only firms I know of for sure that generate
and this kind of information and then use it when doing instruction
scheduling.  (I'd bet that IBM, CDC, NEC, etc. do it too, but I don't
know for sure).  One problem is that data dependence analysis is done
at a fairly high level, while instruction scheduling is a very low
level optimization.  Transmitting the information through the compiler
can be hard.  Question:  How detailed is the information passed
to the instruction scheduler?  Anyone at Cray/Multiflow/Ardent etc.
care to say?

A lot of things that occur in practice can be seen to be independent
without resorting to full data dependence analysis (disjoint, global
arrays, scalars from the stack, etc.)  How much of this kind of analysis
is done by various compilers?  What sort of disambiguation has been
found most useful for scheduling?  Anyone at MIPSco, AMD, GE, Mot.,
Sun, etc.  care to enlighten us?

Studies done several years ago (sorry, don't have the references, but
I believe they are cited in Ellis's "Bulldog" thesis bibliography)
showed that in typical Fortran programs, a basic block only has about
a factor of 2 of parallelism in the operations.  For cpu's with only a
few 2 cycle instructions (e.g. load/branch like the original Berkeley
RISC and Stanford MIPS machines) this is probably enough; you could
typically find enough "local" parallelism to fill the slots most of
the time.  If many of your instructions take multiple cycles (e.g.
a Cray X-MP where even doing an address-length (24bit) integer add
takes 2 cycles, and a load takes 14), you need to do more, or you'll
wait a lot.  For example, a measurement done at LLNL in July of
1985 showed that nearly **60%** of the cycles on their Cray X-MP48
were spent interlocked waiting for results of previous computations.
(This does NOT count time spent waiting after a branch, which is
probably about another 5-10%.)  Clearly, sceduling only within a
basic block was not enough; the recent Cray compilers will push
some operations (i.e. loads) across block boundries.  VLIW machines
have the same sort of problem; the machine is capable of issuing
instructions faster than the (data dependent) computations can deliver
results.  VLIW machines also break block boundries to find more
instructions to issue concurrently.

The whole point of the last paragraph is to say that the amount of
worry you put into this optimization is very dependent on the payoff
your particular hardware can get out of it (no surprise), and that
the payoff varies dramatically from machine to machine.  SO, the
question (finally!) is:  for YOUR particular architecture, how much
time is spent interlocked (or executing NO-OPs for those machines
without interlocks)?  Aggregate numbers and integer only numbers are
fine, but particularly interesting would be numbers involving multi-
cycle instructions (e.g. floating point).

An aside: do the current "no hardware interlocks" cpus REALLY have
no interlocks, or are they just talking about the integer ALU ops?
Since I know something about the MIPSco chip(s), I'll use that as
an example: when you do an integer divide, do you really put in 35
no-ops, or is this "special cased"?  Does the f.p. co-processor
have interlocks?

This is already too long; I'll beat the subject of memory latency to
death sometime in the future.

---------------------------------------------------------------------
Bron Nelson  bron@sgi.com
Don't blame my employers for my opinions

jesup@pawl14.pawl.rpi.edu (Randell E. Jesup) (03/12/88)

In article <12678@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes:
>  The real problem is loads/stores.  Without
>some kind of analysis, you cannot move a load past a store, and
>you cannot move a store past a load OR a store.  This can severely
>inhibit the amount of re-ordering that can be done, particularly
>if you only schedule a basic block at a time.

	Yup!  There are some simple cases, such as where you're loading/storing
with the same base register, and the register wasn't changed inbetween (in
particular, this helps local variable references.)  Maybe there should be
assembler instructions (not real ones) that contain some of this information
that the compiler/programmer has found out.  Example:

	(addr is offset(base))
	uldw reg,addr	- unaliased loadword (no references via other
			  registers/absolute will occur)
	ustw addr,reg	- unaliased storeword
	aldw reg,addr,addr[,addr...] - aliased loadword - references via later
			  addr are aliases of one being used.
	etc.

This is all in all pretty kludgey.  I like this better:

	.alias addr,addr,addr - informs back end (assembler/reorganizer) that
			  these addresses are aliased, otherwise it can assume
			  it's safe to move load/stores around (within usual
			  bounds.)

This will only be valid until any one of the address registers used is
changed, after which the reorg must assume full aliasing again until
told otherwise.  The net effect is lots of these near beginnings of
blocks.  But it does do the job (if the compiler can figure it out,
that is!)
			
>  Clearly, sceduling only within a
>basic block was not enough; the recent Cray compilers will push
>some operations (i.e. loads) across block boundries.

	Global optimization is the wave of the future in high-speed pipelined
processors (IMHO).
	Actually, some pretty old compilers do this: it's called removing
loop invariants.  It is, of course, a special-case optimization, but it
does move things across block boundaries.  (And boy is it a pain when it
gets it wrong.  I had a fortran subroutine with dummy[3] = dummy[3] in
it so it wouldn't pull it out of the loop (it was equivalenced.))

>The whole point of the last paragraph is to say that the amount of
>worry you put into this optimization is very dependent on the payoff
>your particular hardware can get out of it (no surprise), and that
>the payoff varies dramatically from machine to machine.

	The potential payoff is almost directly proportional to the length
of your load delays.  The faster CPUs get, given current memory, board, and
IC packaging technology, the longer these delays will get.

>An aside: do the current "no hardware interlocks" cpus REALLY have
>no interlocks, or are they just talking about the integer ALU ops?

	The rpm-40 has NO interlocks, nor would I envision any XP for it
having any either.  However, that doesn't mean SOME interlocks aren't a
good thing, especially in long load delay machines, IFF you can do it
without affecting cycle time (i.e. doesn't hit critical path).

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

root@mfci.UUCP (SuperUser) (03/13/88)

In article <12678@sgi.SGI.COM] bron@olympus.SGI.COM (Bron C. Nelson) writes:
]Question:  How detailed is the information passed
]to the instruction scheduler?  Anyone at Cray/Multiflow/Ardent etc.
]care to say?
I'll defer to one of our compiler people to answer this one (left this
in so you wouldn't think I was avoiding it, even though I am...)
]
].... Clearly, sceduling only within a
]basic block was not enough; the recent Cray compilers will push
]some operations (i.e. loads) across block boundries.  VLIW machines
]have the same sort of problem; the machine is capable of issuing
]instructions faster than the (data dependent) computations can deliver
]results.  VLIW machines also break block boundries to find more
]instructions to issue concurrently.
A minor nit:  it's the compiler, in our case a Trace Scheduling (tm)
compacting compiler (hope that keeps our legal guys happy) that breaks
the block boundaries;  the VLIW part makes it worthwhile.  And we're
usually careful to distinguish between instructions, which are the
aggregate total of bits which all come flying out of the icache at 
once, vs. packets, which are the individual control fields for the
various functional units.  I usually look at the machine as a parallel
collection of pipelines, each of different length, most
of which (especially the floating pipes) have a latency of at least
one instruction.
]
]The whole point of the last paragraph is to say that the amount of
]worry you put into this optimization is very dependent on the payoff
]your particular hardware can get out of it (no surprise), and that
]the payoff varies dramatically from machine to machine.  SO, the
]question (finally!) is:  for YOUR particular architecture, how much
]time is spent interlocked (or executing NO-OPs for those machines
]without interlocks)?  Aggregate numbers and integer only numbers are
]fine, but particularly interesting would be numbers involving multi-
]cycle instructions (e.g. floating point).
I know you want numbers here, but I think a few more points should
be included.  First, the answers will vary greatly with the code you're
discussing.  I think you know you have a balanced machine if a wide
range of code causes the machine to limit in different places.  For
instance, one program may (in a hypothetical VLIW) cause the register
read ports to become the bottleneck, another may run out of memory
bandwidth, a third might require more floating point units,  a
fourth may do a lot of non-pipelined operations (divide), and a fifth
might do chained memory operations that can't be done in parallel 
(which means your cpu waits around for the memory latency a lot -- this
hypothetical VLIW has no data cache.)  Depending
on which limit is the one currently blocking performance, your answer
on the NOPs will vary a lot.  Perhaps if you pick a benchmark and
ask we'll be able to compare results better.
]
]An aside: do the current "no hardware interlocks" cpus REALLY have
]no interlocks, or are they just talking about the integer ALU ops?
]Since I know something about the MIPSco chip(s), I'll use that as
]an example: when you do an integer divide, do you really put in 35
]no-ops, or is this "special cased"?  Does the f.p. co-processor
]have interlocks?
We have no hardware interlocks except for a bank stall resolver built
into the memory controllers.

]Bron Nelson  bron@sgi.com
I hope to have some reportable numbers on this stuff soon.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

ram%shukra@Sun.COM (Renu Raman, Taco-Bell Microsystems (A Bell Operating Co.)) (03/14/88)

In article <12512@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes:
>This isn't really an architectural issue, so probably doesn't
>really belong in comp.arch.  However, it does have interesting

     [This IS an architecural issue: Expanding the realm of processors that
     you have mentioned, include vector processors and multiple functional
     unit processors.]

>architectural impact.  (For example, a machine similar to the
>Multi-FLow VLIW (tm I think) machine could probably have been
>BUILT years ago, but without something like a trace-scheduler,
>
>Bron Nelson  bron@sgi.com
>Don't blame my employers for my opinions

 neither mine.

 [Hi Bron] 

     Great!  I have been wanting to write about scoreboarding and
     here comes along something.  OK.  Ardent's new machine as well Motorola's
     (probably announced) and apollo's pre-announced machines all have some HW
     dedicated to do scoreboarding (in Ardent's case 1/3 of the vector control
     gate count is dedicated to scoreboarding. Don't have any
     info on Apollo's & Motorola's scoreboarding yet).

     In the case of Ardent their scoreboaring is dedicated to the vector
     ops (correct me - the info that I have is very hazy) but in the case of
     Apollo & Moto, it is for multiple scalar FUs.

     HW support for dynamic scheduling of code first appeared in the 
     CDC 6600 (Thornton)  and later on in IBM 360/?? - specifically for the
     FP section.  (I think Tomasulo was involved or wrote about it).

     What is amazing is that in the Cray-1, which had multiple functional
     units, the issue scheme was at most 1/cycle. (gravest source of imbalance).
     Weiss & Smith [1](Astronautics Corp.) showed that by tagging the register
     file on the cray to do the dependecy checks, the performance 
     could be improved by as much as 43%(bare bones without any
     compiler optimizations thru static re-organization)[No estimates on
     likely increase in si area].  In any case Cray avoided any sort of
     scoreboarding!!!

     Some of the  + & - in dynamic scheduling are:

     (1/2)+. Compiler can be unburdened. 
	  
	  It does not make sense as in general you compile once and run
     the program zillion times.  One can always turn off optimizations to
     go thru edit-compile-debug cycle faster.

     (1/2)+ Performance is not tied to quality of compiler:

          Again not a real plus.  Maybe true of the PC world where
     a compiler at $40 and $200 makes a difference.

     +/- Handling of Dynamic dependencies[DS] that a compiler cannot: Valid case.
     For procedural languages (where most of the processing is on today),
     what % of overall data dependencies are dynamic and what % statically
     determinable? Anybody have any nos. or references to prior work?
     One possiblity is that interpreted  & OO languages could benefit more
     than compiled languages.  Also, how much value is added in DS with good
     static scheduling techniques?  [DS is a good case for the memory
     bank conflict problem].

     ? Handling of Changes in processor configurations: ??? (turning on & off
     functional units?)

     - Context switches & Interrupts. (embedded controllers?):
     Seems to be exception handling and context switches
     (esp. for embedded controller applications
     where some of the RISCs find their way) are likely candidates for
     PIA (Pain In ....).

     + Multiple Instruction streams & scheduling in multiprocessor systems.
       Again a good candidate but too early to predict if that will
       be a real advantage.

     -. Increase in chip area.  True, but by how much?.  More critical issue
     is time(and si) spent in book keeping the dependencies in the
     instruction queue and overheads in the handling of interrupts &
     context switches.

     A much later reference is Torng et. al [TOC Sep 86]. [nos. ranging
     from 80-120 % in improvement are reported by using a despatch
     stack and associated book-keeping HW.  [Their reference base is not
     clear to me].  

     Is there any justification to dedicate a  lot of HW for scoreboarding
     (it is probably too early to talk about it, and many may not even
     talk about it) or can we trust the compiler to do it all?  It seems
     scoreboarding can be done efficiently by the compiler, except in the
     case where there are multiple instruction streams threading thru
     different processors (if somebody has built that, then they have
     a data flow machine) or when there are lots of autonomously running
     functional units making the code re-organization  a hair raising
     task.  Cydrome has addressed some of tradeoffs with scheduling
     in their cydra (see "Cydra 5 Directed Dataflow architecture" - Compcon 86).

     "There is a trend to move software problems to HW and code
     scheduling seems to be a candidate"!!! - Weiss & Smith (Will it be?)
---------------------
   Renukanthan Raman				ARPA:ram@sun.com
   Sun Microsystems			UUCP:{ucbvax,seismo,hplabs}!sun!ram
   M/S 18-41, 2500 Garcia Avenue,
   Mt. View,  CA 94043

firth@sei.cmu.edu (Robert Firth) (03/14/88)

Thanks to Randell Jesup, Bron Nelson, and others for prompting
this.  The question they discuss is, how might one do some
really good code reorganising for a machine with pipeline
constraints, eg a RISC machine.

First point: when we looked at the MIPS M/500, we found some
less than good code reorganisation.  Details are in our report,
but one place where the reorganiser was VERY conservative was
in rearranging loads and stores.

The reason for this was that the reorganiser input was at
Assembler level, and the algorithms generally made worst-case
assumptions about aliasing.

Second point: To get really good code reorganisation, much the
best place to do it is in the compiler back end.  That is because
you there have some really vital information, for instance

. conceptual type
. conceptual address space

At the assembler level, variables X and Y may seem to be just 16-bit
integers, but if they are of conceptually different types, they can't
alias.  Similarly, X and Y may both be machine addresses, but if they
are reference variables of different types, they can't alias.  Nor
can, say, a by-reference parameter alias a pointer to an allocated
object.

Of course, this assumes you are working with a true high-level language,
such as Modula-2 or Ada.  If the programmer can perform arbitrary casts
and puns, all bets are off.

Third point: a good compiler must carry this information right through
to final code selection and ordering.  In principle, every fragment
that the compiler ever generates must be linked back to the attributed
parse tree, so that no attribute is ever lost or rendered inaccessible.

Hence, when you are debating whether to rearrange

	store r2, xxx(r12)
	load  r1, yyy(r11)

you can look back and see, for instance, that the load is of an INTEGER
and the store of a DURATION, or the load is from module MATHLIB (with
r11 bound as base pointer) and the store into an allocated object of
type DATE (designated by a pointer in r12). Or whatever.

Final point: not many compilers do this.  The reason is that this design
largely abandons the traditional "front end / back end" separation.  If
the back end has to interrogate the APT, then it probably has to know
which language the source program came from.  Unless, of course, you
first design a common APT across your set of languages - and then you
throw away your automatic parser generator (actually, not a bad idea,
given some of the ones I've seen!).

I've rambled on enough.  More later, maybe, since I did take this idea
down to a detailed design and it may yet become a Modula compiler.

bcase@Apple.COM (Brian Case) (03/16/88)

In article <4553@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:
    [about keeping information around until the last minute to help
     reorganization and such.]

Although it doesn't reflect quit the same level of aggressiveness that
Robert talks about, the paper

    David W. Wall. Global register allocation at link-time.  _Proceedings
    of the SIGPLAN '86 Symposium on Compiler Construction._  _SIGPLAN
    Notices 21_ (7): 264-275 (July 1986).

talks about keeping information around until the very last.  I think
Robert's comments about very good compilers needing to abandon the now-
popular front-end back-end separation (so that "all you have to do is
write a new front end for a new language") will hold true.  It seems like
there is much to be gained.

grunwald@uiucdcsm.cs.uiuc.edu (03/17/88)

The paper by Wall keeps the Front-end/Back-end strategy, it just moves
the point of actual code generation to a later phase of the binding.
The linkers all use the IR generated by the front end. The last stage
is a translation of the linked IR to the machine code. 

The last stage (IR -> Assembler) performs global register allocation. The
advantages are that you have better information about other subroutines
at link time, and you can pass subroutine parameters in registers.

If you look at the improvements listed in tables 2, 4 and 5, you see that
the run-time information contibutes about as much improvement as
register coloring, but the two combined rarely perform significantly better
than one or the other.

This leads one to question the advantages of
	(1) Doing both
	(2) Bothering with recording the run-time data

Any comments?

suneel@hpclskj.HP.COM (Suneel Jain x75763) (03/17/88)

> From: firth@sei.cmu.edu (Robert Firth)

> Thanks to Randell Jesup, Bron Nelson, and others for prompting
> this.  The question they discuss is, how might one do some
> really good code reorganising for a machine with pipeline
> constraints, eg a RISC machine.

The Instruction Scheduler for the HP Precision Architecture uses 
aliasing information to determine if two memory references could
interfere with each other. This allows the scheduler to rearrange
loads and stores much better than if worst case assumptions are 
used. This has been especially useful in scheduling floating 
point operations.

The aliasing information is computed by the various front ends and
passed down to the level where the scheduler is run. This process is
described in the following paper:

    Deborah S. Coutant, "Retargetable High-Level Alias Analysis",
    Proc. ACM Symposium on Principles of Programming Languages, 
    January 1986, pp 110-118. 

Suneel Jain
Hewlett-Packard
suneel@hpda.HP.COM

jesup@pawl16.pawl.rpi.edu (Randell E. Jesup) (03/20/88)

In article <3300024@uiucdcsm> grunwald@uiucdcsm.cs.uiuc.edu writes:
>
>The paper by Wall keeps the Front-end/Back-end strategy, it just moves
>the point of actual code generation to a later phase of the binding.
>The linkers all use the IR generated by the front end. The last stage
>is a translation of the linked IR to the machine code. 

	This is a good way to do things (if I understand you), as linking
BEFORE final optimization allows true global optimization.  I'm not familiar
with that paper, where was it published (I'm no longer at GE, so I don't see
these things unless I look).

>The last stage (IR -> Assembler) performs global register allocation. The
>advantages are that you have better information about other subroutines
>at link time, and you can pass subroutine parameters in registers.

	You can pass stuff in registers without this, it's just less efficient.
The added information DOES allow some really nice optimizations, especially
on something like the RPM-40, where branches, etc are variable length.

>If you look at the improvements listed in tables 2, 4 and 5, you see that
>the run-time information contibutes about as much improvement as
>register coloring, but the two combined rarely perform significantly better
>than one or the other.
>
>This leads one to question the advantages of
>	(1) Doing both
>	(2) Bothering with recording the run-time data

	Without having seen the paper, it's hard to tell, but it sounds like
Wall is just doing register allocation late, not other types.  Also, was he
modeling a CISC or RISC?  It can make a lot of difference (not that many 
papers for one are not totally relevant to the other, a point that is
easy to overlook (since doing the statistical analysis, etc, is a major
project).

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

jordan@aero.org (Larry M. Jordan) (04/09/91)

Where and when does instruction scheduling/reorganizing occur?  Is
this function performed by the HOL compilers?  (If so, the symbolic
assembly I'm reading obscures this fact.  If not, then during
assembly?)  

In a word or two, could someone summarize the pros/cons for either
approach (HOL or independent lang. independent pass)?

--Larry

mike@vlsivie.tuwien.ac.at (Michael K. Gschwind) (04/10/91)

In article <1991Apr8.224717.14402@aero.org> jordan@aero.org (Larry M. Jordan) writes:
>Where and when does instruction scheduling/reorganizing occur?  Is
>this function performed by the HOL compilers?  (If so, the symbolic
>assembly I'm reading obscures this fact.  If not, then during
>assembly?)  
>
>In a word or two, could someone summarize the pros/cons for either
>approach (HOL or independent lang. independent pass)?
>
>--Larry

dclark@b11.ingr.com (Dave Clark) (04/10/91)

jordan@aero.org (Larry M. Jordan) writes:

>Where and when does instruction scheduling/reorganizing occur?  Is
>this function performed by the HOL compilers?  (If so, the symbolic
>assembly I'm reading obscures this fact.  If not, then during
>assembly?)  

Yes and no.  Most HOL compilers do some sort of optimizing on the
back-end.  Although I haven't (yet) seen an assembler that does
optimizations, I have seen utilities that can optimize assembly
code (part of an HOL compiler back-end).

One of the compilers I'm using actually does instruction reordering
within the linker!

>In a word or two, could someone summarize the pros/cons for either
>approach (HOL or independent lang. independent pass)?

HOL or high-level intermediate representation: Control structures are
more easily recognizable.  Restructuring is straightforward.  Machine-
specific optimizations are often omitted from this phase so that the
bulk of the compiler can be easily ported.

Assembly level: This is a good place to put machine-specific optimi-
zations.  Instruction timing and reordering is a simple search prob-
lem, especially on the advanced RISC chips.

>--Larry 

----------------------------------------------------------------------------
   Dave Clark                     |
   System Development             |  I among men bear the same wounded hand,
   Intergraph Corp., CR1102       |  suffer the same reddened cup
   Huntsville, AL 35894-0001      |  and live an identical rage.
   UUCP: uunet!ingr!b11!dclark    |                         -- Pablo Neruda
   Internet: dclark@b11.ingr.com  |
----------------------------------------------------------------------------

dahl@matterhorn.ee.umn.edu (Peter Dahl) (04/11/91)

In article <1991Apr10.131341.26357@b11.ingr.com> dclark@b11.ingr.com (Dave Clark) writes:
>jordan@aero.org (Larry M. Jordan) writes:
>
>>Where and when does instruction scheduling/reorganizing occur?  Is
>>this function performed by the HOL compilers?  (If so, the symbolic
>>assembly I'm reading obscures this fact.  If not, then during
>>assembly?)  
>
>One of the compilers I'm using actually does instruction reordering
>within the linker!
 ^^^^^^^^^^^^^^^^^
Register allocation can be also done by the linker. See "Register 
Windows vs. Register Allocation", David Wall, Proceedings of the
SIGPLAN '88 Conference on Programming Language Design and Implement-
ation, Atlanta, GA, June 22-24, 1988.

This paper describes why the author thinks that link time code 
modification is effective. He uses profiling of usage frequencies
to make decisions. However he only claims results as good as
compiler based schemes and nearly as good as register windows. The
part that is nice about this scheme is that it could set you up for
other global optimizations, depending on the types of information
the compiler tucks away for you to work with.
--

peter dahl@src.honeywell.com
I had fun once, and it wasn't anything like this

meissner@osf.org (Michael Meissner) (04/11/91)

In article <1991Apr10.131341.26357@b11.ingr.com> dclark@b11.ingr.com (Dave Clark) writes:

| jordan@aero.org (Larry M. Jordan) writes:
| 
| >Where and when does instruction scheduling/reorganizing occur?  Is
| >this function performed by the HOL compilers?  (If so, the symbolic
| >assembly I'm reading obscures this fact.  If not, then during
| >assembly?)  
| 
| Yes and no.  Most HOL compilers do some sort of optimizing on the
| back-end.  Although I haven't (yet) seen an assembler that does
| optimizations, I have seen utilities that can optimize assembly
| code (part of an HOL compiler back-end).

The MIPS assemblers do instruction scheduling.  It sometimes is
annoying to have to disassemble the object file to find exactly what
sequences of instructions are used.

| Assembly level: This is a good place to put machine-specific optimi-
| zations.  Instruction timing and reordering is a simple search prob-
| lem, especially on the advanced RISC chips.

However at the assembly level can be too late if the latency is high
enough.  For example, if the compiler reuses register temps quick
enough, the assembler doesn't have a chance to do aggresive code
movement, since the next instruction reuses the register that you are
waiting to be reloaded.  For example, if the code sequence is:

	a = b;
	c = d;

a compiler on a load/store machine might rewrite this as:

	reg = b;
	a = reg;
	reg = d;
	c = reg;

wheras if you have a latency, and you have plenty of registers, you
really want to rewrite it as:

	reg1 = b;
	reg2 = d;
	a = reg1;
	c = reg2;

--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

paulb@ssd.csd.harris.com (Paul Beusterien) (04/12/91)

In article  <MEISSNER.91Apr11114114@curley.osf.org> meissner@osf.org (Michael Meissner) writes :
} 
} However at the assembly level can be too late if the latency is high
} enough.  For example, if the compiler reuses register temps quick
} enough, the assembler doesn't have a chance to do aggresive code
} movement, since the next instruction reuses the register that you are
} waiting to be reloaded.  For example, if the code sequence is:
} 
} 	a = b;
} 	c = d;
} 
} a compiler on a load/store machine might rewrite this as:
} 
} 	reg = b;
} 	a = reg;
} 	reg = d;
} 	c = reg;
} 
} wheras if you have a latency, and you have plenty of registers, you
} really want to rewrite it as:
} 
} 	reg1 = b;
} 	reg2 = d;
} 	a = reg1;
}    	c = reg2;


This example shows the value of a register reallocator that happens
as a pre-pass to instruction scheduling.  Doing the analysis to detect the
above case is much simpler than generic register allocation and it can be done
with basically an O(n) traversal over the assembly code.
--
Paul Beusterien                          paulb@ssd.csd.harris.com
Harris Computer Systems        {uunet,mit-eddie,novavax}!hcx1!paulb
Ft. Lauderdale, FL                     voice: (305) 973 5270

jesup@cbmvax.commodore.com (Randell Jesup) (04/15/91)

In article <1991Apr10.131341.26357@b11.ingr.com> dclark@b11.ingr.com (Dave Clark) writes:
> Although I haven't (yet) seen an assembler that does
>optimizations, I have seen utilities that can optimize assembly
>code (part of an HOL compiler back-end).

	Actually, (somewhat) optimizing assemblers are by no means uncommon.
For example, the 680x0 assembler I use converts my lsl.l #1,Dn instruction
into add.l Dn,Dn for me (there are ones that do far more wide-ranging
optimizations, used as back-ends for compilers).

>One of the compilers I'm using actually does instruction reordering
>within the linker!

	That's the right place to do it!  If you have deep pipelines, being
forced to make worst-case assumptions at routine entry can be quite limiting.
By doing reorganization in (or after) the linker, you know all the pre-
conditions a routine will have to deal with (unless someone takes it's
address, of course - then you go back to worst-case, unless you have a VERY
smart and lucky linker).

	The RPM-40 backend team (which I was part of) made use of this
trick, since we had long pipelines and load-delays (plus load/store/bra
interlocks to worry about).  Libraries were supposed to be in effectively
source format.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
Disclaimer: Nothing I say is anything other than my personal opinion.
Thus spake the Master Ninjei: "To program a million-line operating system
is easy, to change a man's temperament is more difficult."
(From "The Zen of Programming")  ;-)

sweany@berlioz.cs.colostate.EDU (Phil Sweany) (04/16/91)

In article  <MEISSNER.91Apr11114114@curley.osf.org> 
   meissner@osf.org (Michael Meissner) writes :

} However at the assembly level can be too late if the latency is high
} enough.

(and continues with an example where reusing a hard register will limit
 instruction scheduling due to a reuse of a register, leading to an
 anti-dependency on that register).


In article <PAULB.91Apr11130041@rcx1.ssd.csd.harris.com> 
  paulb@ssd.csd.harris.com  (Paul Beusterien) suggests

  > This example shows the value of a register reallocator that happens
  > as a pre-pass to instruction scheduling.  


Another (better ?) solution is available if graph coloring register 
assginment is used.  Since the problem is caused by anti-dependencies
on the hard registers which limit scheduling's ability to overlap
operations, we can circumvent it by delaying the assignment
of symbolic registers until after instruction scheduling has taken place.
Steve Beaty and I described such a scheme in: 

@inproceedings
{
        SwBe90,
        topic =         confp,
        author =        "Sweany, P. and Beaty, S.",
        title =         "Post-Compaction Register Assignment in a
				Retargetable Compiler",
        booktitle =     "Proceedings of the 23th Microprogramming
				Workshop (MICRO-23)",
        address =       "Orlando, FL",
        month =         "November",
        year =          "1990"
}

---
Phil Sweany			sweany@cs.colostate.edu

meissner@osf.org (Michael Meissner) (04/16/91)

In article <20632@cbmvax.commodore.com> jesup@cbmvax.commodore.com
(Randell Jesup) writes:

| >One of the compilers I'm using actually does instruction reordering
| >within the linker!
| 
| 	That's the right place to do it!  If you have deep pipelines, being
| forced to make worst-case assumptions at routine entry can be quite limiting.
| By doing reorganization in (or after) the linker, you know all the pre-
| conditions a routine will have to deal with (unless someone takes it's
| address, of course - then you go back to worst-case, unless you have a VERY
| smart and lucky linker).

Except of course if you are using object oriented techniques, and are
calling through a function pointer.  In which case, the linker has to
fall back to the normal case of using fixed defaults.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

firth@sei.cmu.edu (Robert Firth) (04/17/91)

In article <MEISSNER.91Apr16121331@curley.osf.org> meissner@osf.org (Michael Meissner) writes:

[linker optimisation of function calls]

>Except of course if you are using object oriented techniques, and are
>calling through a function pointer.  In which case, the linker has to
>fall back to the normal case of using fixed defaults.

I don't understand the logic here.  Object oriented techniques are ways
of designing code, not of implementing it.  I know of no evidence that
OOD leads to code with greater dynamic functional variability than other
techniques.

Even if, for some reason, the prototyping tools use function pointers
to increase development efficiency, the production code will probably
use constant pointers (or write-once pointers, as the case may be).
Propagating the constant value to the call sites is one of the simplest
global optimisations a linker can do, and of couse the door is then
opened to all the other neat tricks.

cs450a03@uc780.umd.edu (04/17/91)

Curiosity question:

Are there any architectures with setbranch/commit semantics?  (where
setbranch sets up a branch target, and allows for pipeline stuffing,
and commit actually changes flow of control).

Or are pipeline stages generally too expensive for this kind of
duplication?

Inquiring minds want to know...

Raul Rockwell

preston@ariel.rice.edu (Preston Briggs) (04/17/91)

cs450a03@uc780.umd.edu writes:

>Are there any architectures with setbranch/commit semantics?  (where
>setbranch sets up a branch target, and allows for pipeline stuffing,
>and commit actually changes flow of control).

The proposed Tera machine has 8 _target registers_ that can serve
as the target of branches.  I don't know what they'll actually do when
the machine is built.  In addition to preloading some of the pipeline
in each direction, others have proposed prefetching and locking the
appropriate I-cache and TLB lines.

Preston Briggs

meissner@osf.org (Michael Meissner) (04/17/91)

In article <24130@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:

| In article <MEISSNER.91Apr16121331@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
| 
| [linker optimisation of function calls]
| 
| >Except of course if you are using object oriented techniques, and are
| >calling through a function pointer.  In which case, the linker has to
| >fall back to the normal case of using fixed defaults.
| 
| I don't understand the logic here.  Object oriented techniques are ways
| of designing code, not of implementing it.  I know of no evidence that
| OOD leads to code with greater dynamic functional variability than other
| techniques.

I'm thinking of virtual functions in C++, sending generalized messages
in Objective C, and callbacks in X.  At least in perusing the object
oriented literture, great play is made of the ability to superclass
an object, and replace it's handler.

| Even if, for some reason, the prototyping tools use function pointers
| to increase development efficiency, the production code will probably
| use constant pointers (or write-once pointers, as the case may be).
| Propagating the constant value to the call sites is one of the simplest
| global optimisations a linker can do, and of couse the door is then
| opened to all the other neat tricks.

Not in a lot of code that I see.  Calling shared libraries, and
dynamically loaded executables is another way of calling through
pointers.

Take off those Fortran colored glasses :-)
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

jbuck@janus.Berkeley.EDU (Joe Buck) (04/18/91)

In article <24130@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>In article <MEISSNER.91Apr16121331@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
>
>[linker optimisation of function calls]
>
>>Except of course if you are using object oriented techniques, and are
>>calling through a function pointer.  In which case, the linker has to
>>fall back to the normal case of using fixed defaults.
>
>I don't understand the logic here.  Object oriented techniques are ways
>of designing code, not of implementing it.  I know of no evidence that
>OOD leads to code with greater dynamic functional variability than other
>techniques.

While it is true that some virtual function calls (to use C++ terminology)
can be eliminated, the use of function pointers is really an inevitable
part of true object oriented design, and there's no real loss of efficiency
by doing so: the extra overhead of the function pointer is traded off
against the extra overhead of a switch statement (we write far fewer
switch statements and less control logic in OO programs than the equivalent
non-OO programs).

For example the C function:

doSomethingTo (objPointer)
struct object * objPointer;
{
	switch (objPointer->type) {
		TYPE_A:	return doA(objPointer);
		TYPE_B: return doB(objPointer);
		...
		default: /* error */
}

is replaced in C++ by making "doSomethingTo" a virtual function, and the
switch statement is eliminated.  Existing code no longer needs to be
modified if a new type of object is added.

Since OO techniques also make incremental linking much easier, we really
do want the ability to add new code without modifying even the machine
code, so yes, we do want function pointers, and no, they are not going
away, and anyone designing a new architecture that doesn't handle
indirect function calls efficiently is really going to suffer in the
next few years, since indirect function calls are going to become more
common. 

--
Joe Buck
jbuck@janus.berkeley.edu	 {uunet,ucbvax}!janus.berkeley.edu!jbuck