bron@olympus.SGI.COM (Bron C. Nelson) (03/11/88)
In article <477@imagine.PAWL.RPI.EDU>, jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes: > I think these issues have been beaten to death. Anyone want a new subject, > like reorganizing algorithms? (my specialty :-) Is there anyone else out > there that does linking before reorganization? Or given thought to doing > optimal (instead of near-optimal) reorganization in small code blocks? > I think this is a fine idea. Instruction scheduling/reorganization is a neat field that still has lots of room for good ideas. Also, it is destined to become an increasingly "hot" topic as memory latencies increase and asynchronous funtional units become more and more common even in the micro world (the MIPS R2000 for example has autonomous integer multiply and divide hardware that can run overlapped with other "standard" ALU ops, and of course the floating point co-processor runs autonomously as well). This isn't really an architectural issue, so probably doesn't really belong in comp.arch. However, it does have interesting architectural impact. (For example, a machine similar to the Multi-FLow VLIW (tm I think) machine could probably have been BUILT years ago, but without something like a trace-scheduler, the machine would be worthless.) Also, there doesn't really seem to be a good place to discuss this except perhaps comp.misc (or comp.compilers, but I'd rather avoid moderation :-) at least to start). Ergo, I will post my first few articles on this topic to comp.arch, cross-posted to comp.misc. If enough comp.arch people complain, we'll use comp.misc exclusively. Bron Nelson bron@sgi.com Silicon Graphics Inc. Don't blame my employers for my opinions
bron@olympus.SGI.COM (Bron C. Nelson) (03/11/88)
Recently jesup suggested we start talking about something else and proposed reorganization algorithms as a topic. I think this is a fine idea. Instruction scheduling/reorganization is a neat field that still has lots of room for good ideas. Also, it is destined to become an increasingly "hot" topic as memory latencies increase and asynchronous funtional units become more and more common even in the micro world (the MIPS R2000 for example has autonomous integer multiply and divide hardware that can run overlapped with other "standard" ALU ops, and of course the floating point co-processor runs autonomously as well). This isn't really an architectural issue, so probably doesn't really belong in comp.arch. However, it does have interesting architectural impact. (For example, a machine similar to the Multiflow VLIW (tm I think) machine could probably have been BUILT years ago, but without something like a trace-scheduler, the machine would be worthless.) Also, there doesn't really seem to be a good place to discuss this except perhaps comp.misc (or comp.compilers, but I'd rather avoid moderation :-) at least to start). Ergo, I will post my first few articles on this topic to comp.misc, cross-posted to comp.arch. If enough comp.arch people complain, I'll cease cross-posting. Bron Nelson bron@sgi.com Silicon Graphics Inc. Don't blame my employers for my opinions
bron@olympus.SGI.COM (Bron C. Nelson) (03/11/88)
(Hummm... I see I somehow managed to stupidly post BOTH versions of my introduction article.) In an attempt to deflect the endless bickering over the RPM40 from comp.arch, I am taking a cue from Randell Jesup and am trying to start a discussion of Instruction Scheduling. Please contribute your own experiences. Let's start things off with the best of my scanty bibliography. I strongly encourage people to suggest other relevant papers (e-mail to me and I'll summarize, or post directly if you're so inclined). Thomas Gross "Code Optimization of Pipeline Constraints" Stanford University Computer Systems Laboratory Technical Report No. 83-255 December 1983 Based on the author's PhD dissertation. Describes an instruction scheduler that (I believe) formed the basis of the one that is used by MIPS Co. Scheduling is done after register selection (post-pass scheduling). Phillip Gibbons and Steven Muchnick "Efficient Instruction Scheduling for a Pipelined Architecture" Proceedings of the 1986 SIGPLAN Compiler Construction Conference pp. 11 - 16 Instruction scheduler for the HP Precision Architecture. Seems to be strongly based on Gross's work. Wei-Chung Hsu "Register Allocation and Code Scheduling for Load/Store Architectures" 1987 Ph.D thesis University of Wisconsin, Madison Discusses pros and cons of scheduling before or after register allocation (pre-pass vs post-pass methods). Gives examples and good algorithms for both. Talks about the desirability of doing integrated scheduling and register allocation and gives simple methods of keeping the one in mind while doing the other to help improve both. John Ellis "Bulldog: A Compiler for VLIW Architectures" 1985 Ph.D thesis Yale University Also published in the ACM Doctoral Dissertation Award Series by MIT press. Trace scheduling explained. A lot of reading, but well worth it. The idea that makes VLIW machines work. Rajiv Gupta and Mary Soffa "Region Scheduling" Proceedings of the 2nd International Conference on Supercomputing International Supercomputing Institute, Inc. 1987 Vol III, pp 141 - 148 Describes work-in-progress without any hard numbers, but this looks like a very clever idea. The authors claim superiority over trace scheduling on several grounds. The ideas seem more solidly based than trace scheduling (which always struck me as a good ad-hoc idea carried to extremes) as well as easier to understand. If this lives up to its promises, I would rate it as something to follow closely. BTW - anyone know where I can contact the authors/get more info? Well, that's it! (except for old and/or scummy papers that aren't worth reading). Surely some of you Out There can suggest more. Bron Nelson bron@sgi.com Don't blame my employers for my opinions
bron@olympus.SGI.COM (Bron C. Nelson) (03/12/88)
What you need to know in order to do decent instruction scheduling is the idea of "dependence;" you need to know which instructions MUST preceed which other instructions, and which can be done independently. For ALU ops, the dependencies can be seen in the register usage. The real problem is loads/stores. Without some kind of analysis, you cannot move a load past a store, and you cannot move a store past a load OR a store. This can severely inhibit the amount of re-ordering that can be done, particularly if you only schedule a basic block at a time. The question I want the answer to is: how much and what kind of analysis is actually done by the various compilers Out There? To do a really good job, you need the kind of data dependence analysis that a vectorizing compiler does. Cray Research, Multiflow, and Ardent Computer are the only firms I know of for sure that generate and this kind of information and then use it when doing instruction scheduling. (I'd bet that IBM, CDC, NEC, etc. do it too, but I don't know for sure). One problem is that data dependence analysis is done at a fairly high level, while instruction scheduling is a very low level optimization. Transmitting the information through the compiler can be hard. Question: How detailed is the information passed to the instruction scheduler? Anyone at Cray/Multiflow/Ardent etc. care to say? A lot of things that occur in practice can be seen to be independent without resorting to full data dependence analysis (disjoint, global arrays, scalars from the stack, etc.) How much of this kind of analysis is done by various compilers? What sort of disambiguation has been found most useful for scheduling? Anyone at MIPSco, AMD, GE, Mot., Sun, etc. care to enlighten us? Studies done several years ago (sorry, don't have the references, but I believe they are cited in Ellis's "Bulldog" thesis bibliography) showed that in typical Fortran programs, a basic block only has about a factor of 2 of parallelism in the operations. For cpu's with only a few 2 cycle instructions (e.g. load/branch like the original Berkeley RISC and Stanford MIPS machines) this is probably enough; you could typically find enough "local" parallelism to fill the slots most of the time. If many of your instructions take multiple cycles (e.g. a Cray X-MP where even doing an address-length (24bit) integer add takes 2 cycles, and a load takes 14), you need to do more, or you'll wait a lot. For example, a measurement done at LLNL in July of 1985 showed that nearly **60%** of the cycles on their Cray X-MP48 were spent interlocked waiting for results of previous computations. (This does NOT count time spent waiting after a branch, which is probably about another 5-10%.) Clearly, sceduling only within a basic block was not enough; the recent Cray compilers will push some operations (i.e. loads) across block boundries. VLIW machines have the same sort of problem; the machine is capable of issuing instructions faster than the (data dependent) computations can deliver results. VLIW machines also break block boundries to find more instructions to issue concurrently. The whole point of the last paragraph is to say that the amount of worry you put into this optimization is very dependent on the payoff your particular hardware can get out of it (no surprise), and that the payoff varies dramatically from machine to machine. SO, the question (finally!) is: for YOUR particular architecture, how much time is spent interlocked (or executing NO-OPs for those machines without interlocks)? Aggregate numbers and integer only numbers are fine, but particularly interesting would be numbers involving multi- cycle instructions (e.g. floating point). An aside: do the current "no hardware interlocks" cpus REALLY have no interlocks, or are they just talking about the integer ALU ops? Since I know something about the MIPSco chip(s), I'll use that as an example: when you do an integer divide, do you really put in 35 no-ops, or is this "special cased"? Does the f.p. co-processor have interlocks? This is already too long; I'll beat the subject of memory latency to death sometime in the future. --------------------------------------------------------------------- Bron Nelson bron@sgi.com Don't blame my employers for my opinions
jesup@pawl14.pawl.rpi.edu (Randell E. Jesup) (03/12/88)
In article <12678@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes: > The real problem is loads/stores. Without >some kind of analysis, you cannot move a load past a store, and >you cannot move a store past a load OR a store. This can severely >inhibit the amount of re-ordering that can be done, particularly >if you only schedule a basic block at a time. Yup! There are some simple cases, such as where you're loading/storing with the same base register, and the register wasn't changed inbetween (in particular, this helps local variable references.) Maybe there should be assembler instructions (not real ones) that contain some of this information that the compiler/programmer has found out. Example: (addr is offset(base)) uldw reg,addr - unaliased loadword (no references via other registers/absolute will occur) ustw addr,reg - unaliased storeword aldw reg,addr,addr[,addr...] - aliased loadword - references via later addr are aliases of one being used. etc. This is all in all pretty kludgey. I like this better: .alias addr,addr,addr - informs back end (assembler/reorganizer) that these addresses are aliased, otherwise it can assume it's safe to move load/stores around (within usual bounds.) This will only be valid until any one of the address registers used is changed, after which the reorg must assume full aliasing again until told otherwise. The net effect is lots of these near beginnings of blocks. But it does do the job (if the compiler can figure it out, that is!) > Clearly, sceduling only within a >basic block was not enough; the recent Cray compilers will push >some operations (i.e. loads) across block boundries. Global optimization is the wave of the future in high-speed pipelined processors (IMHO). Actually, some pretty old compilers do this: it's called removing loop invariants. It is, of course, a special-case optimization, but it does move things across block boundaries. (And boy is it a pain when it gets it wrong. I had a fortran subroutine with dummy[3] = dummy[3] in it so it wouldn't pull it out of the loop (it was equivalenced.)) >The whole point of the last paragraph is to say that the amount of >worry you put into this optimization is very dependent on the payoff >your particular hardware can get out of it (no surprise), and that >the payoff varies dramatically from machine to machine. The potential payoff is almost directly proportional to the length of your load delays. The faster CPUs get, given current memory, board, and IC packaging technology, the longer these delays will get. >An aside: do the current "no hardware interlocks" cpus REALLY have >no interlocks, or are they just talking about the integer ALU ops? The rpm-40 has NO interlocks, nor would I envision any XP for it having any either. However, that doesn't mean SOME interlocks aren't a good thing, especially in long load delay machines, IFF you can do it without affecting cycle time (i.e. doesn't hit critical path). // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
root@mfci.UUCP (SuperUser) (03/13/88)
In article <12678@sgi.SGI.COM] bron@olympus.SGI.COM (Bron C. Nelson) writes:
]Question: How detailed is the information passed
]to the instruction scheduler? Anyone at Cray/Multiflow/Ardent etc.
]care to say?
I'll defer to one of our compiler people to answer this one (left this
in so you wouldn't think I was avoiding it, even though I am...)
]
].... Clearly, sceduling only within a
]basic block was not enough; the recent Cray compilers will push
]some operations (i.e. loads) across block boundries. VLIW machines
]have the same sort of problem; the machine is capable of issuing
]instructions faster than the (data dependent) computations can deliver
]results. VLIW machines also break block boundries to find more
]instructions to issue concurrently.
A minor nit: it's the compiler, in our case a Trace Scheduling (tm)
compacting compiler (hope that keeps our legal guys happy) that breaks
the block boundaries; the VLIW part makes it worthwhile. And we're
usually careful to distinguish between instructions, which are the
aggregate total of bits which all come flying out of the icache at
once, vs. packets, which are the individual control fields for the
various functional units. I usually look at the machine as a parallel
collection of pipelines, each of different length, most
of which (especially the floating pipes) have a latency of at least
one instruction.
]
]The whole point of the last paragraph is to say that the amount of
]worry you put into this optimization is very dependent on the payoff
]your particular hardware can get out of it (no surprise), and that
]the payoff varies dramatically from machine to machine. SO, the
]question (finally!) is: for YOUR particular architecture, how much
]time is spent interlocked (or executing NO-OPs for those machines
]without interlocks)? Aggregate numbers and integer only numbers are
]fine, but particularly interesting would be numbers involving multi-
]cycle instructions (e.g. floating point).
I know you want numbers here, but I think a few more points should
be included. First, the answers will vary greatly with the code you're
discussing. I think you know you have a balanced machine if a wide
range of code causes the machine to limit in different places. For
instance, one program may (in a hypothetical VLIW) cause the register
read ports to become the bottleneck, another may run out of memory
bandwidth, a third might require more floating point units, a
fourth may do a lot of non-pipelined operations (divide), and a fifth
might do chained memory operations that can't be done in parallel
(which means your cpu waits around for the memory latency a lot -- this
hypothetical VLIW has no data cache.) Depending
on which limit is the one currently blocking performance, your answer
on the NOPs will vary a lot. Perhaps if you pick a benchmark and
ask we'll be able to compare results better.
]
]An aside: do the current "no hardware interlocks" cpus REALLY have
]no interlocks, or are they just talking about the integer ALU ops?
]Since I know something about the MIPSco chip(s), I'll use that as
]an example: when you do an integer divide, do you really put in 35
]no-ops, or is this "special cased"? Does the f.p. co-processor
]have interlocks?
We have no hardware interlocks except for a bank stall resolver built
into the memory controllers.
]Bron Nelson bron@sgi.com
I hope to have some reportable numbers on this stuff soon.
Bob Colwell mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405 203-488-6090
ram%shukra@Sun.COM (Renu Raman, Taco-Bell Microsystems (A Bell Operating Co.)) (03/14/88)
In article <12512@sgi.SGI.COM> bron@olympus.SGI.COM (Bron C. Nelson) writes: >This isn't really an architectural issue, so probably doesn't >really belong in comp.arch. However, it does have interesting [This IS an architecural issue: Expanding the realm of processors that you have mentioned, include vector processors and multiple functional unit processors.] >architectural impact. (For example, a machine similar to the >Multi-FLow VLIW (tm I think) machine could probably have been >BUILT years ago, but without something like a trace-scheduler, > >Bron Nelson bron@sgi.com >Don't blame my employers for my opinions neither mine. [Hi Bron] Great! I have been wanting to write about scoreboarding and here comes along something. OK. Ardent's new machine as well Motorola's (probably announced) and apollo's pre-announced machines all have some HW dedicated to do scoreboarding (in Ardent's case 1/3 of the vector control gate count is dedicated to scoreboarding. Don't have any info on Apollo's & Motorola's scoreboarding yet). In the case of Ardent their scoreboaring is dedicated to the vector ops (correct me - the info that I have is very hazy) but in the case of Apollo & Moto, it is for multiple scalar FUs. HW support for dynamic scheduling of code first appeared in the CDC 6600 (Thornton) and later on in IBM 360/?? - specifically for the FP section. (I think Tomasulo was involved or wrote about it). What is amazing is that in the Cray-1, which had multiple functional units, the issue scheme was at most 1/cycle. (gravest source of imbalance). Weiss & Smith [1](Astronautics Corp.) showed that by tagging the register file on the cray to do the dependecy checks, the performance could be improved by as much as 43%(bare bones without any compiler optimizations thru static re-organization)[No estimates on likely increase in si area]. In any case Cray avoided any sort of scoreboarding!!! Some of the + & - in dynamic scheduling are: (1/2)+. Compiler can be unburdened. It does not make sense as in general you compile once and run the program zillion times. One can always turn off optimizations to go thru edit-compile-debug cycle faster. (1/2)+ Performance is not tied to quality of compiler: Again not a real plus. Maybe true of the PC world where a compiler at $40 and $200 makes a difference. +/- Handling of Dynamic dependencies[DS] that a compiler cannot: Valid case. For procedural languages (where most of the processing is on today), what % of overall data dependencies are dynamic and what % statically determinable? Anybody have any nos. or references to prior work? One possiblity is that interpreted & OO languages could benefit more than compiled languages. Also, how much value is added in DS with good static scheduling techniques? [DS is a good case for the memory bank conflict problem]. ? Handling of Changes in processor configurations: ??? (turning on & off functional units?) - Context switches & Interrupts. (embedded controllers?): Seems to be exception handling and context switches (esp. for embedded controller applications where some of the RISCs find their way) are likely candidates for PIA (Pain In ....). + Multiple Instruction streams & scheduling in multiprocessor systems. Again a good candidate but too early to predict if that will be a real advantage. -. Increase in chip area. True, but by how much?. More critical issue is time(and si) spent in book keeping the dependencies in the instruction queue and overheads in the handling of interrupts & context switches. A much later reference is Torng et. al [TOC Sep 86]. [nos. ranging from 80-120 % in improvement are reported by using a despatch stack and associated book-keeping HW. [Their reference base is not clear to me]. Is there any justification to dedicate a lot of HW for scoreboarding (it is probably too early to talk about it, and many may not even talk about it) or can we trust the compiler to do it all? It seems scoreboarding can be done efficiently by the compiler, except in the case where there are multiple instruction streams threading thru different processors (if somebody has built that, then they have a data flow machine) or when there are lots of autonomously running functional units making the code re-organization a hair raising task. Cydrome has addressed some of tradeoffs with scheduling in their cydra (see "Cydra 5 Directed Dataflow architecture" - Compcon 86). "There is a trend to move software problems to HW and code scheduling seems to be a candidate"!!! - Weiss & Smith (Will it be?) --------------------- Renukanthan Raman ARPA:ram@sun.com Sun Microsystems UUCP:{ucbvax,seismo,hplabs}!sun!ram M/S 18-41, 2500 Garcia Avenue, Mt. View, CA 94043
firth@sei.cmu.edu (Robert Firth) (03/14/88)
Thanks to Randell Jesup, Bron Nelson, and others for prompting this. The question they discuss is, how might one do some really good code reorganising for a machine with pipeline constraints, eg a RISC machine. First point: when we looked at the MIPS M/500, we found some less than good code reorganisation. Details are in our report, but one place where the reorganiser was VERY conservative was in rearranging loads and stores. The reason for this was that the reorganiser input was at Assembler level, and the algorithms generally made worst-case assumptions about aliasing. Second point: To get really good code reorganisation, much the best place to do it is in the compiler back end. That is because you there have some really vital information, for instance . conceptual type . conceptual address space At the assembler level, variables X and Y may seem to be just 16-bit integers, but if they are of conceptually different types, they can't alias. Similarly, X and Y may both be machine addresses, but if they are reference variables of different types, they can't alias. Nor can, say, a by-reference parameter alias a pointer to an allocated object. Of course, this assumes you are working with a true high-level language, such as Modula-2 or Ada. If the programmer can perform arbitrary casts and puns, all bets are off. Third point: a good compiler must carry this information right through to final code selection and ordering. In principle, every fragment that the compiler ever generates must be linked back to the attributed parse tree, so that no attribute is ever lost or rendered inaccessible. Hence, when you are debating whether to rearrange store r2, xxx(r12) load r1, yyy(r11) you can look back and see, for instance, that the load is of an INTEGER and the store of a DURATION, or the load is from module MATHLIB (with r11 bound as base pointer) and the store into an allocated object of type DATE (designated by a pointer in r12). Or whatever. Final point: not many compilers do this. The reason is that this design largely abandons the traditional "front end / back end" separation. If the back end has to interrogate the APT, then it probably has to know which language the source program came from. Unless, of course, you first design a common APT across your set of languages - and then you throw away your automatic parser generator (actually, not a bad idea, given some of the ones I've seen!). I've rambled on enough. More later, maybe, since I did take this idea down to a detailed design and it may yet become a Modula compiler.
bcase@Apple.COM (Brian Case) (03/16/88)
In article <4553@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:
[about keeping information around until the last minute to help
reorganization and such.]
Although it doesn't reflect quit the same level of aggressiveness that
Robert talks about, the paper
David W. Wall. Global register allocation at link-time. _Proceedings
of the SIGPLAN '86 Symposium on Compiler Construction._ _SIGPLAN
Notices 21_ (7): 264-275 (July 1986).
talks about keeping information around until the very last. I think
Robert's comments about very good compilers needing to abandon the now-
popular front-end back-end separation (so that "all you have to do is
write a new front end for a new language") will hold true. It seems like
there is much to be gained.
grunwald@uiucdcsm.cs.uiuc.edu (03/17/88)
The paper by Wall keeps the Front-end/Back-end strategy, it just moves the point of actual code generation to a later phase of the binding. The linkers all use the IR generated by the front end. The last stage is a translation of the linked IR to the machine code. The last stage (IR -> Assembler) performs global register allocation. The advantages are that you have better information about other subroutines at link time, and you can pass subroutine parameters in registers. If you look at the improvements listed in tables 2, 4 and 5, you see that the run-time information contibutes about as much improvement as register coloring, but the two combined rarely perform significantly better than one or the other. This leads one to question the advantages of (1) Doing both (2) Bothering with recording the run-time data Any comments?
suneel@hpclskj.HP.COM (Suneel Jain x75763) (03/17/88)
> From: firth@sei.cmu.edu (Robert Firth) > Thanks to Randell Jesup, Bron Nelson, and others for prompting > this. The question they discuss is, how might one do some > really good code reorganising for a machine with pipeline > constraints, eg a RISC machine. The Instruction Scheduler for the HP Precision Architecture uses aliasing information to determine if two memory references could interfere with each other. This allows the scheduler to rearrange loads and stores much better than if worst case assumptions are used. This has been especially useful in scheduling floating point operations. The aliasing information is computed by the various front ends and passed down to the level where the scheduler is run. This process is described in the following paper: Deborah S. Coutant, "Retargetable High-Level Alias Analysis", Proc. ACM Symposium on Principles of Programming Languages, January 1986, pp 110-118. Suneel Jain Hewlett-Packard suneel@hpda.HP.COM
jesup@pawl16.pawl.rpi.edu (Randell E. Jesup) (03/20/88)
In article <3300024@uiucdcsm> grunwald@uiucdcsm.cs.uiuc.edu writes: > >The paper by Wall keeps the Front-end/Back-end strategy, it just moves >the point of actual code generation to a later phase of the binding. >The linkers all use the IR generated by the front end. The last stage >is a translation of the linked IR to the machine code. This is a good way to do things (if I understand you), as linking BEFORE final optimization allows true global optimization. I'm not familiar with that paper, where was it published (I'm no longer at GE, so I don't see these things unless I look). >The last stage (IR -> Assembler) performs global register allocation. The >advantages are that you have better information about other subroutines >at link time, and you can pass subroutine parameters in registers. You can pass stuff in registers without this, it's just less efficient. The added information DOES allow some really nice optimizations, especially on something like the RPM-40, where branches, etc are variable length. >If you look at the improvements listed in tables 2, 4 and 5, you see that >the run-time information contibutes about as much improvement as >register coloring, but the two combined rarely perform significantly better >than one or the other. > >This leads one to question the advantages of > (1) Doing both > (2) Bothering with recording the run-time data Without having seen the paper, it's hard to tell, but it sounds like Wall is just doing register allocation late, not other types. Also, was he modeling a CISC or RISC? It can make a lot of difference (not that many papers for one are not totally relevant to the other, a point that is easy to overlook (since doing the statistical analysis, etc, is a major project). // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
jordan@aero.org (Larry M. Jordan) (04/09/91)
Where and when does instruction scheduling/reorganizing occur? Is this function performed by the HOL compilers? (If so, the symbolic assembly I'm reading obscures this fact. If not, then during assembly?) In a word or two, could someone summarize the pros/cons for either approach (HOL or independent lang. independent pass)? --Larry
mike@vlsivie.tuwien.ac.at (Michael K. Gschwind) (04/10/91)
In article <1991Apr8.224717.14402@aero.org> jordan@aero.org (Larry M. Jordan) writes: >Where and when does instruction scheduling/reorganizing occur? Is >this function performed by the HOL compilers? (If so, the symbolic >assembly I'm reading obscures this fact. If not, then during >assembly?) > >In a word or two, could someone summarize the pros/cons for either >approach (HOL or independent lang. independent pass)? > >--Larry
dclark@b11.ingr.com (Dave Clark) (04/10/91)
jordan@aero.org (Larry M. Jordan) writes: >Where and when does instruction scheduling/reorganizing occur? Is >this function performed by the HOL compilers? (If so, the symbolic >assembly I'm reading obscures this fact. If not, then during >assembly?) Yes and no. Most HOL compilers do some sort of optimizing on the back-end. Although I haven't (yet) seen an assembler that does optimizations, I have seen utilities that can optimize assembly code (part of an HOL compiler back-end). One of the compilers I'm using actually does instruction reordering within the linker! >In a word or two, could someone summarize the pros/cons for either >approach (HOL or independent lang. independent pass)? HOL or high-level intermediate representation: Control structures are more easily recognizable. Restructuring is straightforward. Machine- specific optimizations are often omitted from this phase so that the bulk of the compiler can be easily ported. Assembly level: This is a good place to put machine-specific optimi- zations. Instruction timing and reordering is a simple search prob- lem, especially on the advanced RISC chips. >--Larry ---------------------------------------------------------------------------- Dave Clark | System Development | I among men bear the same wounded hand, Intergraph Corp., CR1102 | suffer the same reddened cup Huntsville, AL 35894-0001 | and live an identical rage. UUCP: uunet!ingr!b11!dclark | -- Pablo Neruda Internet: dclark@b11.ingr.com | ----------------------------------------------------------------------------
dahl@matterhorn.ee.umn.edu (Peter Dahl) (04/11/91)
In article <1991Apr10.131341.26357@b11.ingr.com> dclark@b11.ingr.com (Dave Clark) writes: >jordan@aero.org (Larry M. Jordan) writes: > >>Where and when does instruction scheduling/reorganizing occur? Is >>this function performed by the HOL compilers? (If so, the symbolic >>assembly I'm reading obscures this fact. If not, then during >>assembly?) > >One of the compilers I'm using actually does instruction reordering >within the linker! ^^^^^^^^^^^^^^^^^ Register allocation can be also done by the linker. See "Register Windows vs. Register Allocation", David Wall, Proceedings of the SIGPLAN '88 Conference on Programming Language Design and Implement- ation, Atlanta, GA, June 22-24, 1988. This paper describes why the author thinks that link time code modification is effective. He uses profiling of usage frequencies to make decisions. However he only claims results as good as compiler based schemes and nearly as good as register windows. The part that is nice about this scheme is that it could set you up for other global optimizations, depending on the types of information the compiler tucks away for you to work with. -- peter dahl@src.honeywell.com I had fun once, and it wasn't anything like this
meissner@osf.org (Michael Meissner) (04/11/91)
In article <1991Apr10.131341.26357@b11.ingr.com> dclark@b11.ingr.com (Dave Clark) writes: | jordan@aero.org (Larry M. Jordan) writes: | | >Where and when does instruction scheduling/reorganizing occur? Is | >this function performed by the HOL compilers? (If so, the symbolic | >assembly I'm reading obscures this fact. If not, then during | >assembly?) | | Yes and no. Most HOL compilers do some sort of optimizing on the | back-end. Although I haven't (yet) seen an assembler that does | optimizations, I have seen utilities that can optimize assembly | code (part of an HOL compiler back-end). The MIPS assemblers do instruction scheduling. It sometimes is annoying to have to disassemble the object file to find exactly what sequences of instructions are used. | Assembly level: This is a good place to put machine-specific optimi- | zations. Instruction timing and reordering is a simple search prob- | lem, especially on the advanced RISC chips. However at the assembly level can be too late if the latency is high enough. For example, if the compiler reuses register temps quick enough, the assembler doesn't have a chance to do aggresive code movement, since the next instruction reuses the register that you are waiting to be reloaded. For example, if the code sequence is: a = b; c = d; a compiler on a load/store machine might rewrite this as: reg = b; a = reg; reg = d; c = reg; wheras if you have a latency, and you have plenty of registers, you really want to rewrite it as: reg1 = b; reg2 = d; a = reg1; c = reg2; -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
paulb@ssd.csd.harris.com (Paul Beusterien) (04/12/91)
In article <MEISSNER.91Apr11114114@curley.osf.org> meissner@osf.org (Michael Meissner) writes : } } However at the assembly level can be too late if the latency is high } enough. For example, if the compiler reuses register temps quick } enough, the assembler doesn't have a chance to do aggresive code } movement, since the next instruction reuses the register that you are } waiting to be reloaded. For example, if the code sequence is: } } a = b; } c = d; } } a compiler on a load/store machine might rewrite this as: } } reg = b; } a = reg; } reg = d; } c = reg; } } wheras if you have a latency, and you have plenty of registers, you } really want to rewrite it as: } } reg1 = b; } reg2 = d; } a = reg1; } c = reg2; This example shows the value of a register reallocator that happens as a pre-pass to instruction scheduling. Doing the analysis to detect the above case is much simpler than generic register allocation and it can be done with basically an O(n) traversal over the assembly code. -- Paul Beusterien paulb@ssd.csd.harris.com Harris Computer Systems {uunet,mit-eddie,novavax}!hcx1!paulb Ft. Lauderdale, FL voice: (305) 973 5270
jesup@cbmvax.commodore.com (Randell Jesup) (04/15/91)
In article <1991Apr10.131341.26357@b11.ingr.com> dclark@b11.ingr.com (Dave Clark) writes: > Although I haven't (yet) seen an assembler that does >optimizations, I have seen utilities that can optimize assembly >code (part of an HOL compiler back-end). Actually, (somewhat) optimizing assemblers are by no means uncommon. For example, the 680x0 assembler I use converts my lsl.l #1,Dn instruction into add.l Dn,Dn for me (there are ones that do far more wide-ranging optimizations, used as back-ends for compilers). >One of the compilers I'm using actually does instruction reordering >within the linker! That's the right place to do it! If you have deep pipelines, being forced to make worst-case assumptions at routine entry can be quite limiting. By doing reorganization in (or after) the linker, you know all the pre- conditions a routine will have to deal with (unless someone takes it's address, of course - then you go back to worst-case, unless you have a VERY smart and lucky linker). The RPM-40 backend team (which I was part of) made use of this trick, since we had long pipelines and load-delays (plus load/store/bra interlocks to worry about). Libraries were supposed to be in effectively source format. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup Disclaimer: Nothing I say is anything other than my personal opinion. Thus spake the Master Ninjei: "To program a million-line operating system is easy, to change a man's temperament is more difficult." (From "The Zen of Programming") ;-)
sweany@berlioz.cs.colostate.EDU (Phil Sweany) (04/16/91)
In article <MEISSNER.91Apr11114114@curley.osf.org> meissner@osf.org (Michael Meissner) writes : } However at the assembly level can be too late if the latency is high } enough. (and continues with an example where reusing a hard register will limit instruction scheduling due to a reuse of a register, leading to an anti-dependency on that register). In article <PAULB.91Apr11130041@rcx1.ssd.csd.harris.com> paulb@ssd.csd.harris.com (Paul Beusterien) suggests > This example shows the value of a register reallocator that happens > as a pre-pass to instruction scheduling. Another (better ?) solution is available if graph coloring register assginment is used. Since the problem is caused by anti-dependencies on the hard registers which limit scheduling's ability to overlap operations, we can circumvent it by delaying the assignment of symbolic registers until after instruction scheduling has taken place. Steve Beaty and I described such a scheme in: @inproceedings { SwBe90, topic = confp, author = "Sweany, P. and Beaty, S.", title = "Post-Compaction Register Assignment in a Retargetable Compiler", booktitle = "Proceedings of the 23th Microprogramming Workshop (MICRO-23)", address = "Orlando, FL", month = "November", year = "1990" } --- Phil Sweany sweany@cs.colostate.edu
meissner@osf.org (Michael Meissner) (04/16/91)
In article <20632@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: | >One of the compilers I'm using actually does instruction reordering | >within the linker! | | That's the right place to do it! If you have deep pipelines, being | forced to make worst-case assumptions at routine entry can be quite limiting. | By doing reorganization in (or after) the linker, you know all the pre- | conditions a routine will have to deal with (unless someone takes it's | address, of course - then you go back to worst-case, unless you have a VERY | smart and lucky linker). Except of course if you are using object oriented techniques, and are calling through a function pointer. In which case, the linker has to fall back to the normal case of using fixed defaults. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
firth@sei.cmu.edu (Robert Firth) (04/17/91)
In article <MEISSNER.91Apr16121331@curley.osf.org> meissner@osf.org (Michael Meissner) writes: [linker optimisation of function calls] >Except of course if you are using object oriented techniques, and are >calling through a function pointer. In which case, the linker has to >fall back to the normal case of using fixed defaults. I don't understand the logic here. Object oriented techniques are ways of designing code, not of implementing it. I know of no evidence that OOD leads to code with greater dynamic functional variability than other techniques. Even if, for some reason, the prototyping tools use function pointers to increase development efficiency, the production code will probably use constant pointers (or write-once pointers, as the case may be). Propagating the constant value to the call sites is one of the simplest global optimisations a linker can do, and of couse the door is then opened to all the other neat tricks.
cs450a03@uc780.umd.edu (04/17/91)
Curiosity question: Are there any architectures with setbranch/commit semantics? (where setbranch sets up a branch target, and allows for pipeline stuffing, and commit actually changes flow of control). Or are pipeline stages generally too expensive for this kind of duplication? Inquiring minds want to know... Raul Rockwell
preston@ariel.rice.edu (Preston Briggs) (04/17/91)
cs450a03@uc780.umd.edu writes: >Are there any architectures with setbranch/commit semantics? (where >setbranch sets up a branch target, and allows for pipeline stuffing, >and commit actually changes flow of control). The proposed Tera machine has 8 _target registers_ that can serve as the target of branches. I don't know what they'll actually do when the machine is built. In addition to preloading some of the pipeline in each direction, others have proposed prefetching and locking the appropriate I-cache and TLB lines. Preston Briggs
meissner@osf.org (Michael Meissner) (04/17/91)
In article <24130@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: | In article <MEISSNER.91Apr16121331@curley.osf.org> meissner@osf.org (Michael Meissner) writes: | | [linker optimisation of function calls] | | >Except of course if you are using object oriented techniques, and are | >calling through a function pointer. In which case, the linker has to | >fall back to the normal case of using fixed defaults. | | I don't understand the logic here. Object oriented techniques are ways | of designing code, not of implementing it. I know of no evidence that | OOD leads to code with greater dynamic functional variability than other | techniques. I'm thinking of virtual functions in C++, sending generalized messages in Objective C, and callbacks in X. At least in perusing the object oriented literture, great play is made of the ability to superclass an object, and replace it's handler. | Even if, for some reason, the prototyping tools use function pointers | to increase development efficiency, the production code will probably | use constant pointers (or write-once pointers, as the case may be). | Propagating the constant value to the call sites is one of the simplest | global optimisations a linker can do, and of couse the door is then | opened to all the other neat tricks. Not in a lot of code that I see. Calling shared libraries, and dynamically loaded executables is another way of calling through pointers. Take off those Fortran colored glasses :-) -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
jbuck@janus.Berkeley.EDU (Joe Buck) (04/18/91)
In article <24130@as0c.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes: >In article <MEISSNER.91Apr16121331@curley.osf.org> meissner@osf.org (Michael Meissner) writes: > >[linker optimisation of function calls] > >>Except of course if you are using object oriented techniques, and are >>calling through a function pointer. In which case, the linker has to >>fall back to the normal case of using fixed defaults. > >I don't understand the logic here. Object oriented techniques are ways >of designing code, not of implementing it. I know of no evidence that >OOD leads to code with greater dynamic functional variability than other >techniques. While it is true that some virtual function calls (to use C++ terminology) can be eliminated, the use of function pointers is really an inevitable part of true object oriented design, and there's no real loss of efficiency by doing so: the extra overhead of the function pointer is traded off against the extra overhead of a switch statement (we write far fewer switch statements and less control logic in OO programs than the equivalent non-OO programs). For example the C function: doSomethingTo (objPointer) struct object * objPointer; { switch (objPointer->type) { TYPE_A: return doA(objPointer); TYPE_B: return doB(objPointer); ... default: /* error */ } is replaced in C++ by making "doSomethingTo" a virtual function, and the switch statement is eliminated. Existing code no longer needs to be modified if a new type of object is added. Since OO techniques also make incremental linking much easier, we really do want the ability to add new code without modifying even the machine code, so yes, we do want function pointers, and no, they are not going away, and anyone designing a new architecture that doesn't handle indirect function calls efficiently is really going to suffer in the next few years, since indirect function calls are going to become more common. -- Joe Buck jbuck@janus.berkeley.edu {uunet,ucbvax}!janus.berkeley.edu!jbuck