aglew@crhc.uiuc.edu (Andy Glew) (10/12/90)
[Copied from comp.arch. -John] >[gillies@m.cs.uiuc.edu] >One of the problems in new CPU designs is that designers don't realize >which architecture enhancements are pointless, because we don't and >may never have the optimization technology to take advantage of them. Ah, a potentially interesting and useful topic. Perhaps we can start a discussion that will lead to a list of possible hardware architectural enhancements that a compiler can/cannot take advantage of? Maybe the real compiler guys (preston, David C.) will help us out? By the way, it is useful in such discussions to distinguish between run-of-the-mill optimizations that can pretty much be added to any existing compiler by a fairly good CS grad, and optimizations that require so much background that only the "experts" in the field can reasonably be expected to do this. For example, many of the parallelizing FORTRAN loop optimizations can reasonably only be expected to be done by people from Rice or UIUC CSRD (note that I'm UIUC, but CRHC, not CSRD), so unless you are willing to pay for these guys (or their assorted front companies) you aren't likely to get them onto your machine. Cost of compiler development can be significant. Sometimes a company might put a hardware feature in even though they know a compiler approach would be better, because they can't afford the compiler guys (or the compiler guys already have exclusive contracts with a competitor). Let me list a few things to start off the discussion. I hope and expect to be shot down on a few of them. It's easy to list a few of the hardware enhancements that we already know compilers can take advantage of. Branch Delay Slots - small number One or two branch delay slots can be filled quite easily. DIFFICULTY: a reasonably good CS grad can do it. Branch Delay slots - large number Most groups report difficulty filling a large number of branch delay slots, although my group's compiler guy (UIUC IMPACT, Prof Weh-M Hwu) has been able to get good results for up to 8 delay slots, on slightly different hardware (forward semantic, with profiling feedback). DIFFICULTY: given the slightly different hardware, and exposure to the necessary ideas, a reasonably good CS grad can do it. Register file - moderate sized (up to 32 registers) Most compilers can use these efficiently. DIFFICULTY: a reasonably good CS grad can do it. Avoiding IBM's patents is a slight problem. Register file - large (around 128 registers, or more) Most compilers do not get enough benefit from these to justify the extra hardware, or the slowed down register access. Multiple, hierarchical, registers sets Like Cray's fast S registers and slower T registers. It took a very long time for compilers to take advantage of these multiple register sets. DIFFICULTY: complex. Separate floating point register file Most compilers can take advantage of this. DIFFICULTY: easy Heterogenous register file Few compilers have been developed that can take advantage of a truly heterogenous register file, one in which, for example, the divide unit writes to registers D1..D2, the add unit writes to registers A1..A16, the shift unit writes to registers S1..S4 -- even though such hardware conceivably has a cycle time advantage over homogenous registers, even on VLIW machines where data can easily be moved to generic registers when necessary. DIFFICULTY: hard. Instruction cache Many groups have reported compilers that can take advantage of small instruction caches by rearranging code to minimize I-cache misses. Eg. HP, UIUC IMPACT. DIFFICULTY: moderately complex - a good CS grad can do it, but it requires tools and background that take a while to develop, and are not widely available. One of the complicating factors is that this sort of optimization is most useful when it is done in cooperaton with the linker/loader. Which is something that your average compiler hacker doesn't have background in (although it isn't too hard). Data cache - software managed consistency This reportedly has been done, but certainly isn't run-of-the-mill. DIFFICULTY: needs a skilled compiler expert. Micro-scheduling parallelism (like CONVEX's ASAP) Apparently isn't too hard to develop compiler support for this. Hardware is rather expensive, though. DIFFICULTY: moderately complex Vectorizing Done in lots of places, but high performance vectorizing compilers still seem to be a problem for a lot of companies. DIFFICULTY: complex. The easy things are easy, but milking the last few FLOPs out of your vector unit requires compiler experts. Parallelizing - fine grain, small numbers of processors (like Alliant's concurrent hardware) Done, eg. by Alliant and Cray DIFFICULTY: to be competitive with this sort of hardware, you need real compiler experts on your team. Harder than vectorizing. Parallelizing, fine grain, large numbers of processors. What Tera is trying to do. DIFFICULTY: very complex. Leading edge. Good luck, Tera. Multiple functional units - heterogenous - VLIW or superscalar DIFFICULTY: complex. Complexity of the packing algorithms grows rapidly as you try to go beyond basic block parallelism Multiple functional units - homogenous - VLIW or superscalar DIFFICULTY: moderately complex Easier than the heterogenous case, and the packing algorithms are considerably easier. Special hardware instructions - scalar Taking advantage of simple instructions like abs(), conditional exchange, etc. DIFFICULTY: (1) When treated not as a compiler problem, but as a problem of simply writing libraries to inline optimized machine code, EASY Requires inlining support. (2) When treated as a compiler problem, eg. compiler should recognize IF a < 0 THEN a = -a and automatically convert it to a = ABS(a) MODERATELY COMPLEX. I expect Herman Rubin to comment on this. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
preston@titan.rice.edu (Preston Briggs) (10/12/90)
[Copied from comp.arch. -John] In article <AGLEW.90Oct11144920@dwarfs.crhc.uiuc.edu> aglew@crhc.uiuc.edu (Andy Glew) writes: >Perhaps we can start >a discussion that will lead to a list of possible hardware >architectural enhancements that a compiler can/cannot take advantage of? I'll comment on a couple of features from the list >Register file - large (around 128 registers, or more) > Most compilers do not get enough benefit from these to justify > the extra hardware, or the slowed down register access. In the proceedings of Sigplan 90, there's a paper about how to chew lots of registers. Improving Register Allocation for Subscripted Variables Callahan, Carr, Kennedy I suggested the subtitle "How to use of all those FP registers" but nobody was impressed. Also, there's a limit to how many registers you need, at least for scientific fortran. It depends on the speed of memory and cache, speed of the FPU, and the actual applications. The idea is that once the FPU is running at full speed, more registers are wasted. >Heterogenous register file > Few compilers have been developed that can take advantage of a > truly heterogenous register file, one in which, for example, the > divide unit writes to registers D1..D2, the add unit writes to > registers A1..A16, the shift unit writes to registers S1..S4 -- > even though such hardware conceivably has a cycle time advantage > over homogenous registers, even on VLIW machines where data can > easily be moved to generic registers when necessary. > DIFFICULTY: hard. At first glance, the problem seems susceptable to coloring. Perhaps I'm missing something. >Data cache - software managed consistency > This reportedly has been done, but certainly isn't run-of-the-mill. > DIFFICULTY: needs a skilled compiler expert. At Rice (and other places), people are considering the perhaps related problems of trying to manage cache usage for a single processor. I'm personally turned on by the topic because of big performance gains possible and the possible impact on architecture. Questions like: Can we get away with no D-cache? Perhaps we don't need cache for FP only? Can we get away with only direct mapped cache? What does a compiler do with set associativity? How can we do prefetches to cache? Porterfield did a thesis here that talks some about these questions. Additionally, Callahan and Porterfield (both at Tera) have a paper in Supercomputing 90 on (perhaps) similar topics. >Multiple functional units - heterogenous - VLIW or superscalar > DIFFICULTY: complex. >Multiple functional units - homogenous - VLIW or superscalar > DIFFICULTY: moderately complex > Easier than the heterogenous case, and the packing algorithms > are considerably easier. I had never thought to distinguish the two cases and I'm not sure why the scheduling algorithms should be much different. >Special hardware instructions - scalar > Taking advantage of simple instructions like abs(), conditional > exchange, etc. > DIFFICULTY: > (1) When treated not as a compiler problem, but as a problem of simply > writing libraries to inline optimized machine code, EASY > Requires inlining support. For intrinsics, I follow the PL.8 example. That is, have intermediate language instructions for ABS etc. so the optimizer can try and hoist them or perhaps strength reduce them (e.g. SIN). Then expand to a simple form (perhaps with branches and so forth), and let the optimizer get at the guts of each operation. Some like ABS might be available as basic instructions and so need not be expanded to a lower level form. This seems to require that the front-end recognize certain calls as intrinsics. Naturally, this works fine with Fortran, but compilers for other languages could easily adopt the same approach. Probably have for C. This isn't wonderfully extensible, but people have worked on variations that might be worth exploring. In particular, the Experimental Compiler System (ECS) project at IBM hoped to achieve the same effect in a more extensible fashion. -- Preston Briggs looking for the great leap forward preston@titan.rice.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
aglew@crhc.uiuc.edu (Andy Glew) (10/12/90)
[Copied from comp.arch -John] >>Perhaps we can start >>a discussion that will lead to a list of possible hardware >>architectural enhancements that a compiler can/cannot take advantage of? > >I'll comment on a couple of features from the list Thanks. Do you have any features of your own to add? >>Register file - large (around 128 registers, or more) >> Most compilers do not get enough benefit from these to justify >> the extra hardware, or the slowed down register access. > >In the proceedings of Sigplan 90, there's a paper about how to chew >lots of registers. > > Improving Register Allocation for Subscripted Variables > Callahan, Carr, Kennedy > >I suggested the subtitle "How to use of all those FP registers" >but nobody was impressed. Also, there's a limit to how many registers >you need, at least for scientific fortran. It depends on the speed >of memory and cache, speed of the FPU, and the actual applications. >The idea is that once the FPU is running at full speed, >more registers are wasted. I'm pretty sure that being able to put elements of aggregates into registers is the next big step - subscripted variables and structure fields. However, if it's only just now being published, that probably means its leading edge, so a company should not design for this unless it has guys in its compiler group actually working on it. Any comment on this way of looking at it? >>Heterogenous register file >> Few compilers have been developed that can take advantage of a >> truly heterogenous register file, one in which, for example, the >> divide unit writes to registers D1..D2, the add unit writes to >> registers A1..A16, the shift unit writes to registers S1..S4 -- >> even though such hardware conceivably has a cycle time advantage >> over homogenous registers, even on VLIW machines where data can >> easily be moved to generic registers when necessary. >> DIFFICULTY: hard. > >At first glance, the problem seems susceptable to coloring. >Perhaps I'm missing something. I agree with you --- I really don't understand why heterogenous register files are so hard to handle. But homogenous register files are one thing that compiler people have gone into rhapsodies wrt. RISC about. Here's one example: the Intel 80x86 is basically a heterogenous register file machine. Specific registers were tied to the outputs and inputs of specific functional units in the original hardware. Compiler people hated targetting this architecture, and there are very few compilers that can produce machine code comparable to hand-coded assembly on this architecture. But heterogenous register files are much easier to make fast. I think that two things can be done to make them easier for the compilers to use: (1) Provide more than one register at the output of any specific functional unit. This avoids the need to immediately move the result away. (if my spelling is bad it's because my keyboard just went flakey) (2) Make the register file heterogenous for writes, but homogenous for reads (it's easier to provide read multiporting than write multiporting). >>Data cache - software managed consistency >> This reportedly has been done, but certainly isn't run-of-the-mill. >> DIFFICULTY: needs a skilled compiler expert. > >At Rice (and other places), people are considering the perhaps >related problems of trying to manage cache usage for a single >processor. I'm personally turned on by the topic because of >big performance gains possible and the possible impact on >architecture. Questions like: Can we get away with no D-cache? >Perhaps we don't need cache for FP only? >Can we get away with only direct mapped cache? What does a compiler >do with set associativity? How can we do prefetches to cache? > >Porterfield did a thesis here that talks some about these questions. >Additionally, Callahan and Porterfield (both at Tera) have a paper >in Supercomputing 90 on (perhaps) similar topics. Again - it's current research, so a company should not "bet the farm" on this sort of thing, unless it has people actively working in the field. A compnay should not assume that it can just go out and buy this sort of compiler technology, or that it can hire a generic CS grad to implement this sort of optimization within a year. >>Multiple functional units - heterogenous - VLIW or superscalar >> DIFFICULTY: complex. >>Multiple functional units - homogenous - VLIW or superscalar >> DIFFICULTY: moderately complex >> Easier than the heterogenous case, and the packing algorithms >> are considerably easier. > >I had never thought to distinguish the two cases and >I'm not sure why the scheduling algorithms should be much different. Neither am I. I just put this up because it seemed to be what Gillies was referring to. >>Special hardware instructions - scalar >> Taking advantage of simple instructions like abs(), conditional >> exchange, etc. >> DIFFICULTY: >> (1) When treated not as a compiler problem, but as a problem of simply >> writing libraries to inline optimized machine code, EASY >> Requires inlining support. > >For intrinsics, I follow the PL.8 example. >That is, have intermediate language instructions >for ABS etc. so the optimizer can try and hoist them or perhaps strength >reduce them (e.g. SIN). Then expand to a simple form (perhaps with branches >and so forth), and let the optimizer get at the guts of each operation. >Some like ABS might be available as basic instructions and so need not >be expanded to a lower level form. This seems to require that the >front-end recognize certain calls as intrinsics. Naturally, this >works fine with Fortran, but compilers for other languages could >easily adopt the same approach. Probably have for C. > >This isn't wonderfully extensible, but people have worked on >variations that might be worth exploring. In particular, >the Experimental Compiler System (ECS) project at IBM hoped to >achieve the same effect in a more extensible fashion. I'm a fan of GNU EMACS style inlining. I don't think that your intermediate language should even attempt to represent all of the special hardware instructions that are likely to be useful - and it is very frustrating when you have the operation that nobody thought of. PS. comp.compilers has been talking about this recently. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
spot@TR4.GP.CS.CMU.EDU (Scott Draves) (10/12/90)
[Copied from comp.arch -John] |> >>Register file - large (around 128 registers, or more) |> >> Most compilers do not get enough benefit from these to justify |> >> the extra hardware, or the slowed down register access. |> > |> >In the proceedings of Sigplan 90, there's a paper about how to chew |> >lots of registers. |> > |> > Improving Register Allocation for Subscripted Variables |> > Callahan, Carr, Kennedy |> > |> >I suggested the subtitle "How to use of all those FP registers" |> >but nobody was impressed. Also, there's a limit to how many registers |> >you need, at least for scientific fortran. It depends on the speed |> >of memory and cache, speed of the FPU, and the actual applications. |> >The idea is that once the FPU is running at full speed, |> >more registers are wasted. |> shouldn't loop unrolling burn lots of registers also? when unrolling, which ceiling will you hit first, the number of registers, or the size of the I-cache? Scott Draves spot@cs.cmu.edu [A friend at Multiflow told me that the large register file was crucial to get serious unrolling speedups. There was some rather tricky work balancing unrolling vs. avoiding register saves by using otherwise unused registers in leaf routines. -John] -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
golds@fjcnet.GOV (Rich Goldschmidt) (10/13/90)
[Copied from comp.arch -John] Maybe this is naive or too futuristic, but is anyone working towards methods for automatically generating a compiler based on the architecture design? It would seem that even before there is silicon, there is enough info about a new CPU in the CAD system used for layout that the features of special interest for generating a compiler are known. To the extent that generating a compiler is rule based (those tasks a good CS grad can do?), why hasn't it been automated? Or are there people working on this now? Chip designers might even take compilers into consideration in their designs :-). -- Rich Goldschmidt: uunet!fjcp60!golds or golds@fjc.gov -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
ctl8588@rigel.tamu.edu (LAUGHLIN, CHET) (10/15/90)
In article <1990Oct12.230424.930@esegue.segue.boston.ma.us>, golds@fjcnet.GOV (Rich Goldschmidt) writes... >Maybe this is naive or too futuristic, but is anyone working towards >methods for automatically generating a compiler based on the architecture >design? It would seem that even before there is silicon, there is I would suggest "Computer Architecture a Quantitative Approach" by John L Hennessy and David A Patterson, Morgan Kaufmann Publishers, c1990. They speak at several different points on how HLL compilers will affect the system performance. In addition, they mention some common pitfalls designers tend to not think about till after the fact...like KISS for instruction sets because compiler writers deal with a simple principle of "make the common fast and the rest correct." (Of course, they are biased by background...but the book has many good points) Chet Laughlin CTL8588@RIGEL.TAMU.EDU (128.194.4.4) -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
jourdan@minos.inria.fr (Martin Jourdan) (10/15/90)
In article <1990Oct12.230424.930@esegue.segue.boston.ma.us>, golds@fjcnet.GOV (Rich Goldschmidt) writes: => Maybe this is naive or too futuristic, but is anyone working towards => methods for automatically generating a compiler based on the architecture => design? There already has been a HUGE amount of work, and work is still going on, on ``automatic'' generation of *parts of* compilers from architectural descriptions. The most obvious example is the work on code generation (instruction selection and register allocation), for which there now exist usable systems, e.g. TWIG, BEG and Pagode (non-exhaustive list). Pagode for instance (I know it quite well because it is being developed by colleagues of mine at INRIA) takes as input a grammar describing the input IR trees and a description of the machine which can be very easily deduced from the ``programmer's manual'' provided by the CPU manufacturer. The basic entities in a machine description are storage bases (registers, memory, etc.), storage classes (groups thereof), access modes, access classes and instructions. The semantics of the instructions is given in terms of the IR grammar. From this Pagode generates an instruction selector and a register allocator. Work is also going on in other areas of compiler construction. For instance, Francois Bodin showed in his thesis (``Optimisation de microcode pour une architecture horizontale et synchrone: etude et mise en oeuvre d'un compilateur'', U. Rennes, France, June 1989) that it was possible to use a finer description of the machine, exhibiting timing and other ``internal'' information, to automatically generate instruction schedulers. Davidson and Fraser are well known for their work on the automatic construction of peephole optimizers from machines descriptions. Automatic generation of other compiler phases is probably also possible but you then have to take into account source-language issues in addition to target-machine ones. => [...] Chip designers might even take compilers into => consideration in their designs :-). This is obviously desirable, and I believe this was the main motivation to start this discussion thread. Martin Jourdan <jourdan@minos.inria.fr>, INRIA, Rocquencourt, France. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
anders@dit.lth.se (Anders Ardo) (10/16/90)
> In article <1990Oct12.230424.930@esegue.segue.boston.ma.us>, > golds@fjcnet.GOV (Rich Goldschmidt) writes: > => Maybe this is naive or too futuristic, but is anyone working towards > => methods for automatically generating a compiler based on the architecture > => design? > > There already has been a HUGE amount of work, [including] > TWIG, BEG and Pagode (non-exhaustive list). ... Could you please give some references to these systems. > => [...] Chip designers might even take compilers into > => consideration in their designs :-). I fully agree. We have just embarked on a project which aims at integrating a VLSI design environment with compiler generation tools. This will enable us to rapidly generate and test complete system (including applications) either simulated or real hardware, thus enabling us to efficiently evaluate different design tradeoffs between hardware and software. -- Anders Ardo Tel: int+46 46 107522 Dept. of Computer Engineering fax: int+46 46 104714 University of Lund, P.O. Box 118 Internet: anders@dit.lth.se S-221 00 Lund, Sweden or anders%dit.lth.se@uunet.uu.net -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hankd@dynamo.ecn.purdue.edu (Hank Dietz) (10/16/90)
In article <1990Oct12.024252.8361@esegue.segue.boston.ma.us> aglew@crhc.uiuc.edu (Andy Glew) writes: >>[gillies@m.cs.uiuc.edu] >>One of the problems in new CPU designs is that designers don't realize >>which architecture enhancements are pointless, because we don't and >>may never have the optimization technology to take advantage of them. > >Ah, a potentially interesting and useful topic. Perhaps we can start >a discussion that will lead to a list of possible hardware >architectural enhancements that a compiler can/cannot take advantage >of? Maybe the real compiler guys (preston, David C.) will help us out? ... >For example, many of the parallelizing FORTRAN loop optimizations can >reasonably only be expected to be done by people from Rice or UIUC CSRD (note >that I'm UIUC, but CRHC, not CSRD), so unless you are willing to pay for >these guys (or their assorted front companies) you aren't likely to get them >onto your machine. While it is true that the group of researchers in automatic parallelization is small, it certainly isn't limited to UIUC CSRD and Rice; there are also substantial efforts at IBM, Intel, UC Irvine, CMU, etc. For example, Purdue faculty in this field include Jose Fortes and me -- both of us have been publishing in this field for more than five years (well over 50 publications) and we have implemented working parallelizers for subsets of the C language targeted to several different architectures. The point is that, although compiler experts are in demand, it simply isn't true that there are only one or two places that know how to do things. Further, at Purdue EE, I teach a graduate course on compiler code generation, optimization, and parallelization. In the course, *EVERY* student implements an optimizing, parallelizing, compiler for a small language and targets it to a simple parallel abstract machine -- usually a VLIW. I'm not saying that one course makes them experts, but the students from that course are virtually all compiler-literate to the point where at least relatively mundane things like traditional dependence analysis and vectorization are well within their grasp. Students complete that course at a rate of about 15/year. >Cost of compiler development can be significant. Sometimes a company might >put a hardware feature in even though they know a compiler approach would be >better, because they can't afford the compiler guys (or the compiler guys >already have exclusive contracts with a competitor). In my view, this is 99% false. Companies tend to put the money into hardware because it is more concrete and they also are used to putting money into hardware. For example, one of my former students works at Motorola as a compiler person -- but he's one of a very few compared to *MANY* architecture/hardware folk. In fact, he also has an architecture background and without it he probably wouldn't have been given the job. Companies have to learn that creating a compiler is comparably difficult to creating an architecture; the tendency is to give it less weight, resulting in overworked compiler people and delays in completing the compilers. A secondary issue is that designing one without deeply considering the other just plain doesn't work, and there are few people who are experts in BOTH compiler and architecture to act as the interface between the two groups. In contrast, consider a company like Burton Smith's Tera. Burton knows what he's doing -- he has tried very hard to make his company have a balance of compiler, architecture/hardware, and OS people. Did he have trouble getting these people? Perhaps a bit -- good OS folk are particularly hard to find in these "well, let's just port unix" times -- but generally I'd say he had less trouble than most companies would have because it is clear that he values these people at least as much as he values architecture/hardware types. >Let me list a few things to start off the discussion. I hope and expect to >be shot down on a few of them. It's easy to list a few of the hardware >enhancements that we already know compilers can take advantage of. Wrong viewpoint or, as a certain public figure used to say, "well, there you go again." You're trying to give architecture/hardware people a list of "you can feel safe doing this without consulting a compiler person" things -- the trick is to involve compiler people throughout rather than letting the machine be built and then calling in the compiler folk (and giving them H*ll because the compiler hasn't been able to achieve the machine's peak performance and wasn't delivered on time). For some years now, I've had a research group (about 2-3 faculty and 10-20 students) called CARP: Compiler-oriented Architecture Research at Purdue. A one paragraph version of our manifesto: "Research in compiler optimization/parallelization and hardware architecture is, and should be, tightly interwoven. CARP, the Compiler-oriented Architecture Research group at Purdue, centers on the innovative use of the interaction of compiler and architecture to increase system performance. In general, this is accomplished by blending STATIC (compile-time, assemble-time, or link-time) and DYNAMIC (runtime hardware, firmware, or operating system) analysis so that each computational atom is processed in the most efficient and reliable way. Statically, it is possible to understand/transform the entire program, yet only probabilistic knowledge is available (e.g., one can know branching probabilities, but not which way the branch goes this time). Dynamically, understanding/transformability is limited to a few instructions around the current program counter, but perfect knowledge within that range is common. Very few problems can be solved equally well using either kind of information -- the trick is simply to solve each problem in the right place." >Branch Delay Slots - small number >Branch Delay slots - large number >Register file - moderate sized (up to 32 registers) >Register file - large (around 128 registers, or more) >Separate floating point register file >Heterogenous register file >Instruction cache >Micro-scheduling parallelism (like CONVEX's ASAP) >Vectorizing >Multiple functional units - heterogenous - VLIW or superscalar >Multiple functional units - homogenous - VLIW or superscalar All old ideas with multiple viable approaches in the literature. This is not to say they are done perfectly, but that's not the issue in making a product.... Unfortunately, a few are not easy to automate in "generic" code generators (e.g., heterogeneous register file). >Multiple, hierarchical, registers sets >Data cache - software managed consistency >Parallelizing - fine grain, small numbers of processors >Parallelizing, fine grain, large numbers of processors. >Special hardware instructions - scalar These are problems with no readily available "cookbook" solutions. That doesn't necessarily mean they'd be hard for a compiler to deal with, just that it will take a bit of head scratching.... Of course, I still contend that the above list is headed the wrong way -- we should be looking for new ideas being synthesized by viewing both compiler and architecture/hardware. For example, the Barrier MIMD work (see papers in ICPP 90) could only have come from such a holistic view. -hankd@ecn.purdue.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
preston@titan.rice.edu (Preston Briggs) (10/17/90)
In article <1990Oct16.085249.13511@lth.se> anders@dit.lth.se (Anders Ardo) writes: >> There already has been a HUGE amount of work, [including] >> TWIG, BEG and Pagode (non-exhaustive list). ... >Could you please give some references to these systems. The basis for TWIG is described in Efficient Tree Pattern Matching: an Aid to Code Generation A. Aho and M. Ganapathi POPL 12, 1985 The TWIG reference manual (I've never seen this one) is Twig reference maual S. Tjiang AT&T Bell Labs, internal memorandum January 1986 An interesting application is Anatomy of a Hardware Compiler K. Keutzer and W. Wolf Sigplan 88 Conference on Programming Language Design and Implementation in Sigplan Notices, July 1988 BEG is described in BEG - A generator for efficient back ends H. Emmelmann, F. Schr\:oer, and R. Landwehr Sigplan 89 Conference on Programming Language Design &c. Sigplan Notices, July 1989 -- Preston Briggs preston@titan.rice.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
anders@dit.lth.se (Anders Ardo) (10/17/90)
Have anyone put together a bibliography of papers on code generation issues and methods like those mentioned here, eg: >Branch Delay Slots ... >Register file ... >Multiple functional units ... >[etc.] and/or papers on the automation of these tasks in generated code generators? Anders -- Anders Ardo Tel: int+46 46 107522 Dept. of Computer Engineering fax: int+46 46 104714 University of Lund, P.O. Box 118 Internet: anders@dit.lth.se S-221 00 Lund, Sweden or anders%dit.lth.se@uunet.uu.net -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
larus@primost.cs.wisc.edu (James Larus) (10/17/90)
Having implemented a few `state of the art' compiler algorithms (and a few beyond the state of the art), I can say that most competent graduate students could code them, but most companies wouldn't implement them. The key problem is that most papers present enough details, but don't present enough evidence that a technique is effective. A rational project manager would not commit the resources to try out these algorithms. There are exceptions, of course, but compiler papers frequently use towers of hanoi or puzzle-sized benchmarks. /Jim -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
hankd@ecn.purdue.edu (Hank Dietz) (10/18/90)
In article <1990Oct12.231125.1275@esegue.segue.boston.ma.us> aglew@crhc.uiuc.edu (Andy Glew) writes: >>In the proceedings of Sigplan 90, there's a paper about how to chew >>lots of registers. >> >> Improving Register Allocation for Subscripted Variables >> Callahan, Carr, Kennedy >> >>I suggested the subtitle "How to use of all those FP registers" but nobody >>was impressed. Also, there's a limit to how many registers you need, at >>least for scientific fortran. It depends on the speed of memory and cache, >>speed of the FPU, and the actual applications. The idea is that once the >>FPU is running at full speed, more registers are wasted. > >I'm pretty sure that being able to put elements of aggregates into >registers is the next big step - subscripted variables and structure fields. It isn't hard to use lots of registers... the trick is to benefit from doing so. Fetching from outside the processor is harmless provided that it never slows the system down, and that point often can be reached with a surprisingly small number of registers. Unrolling and better alias analysis are two good approaches to making more registers useful -- there is also the hardware fix called CRegs (see the paper in Supercomputing 1988), which allows aliased values to be kept in registers (really CRegs). ... >>>Heterogenous register file ... >>At first glance, the problem seems susceptable to coloring. >>Perhaps I'm missing something. > >I agree with you --- I really don't understand why heterogenous >register files are so hard to handle. But homogenous register files >are one thing that compiler people have gone into rhapsodies wrt. RISC >about. > >Here's one example: the Intel 80x86 is basically a heterogenous >register file machine. Get a copy of the Compiler Writer's Guide. I have one for the 286. It presents a very reasonable scheme for using coloring with multiple classes of registers (as in a 286). In fact, I think it is one of the most understandable descriptions of the coloring technique in general. -hankd@ecn.purdue.edu -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
jourdan@minos.inria.fr (10/19/90)
In addition to the references on TWIG given by Preston Briggs, let me add: Alfred V. Aho, Mahadevan Ganapathi and Steven W. K. Tjiang Code Generation Using Tree Matching and Dynamic Programming ACM Trans. on Progr. Lang. and Systems (TOPLAS) Vol. 11, No. 4 (Oct. 1989), pp. 491-516 The reference on BEG given by Preston is, as far as I know (and I know them quite well), the only published one. References on Pagode: Annie Despland, Monique Mazaud and Raymond Rakotozafy Using Rewriting Techniques to Produce Code Generators and Proving them Correct to appear quite soon in Science of Computer Programming (report RR-1046, INRIA, Rocquencourt, June 1989) Annie Despland, Monique Mazaud and Raymond Rakotozafy PAGODE: A back end generator using attribute abstract syntaxes and term rewritings to appear in the proceedings of Compiler Compilers '90, to be held next week in Schwerin, Germany Annie Despland, Monique Mazaud and Raymond Rakotozafy Code generator generation based on template-driven target term rewriting Proceedings of Rewriting Techniques and Applications, Bordeaux, France, May 1987, LNCS 256, pp 105-120. Annie Despland, Monique Mazaud and Raymond Rakotozafy An implementation of retargetable code generators in Prolog Proceedings of the International Workshop on Programming Language Implementation and Logic Programming (PLILP) Orleans, France, May 1988, LNCS 348, pp 81-104. Martin Jourdan <jourdan@minos.inria.fr>, INRIA, Rocquencourt, France. -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.
worley@compass.com (Dale Worley) (10/24/90)
Ah, a potentially interesting and useful topic. Perhaps we can start a discussion that will lead to a list of possible hardware architectural enhancements that a compiler can/cannot take advantage of? Part of the trouble with this idea is that "what hardware can be taken advantage of" changes over time. For instance, I believe that the Cray-1 was out for a number of years before people had developed good vectorizing compilers -- since vector hardware hadn't existed, there was no incentive to figure out how to build a compiler that took advantage of it! This leads to a chicken-and-egg problem -- the compiler doesn't exist because no hardware needs it, and vice-versa. The correct solution was mentioned by somebody -- develop both in parallel and tune the combination of the two for best price & performance. Of course, it means you have to fund both a hardware effort and a software effort! Dale Worley Compass, Inc. worley@compass.com -- Send compilers articles to compilers@esegue.segue.boston.ma.us {ima | spdcc | world}!esegue. Meta-mail to compilers-request@esegue.