preston@ariel.rice.edu (Preston Briggs) (05/03/91)
carter@cs.wisc.edu (Gregory Carter) writes: [wondering about inlining the world] Don't forget to be careful of recursion! and the moderator notes >[GCC and some other C compilers permit you do declare a procedure "inline" >so that it is expanded wherever it is called. That gives pretty much the >same effect, faster but larger code. -John] Locally, we're pretty down on naive inlining. I think there are some cases where it pays, but they are much rarer than imagined. So, here's a list of reasons not to inline: Code growth Replicating procedure bodies in many places. Can have bad effects on paging, TLB, I-Cache. (however, may improve locality sometimes.) Hurts compile-time, since optimizers are pretty sensitive to code size. Register pressure Call sites are natural places to spill lots of registers. Register allocators are rarely able to achieve the same efficiency. Massive recompilation When you edit an inlined routine, you've got to recompile everyone that calls the routine. On big programs (say an OS or perhaps a reservor simulator), this is a Bad Thing. People sometimes argue that the inlining is only done at the last second, after code has been debugged. I don't think this is a good argument. Code is always being debugged, especially the big things, like OS's. Further, big programs take a lot of performance tuning, usually done with full optimization. Surely everyone knows not to do initial development with the optimizer turned on? Inlining isn't as good as interprocedural analysis Interprocedural analysis can track across the entire program, including loops in the call graph (recursion). Inlining only gets effects on procedures actually inlined. Imagine a huge program, with many constant parameters set by the main routine. Interprocedural constant propagation will get these constant all through the code; inlining won't get anything unless it's inlined into the main routine. Inlining is ineffective We've tried lots of experiments, with a variety of code on a variety of machines (scalar, vector, parallel, RISC, and CISC). Sometimes we manage a 20% improvement. Sometimes we degrade by up to 20%. Usually, there's only a fractional change one way or another. It's very difficult to predict when it will pay off. We believe that many people know these things, often from experimentation. However, nobody likes to publish negative results. Hence, the knowledge has been Supressed. ------------------ Obviously, inlining is not completely Evil. However, I think it has been over-sold and over-used. This is a little attempt to even the balance. Preston Briggs -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
ressler@cs.cornell.edu (Gene Ressler) (05/03/91)
These are very good points. My own experiments: Recently did an event-driven simulator in C. Ran gcc -finline-functions -O on the inner loop module and got about 12% improvement over plain -O on both SPARCstation and MIPS. About the same with c89 -Q -O vs c89 -O on an RS6000 320. In Lucid Common Lisp code for image processing, however, I've gotten 50% on a SPARC. But you'd expect that with the heavy weight of Common Lisp's function calls. Gene Ressler -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
daniel@quilty.Stanford.EDU (Daniel Weise) (05/03/91)
Locally, we're pretty down on naive inlining. I think there are some cases where it pays, but they are much rarer than imagined. So, here's a list of reasons not to inline: What is naive inlining? Inlining without measuring the gain? Doing any program transform on the thought that it might help is known to be a bad strategy. Why pick on inlining? On a more general note, I would guess that part of the reason you are down on inlining is because your local coding style has the programmer automatically providing the inlining, instead of writing in a more procedural and abstract form. Therefore you see plus or minus 20% performance changes. In both CLU style and Lisp style programming, inlining of small functions yields large performance gains. Your other arguments are not so clear cut. CODE GROWTH: Replicating procedure bodies in many places. Can have bad effects on paging, TLB, I-Cache. (however, may improve locality sometimes.) Hurts compile-time, since optimizers are pretty sensitive to code size. For small procedures, code growth is minimal. Inlining decreases data shuffling and other overhead. As I-Caches get bigger the slightly increased code size becomes less and less important. REGISTER PRESSURE: Call sites are natural places to spill lots of registers. Register allocators are rarely able to achieve the same efficiency. This merely means we have to make better register allocators. MASSIVE RECOMPILATION: When you edit an inlined routine, you've got to recompile everyone that calls the routine. On big programs (say an OS or perhaps a reservor simulator), this is a Bad Thing. But you have to recompile a program of any size when a structure declaration changes. Why persecute functions and not structures? You have to draw the line somewhere, but you should be aware that you are drawing a line in a place that someone else may not agree with. INLINING ISN'T AS GOOD AS INTERPROCEDURAL ANALYSIS: ... Inlining + interprocedural analysis > just interprocedural analysis. Also, if one isn't doing interprocedural analysis, inlining is the next best thing. Inlining is ineffective As I state above, this really is a function of how programs are written. For your local writing style and compiler this may be true. Obviously, inlining is not completely Evil. However, I think it has been over-sold and over-used. This is a little attempt to even the balance. Your argument boils down to: inlining doesn't help our code, and makes our tools less useful. For other people's code and tools, inlining is can be vital. For example, CLU code would run much slower if the compiler didn't inline small functions. I know production C++ programmers who turn on inlining in the last stages of developement (I am thinking in particular of developers at a computer company working on a large experimental OS). I don't remember the speedups they get, but it's substantial enough to warrant using inlining even though it hurts debugging. I suspect that the reason is that there is disagreement on this issue is that neither of us are quantifying what is being inlined. I would buy your arguments for larger procedures, and you would probably buy mine for smaller procedures. Daniel Weise -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
boehm@parc.xerox.com (Hans Boehm) (05/04/91)
It seems to me that all such evaluations are incredibly dependent on the source language and on the programming style. I would expect to see no improvement for Fortran code that was written by people who believed that procedure calls are very expensive and should be avoided like the plague. They presumably already did the profitable inlining themselves (and perhaps a little more). I wouldn't expect a lot of improvement in an average Pascal program either. In my experience, the place where inlining pays off a lot is for languages and coding styles that encourage operations on abstract data types to be expressed as separate procedures. In a well-written Modula-3 or Oberon program, I would expect to see a reasonable number of procedures that can be inlined to 2 or 3 instructions, with perhaps no additional register usage. A similar effect also occurs in the presence of generic functions, though that requires more work. If I invoke qsort with integer comparison, I would expect to gain a fair amount by cloning qsort, and inlining the comparison (which may effectively have a zero instruction count, since I save the test on the function result). My experience with inlining in Russell is very different from Preston's experience. But then both of the above issues apply (though I don't create more than one clone of a procedure). Furthermore all of the "primitives" like integer addition are conceptually procedures (that may be passed as parameters or assigned or reused by user defined types). Russell programs compiled without inlining are intolerably slow. (I don't recall exact numbers. A factor of 3-5 slower is a guess.) This is largely due to what would be primitives in other languages. But since it's common to slightly modify an existing type like "Float" to derive new types, the notion of a "primitve" is unclear, and a more general mechanism is clearly needed. I also like the idea of supplying most language "primitives" in predefined modules, and then letting inlining remove the performance distinction. None of this is astonishing or inconsistent with what Preston said. But it's important to only apply these results to their proper context. Hans (boehm@xerox.com) -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
gateley@rice.edu (John Gateley) (05/04/91)
In article <1991May2.180508.17100@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: carter@cs.wisc.edu (Gregory Carter) writes: [wondering about inlining the world] Don't forget to be careful of recursion! Homework assignment: figure out how to inline with recursion (Yes, I did it. Turned out to be pretty neat :^). and the moderator notes >[GCC and some other C compilers permit you do declare a procedure "inline" >so that it is expanded wherever it is called. That gives pretty much the >same effect, faster but larger code. -John] Locally, we're pretty down on naive inlining. I think there are some cases where it pays, but they are much rarer than imagined. So, here's a list of reasons not to inline: I have found several places where inlining is very useful, but they all seem to be in the Lisp world. In the compiler (for CL/Scheme) I worked on we used inlining quite heavily to optimize the language runtimes. However, we were careful about it - we would only inline a procedure if it didn't cause code explosion or if there was some reason the extra code would get folded away. For example, if we saw an "aref" call (CL's general purpose array reference) and we could determine that the argument was a simple vector, we would fold it away producing a simple vector ref. We didn't just blindly inline the whole mess (which was 50-100 lines of code) and hope that it would go away. John gateley@rice.edu -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
mac@eleazar.dartmouth.edu (Alex Colvin) (05/04/91)
As others have pointed out, an environment (like C++) that supports inlining encourages the use of procedures that the programmer would manually inline in the absence of such support. Trivial inlined procedures give you access control to various module data, e.g. read- or increment-only access. Another thing inlining gives you is better flow analysis with a way to fold constants into procedures. This makes optimizations such as { if (false) S1 else S2 } -> S2 more useful. In this case, you expect the body to expand to different code in each instance. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@june.cs.washington.edu (David Keppel) (05/04/91)
>Preston Briggs writes: >>[Some reasons why you might not want to inline] > REGISTER PRESSURE: > Call sites are natural places to spill lots of registers. > Register allocators are rarely able to achieve the same efficiency. daniel@quilty.Stanford.EDU (Daniel Weise) writes: >[So we have to make better register allocators.] Stronger than that: if the call site is the right place to save/restore registers, then save and restore registers at the call site, around the inlined function. Then delete any redundant (useless) saves and restores. On the basis of register pressure, you couldn't possibly do worse than a function call. You might still suffer all the other side effects (code growth, etc.). ;-D on ( It's all caching ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
chris@crackers.clearpoint.com (Chris Clark) (05/04/91)
Although, I generally agree with Hans' comments. I do want to make a minor correction to his first statement. I know longer have the precise data on it. However, the assumption that old style FORTRAN programs do not profit from inlining is incorrect. I was part of a team that did inlining as part of optimization enhancements for FORTRAN, PL/I, and related compilers. Our results suggested that inlining paid off more often than expected (but possibly not by as high a percentage) even when inlining large procedures. I also believe that you'll find that the MIPS people do fairly extensive inlining as part of their optimizations and they are targetting "traditional" languages. (They had (have?) something called pmerge or umerge which does it. They may even wait until link time to do it now, to "solve" the separate compilation problem.) 1 PARAGRAPH DIGRESSION: The basic problem is that I think we are not as smart about when and what to optimize by hand as we think. Also many applications are written primarily for correctness, portability, or maintainability in any case. Though, I will admit a lot of FORTRAN is well tuned, since computers used to be so slow! Truly large procedures do not generally cause the program of code expansion, because they are quite often only called from one site. One hard part of the trade off is the dynamic/static balance. It's almost univerally a win to inline a procedure if it has only one call site. To avoid the locality of reference problem that may occur if the procedure is conditionally called, put the inlined procedure at the top or bottom of the code and jump to and from it. (One of the jumps can be the condition which triggers the call.) You may have to tweak other parts of your optimizer to avoid it re-linearizing the flow of control and rearranging the new code back into the middle. Statistically, having the code in a different part of the same module should not have a higher probability of a cache conflict than having the code in a seperate module. However, in specific instances that will not be the case. (And in benchmarking, specific instances are all that count. Unfortunately, I think that's true in general--every execution is always a specific instance.) Thus, user control is important, for those cases when the user is actually smart about exactly what they want done--i.e. they ran the profiler and analyzed the results. The hard analysis comes with the "wide part of the onion", the middle layer of abstraction. Here the functions tend to be called from several sites and be of moderate length. This is often the meat of the application and performs those parts which are specific to the task at hand, as opposed to the top general driver code and bottom standard canned primitives like add to a list. (I think I just recently read that this is where 80% of the coding errors occur also.) Anyway, here it is usually only profitable to inline if the call occurs within the context of a loop, and even then with some trepidation. CAVEATS: However, although it was profitable, I think we were shooting for an average 10-15% gain on total execution time, including all of the optimizations we did at that rev. I belive strength reduction and improved register analysis improved the performance more than inlining did in many applications. I also don't remember whether we did the recursion unrolling inline substitution or not. I do know we did special code to help us with up-level variable references. The code was also controlled by command line parameters to give users some knobs to tweak and hopefully prevent explosive growth, which would have been fatal since the architecture was segmented and there was a fixed upper bound on the size of any one procedure. It is possible that our results were colored by the truly expensive procedure call overhead or size of the Icache on the target machines. A typical CPU was 3 to 5 boards with about half a board dedicated to cache if I recall. The results were also colored by the fact that the compiler did no interprocedural analysis at that rev. The nature of the Icache may have also affected the results. I believe the probability of a cache miss on two addresses within a segment conflicting were lower (possibly zero on the high end machine) than conflicts between separate segments, and thus between separate procedures. (Multiple procedures could exist in a segment, but the linker always packed segments from the same end.) All of this is only my recollections. The true data, which was gathered on "real live applications" has probably been lost for years and was probably marked as proprietary anyway. I also didn't do the implementation. I think Rich Ford (now at Encore) or Dan Laska (Compass?) did, but it also may have been Karen Berman-Mulligan (Private), Ruth Halpern (LTX), or Suresh Rao (Intel). I hope this helps, - Chris (Clark) -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
carter@cs.wisc.edu (Gregory Carter) (05/05/91)
I suppose I should have outlined the process of combining procedural and inlined code replacement, so here it goes. What I initially wanted was for the source code to remain UNCHANGED. All transformations would be done without actually modifiying the source, only at the code generation stage. Therefore, the original source text, would remain the same, no extended compile times, no debugging problems. Only at the code generation stages would the inline effects be seen. On another note, some of you have mentioned that recursion would die this way, and since I have never been a big fan of recursion, and avoid it where possible, and very seldom do we think recursively (BE HONEST NOW) I thought this option would be common enough to actually be feasable. (Yes, recursion has its place, yes its useful, but it's a specialty item!) However, I would think there would be some money out there for you hot shots to design a paradigm which converts between recursive code and iterative code styles..AUTOMATICALLY. :) (Bring on the JOLT COLA) :) (Maybe some weekend when I have time, or when I find a competent CS536 prof) :) I was amazed to find that most of you skipped over the implications of inlining code blocks....JMP's(GOTO's)!!! (Isn't that a NO NO? Or is it?) I personally love GOTO's, which is part of the reason for my reason to trash the stack model (On the machine level, on the abstract level its very useful) (Please note above, I don't condone goto's. Although, they sure are QUICK) By the time you fix your stack frame, you're already executing your work code inline that is. Now for a radical idea. Hardware with global code support, instead of wasting that extra silicon real estate for a stack frame register? Any comments? Everyone secretly knows anyway that goto's are better, nicer and more efficient programming context for designing loops of all kinds anyway. (Of course, I will contend that too much of a good thing is bad..or is it?) :) Gregory (Have fun with that last one, and remember BE HONEST). [To each his own, I suppose. I find that the most natural way to express a nontrivial routine is more often than not recursive. Converting recursive programs to non-recursive mechanically is not hard. It must be, since no computer I know up supports recursion in hardware -- they all implement recursion with arrays of activation records. Also, the arguments about gotoless programming usually refer to programs written by humans, not programs written by machine, though on a deeply pipelined RISC, any sort of branch tends to be expensive. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
pardo@june.cs.washington.edu (David Keppel) (05/06/91)
Ooops. I wrote something very unclear: >Preston Briggs writes: >>[Register allocators aren't as smart as bind save/restore at >> the call site.] pardo@june.cs.washington.edu (David Keppel) writes: >[Technique...] On the basis of register pressure, you couldn't possibly do >worse than a function call. You might still suffer all the other side >effects (code growth, etc.). Gee, it sounds like I'm saying that a function call is the worst of all possible worlds, when I was talking about avoiding the function call. What I *meant* to say (I really did re-read it, but it made sense at the time) was that if you: * Do register saves and restores in the caller like you were doing a function call * Do register allocation in the callee like it was a real function * Inline the function then you can: * Use the same register allocation strategy and get no worse register allocation * Remove redundant register saves and restores * Remove redundant branches Hopefully it make sense this time. ;-D on ( Confusion reigns supreme ) Pardo -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
ea08+@andrew.cmu.edu (Eric A. Anderson) (05/06/91)
carter@cs.wisc.edu (Gregory Carter) writes: > Everyone secretly knows anyway that goto's are better, nicer and more > efficient programming context for designing loops of all kinds anyway. (Of > course, I will contend that too much of a good thing is bad..or is it?) The last time I used gotos was when a program needed to get turned in in about 30 seconds, and I had forgotten one of my loop exiting conditions. Then I wrote an if statement which just jumped me out of the loop without having to add another boolean termination value. That I see as the primary value of gotos, is the elmination of the if (Multi level exit test) then terminate := true; to get you out of 2 or three depths of loops. > [To each his own, I suppose. I find that the most natural way to express a > nontrivial routine is more often than not recursive. Converting recursive > programs to non-recursive mechanically is not hard. It must be, since no > computer I know up supports recursion in hardware -- they all implement > recursion with arrays of activation records. Also, the arguments about > gotoless programming usually refer to programs written by humans, not > programs written by machine, though on a deeply pipelined RISC, any sort of > branch tends to be expensive. -John] I believe that in SML of New Jersey, recursion is supported not by using arrays of activation records, but more by jumping around from place to place via continuation passing (please correct me if I am wrong, I have just started to learn functional programming, and some of the concepts seem pretty nebulus/like magic to me). Functional language compilers tend to spend a lot of time making it so that recursion is efficient because that is the only way to do things. As a result, in the article on the continuation passing style, they note that properly tail recursive functions will optimize to a loop type method. > On another note, some of you have mentioned that recursion would die this > way, and since I have never been a big fan of recursion, and avoid it where > possible, and very seldom do we think recursively (BE HONEST NOW) I thought > this option would be common enough to actually be feasable. (Yes, > recursion has its place, yes its useful, but it's a specialty item!) I'd say that I often think recusively, how do you do a quicksort? partition, then recursively sort each side. How do you do a binary search. Check the middle then recursively search the side it's on. I think that it is merely a matter of personal preference if you like the recursive .vs. non-recursive style of programming, as you can convert one to the other. -Eric [Good point about continuation passing which involves a heap rather than an array of activation records. -John] -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
mcg@ichips.intel.com (Steven McGeady) (05/07/91)
I've just read the thread on inlining (through 5 May 91), and have a few comments to add, as an implementor: - respondants don't seem to be making a distinction between inlining as a programmatic, user-specified extension, and inlining as a transparent, compiler-implemented optimization. While closely related, I feel these two types of inlining must be addressed separately: - user-specified inlining is as good as the user's understanding of his or her program. In situations where the user has a deep understanding of the performance behaviour of the program under study, user-directed inlining can be a powerful tool. When I wrote 'inline', a stand-alone C-to-C inliner, I carefully studied several algorithms, including 'compress'. Careful profiling followed by inlining resulted in a 10% performance improvement, even in this carefully-optimized program. - heuristic inlining is only as good as the heuristic (duh). Our research is pointing out that we haven't found a good heuristic yet without using profiling feedback. We've tried to synthesize a heuristic from call-graph, register-pressure, and size information, without repeatable success (i.e. over a broad selection of programs). Heuristics that include profiling input (weighted dynamic call tree) can repeatably produce improvements in most programs, without causing serious regressions. (Our compiler does global (inter-module) inlining with a two-pass model). Unfortunately, users often think they know more about their programs than they actually do, and many don't have the tools, or are too lazy to measure their programs. Many inlining decisions users make are just plain wrong. Heuristic inliners like gcc's make the user's task easier: try it both ways, and pick the fastest. This doesn't validate the practice of inlining, it merely provides commentary on the effectiveness of gcc's heuristic (which is: not particularly). - several respondants have noted that good interprocedural dataflow analysis can yield better results. In theory, I agree (on processors where calls are relatively cheap), however, true REF/DEF dataflow information can quickly become intractable (or at least very difficult) in a large program, when attempted across the entire program (for C, when tracking all points-to information). So if Global DFA is limited to a procedure, inlining frequently-traversed arcs on the call-graph can dramatically improve the overall effectiveness of DFA-based optimizations. - along the same lines as the last point, inlining can also expose many other worthwhile optimizations that can't profitably be done on an intermodule basis. In particular, until call tailoring becomes a reality (including debug support!) I think the utility of some classes of inlining to be high, when modified with profile information. Summary: - 'inlining' means two different things - user-inlining is effective only for sophisticated users - compiler heuristic inlining is currently hampered by poor heuristics - profile information considered essential for inlining heuristics - intermodule global DFA considered difficult to intractable - intelligent profile-driven inlining is a Good Thing S. McGeady i960 Software Architecture Group Intel Corp. -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.