[comp.compilers] the Evil Effects of Inlining

preston@ariel.rice.edu (Preston Briggs) (05/03/91)

carter@cs.wisc.edu (Gregory Carter) writes:

[wondering about inlining the world]

Don't forget to be careful of recursion!

and the moderator notes

>[GCC and some other C compilers permit you do declare a procedure "inline"
>so that it is expanded wherever it is called.  That gives pretty much the
>same effect, faster but larger code. -John]

Locally, we're pretty down on naive inlining.  I think there are some
cases where it pays, but they are much rarer than imagined.  So, here's a
list of reasons not to inline:

Code growth

	Replicating procedure bodies in many places.
	Can have bad effects on paging, TLB, I-Cache.
	(however, may improve locality sometimes.)

	Hurts compile-time, since optimizers are pretty
	sensitive to code size.


Register pressure

	Call sites are natural places to spill lots of registers.
	Register allocators are rarely able to achieve the same efficiency.


Massive recompilation

	When you edit an inlined routine, you've got to recompile
	everyone that calls the routine.  On big programs (say an OS
	or perhaps a reservor simulator), this is a Bad Thing.

	People sometimes argue that the inlining is only done at the last
	second, after code has been debugged.  I don't think this is
	a good argument.  Code is always being debugged, especially the
	big things, like OS's.  Further, big programs take a lot of
	performance tuning, usually done with full optimization.

	Surely everyone knows not to do initial development
	with the optimizer turned on?


Inlining isn't as good as interprocedural analysis

	Interprocedural analysis can track across the entire program,
	including loops in the call graph (recursion).  Inlining
	only gets effects on procedures actually inlined.

	Imagine a huge program, with many constant parameters set by the
	main routine.  Interprocedural constant propagation will get
	these constant all through the code; inlining won't get anything
	unless it's inlined into the main routine.


Inlining is ineffective

	We've tried lots of experiments, with a variety of code
	on a variety of machines (scalar, vector, parallel, RISC, and CISC).
	Sometimes we manage a 20% improvement.
	Sometimes we degrade by up to 20%.  Usually, there's only
	a fractional change one way or another.
	It's very difficult to predict when it will pay off.


We believe that many people know these things, often from
experimentation.  However, nobody likes to publish negative results.
Hence, the knowledge has been Supressed.

------------------

Obviously, inlining is not completely Evil.
However, I think it has been over-sold and over-used.
This is a little attempt to even the balance.

Preston Briggs
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

ressler@cs.cornell.edu (Gene Ressler) (05/03/91)

These are very good points.  My own experiments: Recently did an
event-driven simulator in C.  Ran gcc -finline-functions -O on the inner
loop module and got about 12% improvement over plain -O on both
SPARCstation and MIPS.  About the same with c89 -Q -O vs c89 -O on an
RS6000 320.  In Lucid Common Lisp code for image processing, however, I've
gotten 50% on a SPARC.  But you'd expect that with the heavy weight of
Common Lisp's function calls.

Gene Ressler
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

daniel@quilty.Stanford.EDU (Daniel Weise) (05/03/91)

   Locally, we're pretty down on naive inlining.  I think there are some
   cases where it pays, but they are much rarer than imagined.  So, here's a
   list of reasons not to inline:

What is naive inlining?  Inlining without measuring the gain?  Doing any
program transform on the thought that it might help is known to be a bad
strategy.  Why pick on inlining?

On a more general note, I would guess that part of the reason you are down
on inlining is because your local coding style has the programmer
automatically providing the inlining, instead of writing in a more
procedural and abstract form.  Therefore you see plus or minus 20%
performance changes.  In both CLU style and Lisp style programming,
inlining of small functions yields large performance gains.

Your other arguments are not so clear cut.

   CODE GROWTH:
   Replicating procedure bodies in many places.  Can have bad effects
   on paging, TLB, I-Cache.  (however, may improve locality sometimes.)
   Hurts compile-time, since optimizers are pretty sensitive to code size.

For small procedures, code growth is minimal.  Inlining decreases data
shuffling and other overhead.  As I-Caches get bigger the slightly
increased code size becomes less and less important.

   REGISTER PRESSURE:
   Call sites are natural places to spill lots of registers.
   Register allocators are rarely able to achieve the same efficiency.

This merely means we have to make better register allocators.

   MASSIVE RECOMPILATION:

   When you edit an inlined routine, you've got to recompile everyone
   that calls the routine.  On big programs (say an OS or perhaps a
   reservor simulator), this is a Bad Thing.

But you have to recompile a program of any size when a structure
declaration changes.  Why persecute functions and not structures?  You
have to draw the line somewhere, but you should be aware that you are
drawing a line in a place that someone else may not agree with.

   INLINING ISN'T AS GOOD AS INTERPROCEDURAL ANALYSIS: ...

Inlining + interprocedural analysis > just interprocedural analysis.
Also, if one isn't doing interprocedural analysis, inlining is the next
best thing.

   Inlining is ineffective

As I state above, this really is a function of how programs are written.
For your local writing style and compiler this may be true.  

   Obviously, inlining is not completely Evil.
   However, I think it has been over-sold and over-used.
   This is a little attempt to even the balance.

Your argument boils down to: inlining doesn't help our code, and makes our
tools less useful.  For other people's code and tools, inlining is can be
vital.  For example, CLU code would run much slower if the compiler didn't
inline small functions.  I know production C++ programmers who turn on
inlining in the last stages of developement (I am thinking in particular
of developers at a computer company working on a large experimental OS).
I don't remember the speedups they get, but it's substantial enough to
warrant using inlining even though it hurts debugging.

I suspect that the reason is that there is disagreement on this issue is
that neither of us are quantifying what is being inlined.  I would buy
your arguments for larger procedures, and you would probably buy mine for
smaller procedures.

Daniel Weise
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

boehm@parc.xerox.com (Hans Boehm) (05/04/91)

It seems to me that all such evaluations are incredibly dependent on the
source language and on the programming style.  I would expect to see no
improvement for Fortran code that was written by people who believed that
procedure calls are very expensive and should be avoided like the plague.
They presumably already did the profitable inlining themselves (and
perhaps a little more).  I wouldn't expect a lot of improvement in an
average Pascal program either.

In my experience, the place where inlining pays off a lot is for languages
and coding styles that encourage operations on abstract data types to be
expressed as separate procedures.  In a well-written Modula-3 or Oberon
program, I would expect to see a reasonable number of procedures that can
be inlined to 2 or 3 instructions, with perhaps no additional register
usage.

A similar effect also occurs in the presence of generic functions, though
that requires more work.  If I invoke qsort with integer comparison, I
would expect to gain a fair amount by cloning qsort, and inlining the
comparison (which may effectively have a zero instruction count, since I
save the test on the function result).

My experience with inlining in Russell is very different from Preston's
experience.  But then both of the above issues apply (though I don't
create more than one clone of a procedure).  Furthermore all of the
"primitives" like integer addition are conceptually procedures (that may
be passed as parameters or assigned or reused by user defined types).
Russell programs compiled without inlining are intolerably slow.  (I don't
recall exact numbers.  A factor of 3-5 slower is a guess.)  This is
largely due to what would be primitives in other languages.  But since
it's common to slightly modify an existing type like "Float" to derive new
types, the notion of a "primitve" is unclear, and a more general mechanism
is clearly needed.  I also like the idea of supplying most language
"primitives" in predefined modules, and then letting inlining remove the
performance distinction.

None of this is astonishing or inconsistent with what Preston said.  But
it's important to only apply these results to their proper context.

Hans
(boehm@xerox.com)
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

gateley@rice.edu (John Gateley) (05/04/91)

In article <1991May2.180508.17100@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
   carter@cs.wisc.edu (Gregory Carter) writes:
   [wondering about inlining the world]
   Don't forget to be careful of recursion!

Homework assignment: figure out how to inline with recursion (Yes, I
did it. Turned out to be pretty neat :^).

   and the moderator notes

   >[GCC and some other C compilers permit you do declare a procedure "inline"
   >so that it is expanded wherever it is called.  That gives pretty much the
   >same effect, faster but larger code. -John]

   Locally, we're pretty down on naive inlining.  I think there are some
   cases where it pays, but they are much rarer than imagined.  So, here's a
   list of reasons not to inline:

I have found several places where inlining is very useful, but they all
seem to be in the Lisp world. In the compiler (for CL/Scheme) I worked on
we used inlining quite heavily to optimize the language runtimes. However,
we were careful about it - we would only inline a procedure if it didn't
cause code explosion or if there was some reason the extra code would get
folded away. For example, if we saw an "aref" call (CL's general purpose
array reference) and we could determine that the argument was a simple
vector, we would fold it away producing a simple vector ref. We didn't
just blindly inline the whole mess (which was 50-100 lines of code) and
hope that it would go away. 

John
gateley@rice.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

mac@eleazar.dartmouth.edu (Alex Colvin) (05/04/91)

As others have pointed out, an environment (like C++) that supports inlining
encourages the use of procedures that the programmer would manually inline in
the absence of such support.  Trivial inlined procedures give you access
control to various module data, e.g. read- or increment-only access.

Another thing inlining gives you is better flow analysis with
a way to fold constants into procedures.  This makes optimizations such
as 
	{ if (false) S1 else S2 }  -> S2
more useful.  In this case, you expect the body to expand to different code
in each instance.

-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@june.cs.washington.edu (David Keppel) (05/04/91)

>Preston Briggs writes:
>>[Some reasons why you might not want to inline]
>   REGISTER PRESSURE:
>   Call sites are natural places to spill lots of registers.
>   Register allocators are rarely able to achieve the same efficiency.

daniel@quilty.Stanford.EDU (Daniel Weise) writes:
>[So we have to make better register allocators.]

Stronger than that: if the call site is the right place to save/restore
registers, then save and restore registers at the call site, around the
inlined function.  Then delete any redundant (useless) saves and
restores.

On the basis of register pressure, you couldn't possibly do worse than
a function call.  You might still suffer all the other side effects
(code growth, etc.).

		;-D on  ( It's all caching )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

chris@crackers.clearpoint.com (Chris Clark) (05/04/91)

Although, I generally agree with Hans' comments. I do want to make a minor
correction to his first statement.  I know longer have the precise data on
it.  However, the assumption that old style FORTRAN programs do not profit
from inlining is incorrect.  I was part of a team that did inlining as
part of optimization enhancements for FORTRAN, PL/I, and related
compilers.  Our results suggested that inlining paid off more often than
expected (but possibly not by as high a percentage) even when inlining
large procedures.  I also believe that you'll find that the MIPS people do
fairly extensive inlining as part of their optimizations and they are
targetting "traditional" languages.  (They had (have?) something called
pmerge or umerge which does it.  They may even wait until link time to do
it now, to "solve" the separate compilation problem.) 

1 PARAGRAPH DIGRESSION: The basic problem is that I think we are not as
smart about when and what to optimize by hand as we think.  Also many
applications are written primarily for correctness, portability, or
maintainability in any case.  Though, I will admit a lot of FORTRAN is
well tuned, since computers used to be so slow!

Truly large procedures do not generally cause the program of code
expansion, because they are quite often only called from one site.  One
hard part of the trade off is the dynamic/static balance.  It's almost
univerally a win to inline a procedure if it has only one call site.  To
avoid the locality of reference problem that may occur if the procedure is
conditionally called, put the inlined procedure at the top or bottom of
the code and jump to and from it.  (One of the jumps can be the condition
which triggers the call.)  You may have to tweak other parts of your
optimizer to avoid it re-linearizing the flow of control and rearranging
the new code back into the middle.  Statistically, having the code in a
different part of the same module should not have a higher probability of
a cache conflict than having the code in a seperate module.  However, in
specific instances that will not be the case.  (And in benchmarking,
specific instances are all that count.  Unfortunately, I think that's true
in general--every execution is always a specific instance.)  Thus, user
control is important, for those cases when the user is actually smart
about exactly what they want done--i.e. they ran the profiler and analyzed
the results.

The hard analysis comes with the "wide part of the onion", the middle
layer of abstraction.  Here the functions tend to be called from several
sites and be of moderate length.  This is often the meat of the
application and performs those parts which are specific to the task at
hand, as opposed to the top general driver code and bottom standard canned
primitives like add to a list.  (I think I just recently read that this is
where 80% of the coding errors occur also.)  Anyway, here it is usually
only profitable to inline if the call occurs within the context of a loop,
and even then with some trepidation.

CAVEATS: 

However, although it was profitable, I think we were shooting for an
average 10-15% gain on total execution time, including all of the
optimizations we did at that rev.  I belive strength reduction and
improved register analysis improved the performance more than inlining did
in many applications.  I also don't remember whether we did the recursion
unrolling inline substitution or not.  I do know we did special code to
help us with up-level variable references.  The code was also controlled
by command line parameters to give users some knobs to tweak and hopefully
prevent explosive growth, which would have been fatal since the
architecture was segmented and there was a fixed upper bound on the size
of any one procedure.

It is possible that our results were colored by the truly expensive
procedure call overhead or size of the Icache on the target machines.  A
typical CPU was 3 to 5 boards with about half a board dedicated to cache
if I recall.  The results were also colored by the fact that the compiler
did no interprocedural analysis at that rev.  

The nature of the Icache may have also affected the results.  I believe
the probability of a cache miss on two addresses within a segment
conflicting were lower (possibly zero on the high end machine) than
conflicts between separate segments, and thus between separate procedures.
(Multiple procedures could exist in a segment, but the linker always
packed segments from the same end.)

All of this is only my recollections.  The true data, which was gathered
on "real live applications" has probably been lost for years and was
probably marked as proprietary anyway.  I also didn't do the
implementation.  I think Rich Ford (now at Encore) or Dan Laska (Compass?)
did, but it also may have been Karen Berman-Mulligan (Private), Ruth
Halpern (LTX), or Suresh Rao (Intel).

I hope this helps,
- Chris (Clark) 
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

carter@cs.wisc.edu (Gregory Carter) (05/05/91)

I suppose I should have outlined the process of combining procedural and
inlined code replacement, so here it goes.

What I initially wanted was for the source code to remain UNCHANGED.

All transformations would be done without actually modifiying the source,
only at the code generation stage.

Therefore, the original source text, would remain the same, no extended
compile times, no debugging problems.

Only at the code generation stages would the inline effects be seen.

On another note, some of you have mentioned that recursion would die this
way, and since I have never been a big fan of recursion, and avoid it where
possible, and very seldom do we think recursively (BE HONEST NOW) I thought
this option would be common enough to actually be feasable.  (Yes,
recursion has its place, yes its useful, but it's a specialty item!)

However, I would think there would be some money out there for you hot
shots to design a paradigm which converts between recursive code and
iterative code styles..AUTOMATICALLY. :) (Bring on the JOLT COLA) :) (Maybe
some weekend when I have time, or when I find a competent CS536 prof) :)
 
I was amazed to find that most of you skipped over the implications of
inlining code blocks....JMP's(GOTO's)!!! (Isn't that a NO NO?  Or is it?)

I personally love GOTO's, which is part of the reason for my reason to
trash the stack model (On the machine level, on the abstract level its very
useful) (Please note above, I don't condone goto's.  Although, they sure
are QUICK)

By the time you fix your stack frame, you're already executing your work
code inline that is.

Now for a radical idea.  Hardware with global code support, instead of
wasting that extra silicon real estate for a stack frame register?

Any comments?

Everyone secretly knows anyway that goto's are better, nicer and more
efficient programming context for designing loops of all kinds anyway.  (Of
course, I will contend that too much of a good thing is bad..or is it?)

:) Gregory (Have fun with that last one, and remember BE HONEST).
[To each his own, I suppose.  I find that the most natural way to express a
nontrivial routine is more often than not recursive.  Converting recursive
programs to non-recursive mechanically is not hard.  It must be, since no
computer I know up supports recursion in hardware -- they all implement
recursion with arrays of activation records.  Also, the arguments about
gotoless programming usually refer to programs written by humans, not
programs written by machine, though on a deeply pipelined RISC, any sort of
branch tends to be expensive. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

pardo@june.cs.washington.edu (David Keppel) (05/06/91)

Ooops.  I wrote something very unclear:

>Preston Briggs writes:
>>[Register allocators aren't as smart as bind save/restore at
>> the call site.]

pardo@june.cs.washington.edu (David Keppel) writes:
>[Technique...] On the basis of register pressure, you couldn't possibly do
>worse than a function call.  You might still suffer all the other side
>effects (code growth, etc.).

Gee, it sounds like I'm saying that a function call is the worst of all
possible worlds, when I was talking about avoiding the function call.
What I *meant* to say (I really did re-read it, but it made sense at
the time) was that if you:

* Do register saves and restores in the caller like you were doing a
  function call
* Do register allocation in the callee like it was a real function
* Inline the function

then you can:

* Use the same register allocation strategy and get no worse register
  allocation
* Remove redundant register saves and restores
* Remove redundant branches

Hopefully it make sense this time.

	;-D on  ( Confusion reigns supreme )  Pardo
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

ea08+@andrew.cmu.edu (Eric A. Anderson) (05/06/91)

carter@cs.wisc.edu (Gregory Carter) writes:
> Everyone secretly knows anyway that goto's are better, nicer and more
> efficient programming context for designing loops of all kinds anyway.  (Of
> course, I will contend that too much of a good thing is bad..or is it?)

The last time I used gotos was when a program needed to get turned in in
about 30 seconds, and I had forgotten one of my loop exiting conditions.
Then I wrote an if statement which just jumped me out of the loop without
having to add another boolean termination value.

That I see as the primary value of gotos, is the elmination of the
if (Multi level exit test)
  then terminate := true; 
to get you out of 2 or three depths of loops.

> [To each his own, I suppose.  I find that the most natural way to express a
> nontrivial routine is more often than not recursive.  Converting recursive
> programs to non-recursive mechanically is not hard.  It must be, since no
> computer I know up supports recursion in hardware -- they all implement
> recursion with arrays of activation records.  Also, the arguments about
> gotoless programming usually refer to programs written by humans, not
> programs written by machine, though on a deeply pipelined RISC, any sort of
> branch tends to be expensive. -John]

I believe that in SML of New Jersey, recursion is supported not by using
arrays of activation records, but more by jumping around from place to
place via continuation passing (please correct me if I am wrong, I have
just started to learn functional programming, and some of the concepts
seem pretty nebulus/like magic to me).  Functional language compilers tend
to spend a lot of time making it so that recursion is efficient because
that is the only way to do things.  As a result, in the article on the
continuation passing style, they note that properly tail recursive
functions will optimize to a loop type method. 

> On another note, some of you have mentioned that recursion would die this
> way, and since I have never been a big fan of recursion, and avoid it where
> possible, and very seldom do we think recursively (BE HONEST NOW) I thought
> this option would be common enough to actually be feasable.  (Yes,
> recursion has its place, yes its useful, but it's a specialty item!)

I'd say that I often think recusively, how do you do a quicksort?
partition, then recursively sort each side.  How do you do a binary
search.  Check the middle then recursively search the side it's on.  I
think that it is merely a matter of personal preference if you like the
recursive .vs. non-recursive style of programming, as you can convert one
to the other.

          -Eric
[Good point about continuation passing which involves a heap rather than an
array of activation records. -John]
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

mcg@ichips.intel.com (Steven McGeady) (05/07/91)

I've just read the thread on inlining (through 5 May 91), and have a few
comments to add, as an implementor:

 - respondants don't seem to be making a distinction between inlining as a
   programmatic, user-specified extension, and inlining as a transparent,
   compiler-implemented optimization.

   While closely related, I feel these two types of inlining must be
   addressed separately:

	- user-specified inlining is as good as the user's understanding
	  of his or her program.  In situations where the user has a deep
	  understanding of the performance behaviour of the program under
	  study, user-directed inlining can be a powerful tool.  When I
	  wrote 'inline', a stand-alone C-to-C inliner, I carefully
	  studied several algorithms, including 'compress'.  Careful
	  profiling followed by inlining resulted in a 10% performance
	  improvement, even in this carefully-optimized program.

	- heuristic inlining is only as good as the heuristic (duh).  Our
	  research is pointing out that we haven't found a good heuristic
	  yet without using profiling feedback.  We've tried to synthesize
	  a heuristic from call-graph, register-pressure, and size information,
	  without repeatable success (i.e. over a broad selection of programs).
	  Heuristics that include profiling input (weighted dynamic call tree)
	  can repeatably produce improvements in most programs, without causing
	  serious regressions.  (Our compiler does global (inter-module)
	  inlining with a two-pass model).

    Unfortunately, users often think they know more about their programs than
    they actually do, and many don't have the tools, or are too lazy to
    measure their programs.  Many inlining decisions users make are just plain
    wrong.  Heuristic inliners like gcc's make the user's task easier: try it
    both ways, and pick the fastest.  This doesn't validate the practice of
    inlining, it merely provides commentary on the effectiveness of gcc's
    heuristic (which is: not particularly).

 - several respondants have noted that good interprocedural dataflow analysis
   can yield better results.  In theory, I agree (on processors where calls
   are relatively cheap), however, true REF/DEF dataflow information can
   quickly become intractable (or at least very difficult) in a large
   program, when attempted across the entire program (for C, when tracking
   all points-to information).  So if Global DFA is limited to a procedure,
   inlining frequently-traversed arcs on the call-graph can dramatically
   improve the overall effectiveness of DFA-based optimizations.

 - along the same lines as the last point, inlining can also expose many other
   worthwhile optimizations that can't profitably be done on an intermodule
   basis.  In particular, until call tailoring becomes a reality (including
   debug support!) I think the utility of some classes of inlining to be high,
   when modified with profile information.

Summary:
	- 'inlining' means two different things
	- user-inlining is effective only for sophisticated users
	- compiler heuristic inlining is currently hampered by poor heuristics
	- profile information considered essential for inlining heuristics
	- intermodule global DFA considered difficult to intractable
	- intelligent profile-driven inlining is a Good Thing

S. McGeady
i960 Software Architecture Group
Intel Corp.
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.