[comp.arch] Inlining subroutines at link time

riordanmr@clvax1.cl.msu.edu (Mark Riordan) (07/04/90)

I just came across a CDC NOS/VE manual describing a facility to 
inline FORTRAN routines after compilation (essentially during linking).
I no longer have access to a NOS/VE system, so I don't know how well
this works, but it seems like a fairly unique feature to me.  Have
you folks ever heard of a similar system?

I quote from the April 88 revision of the Control Data NOS/VE Object
Code Management guide:

"The AFTERBURN_OBJECT_TEXT command improves the execution time of
FORTRAN Version 1 and Version 2 programs by inlining subprograms.
Inlining a subprogram places the statements of the subprogram in the
code where the routine is called.  Inlining subprograms is an optimization
process that trades space for execution time.  Modules with inlined
subprograms are larger, yet they execute faster because they avoid 
call and return instructions, which increase execution time.

The afterburner works on bound modules and can be thought of as
an extension to the binding process.  A bound module is the usual input
to the afterburner, but if an object file is used, all modules in the file
are bound into one module named NEW.  The output from the afterburner
is a new module, with subprograms, or modules, expanded inline.

The afterburner requires no source code changes.  Existing programs
or subroutines can have subprograms expanded inline without changing
the program or subroutine.  For example, a FORTRAN program can
expand a math library routine inline without recompiling.  Since
afterburning occurs as a separate step after compilation, compilation
time is not affected.

The afterburned performs a cost/benefit analysis to determine if a
specific routine should be brought inline.  If a subroutine is called in
two places, it might be profitable to inline the subroutine in one
instance and not the other."

preston@titan.rice.edu (Preston Briggs) (07/04/90)

In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:
>I just came across a CDC NOS/VE manual describing a facility to 
>inline FORTRAN routines after compilation (essentially during linking).
>I no longer have access to a NOS/VE system, so I don't know how well
>this works, but it seems like a fairly unique feature to me.  Have
>you folks ever heard of a similar system?

There's a rumour that some Unix C compilers do similar things.
Generally though, it seems a waste, particularly for Fortran.
Consider the time spent in a subroutine versus a savings of two
instructions -- hardly worth the time to specify an extra compile flag.

Inlining before optimization has the potential for being a lot
more interesting.  The hard part is choosing (automatically)
routines to inline.  It's fairly easy to eliminate most of the
calls in a program, but is the code any faster?  Often not.
On many systems, inlining actually produces slower code in many cases,
despite all the obvious intuitions.  It's particularly frustrating
in view of the hefty time penalties paid to optimize the
resulting humongous routines.

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

dhinds@portia.Stanford.EDU (David Hinds) (07/04/90)

In article <9595@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:
>>I just came across a CDC NOS/VE manual describing a facility to 
>>inline FORTRAN routines after compilation (essentially during linking).
>
>There's a rumour that some Unix C compilers do similar things.
>Generally though, it seems a waste, particularly for Fortran.
                                     ^^^^^^^^^^^^^^^^^^^^^^^^
>Consider the time spent in a subroutine versus a savings of two
>instructions -- hardly worth the time to specify an extra compile flag.

    Geez, is this a cheap shot, or what?  FORTRAN has its weaknesses,
but inefficiency is not one of them.  I know of no other language that
can be as effectively optimized.  The time savings might be significant
if paging/caching overhead is included in the calculation.

>Inlining before optimization has the potential for being a lot
>more interesting.  The hard part is choosing (automatically)
>routines to inline.  It's fairly easy to eliminate most of the
>calls in a program, but is the code any faster?  Often not.
>On many systems, inlining actually produces slower code in many cases,
>despite all the obvious intuitions.  It's particularly frustrating
>in view of the hefty time penalties paid to optimize the
>resulting humongous routines.
>
    The MIPS compiler front-ends can all generate intermediate "ucode"
object files.  These are linked, then passed to a procedure merge
program to inline things.  This then goes through the global optimizer,
and on to the code generator.  The inliner can use feedback from the
profiler to decide where it is most useful.

 -David Hinds
  dhinds@popserver.stanford.edu

lm@snafu.Sun.COM (Larry McVoy) (07/04/90)

In article <1990Jul3.215128.23026@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes:
>    Geez, is this a cheap shot, or what?  FORTRAN has its weaknesses,
>but inefficiency is not one of them.  I know of no other language that
>can be as effectively optimized.  

Yeah, I used to like fortran pretty well.  I have before me "Programmers Guide
to Fortran 90" which contains things, such as pointers, that make me wonder
if this is going to make fortran an uninteresting language.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

cik@l.cc.purdue.edu (Herman Rubin) (07/04/90)

In article <9595@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes:
> In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:

			............................

> There's a rumour that some Unix C compilers do similar things.
> Generally though, it seems a waste, particularly for Fortran.
> Consider the time spent in a subroutine versus a savings of two
> instructions -- hardly worth the time to specify an extra compile flag.

I have difficulty seeing how a linker can inline, as register conflicts
must be avoided.  If it can change the register allocations, it may,
however, do very well.

As for saving only two instructions, these instructions are likely to
be very time-consuming, usually at least partially blocking internal
parallelism.  I far too many machines on which a transfer should be
considered a catastrophic instruction.

> Inlining before optimization has the potential for being a lot
> more interesting.  The hard part is choosing (automatically)
> routines to inline.  It's fairly easy to eliminate most of the
> calls in a program, but is the code any faster?  Often not.

I agree completely.  But in the next millenium, any intelligent
programmer will do better without working hard at these decisions
than an automaton.   Why are people in the computer field so 
determined to get in the way of using brainpower?
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)	{purdue,pur-ee}!l.cc!cik(UUCP)

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/04/90)

In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> 
	riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:
>Modules with inlined
>subprograms are larger, yet they execute faster because they avoid 
>call and return instructions, which increase execution time.

Actually, the win (when there is a win) comes from the direct binding
of the arguments, which avoids the call site having to gather
together copies/pointers.  A result can be computed directly into its
target, avoiding a copyback.  Also, the inlined routine's locals are
merged into the caller's locals, saving the stack manipulation.

The architectural tie-in is that inlining disturbs the statistical
properties of programs, increasing the number of registers used,
enlarging the code, and affecting its locality.
-- 
Don		D.C.Lindsay

njk@diku.dk (Niels J|rgen Kruse) (07/04/90)

cik@l.cc.purdue.edu (Herman Rubin) writes:

}I agree completely.  But in the next millenium, any intelligent
}programmer will do better without working hard at these decisions
}than an automaton.   Why are people in the computer field so
}determined to get in the way of using brainpower?
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Basically, isn't that what the computer field is all about?

Should we wonder why some people try to carry it too far?
-- 
Niels J|rgen Kruse 	DIKU Graduate 	njk@diku.dk

amull@Morgan.COM (Andrew P. Mullhaupt) (07/05/90)

In article <9804@pt.cs.cmu.edu>, lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
> The architectural tie-in is that inlining disturbs the statistical
> properties of programs, increasing the number of registers used,
> enlarging the code, and affecting its locality.

There is also the possibility that I-cache behavior is affected
by inlining: Inline code is somewhat more sequential. You can really
lose on this one if you experience page faults from the larger
code size, however. 

In C, inlining can (depending on where it happens) allow for
more optimization of pointer accesses than if function calls
intervene.

The best known (or depending on your point of view most infamous)
case of inlining is when the Dhrystone benchmark is compiled with
some memory movement routines inline. 

Later,
Andrew Mullhaupt

preston@titan.rice.edu (Preston Briggs) (07/05/90)

Replying to many people at once...

Riordan asked opinions on inlining at link time.
Since it's at link time, I assume (!) the linker simply substitutes
the procedure body (minus the return) for the call instruction.
Hence, 2 instructions saved.  Arguments would still be passed
in registers or on the stack and registers would be saved and restored
as usual.  More complex schemes would (possibly) be too complex and
expensive for link-time.  At any rate, that's all that was promised
by his manual.

I suggested it was a small savings, especially for Fortran.  I didn't
mean a dig at Fortran.  I meant that since a Fortran routine often
represents a fair amount of FP computation, 2 branches aren't
going to count.  For a languages with many smaller routines,
I'd expect more significant improvements.

Paging effects and I-cache effects are difficult to determine in
advance.  You get better locality with inlining, but you also get
bulkier code.  Consider also the effect of inlining a very small routine
(a favorite choice).  With the non-inlined version, the routine might
remain in cache (or paged in) for extended periods, perhaps being
called from many different call sites.  In the inlined case,
each call site would have it's own copy and require more cache
lines or pages.

Best seems to be experimentation with your own <system,compiler,application>.

--
Preston Briggs				looking for the great leap forward
preston@titan.rice.edu

amull@Morgan.COM (Andrew P. Mullhaupt) (07/05/90)

In article <9629@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes:
> Replying to many people at once...
> 
> Riordan asked opinions on inlining at link time.
> Since it's at link time, I assume (!) the linker simply substitutes
> the procedure body (minus the return) for the call instruction.
> Hence, 2 instructions saved.  Arguments would still be passed
> in registers or on the stack and registers would be saved and restored
> as usual.  More complex schemes would (possibly) be too complex and
> expensive for link-time.  At any rate, that's all that was promised
> by his manual.

Well if you're not going to save the argument passing, then the thing
is less useful, but I don't see why in FORTRAN, where you know how
your arguments look on the stack, the linker couldn't easily do a lot
more for you than just save the call and return.

> 
> Paging effects and I-cache effects are difficult to determine in
> advance.  You get better locality with inlining, but you also get
> bulkier code.  Consider also the effect of inlining a very small routine
> (a favorite choice).  With the non-inlined version, the routine might
> remain in cache (or paged in) for extended periods, perhaps being
> called from many different call sites.  In the inlined case,
> each call site would have it's own copy and require more cache
> lines or pages.

It isn't this simple at all. If the program is running on a multitasking
system, your routine might get dirty from part of the OS; in which case
the very small routine always gets pulled in as a bunch of cache misses.
Oppose this with the inlined version, where the routine occurs at many
different addresses, but almost always the subsequent addresses to what
you are executing. In this case, the part of the OS which stomps the
outline routine can only sometimes get you - so you see better overall
performance. Now ask yourself; what's the one behavior for programs 
which I-cache designers _must_ expect? Sure, it's the one where you do
an instruction, and then the next one, and then the one after that...
and so your best hope of good performance across more than one machine
is to inline small routines. Now, many inlining techniques actually
make more informed decisions about what to inline, and what not, but
in the zeroth order case (inline all calls to 'X' or none of 'em) then
the simple answer is inline 'X' if it is smaller than some size and not
if it's bigger. 

Later,
Andrew Mullhaupt

meissner@osf.org (Michael Meissner) (07/05/90)

In article <1184@s8.Morgan.COM> amull@Morgan.COM (Andrew P. Mullhaupt)
writes:

| The best known (or depending on your point of view most infamous)
| case of inlining is when the Dhrystone benchmark is compiled with
| some memory movement routines inline. 

Actually, the ANSI C standard explicitly allows this, since strcpy is
a standard function.  Also, it makes the benchmark compatible with the
original ADA benchmark, since ADA has string support builtin.

I would say that 'most infamous' in terms on inlining, and expansion
is one of the early stone benchmarks, was completely inlined by the
compiler (Perkin Elmer according to my source), and the compiler
noticed there was no I/O, and completely eliminated the code.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Do apple growers tell their kids money doesn't grow on bushes?

3ksnn64@cadlab1.ecn.purdue.edu (Joe Cychosz) (07/06/90)

In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:
>I just came across a CDC NOS/VE manual describing a facility to 
>inline FORTRAN routines after compilation (essentially during linking).
>I no longer have access to a NOS/VE system, so I don't know how well
>this works, but it seems like a fairly unique feature to me.

On the CYBER 990/995 it is almost essential.  The call-return instruction
takes so damn long (3 micro-seconds if my memory is correct).  I spent
quite a bit of time vectorizing codes only to have the subroutine call eat
up most of my efforts.  Thus they CDC had HRC develope the afterburner
which examined the unbounded object and inserted appropriate code segments
inline to avoid subroutine calls.  I believe on the CYBER 2000, the
performance of call-return has been greatly improved.

>Have you folks ever heard of a similar system?

Some one in a later note pointed out MIPS and there UCODE approach.
Digital Productions used a vertual register scheme for much of thier code
and thier optimizer then scheduled actual register usage for the Cray XMP.

>The afterburned performs a cost/benefit analysis to determine if a
>specific routine should be brought inline.  If a subroutine is called in
>two places, it might be profitable to inline the subroutine in one
>instance and not the other."

I wonder how it makes this decision, since profitability depends on the
number of times it is called.

One final note: The afterburner is not limited to only FORTRAN.  At that
point it is only looking at the object code.

3ksnn64@cadlab1.ecn.purdue.edu (Joe Cychosz) (07/06/90)

In article <2305@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>In article <9595@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes:
>> In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:
>
>			............................
>
>> There's a rumour that some Unix C compilers do similar things.
>> Generally though, it seems a waste, particularly for Fortran.
>> Consider the time spent in a subroutine versus a savings of two
>> instructions -- hardly worth the time to specify an extra compile flag.
>
>I have difficulty seeing how a linker can inline, as register conflicts
>must be avoided.  If it can change the register allocations, it may,
>however, do very well.

I believe it does make register reassignments during inlining.  My memory
is a bit fuzzy on this since the afterburner became a formal product after
I stopped working on the 995 and began work on the ETA10s.

>> Inlining before optimization has the potential for being a lot
>> more interesting.  The hard part is choosing (automatically)
>> routines to inline.  It's fairly easy to eliminate most of the
>> calls in a program, but is the code any faster?  Often not.

Yes, especially on vector processors and scalar machines with a fast
loop control instructions.  On most machines it would not be worth the
effort, however on the 995 it is essential.

>I agree completely.  But in the next millenium, any intelligent
>programmer will do better without working hard at these decisions
>than an automaton.   Why are people in the computer field so 
>determined to get in the way of using brainpower?

May be so we can use that brainpower for something else.  It is fairly
difficult to beat todays optimizers.  You may find a better way, but more
than likely you'll realize less than 2% savings.  Of course there are
always the exceptions, after all a optimizer is a program and is thus
subject to errors, and re-organizing the algorithm or process is outside
of the scope of an optimizers function.

ron@sco.COM (Ron Irvine) (07/06/90)

The CDC post-loader (called the afterburner) was developed to
optimize the executable.  It can inline functions, reallocate
registers across function calls and turn far calls into near
calls. It can be a big win on the CYBER since subroutine
calls are expensive.

In general a post-loader can be a powerful tool in "optimizing"
the executable, it can:

Inline/Outline:
	- inline functions, faster execution larger code size.
	- outline code (make into a function), this can reduce
		code size and be profitable on some machines
		(smaller code may fit into cache).
		
Do global register allocation:
	- this can be of great benifit for processors with
		global registers.

Assign registers across function calls:
	- a post loader has full program scope (including libraries)
		so it can create register usage information and
		eliminate some register save/restores across function
		calls.

Use short addresses:
	- data can be grouped to allow short memory references.
		This could be 16 bit offsets instead of 32 bit.
	- or fast access through a data register + small
		offset.
	- far calls can become near calls

Change instructions:
	- Instruction sequences can be tuned for a specific
		machine (without a recompile).
	- Patch around hardware/software bugs without changing
		the compiler.
	- Emulate new hardware/software.
	- Yet another place to reposition load/store and delay
		branch code.

Relocate data and functions:
	- This can be done to improve cache hits based on
		static or dynamic infomation (with feeback
		from a program run).
	- Data can be grouped so that often used data resides
		in a common page in memory.
	- functions can be gouped so that paging can be minimized
		and startup time reduced.
	- get rid of function and data that is newer used.

Performance monitor:
	- Add code to record the performance of a program.  This
		can be done to the "final" executable without
		the need to recompile and link with special
		libraries.

Misc:
	- add code to memory references to find a specific bug
	- modify the scope of symbols in a object file (for
		example this can allow more than one yacc/lex
		parser in a program)
	- allow a function (or data) to be replaced with a new one.
	- rearrange a production binary so that the function
		positions (addresses) are unique for each copy
		(and thus can encode a serial number for identification).

See, "Postloading for Fun and Profit", S.C.Johnson, UNENIX, Winter 90.


Ron Irvine, uunet!scocan!ron

mark@sco.COM (Mark Chojnacki) (07/06/90)

In article <1990Jul3.194348.21178@msuinfo.cl.msu.edu> riordanmr@clvax1.cl.msu.edu (Mark Riordan) writes:

>I just came across a CDC NOS/VE manual describing a facility to 
>inline FORTRAN routines after compilation (essentially during linking).

Actually, it's quite distinct from linking. As of NOS/VE 1.5.1 (released
in December 1989), the Afterburner (as the utility is called) can handle
FORTRAN Version 1, FORTRAN Version 2, CYBIL (CDC's OS implementation
language), COBOL and C (only C/VE initially, CV2 when it becomes available).

>I no longer have access to a NOS/VE system, so I don't know how well
>this works, but it seems like a fairly unique feature to me.  Have
>you folks ever heard of a similar system?

It works very well on CYBERs. There is a limit to how much you can
afterburn a bound module, however...

In article <9595@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:

>There's a rumour that some Unix C compilers do similar things.
>Generally though, it seems a waste, particularly for Fortran.
>Consider the time spent in a subroutine versus a savings of two
>instructions -- hardly worth the time to specify an extra compile flag.

The CALLREL, CALLSEG and RETURN instructions on many CYBER 180 machines
are expensive in terms of machine cycles. For call-intensive programs in
any language on these machines, the Afterburner can be very beneficial.
I know: I've seen it make very significant performance improvements to
high-profile application programs. These were programs which derived very
little benefit from conventional compile-time inlining.

In article <2305@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>I have difficulty seeing how a linker can inline, as register conflicts
>must be avoided.  If it can change the register allocations, it may,
>however, do very well.

The NOS/VE Afterburner (which is not a linker -- it is an object management
tool which takes bound modules as input and produces bound modules as output)
does, in fact, change register allocations to improve performance.

>As for saving only two instructions, these instructions are likely to
>be very time-consuming, usually at least partially blocking internal
>parallelism.  I far too many machines on which a transfer should be
>considered a catastrophic instruction.

You've hit the nail on the head. This is exactly the case with many of the
existing CYBER 180 machines. Apparently, the CYBER 2000 has addressed
this deficiency to some degree with faster CALL/RETURN instructions.
I do not know to what degree they have been improved.

In article <9629@brazos.Rice.edu> preston@titan.rice.edu (Preston Biggs) writes:

>Since it's at link time, I assume (!) the linker simply substitutes
>the procedure body (minus the return) for the call instruction.
>Hence, 2 instructions saved.  Arguments would still be passed
>in registers or on the stack and registers would be saved and restored
>as usual.  More complex schemes would (possibly) be too complex and
>expensive for link-time.  At any rate, that's all that was promised
>by his manual.

Nope. See above. It also removes register saves and restores.

>Paging effects and I-cache effects are difficult to determine in
>advance.  You get better locality with inlining, but you also get
>bulkier code.  Consider also the effect of inlining a very small routine
>(a favorite choice).  With the non-inlined version, the routine might
>remain in cache (or paged in) for extended periods, perhaps being
>called from many different call sites.  In the inlined case,
>each call site would have it's own copy and require more cache
>lines or pages.

Yes, it is clearly a user-mode CPU time vs. memory space tradeoff...

>Best seems to be experimentation with your own <system,compiler,application>.

I second that. The problem is that many environments do not offer a tool
like the Afterburner with which to experiment...

Disclaimer: the above are my opinions and not necessarily my employer's...

Mark Chojnacki, Network Applications, SCO Canada, Inc. (416) 922-1937
mark@sco.com                          (formerly HCR Corporation)
or ..!uunet!scocan!mark               Toronto, Ontario, Canada

fouts@bozeman.ingr.com (Martin Fouts) (07/18/90)

In article <138349@sun.Eng.Sun.COM> lm@snafu.Sun.COM (Larry McVoy) writes:

   In article <1990Jul3.215128.23026@portia.Stanford.EDU> dhinds@portia.Stanford.EDU (David Hinds) writes:
   >    Geez, is this a cheap shot, or what?  FORTRAN has its weaknesses,
   >but inefficiency is not one of them.  I know of no other language that
   >can be as effectively optimized.  

   Yeah, I used to like fortran pretty well.  I have before me "Programmers Guide
   to Fortran 90" which contains things, such as pointers, that make me wonder
   if this is going to make fortran an uninteresting language.

A few years ago I translated the Livermore Loops from Fortran to
semantically equivalent C.  On every machine I ran both version on,
the C version produced correct results faster than the Fortran
version.   Every machine includes:  Cray 2, X/MP and Y/MP, Amdahl
5880, Sun 3, SGI Iris 4D, Vax running BSD, Convex, and Alliant.  That
was a few years ago, and I am sure some improvements have been made in
Fortran compilers since then. (;-)

There are parts of Fortran which are very easy to optimize.  There are
parts which are harder to optimize.  The same is true for other
languages.  The trick with Fortran is that being a smaller language it
has fewere of the latter.  I can write very efficient code in C, and I
can write horrible code in Fortran; so, in another sense the old saw
"You can write Fortran in any language" is as true about efficiency as
about anything else.

Marty
--
Martin Fouts

 UUCP:  ...!pyramid!garth!fouts  ARPA:  apd!fouts@ingr.com
PHONE:  (415) 852-2310            FAX:  (415) 856-9224
 MAIL:  2400 Geng Road, Palo Alto, CA, 94303

If you can find an opinion in my posting, please let me know.
I don't have opinions, only misconceptions.