[gnu.g++.lib.bug] lack of inline functions in libg++

othar@ernie.Berkeley.EDU (Othar Hansson) (10/25/89)

Before I rant, my thanks to the people who wrote the classes in
libg++.  You made my program work!  And reduced its size to a tenth of
that of the previous implementation.  Now for the whining...

I'm finding that I pay a real price in speed when I use some of the
libg++ library-derived classes.  For example, SplaySets are used
extensively in my code, and in a particularly crucial loop, elements
are retrieved from two SplaySets and multiplied -- my goal is to get
around that loop as many times as possible.

Unfortunately, functions like __SplaySet::next() cannot be inlined, as
they are not in the class header file -- they end up in another object
file.  This function call (and other accesses of different classes in
the same loop) costs me a significant constant in the speed around
that loop.  A lot of simple functions (even basic accesses like
class::operator[] and class::operator()) suffer from this same
problem, and they are precisely the functions that are called
thousands/millions of times in typical code.

While I wouldn't like to read an enormous header file for each
compilation, when I actually set my program running for 6 days, I
wouldn't mind the cost.  Can anyone think of a solution to this
problem?  

My initial modest proposal is that it would be nice if there were a
way to optionally GENCLASS a header with lots of inlines.
Alternately, there could be a compiler directive used in file X.cc, to
fetch a function from Y.cc for inlining in the compilation of X.s and
thence X.o.  The technology for doing this would probably enable lots
of other neat options as well...

        Tossing this out and hoping someone smarter than I will solve it,
        
                                                    Othar Hansson
                                                    CS Division
                                                    UC Berkeley
                                                    othar@ernie.berkeley.edu

dl@G.OSWEGO.EDU (Doug Lea) (10/25/89)

> I'm finding that I pay a real price in speed when I use some of the
> libg++ library-derived classes.  For example, SplaySets are used
> extensively in my code, and in a particularly crucial loop, elements
> are retrieved from two SplaySets and multiplied -- my goal is to get
> around that loop as many times as possible.
> 
> Unfortunately, functions like __SplaySet::next() cannot be inlined, as
> they are not in the class header file -- they end up in another object
> file.  This function call (and other accesses of different classes in
> the same loop) costs me a significant constant in the speed around
> that loop.  A lot of simple functions (even basic accesses like
> class::operator[] and class::operator()) suffer from this same
> problem, and they are precisely the functions that are called
> thousands/millions of times in typical code.
> 

Decisions about what to inline in a library are hard, and sometimes
arbitrary. I usually have to guess about which things are both short
enough and well-used enough to be worth declaring as inline. These
decisions never please everyone -- I get just about as many complaints
that there are too many inlines in libg++, making compiles unbearable
slow, as complaints that there are not enough inlines, hurting
performance.

Michael and I are toying with employing a numerical INLINE_LEVEL
compiler switch, with levels that correspond to more or less inlining
(thus, faster or slower compilation). A bunch of compilation and
linkage details have to be worked out before we can implement this.

The specific case at hand (SplaySets) is tougher. The succ() function
called by next() is inherently a loop that is not very amenable to
inlining -- inlining it might only give you a couple of percent
speedup, depending, in part, on the kind of machine you are using. In
your case, it might make sense to use SplaySets for the dynamic
aspects of your application, and then to transfer things to something
like an OSLSet, which supports a faster next() method, for traversal.
You might even transfer things (or, more likely, the corresponding
pointers) into a simple local array and step through it.

About your previous note requesting libg++ Graph classes. These are on
my (much too long) list of things to do. If anyone would like to
donate something before then, or collaborate on them, please do!

-Doug

rjc@maui.cs.ucla.edu (Robert Collins) (10/26/89)

In article <8910251101.AA29989@g.oswego.edu> dl@oswego.oswego.edu writes:
>
>> I'm finding that I pay a real price in speed when I use some of the
>> libg++ library-derived classes.
>
>Decisions about what to inline in a library are hard, and sometimes
>arbitrary. I usually have to guess about which things are both short
>enough and well-used enough to be worth declaring as inline. These
>decisions never please everyone -- I get just about as many complaints
>that there are too many inlines in libg++, making compiles unbearable
>slow, as complaints that there are not enough inlines, hurting
>performance.

I encountered this very early in the process of writing my Connection
Machine library.

>Michael and I are toying with employing a numerical INLINE_LEVEL
>compiler switch, with levels that correspond to more or less inlining
>(thus, faster or slower compilation). A bunch of compilation and
>linkage details have to be worked out before we can implement this.

In a sense, this was my solution.  If -O is specified, I do lots of
inlining, without it, I do little or none.  Each class in my library has
two files, a .h and a .cc.  They look something like this:

foo.cc:
	#define _FOO_CC_
	#include "foo.h"
	<code>

foo.h:
	class foo {
	};

	#if defined(_FOO_CC_)
	#define inline
	#endif

	#if defined(__OPTIMIZE__) || defined(_FOO_CC_)
	<the inline function definitions>
	#endif

	#if defined(_FOO_CC_)
	#define inline inline
	#endif

When compiling the library, a non-inline version gets put in foo.o, but
the parts of the library that use class foo can use inline versions (if
the library is compiled with -O).  When the user of the library is compiling,
they can compile some files with and others without -O.

When I get in a sitution where I want hundreds, or in some cases thousands
of inline functions, I will put them in a separate .h file.

foo.cc:
	#define _FOO_CC_
	#include "foo.h"
	<code>

foo.h:
	class foo {
	};

	#if defined(_FOO_CC_)
	#define inline
	#endif

	#if defined(__OPTIMIZE__) || defined(_FOO_CC_)
	#include "foo2.h"
	#endif

	#if defined(_FOO_CC_)
	#define inline inline
	#endif

foo2.h:
	<define oodles of inline functions>

This way, the huge number of function definitions is not scanned unless
it is necessary.

This simple scheme has helped a whole lot.  I still have a big problem
deciding which functions to inline.  In a number of cases, g++ tells
me that a function is too large to inline.  Retorical question:  if
I say inline, why doesn't it inline?  Who knows better, me or the compiler?
(I know, the answer is "compiler", but humor me, ok?).  But seriously,
it would be nice to be able to force the inlining of what the compiler
thinks are very large functions.  At least in my code, there are lots
of places where the compiler can evaluate a conditional at compile time,
eliminating lots of code as unreachable (if the function is expanded
inline).

The problem with all these inline functions (besides compile time) is that
I suffer from a serious case of "executable bloat".  I mean I get these
HUGE multi-meg executables.  This problem can be alieviated somewhat by better
optimization in the compiler and by better choice in which functions to 
expand inline.

More on my optimization complaints coming soon to a mailing list near
you...

rob collins
rjc@cs.ucla.edu

PS  For those who are interested in the C++/CM interface, it is about 75%
done.  It has stalled, due to other commitments and massive computer
system changes/problems (latest weirdness: I left town for a couple of days,
and when I came back this morning, all of the machine names had been
changed--strange).

-------------------------------------------------------------------------------
rjc@cs.ucla.edu	                    C++/Paris on the CM2:  The only way to fly.
-------------------------------------------------------------------------------

tiemann@LURCH.STANFORD.EDU (Michael Tiemann) (10/26/89)

Another solution to coding around the cost of function calls is to use
a RISC workstantion where that cost is negligible.

Michael

dl@G.OSWEGO.EDU (Doug Lea) (10/26/89)

Michael said...

> Another solution to coding around the cost of function calls is to use
> a RISC workstantion where that cost is negligible

But this is only part of the story. The *primary* goal of inlining on
RISC machine like a SPARC is not really saving a cycle or two for the
function call, but rather `procedural integration'. If inlines reveal
to the compiler some important constraints, constants, etc., that make
further simplifications and optimizations possible, it can be a HUGE
win. Try this:


#include <builtin.h>
#include <stream.h>

class Vec1000000
{
  float    s[1000000];
public:
           Vec1000000() {}
          ~Vec1000000() {}

  int      size();
  float&   sub(int i);

  int      ni_size();    // non-inline versions
  float&   ni_sub(int i);

  virtual int      v_size();  // virtual non-inline versions
  virtual float&   v_sub(int i);
 
};

inline int Vec1000000::size()    { return 1000000; }
       int Vec1000000::ni_size() { return 1000000; }
       int Vec1000000::v_size()  { return 1000000; }

inline float& Vec1000000::sub(int i)    { return s[i]; }
       float& Vec1000000::ni_sub(int i) { return s[i]; }
       float& Vec1000000::v_sub(int i)  { return s[i]; }


void fill(Vec1000000& v, float x)
{
  for (int i = 0; i < v.size(); ++i) v.sub(i) = x;
}

void ni_fill(Vec1000000& v, float x)
{
  for (int i = 0; i < v.ni_size(); ++i) v.ni_sub(i) = x;
}

void v_fill(Vec1000000& v, float x)
{
  for (int i = 0; i < v.v_size(); ++i) v.v_sub(i) = x;
}

main()
{
  Vec1000000 v;

  start_timer();  fill(v, 1.0);
  cout << "fill time = " << return_elapsed_time(0.0) << "\n";
  start_timer();  ni_fill(v, 1.0);
  cout << "ni_fill time = " << return_elapsed_time(0.0) << "\n";
  start_timer();  v_fill(v, 1.0);
  cout << "v_fill time = " << return_elapsed_time(0.0) << "\n";
}


g++ -g -O -fstrength-reduce -fdelayed-branch t1026.cc -lg++

Compilation finished at Thu Oct 26 05:51:39

(on a Sun4/110)

g.oswego.edu% a.out
fill time = 0.71
ni_fill time = 1.93
v_fill time = 4.41
g.oswego.edu% !!
a.out
fill time = 0.63
ni_fill time = 1.9
v_fill time = 3.58
g.oswego.edu% !!
a.out
fill time = 0.58
ni_fill time = 1.89
v_fill time = 4.41
g.oswego.edu% 


So inlining gave a > 2.5X speedup over noninlines, and > 5X over
virtuals on the Sparc. It's not too hard to make examples where it's
closer to 10X. 

The code generated for inline fill() is  about the best you could get
anywhere:

_fill__FR10Vec1000000f:
	!#PROLOGUE# 0
	save %sp,-112,%sp
	!#PROLOGUE# 1
	mov %i1,%o2
	sethi %hi(1000000),%o0
	or %lo(1000000),%o0,%o0
	mov %i0,%o1
	sll %o0,2,%o0
	b L672
	add %o0,%i0,%o0
L677:
	st %o2,[%o1]
	add %o1,4,%o1
L672:
	cmp %o1,%o0
	bl L677
	nop
	ret
	restore



-Doug

grunwald@foobar.colorado.edu (Dirk Grunwald) (10/26/89)

I think that a lot of the effort spent on ``cleaning up inlining'' &
the like would be better spent on global optimization, although that
process is much harder.

I'm thinking of things along the lines of the compilation to symbolic
intermediate (rtl?), and later ``linking'' of the intermediate form
followed by a global procedure integration & register allocation pass.

Examples of this include the MIPS compilers (using ucode as
intermediate) and the work at DEC-WRL for the titan (mahler
intermediate code).

This allows much better procedure integration and you'll improve
register allocation in leaf procedures. Also, it would make G++ much
nicer because your header files would only include class declerations
- there would be not need to include inline procedure bodies in the
header files.

tiemann@LURCH.STANFORD.EDU (Michael Tiemann) (10/26/89)

    The code generated for inline fill() is  about the best you could get
    anywhere:

    _fill__FR10Vec1000000f:
	    !#PROLOGUE# 0
	    save %sp,-112,%sp
	    !#PROLOGUE# 1
	    mov %i1,%o2
	    sethi %hi(1000000),%o0
	    or %lo(1000000),%o0,%o0
	    mov %i0,%o1
	    sll %o0,2,%o0
	    b L672
	    add %o0,%i0,%o0
    L677:
	    st %o2,[%o1]
	    add %o1,4,%o1
    L672:
	    cmp %o1,%o0
	    bl L677
	    nop
	    ret
	    restore



    -Doug

No, that code is bad.  GCC with my scheduling extensions would make
code which looked like this (honest!):

    _fill__FR10Vec1000000f:
	    !#PROLOGUE# 0
	    !#PROLOGUE# 1
	    sethi %hi(1000000),%o2
	    or %lo(1000000),%o2,%o2
	    sll %o2,2,%o2
	    add %o2,%o0,%o2
	    st %o1,[%o0]
    L677:
	    add %o0,4,%o0
    L672:
	    cmp %o0,%o2
	    bl L678
	    st %o1,[%o0]
	    ret
	    nop

or something like it.  There's lots of fun you can have with an
optimizer.

Michael

beshers@brcsun2.uconn.edu (George Beshers) (10/28/89)

I agree with Dirk's sentiment and would add another very important
reason: templates (e.g., generics).  I recently have been programming
a great deal in NJML which is basically SML.  The experience has been
*very* rewarding except for the size/time requirements of the system.
Basically, I have yet to write a piece of code twice because of
some change in typing---a major accomplishment in terms of code
reuse!!

			George Beshers
			beshers@uconn.edu