othar@ernie.Berkeley.EDU (Othar Hansson) (10/25/89)
Before I rant, my thanks to the people who wrote the classes in libg++. You made my program work! And reduced its size to a tenth of that of the previous implementation. Now for the whining... I'm finding that I pay a real price in speed when I use some of the libg++ library-derived classes. For example, SplaySets are used extensively in my code, and in a particularly crucial loop, elements are retrieved from two SplaySets and multiplied -- my goal is to get around that loop as many times as possible. Unfortunately, functions like __SplaySet::next() cannot be inlined, as they are not in the class header file -- they end up in another object file. This function call (and other accesses of different classes in the same loop) costs me a significant constant in the speed around that loop. A lot of simple functions (even basic accesses like class::operator[] and class::operator()) suffer from this same problem, and they are precisely the functions that are called thousands/millions of times in typical code. While I wouldn't like to read an enormous header file for each compilation, when I actually set my program running for 6 days, I wouldn't mind the cost. Can anyone think of a solution to this problem? My initial modest proposal is that it would be nice if there were a way to optionally GENCLASS a header with lots of inlines. Alternately, there could be a compiler directive used in file X.cc, to fetch a function from Y.cc for inlining in the compilation of X.s and thence X.o. The technology for doing this would probably enable lots of other neat options as well... Tossing this out and hoping someone smarter than I will solve it, Othar Hansson CS Division UC Berkeley othar@ernie.berkeley.edu
dl@G.OSWEGO.EDU (Doug Lea) (10/25/89)
> I'm finding that I pay a real price in speed when I use some of the > libg++ library-derived classes. For example, SplaySets are used > extensively in my code, and in a particularly crucial loop, elements > are retrieved from two SplaySets and multiplied -- my goal is to get > around that loop as many times as possible. > > Unfortunately, functions like __SplaySet::next() cannot be inlined, as > they are not in the class header file -- they end up in another object > file. This function call (and other accesses of different classes in > the same loop) costs me a significant constant in the speed around > that loop. A lot of simple functions (even basic accesses like > class::operator[] and class::operator()) suffer from this same > problem, and they are precisely the functions that are called > thousands/millions of times in typical code. > Decisions about what to inline in a library are hard, and sometimes arbitrary. I usually have to guess about which things are both short enough and well-used enough to be worth declaring as inline. These decisions never please everyone -- I get just about as many complaints that there are too many inlines in libg++, making compiles unbearable slow, as complaints that there are not enough inlines, hurting performance. Michael and I are toying with employing a numerical INLINE_LEVEL compiler switch, with levels that correspond to more or less inlining (thus, faster or slower compilation). A bunch of compilation and linkage details have to be worked out before we can implement this. The specific case at hand (SplaySets) is tougher. The succ() function called by next() is inherently a loop that is not very amenable to inlining -- inlining it might only give you a couple of percent speedup, depending, in part, on the kind of machine you are using. In your case, it might make sense to use SplaySets for the dynamic aspects of your application, and then to transfer things to something like an OSLSet, which supports a faster next() method, for traversal. You might even transfer things (or, more likely, the corresponding pointers) into a simple local array and step through it. About your previous note requesting libg++ Graph classes. These are on my (much too long) list of things to do. If anyone would like to donate something before then, or collaborate on them, please do! -Doug
rjc@maui.cs.ucla.edu (Robert Collins) (10/26/89)
In article <8910251101.AA29989@g.oswego.edu> dl@oswego.oswego.edu writes: > >> I'm finding that I pay a real price in speed when I use some of the >> libg++ library-derived classes. > >Decisions about what to inline in a library are hard, and sometimes >arbitrary. I usually have to guess about which things are both short >enough and well-used enough to be worth declaring as inline. These >decisions never please everyone -- I get just about as many complaints >that there are too many inlines in libg++, making compiles unbearable >slow, as complaints that there are not enough inlines, hurting >performance. I encountered this very early in the process of writing my Connection Machine library. >Michael and I are toying with employing a numerical INLINE_LEVEL >compiler switch, with levels that correspond to more or less inlining >(thus, faster or slower compilation). A bunch of compilation and >linkage details have to be worked out before we can implement this. In a sense, this was my solution. If -O is specified, I do lots of inlining, without it, I do little or none. Each class in my library has two files, a .h and a .cc. They look something like this: foo.cc: #define _FOO_CC_ #include "foo.h" <code> foo.h: class foo { }; #if defined(_FOO_CC_) #define inline #endif #if defined(__OPTIMIZE__) || defined(_FOO_CC_) <the inline function definitions> #endif #if defined(_FOO_CC_) #define inline inline #endif When compiling the library, a non-inline version gets put in foo.o, but the parts of the library that use class foo can use inline versions (if the library is compiled with -O). When the user of the library is compiling, they can compile some files with and others without -O. When I get in a sitution where I want hundreds, or in some cases thousands of inline functions, I will put them in a separate .h file. foo.cc: #define _FOO_CC_ #include "foo.h" <code> foo.h: class foo { }; #if defined(_FOO_CC_) #define inline #endif #if defined(__OPTIMIZE__) || defined(_FOO_CC_) #include "foo2.h" #endif #if defined(_FOO_CC_) #define inline inline #endif foo2.h: <define oodles of inline functions> This way, the huge number of function definitions is not scanned unless it is necessary. This simple scheme has helped a whole lot. I still have a big problem deciding which functions to inline. In a number of cases, g++ tells me that a function is too large to inline. Retorical question: if I say inline, why doesn't it inline? Who knows better, me or the compiler? (I know, the answer is "compiler", but humor me, ok?). But seriously, it would be nice to be able to force the inlining of what the compiler thinks are very large functions. At least in my code, there are lots of places where the compiler can evaluate a conditional at compile time, eliminating lots of code as unreachable (if the function is expanded inline). The problem with all these inline functions (besides compile time) is that I suffer from a serious case of "executable bloat". I mean I get these HUGE multi-meg executables. This problem can be alieviated somewhat by better optimization in the compiler and by better choice in which functions to expand inline. More on my optimization complaints coming soon to a mailing list near you... rob collins rjc@cs.ucla.edu PS For those who are interested in the C++/CM interface, it is about 75% done. It has stalled, due to other commitments and massive computer system changes/problems (latest weirdness: I left town for a couple of days, and when I came back this morning, all of the machine names had been changed--strange). ------------------------------------------------------------------------------- rjc@cs.ucla.edu C++/Paris on the CM2: The only way to fly. -------------------------------------------------------------------------------
tiemann@LURCH.STANFORD.EDU (Michael Tiemann) (10/26/89)
Another solution to coding around the cost of function calls is to use a RISC workstantion where that cost is negligible. Michael
dl@G.OSWEGO.EDU (Doug Lea) (10/26/89)
Michael said... > Another solution to coding around the cost of function calls is to use > a RISC workstantion where that cost is negligible But this is only part of the story. The *primary* goal of inlining on RISC machine like a SPARC is not really saving a cycle or two for the function call, but rather `procedural integration'. If inlines reveal to the compiler some important constraints, constants, etc., that make further simplifications and optimizations possible, it can be a HUGE win. Try this: #include <builtin.h> #include <stream.h> class Vec1000000 { float s[1000000]; public: Vec1000000() {} ~Vec1000000() {} int size(); float& sub(int i); int ni_size(); // non-inline versions float& ni_sub(int i); virtual int v_size(); // virtual non-inline versions virtual float& v_sub(int i); }; inline int Vec1000000::size() { return 1000000; } int Vec1000000::ni_size() { return 1000000; } int Vec1000000::v_size() { return 1000000; } inline float& Vec1000000::sub(int i) { return s[i]; } float& Vec1000000::ni_sub(int i) { return s[i]; } float& Vec1000000::v_sub(int i) { return s[i]; } void fill(Vec1000000& v, float x) { for (int i = 0; i < v.size(); ++i) v.sub(i) = x; } void ni_fill(Vec1000000& v, float x) { for (int i = 0; i < v.ni_size(); ++i) v.ni_sub(i) = x; } void v_fill(Vec1000000& v, float x) { for (int i = 0; i < v.v_size(); ++i) v.v_sub(i) = x; } main() { Vec1000000 v; start_timer(); fill(v, 1.0); cout << "fill time = " << return_elapsed_time(0.0) << "\n"; start_timer(); ni_fill(v, 1.0); cout << "ni_fill time = " << return_elapsed_time(0.0) << "\n"; start_timer(); v_fill(v, 1.0); cout << "v_fill time = " << return_elapsed_time(0.0) << "\n"; } g++ -g -O -fstrength-reduce -fdelayed-branch t1026.cc -lg++ Compilation finished at Thu Oct 26 05:51:39 (on a Sun4/110) g.oswego.edu% a.out fill time = 0.71 ni_fill time = 1.93 v_fill time = 4.41 g.oswego.edu% !! a.out fill time = 0.63 ni_fill time = 1.9 v_fill time = 3.58 g.oswego.edu% !! a.out fill time = 0.58 ni_fill time = 1.89 v_fill time = 4.41 g.oswego.edu% So inlining gave a > 2.5X speedup over noninlines, and > 5X over virtuals on the Sparc. It's not too hard to make examples where it's closer to 10X. The code generated for inline fill() is about the best you could get anywhere: _fill__FR10Vec1000000f: !#PROLOGUE# 0 save %sp,-112,%sp !#PROLOGUE# 1 mov %i1,%o2 sethi %hi(1000000),%o0 or %lo(1000000),%o0,%o0 mov %i0,%o1 sll %o0,2,%o0 b L672 add %o0,%i0,%o0 L677: st %o2,[%o1] add %o1,4,%o1 L672: cmp %o1,%o0 bl L677 nop ret restore -Doug
grunwald@foobar.colorado.edu (Dirk Grunwald) (10/26/89)
I think that a lot of the effort spent on ``cleaning up inlining'' & the like would be better spent on global optimization, although that process is much harder. I'm thinking of things along the lines of the compilation to symbolic intermediate (rtl?), and later ``linking'' of the intermediate form followed by a global procedure integration & register allocation pass. Examples of this include the MIPS compilers (using ucode as intermediate) and the work at DEC-WRL for the titan (mahler intermediate code). This allows much better procedure integration and you'll improve register allocation in leaf procedures. Also, it would make G++ much nicer because your header files would only include class declerations - there would be not need to include inline procedure bodies in the header files.
tiemann@LURCH.STANFORD.EDU (Michael Tiemann) (10/26/89)
The code generated for inline fill() is about the best you could get anywhere: _fill__FR10Vec1000000f: !#PROLOGUE# 0 save %sp,-112,%sp !#PROLOGUE# 1 mov %i1,%o2 sethi %hi(1000000),%o0 or %lo(1000000),%o0,%o0 mov %i0,%o1 sll %o0,2,%o0 b L672 add %o0,%i0,%o0 L677: st %o2,[%o1] add %o1,4,%o1 L672: cmp %o1,%o0 bl L677 nop ret restore -Doug No, that code is bad. GCC with my scheduling extensions would make code which looked like this (honest!): _fill__FR10Vec1000000f: !#PROLOGUE# 0 !#PROLOGUE# 1 sethi %hi(1000000),%o2 or %lo(1000000),%o2,%o2 sll %o2,2,%o2 add %o2,%o0,%o2 st %o1,[%o0] L677: add %o0,4,%o0 L672: cmp %o0,%o2 bl L678 st %o1,[%o0] ret nop or something like it. There's lots of fun you can have with an optimizer. Michael
beshers@brcsun2.uconn.edu (George Beshers) (10/28/89)
I agree with Dirk's sentiment and would add another very important reason: templates (e.g., generics). I recently have been programming a great deal in NJML which is basically SML. The experience has been *very* rewarding except for the size/time requirements of the system. Basically, I have yet to write a piece of code twice because of some change in typing---a major accomplishment in terms of code reuse!! George Beshers beshers@uconn.edu