shirley@iuvax.cs.indiana.edu (peter shirley) (01/09/91)
I have a C++ program that takes about 170% as long as a similar C program.
I've run it on a vax and a sun with and without big-O.
I've inlined everything I can, and I don't understand the slowdown.
Could someone enlighten me? (yes, I looked at the C code generated
by CC, and am too dumb to understand it).
The C, then C++ code follow.
Thanks,
Pete Shirley
shirley@cs.indiana.edu
**************** begin C code *********************
#include "stdio.h"
main()
{
int i;
double a[3], b[3];
a[0] = a[1] = a[2] = 0.0;
b[0] = 1.0;
b[1] = 2.0;
b[2] = 4.0;
for ( i = 0; i < 1000000; i++)
{
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
}
fprintf(stderr, "%lf %lf %lf\n", a[0], a[1], a[2]);
}
************* end C code *******************
************** begin C++ code *********************
#include <math.h>
#include <stream.h>
class vector {
protected:
double data[3];
public:
vector();
vector(double, double, double);
void set(double, double, double);
void operator=(vector&);
friend ostream& operator<<(ostream&, vector&);
friend vector operator+(vector&, vector&);
};
inline vector::vector()
{
}
inline vector::vector(double a, double b, double c)
{
data[0] = a; data[1] = b; data[2] = c;
}
inline void vector::set(double a, double b, double c)
{
data[0] = a; data[1] = b; data[2] = c;
}
inline void vector::operator=(vector& v)
{
data[0] = v.data[0];
data[1] = v.data[1];
data[2] = v.data[2];
}
inline vector operator+(vector& u, vector& v)
{
vector temp;
temp.data[0] = u.data[0] + v.data[0];
temp.data[1] = u.data[1] + v.data[1];
temp.data[2] = u.data[2] + v.data[2];
return temp;
}
inline ostream& operator<<(ostream& s, vector& t)
{
return s << "(" << t.data[0] << ", "<< t.data[1] <<", " << t.data[2] <<")";
}
main()
{
vector a(0.0, 0.0, 0.0);
vector b(1.0, 2.0, 4.0);
for (int i = 0; i < 1000000; i++)
a = a + b;
cerr << a << "\n";
}
************** end C++ code ***********************shirley@iuvax.cs.indiana.edu (peter shirley) (01/09/91)
Several people on this group and by mail suggested I needed a += operation for vectors. I tried this, and on our vax it got the C++ code to be as fast as the C code. On a sparc, it was as fast with "big-O" turned off, but the C code was much better with the optimizer on. Maybe cc was smart enough to change something like: a = 0; for (i=0; i < 1000000; i++) a = a + 1; to: a = 1000000; Anyway, things are working great now. Thank you all for the quick help! pete shirley
jak@cs.brown.edu (Jak Kirman) (01/09/91)
In article <1991Jan9.002244.23398@news.cs.indiana.edu> shirley@iuvax.cs.indiana.edu (peter shirley) writes: I have a C++ program that takes about 170% as long as a similar C program. I've run it on a vax and a sun with and without big-O. I've inlined everything I can, and I don't understand the slowdown. Could someone enlighten me? (yes, I looked at the C code generated by CC, and am too dumb to understand it). [code using a vector class, adding vectors with operator+] Part of the problem at least is that you are using a function which returns an object. Typically what happens in this case (at least under cfront) is that the function is called, the return value copied into a temporary, and then the assignment function is called to copy the temporary into the destination. So with vector a, b, and a + operator which returns a vector, a = a + b; will result in a call to the + operator, which returns a vector. This vector is copied into the temporary by C, and then the assignment operator is called. So you are doing nearly twice the work. With a += operator, changing that line to a += b the time was about the same without optimization on a Sparc with AT&T 2.0. The time was still substantially faster in C with optimization; I am not sure what optimizations are being performed in C but not C++. Ellis & Stroustrup ARM 12.1.1c provides an explanation of why the optimizations necessary to avoid the extra copying would be very difficult for most compilers, and points out that in the case of initialization it often can be optimized. Below is a simple example of C++ code and a cleaned-up version of what is produced by cfront. I have changed the names of the temporaries and unmangled the names, and performed a few trivial "simplifications" of the code, such as changing (a,b) to a;b where the result of (a,b) is not used. In the example, when the result of foo is assigned to an existing X object, a temporary is created. When the result of foo is used to initialize an X object, no temporary is created. Note that things would be different were there a copy constructor for X; the function foo would be changed to take a pointer to the location where the result should be created. struct X { int a; int b; void operator= (const X& o) { a = o.a; b = o.b; } }; X foo () { X myx; return myx; } main () { X x1; x1 = foo (); X x2 = foo (); } struct X { int a ; int b ; }; struct X foo () { struct X myx ; return myx ; /* note that C returns a structure */ } int main () { _main(); { struct X x1 ; struct X x2 ; struct X *p ; { struct X tmp ; tmp = foo (); /* structure copied into tmp */ p = &tmp; (& x1 )-> a = *p.a ; /* operator= used to copy tmp into x1 */ (& x1 )-> b = *p.b ; x2 = foo ( ) ; /* no copy necessary } } }
ggs@ulysses.att.com (Griff Smith) (01/09/91)
In article <1991Jan9.002244.23398@news.cs.indiana.edu>, shirley@iuvax.cs.indiana.edu (peter shirley) writes: > I have a C++ program that takes about 170% as long as a similar C program. > I've run it on a vax and a sun with and without big-O. > I've inlined everything I can, and I don't understand the slowdown. The C program is written in terms of basic operations that the optimizer can easily improve; the C++ version isn't. For example: > a[0] = a[0] + b[0]; This is better written as a[0] += b[0]; though modern compilers can usually make the two cases equivalent. The C++ version is not so fortunate. You write > a = a + b; Since `a' is a structure, your `+' operator first has to store the sum in your `temp', then your `=' operator must copy it into `a'. This takes a string move at best, and a slow function call at worst. I tried redefining your `+' operator to be a `vector& +=' operator and my execution time on a tahoe became 6.0 seconds instead of 14.5. That's still not quite as good as 4.5 for the C version, but it explains most of the difference. > Could someone enlighten me? (yes, I looked at the C code generated > by CC, and am too dumb to understand it). C++ output isn't intelligible to most mortals; this mortal frequently has trouble with C++ input. I often find it easier to use adb to look at the assembly language. -- Griff Smith AT&T (Bell Laboratories), Murray Hill Phone: 1-201-582-7736 UUCP: {most AT&T sites}!ulysses!ggs Internet: ggs@ulysses.att.com
steve@Pkg.Mcc.COM (Steve Madere) (01/10/91)
In article <1991Jan9.002244.23398@news.cs.indiana.edu>, shirley@iuvax.cs.indiana.edu (peter shirley) writes: | I have a C++ program that takes about 170% as long as a similar C program. | I've run it on a vax and a sun with and without big-O. | I've inlined everything I can, and I don't understand the slowdown. | Could someone enlighten me? (yes, I looked at the C code generated | by CC, and am too dumb to understand it). | <Stuff deleted> <C program segment folows> | for ( i = 0; i < 1000000; i++) | { | a[0] = a[0] + b[0]; | a[1] = a[1] + b[1]; | a[2] = a[2] + b[2]; | } <C++ program segment follows> | inline void vector::operator=(vector& v) | { | | data[0] = v.data[0]; | data[1] = v.data[1]; | data[2] = v.data[2]; | | } | | | | inline vector operator+(vector& u, vector& v) | { | vector temp; | | temp.data[0] = u.data[0] + v.data[0]; | temp.data[1] = u.data[1] + v.data[1]; | temp.data[2] = u.data[2] + v.data[2]; | return temp; | } <Stuff deleted> | for (int i = 0; i < 1000000; i++) | a = a + b; | | cerr << a << "\n"; | } At this point the cause of the speed problems should be obvious. Your C++ program is not quite equivalent to your C program. To make them equivalent, you should have the following C program segment substituted for the one above. for (i=0;i<1000000;i++) { temp = calloc(3,sizeof(*temp)); temp[0] = a[0] + b[0]; temp[1] = a[1] + b[1]; temp[2] = a[2] + b[2]; a[0]=temp[0]; a[1]=temp[1]; a[2]=temp[2]; free(temp); } Once this is done, I think your two programs will probably run at the same speed :-(. This is a problem with C++ that I have known about for some time. The lure of simplified notation leads one to try to implement vector and matrix operations with special classes that will hide the complexity. Unfortunately, all of those temporary variables that must be created to realize any mathematical operations really waste CPU time. It takes a finite amount of time to allocate space, copy data, and free up the space again. In many cases it is a significant percentage of the total run time of the program. A friend of mine at UCSD has worked on this problem for some time and concluded that what one needs is some kind of matrix awareness in the compiler to recognize when operations can be somehow combined. Clearly this is not the kind of thing that C++ was designed to do. It is appropriate to add matrix awareness to a computational language like FORTRAN but C++ is not a scientific computation language. (no flames please, I just mean that its PRIMARY purpose is not scientific computation). Maybe what one needs is a preprocessed language that takes matrix operations, turns them into optimized C code which you can link with your C++ program. Any suggestions from the peanut gallery? Steve Madere steve@pkg.mcc.com
steve@taumet.com (Stephen Clamage) (01/10/91)
shirley@iuvax.cs.indiana.edu (peter shirley) writes: >I have a C++ program that takes about 170% as long as a similar C program. >I've run it on a vax and a sun with and without big-O. >I've inlined everything I can, and I don't understand the slowdown. >inline vector operator+(vector& u, vector& v) >{ > vector temp; > temp.data[0] = u.data[0] + v.data[0]; > temp.data[1] = u.data[1] + v.data[1]; > temp.data[2] = u.data[2] + v.data[2]; > return temp; >} This is your problem. You create a local object, then return it by value. This must then be copied into the target variable. The equivalent C code looks like this: typedef struct { double data[3]; } vector; vector plus( vector* this, vector* u, vector* v) { ...the body is the same as above... } main() { vector u, v, w; ... w = plus(&w, &u, &v); ... } Compare w = plus(&w, &u, &v); to w.data[0] = u.data[0] + v.data[0]; w.data[1] = u.data[1] + v.data[1]; w.data[2] = u.data[2] + v.data[2]; There is an extra structure copy using plus() instead of direct code, and the copy is in the inner loop. Try rewriting operator+ as: inline vector operator+(vector& u, vector& v) { return vector(u.data[0] + v.data[0], u.data[1] + v.data[1], u.data[2] + v.data[2]); } Many C++ compilers are smart enough not to create a temporary object, but to construct the new vector directly in the target variable. -- Steve Clamage, TauMetric Corp, steve@taumet.com
vaughan@mcc.com (Paul Vaughan) (01/10/91)
From: shirley@iuvax.cs.indiana.edu (peter shirley)
Newsgroups: comp.lang.c++
Date: 9 Jan 91 05:22:13 GMT
Organization: Computer Science, Indiana University
Lines: 115
I have a C++ program that takes about 170% as long as a similar C program.
I've run it on a vax and a sun with and without big-O.
I've inlined everything I can, and I don't understand the slowdown.
Could someone enlighten me? (yes, I looked at the C code generated
by CC, and am too dumb to understand it).
The answer is that the C++ version has to construct and copy a
temporary vector. Here are some timings I made that show this effect
by noting the difference between using your version and using a
vector::operator+= that avoids the copy.
These timings were all on a
SUN4. Note the user time.
Using a = a + b
[1.95]sunspot) g++-1.90.04 -O -o trash trash.cc -lg++
trash.cc: In function class vector operator + (class vector &, class vector &):
trash.cc:62: warning: bitwise copy: `vector' defines operator=()
[1.96]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
32.1 real 10.0 user 0.2 sys
(Ignore the warning from g++, it's inappropriate)
Using a += b;
[1.93]sunspot) g++-1.90.04 -O -o trash trash.cc -lg++
trash.cc: In function class vector operator + (class vector &, class vector &):
trash.cc:62: warning: bitwise copy: `vector' defines operator=()
[1.94]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
13.3 real 5.2 user 0.0 sys
Using a = a + b;
[1.97]sunspot) CC -O -o trash trash.cc
CC trash.cc:
cc -o /d2/prism/prism/src/sim/src/trash -I/net/sunspot/usr/cadsw/CC/sun4/incl -O trash.c -L/net/sunspot/usr/cadsw/CC/sun4/ -lC
[1.98]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
23.0 real 8.2 user 0.1 sys
Using a += b;
[1.99]sunspot) CC -O -o trash trash.cc
CC trash.cc:
cc -o /d2/prism/prism/src/sim/src/trash -I/net/sunspot/usr/cadsw/CC/sun4/incl -O trash.c -L/net/sunspot/usr/cadsw/CC/sun4/ -lC
[1.100]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
8.8 real 3.4 user 0.1 sys
Here is the vector::operator+= that I added:
inline vector& vector::operator+=(vector& v)
{
data[0] += v.data[0];
data[1] += v.data[1];
data[2] += v.data[2];
return *this;
}
The concept of a named return value that a compiler could use to
optimize away this sort of temporary copy has been proposed. I think
it was Doug Lea who came up with that, but I'm not sure. I don't know
the status of that suggestion.
Finally, here is the original code.
The C, then C++ code follow.
Thanks,
Pete Shirley
shirley@cs.indiana.edu
**************** begin C code *********************
#include "stdio.h"
main()
{
int i;
double a[3], b[3];
a[0] = a[1] = a[2] = 0.0;
b[0] = 1.0;
b[1] = 2.0;
b[2] = 4.0;
for ( i = 0; i < 1000000; i++)
{
a[0] = a[0] + b[0];
a[1] = a[1] + b[1];
a[2] = a[2] + b[2];
}
fprintf(stderr, "%lf %lf %lf\n", a[0], a[1], a[2]);
}
************* end C code *******************
************** begin C++ code *********************
#include <math.h>
#include <stream.h>
class vector {
protected:
double data[3];
public:
vector();
vector(double, double, double);
void set(double, double, double);
void operator=(vector&);
friend ostream& operator<<(ostream&, vector&);
friend vector operator+(vector&, vector&);
};
inline vector::vector()
{
}
inline vector::vector(double a, double b, double c)
{
data[0] = a; data[1] = b; data[2] = c;
}
inline void vector::set(double a, double b, double c)
{
data[0] = a; data[1] = b; data[2] = c;
}
inline void vector::operator=(vector& v)
{
data[0] = v.data[0];
data[1] = v.data[1];
data[2] = v.data[2];
}
inline vector operator+(vector& u, vector& v)
{
vector temp;
temp.data[0] = u.data[0] + v.data[0];
temp.data[1] = u.data[1] + v.data[1];
temp.data[2] = u.data[2] + v.data[2];
return temp;
}
inline ostream& operator<<(ostream& s, vector& t)
{
return s << "(" << t.data[0] << ", "<< t.data[1] <<", " << t.data[2] <<")";
}
main()
{
vector a(0.0, 0.0, 0.0);
vector b(1.0, 2.0, 4.0);
for (int i = 0; i < 1000000; i++)
a = a + b;
cerr << a << "\n";
}
************** end C++ code ***********************
--
Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639
Box 200195, Austin, TX 78720 | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan
----------------------------------------------------------------------------------
I spent from $3 to $10 today to pay interest on the national debt. How about you?
----------------------------------------------------------------------------------
Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639
Box 200195, Austin, TX 78720 | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan
----------------------------------------------------------------------------------jimad@microsoft.UUCP (Jim ADCOCK) (01/11/91)
In article <1991Jan9.101212@Pkg.Mcc.COM> steve@Pkg.Mcc.COM (Steve Madere) writes |This is a problem with C++ that I have known about for some time. |The lure of simplified notation leads one to try to implement |vector and matrix operations with special classes that will hide |the complexity. Unfortunately, all of those temporary variables |that must be created to realize any mathematical operations really |waste CPU time. It takes a finite amount of time to allocate space, |copy data, and free up the space again. In many cases it is a |significant percentage of the total run time of the program. | |A friend of mine at UCSD has worked on this problem for some time |and concluded that what one needs is some kind of matrix awareness |in the compiler to recognize when operations can be somehow |combined. Clearly this is not the kind of thing that C++ was designed |to do. It is appropriate to add matrix awareness to a computational |language like FORTRAN but C++ is not a scientific computation |language. (no flames please, I just mean that its PRIMARY purpose |is not scientific computation). | |Any suggestions from the peanut gallery? My claim would be that if one is serious about working with large objects -- say large matrices and/or large vectors, then one can afford to "do the job right" in C++ -- which means probably using reference counting on the internal storage of the vectors and/or matrices, while maintaining an external polymorphic wrapper over those guts which represents the exact form of the vector or matrix: is this matrix a "plain old matrix", tridiagonal, identity, .... [In fact, probably some vendors already provide such libraries] If you have a parallel computer, and a highly parallelizing fortran compiler, you might want to write some of the guts of the internal representation of these matrices in fortran -- but better yet, you'll probably find that your vendor already provides highly optimized routines for performing those guts -- linpack-type stuff. There's no reason why you can't have both the nice syntax of C++, and the speed of fortran -- if you're willing to do a little serious work. On the other hand, if you don't care, well, then you don't care.
Bruce.Hoult@bbs.actrix.gen.nz (01/11/91)
Peter Shirley writes: >I have a C++ program that takes about 170% as long as a similar C program. >I've run it on a vax and a sun with and without big-O. >I've inlined everything I can, and I don't understand the slowdown. >Could someone enlighten me? (yes, I looked at the C code generated >by CC, and am too dumb to understand it). I get the same thing on my Mac IIcx: the C program takes 31 seconds, and the C++ one takes 54. The problem is that the compiler isn't all that smart and is (on my system, and probably on yours) copying the result of operator+ twice on the way to its final destination -- the first time using a bitwise copy (the default copy constructor), and the second time using the operator= that you defined. One way to work around the problem is to define an operator+=. This will drop the execution time down to virtually the same as for the C program. Another way is to put the output of CFront through a smarter C compiler. I suspect that GCC would do a rather better job of optomising the mess that CFront leaves behind. -- Bruce.Hoult@bbs.actrix.gen.nz Twisted pair: +64 4 772 116 BIX: brucehoult Last Resort: PO Box 4145 Wellington, NZ
mat@mole-end.UUCP (Mark A Terribile) (01/11/91)
> > I have a C++ program that takes about 170% as long as a similar C program. > > ... I don't understand the slowdown. > The C program is written in terms of basic operations that the optimizer > can easily improve; the C++ version isn't. For example: > > a[0] = a[0] + b[0]; > This is better written as > a[0] += b[0]; > though modern compilers can usually make the two cases equivalent. The > C++ version is not so fortunate. You write > > a = a + b; > Since `a' is a structure, your `+' operator first has to store the sum in > your `temp', then your `=' operator must copy it into `a'. This takes a > string move at best, and a slow function call at worst. I tried redefining > your `+' operator to be a `vector& +=' operator and my execution time on > a tahoe became 6.0 seconds instead of 14.5. That's still not quite as > good as 4.5 for the C version, but it explains most of the difference. There's also some genuinely bad use of C++ initialization rules around returns and parameters. Fixing it won't make the problem go away, but it will probably shave about half off the extra cost. In the operator+() , you force the loading of a temporary vector, then a copy from the temporary to the return value. Sometimes this is unavoidable; here it is not. Instead of writing inline vector operator+(vector& u, vector& v) { vector temp; temp.data[0] = u.data[0] + v.data[0]; temp.data[1] = u.data[1] + v.data[1]; temp.data[2] = u.data[2] + v.data[2]; return temp; } write inline vector operator+(vector& u, vector& v) { return vector( u.data[0] + v.data[0], u.data[1] + v.data[1], u.data[2] + v.data[2] ); } There is no absolute guarantee that the compiler will not build a temporary and do the copy, but at least some, under at least some circumstances, will build the return value in the place in which it is to finally rest. Class/struct return is usually implemented, and especially in cfront, by passing a pointer to a place wherein the return value is to be built, or by providing extra stack space for the return value. In either case, writing the `value builder' constructor in the return statement gives the compiler permission to optimize away the copy constructor invocation by building the result in place. > > Could someone enlighten me? (yes, I looked at the C code generated > > by CC, and am too dumb to understand it). It's not that bad. Use a good editor to match up the parens and split the lines to exhibit the parentheses nesting, then remove truly redundant parens. Remember that `;' becomes `,' and if/then/else is implemented with ?: . A `0' is used to represent a null-else. Most of the rest of the opacity comes from name mangling and the nested- variable prefixes. Those you will get the hang of in time. -- (This man's opinions are his own.) From mole-end Mark Terribile
jbuck@galileo.berkeley.edu (Joe Buck) (01/16/91)
In article <1877@ruunsa.fys.ruu.nl>, muts@fysaj.fys.ruu.nl (Peter Mutsaers /1000000) writes: |> It is a pity that the c++ optimizer does not (yet) automatically |> change a = a + b into a+=b if operator+= is defined too. It would |> reduce however the freedom to use operators like += *= etc. I got this trick from Andy Koenig. Whenever I write a class with arithmetic operators, I always write +=, *=, etc, as member functions and then write (let's say the class is Foo) inline Foo operator+ (const Foo& a, const Foo& b) { Foo tmp(a); tmp += b; return tmp; } Notice that this function does not need to be a friend (so when people ask me whether arithmetic operators should be members or friends, I reply, neither). With this kind of structure, most optimizers can convert func (Foo& x, Foo& y) { ... Foo c = x + y; ... } into Foo c(x); c += y; with little trouble. -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck