[comp.lang.c++] why is this program slow?

shirley@iuvax.cs.indiana.edu (peter shirley) (01/09/91)

I have a C++ program that takes about 170% as long as a similar C program.
I've run it on a vax and a sun with and without big-O.
I've inlined everything I can, and I don't understand the slowdown.
Could someone enlighten me?  (yes, I looked at the C code generated
by CC, and am too dumb to understand it).

The C, then C++ code follow.

Thanks,

Pete Shirley
shirley@cs.indiana.edu

**************** begin C code *********************

#include "stdio.h"


main()
{
    int i;
    double a[3], b[3]; 

    a[0] = a[1] = a[2] = 0.0;
    b[0] = 1.0;
    b[1] = 2.0;
    b[2] = 4.0;

    for ( i = 0; i < 1000000; i++)
    {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
    }

    fprintf(stderr, "%lf %lf %lf\n", a[0], a[1], a[2]);
}


************* end C code *******************

************** begin C++ code *********************

#include <math.h>
#include <stream.h>


class vector {
protected:
    double data[3];
public:
    vector();
    vector(double, double, double);
    void set(double, double, double);
    void operator=(vector&);
    friend ostream& operator<<(ostream&, vector&);
    friend vector operator+(vector&, vector&);
};


inline vector::vector()
{
}


inline vector::vector(double a, double b, double c)
{
   data[0] = a;  data[1] = b;  data[2] = c;
}


inline void vector::set(double a, double b, double c)
{
   data[0] = a;  data[1] = b;  data[2] = c;
}

inline void vector::operator=(vector& v)
{

   data[0] = v.data[0];
   data[1] = v.data[1];
   data[2] = v.data[2];

}



inline vector operator+(vector& u, vector& v)
{
    vector temp;

    temp.data[0] = u.data[0] + v.data[0];
    temp.data[1] = u.data[1] + v.data[1];
    temp.data[2] = u.data[2] + v.data[2];
    return temp;
}

inline ostream& operator<<(ostream& s, vector& t)
{
   return s << "(" << t.data[0] << ", "<< t.data[1] <<", " << t.data[2] <<")";
}


main()
{
    vector a(0.0, 0.0, 0.0);
    vector b(1.0, 2.0, 4.0);

    for (int i = 0; i < 1000000; i++)
        a = a + b;

    cerr << a << "\n";
}

************** end C++ code ***********************

shirley@iuvax.cs.indiana.edu (peter shirley) (01/09/91)

Several people on this group and by mail suggested I needed a +=
operation for vectors.  I tried this, and on our vax it got the C++
code to be as fast as the C code.  On a sparc, it was as fast with "big-O"
turned off, but the C code was much better with the optimizer on.  Maybe
cc was smart enough to change something like:

	a = 0;
	for (i=0; i < 1000000; i++)
	   a = a + 1;

to:

	a = 1000000;


Anyway, things are working great now.  Thank you all for the quick help!

pete shirley

jak@cs.brown.edu (Jak Kirman) (01/09/91)

In article <1991Jan9.002244.23398@news.cs.indiana.edu> shirley@iuvax.cs.indiana.edu (peter shirley) writes:


   I have a C++ program that takes about 170% as long as a similar C program.
   I've run it on a vax and a sun with and without big-O.
   I've inlined everything I can, and I don't understand the slowdown.
   Could someone enlighten me?  (yes, I looked at the C code generated
   by CC, and am too dumb to understand it).

   [code using a vector class, adding vectors with operator+]

Part of the problem at least is that you are using a function which
returns an object.  Typically what happens in this case (at least under
cfront) is that the function is called, the return value copied into a
temporary, and then the assignment function is called to copy the
temporary into the destination.  So with vector a, b, and a + operator
which returns a vector,

        a = a + b;

will result in a call to the + operator, which returns a vector.  This
vector is copied into the temporary by C, and then the assignment
operator is called.  So you are doing nearly twice the work.

With a += operator, changing that line to

        a += b

the time was about the same without optimization on a Sparc with AT&T
2.0.  The time was still substantially faster in C with optimization; I
am not sure what optimizations are being performed in C but not C++.

Ellis & Stroustrup ARM 12.1.1c provides an explanation of why the
optimizations necessary to avoid the extra copying would be very
difficult for most compilers, and points out that in the case of
initialization it often can be optimized.

Below is a simple example of C++ code and a cleaned-up version of what is
produced by cfront.  I have changed the names of the temporaries and
unmangled the names, and performed a few trivial "simplifications" of
the code, such as changing (a,b) to a;b where the result of (a,b) is not
used.  In the example, when the result of foo is assigned to an existing
X object, a temporary is created.  When the result of foo is used to
initialize an X object, no temporary is created.

Note that things would be different were there a copy constructor for X;
the function foo would be changed to take a pointer to the location
where the result should be created.

struct X {
  int a;
  int b;
  void operator= (const X& o) { a = o.a; b = o.b; }
};


X foo ()
{
  X myx;
  return myx;
}

main ()
{
  X x1;

  x1 = foo ();
  X x2 = foo ();
}


struct X {
int a ;
int b ;
};

struct X foo ()
{ 
  struct X myx ;
  
  return myx ;          /* note that C returns a structure */
}


int main ()
{
  _main(); 
  { 
    struct X x1 ;
    struct X x2 ;
    struct X *p ;
    { 
      struct X tmp ;
      tmp = foo ();             /* structure copied into tmp */
      p = &tmp;
      
      (& x1 )-> a = *p.a ;      /* operator= used to copy tmp into x1 */
      (& x1 )-> b = *p.b ;
      
      x2 = foo ( ) ;            /* no copy necessary
    }
  }
}

ggs@ulysses.att.com (Griff Smith) (01/09/91)

In article <1991Jan9.002244.23398@news.cs.indiana.edu>, shirley@iuvax.cs.indiana.edu (peter shirley) writes:
> I have a C++ program that takes about 170% as long as a similar C program.
> I've run it on a vax and a sun with and without big-O.
> I've inlined everything I can, and I don't understand the slowdown.

The C program is written in terms of basic operations that the optimizer
can easily improve; the C++ version isn't.  For example:

>         a[0] = a[0] + b[0];

This is better written as

	  a[0] += b[0];

though modern compilers can usually make the two cases equivalent.  The
C++ version is not so fortunate.  You write

>         a = a + b;

Since `a' is a structure, your `+' operator first has to store the sum in
your `temp', then your `=' operator must copy it into `a'.  This takes a
string move at best, and a slow function call at worst.  I tried redefining
your `+' operator to be a `vector& +=' operator and my execution time on
a tahoe became 6.0 seconds instead of 14.5.  That's still not quite as
good as 4.5 for the C version, but it explains most of the difference.

> Could someone enlighten me?  (yes, I looked at the C code generated
> by CC, and am too dumb to understand it).

C++ output isn't intelligible to most mortals; this mortal frequently
has trouble with C++ input.  I often find it easier to use adb to look
at the assembly language.
-- 
Griff Smith	AT&T (Bell Laboratories), Murray Hill
Phone:		1-201-582-7736
UUCP:		{most AT&T sites}!ulysses!ggs
Internet:	ggs@ulysses.att.com

steve@Pkg.Mcc.COM (Steve Madere) (01/10/91)

In article <1991Jan9.002244.23398@news.cs.indiana.edu>,
shirley@iuvax.cs.indiana.edu (peter shirley) writes:
| I have a C++ program that takes about 170% as long as a similar C
program.
| I've run it on a vax and a sun with and without big-O.
| I've inlined everything I can, and I don't understand the slowdown.
| Could someone enlighten me?  (yes, I looked at the C code generated
| by CC, and am too dumb to understand it).
| 

<Stuff deleted>
<C program segment folows>
|     for ( i = 0; i < 1000000; i++)
|     {
|         a[0] = a[0] + b[0];
|         a[1] = a[1] + b[1];
|         a[2] = a[2] + b[2];
|     }

<C++ program segment follows>
| inline void vector::operator=(vector& v)
| {
| 
|    data[0] = v.data[0];
|    data[1] = v.data[1];
|    data[2] = v.data[2];
| 
| }
| 
| 
| 
| inline vector operator+(vector& u, vector& v)
| {
|     vector temp;
| 
|     temp.data[0] = u.data[0] + v.data[0];
|     temp.data[1] = u.data[1] + v.data[1];
|     temp.data[2] = u.data[2] + v.data[2];
|     return temp;
| }

<Stuff deleted>
|     for (int i = 0; i < 1000000; i++)
|         a = a + b;
| 
|     cerr << a << "\n";
| }

At this point the cause of the speed problems should be obvious.
Your C++ program is not quite equivalent to your C program.
To make them equivalent, you should have the following C program
segment substituted for the one above.

for (i=0;i<1000000;i++)
{
  temp = calloc(3,sizeof(*temp));
  temp[0] = a[0] + b[0];
  temp[1] = a[1] + b[1];
  temp[2] = a[2] + b[2];
  a[0]=temp[0];
  a[1]=temp[1];
  a[2]=temp[2];
  free(temp);
}

Once this is done, I think your two programs will probably run
at the same speed :-(.

This is a problem with C++ that I have known about for some time.
The lure of simplified notation leads one to try to implement
vector and matrix operations with special classes that will hide
the complexity.  Unfortunately, all of those temporary variables
that must be created to realize any mathematical operations really
waste CPU time.  It takes a finite amount of time to allocate space,
copy data, and free up the space again.  In many cases it is a
significant percentage of the total run time of the program.

A friend of mine at UCSD has worked on this problem for some time
and concluded that what one needs is some kind of matrix awareness
in the compiler to recognize when operations can be somehow 
combined.  Clearly this is not the kind of thing that C++ was designed
to do.  It is appropriate to add matrix awareness to a computational
language like FORTRAN but C++ is not a scientific computation
language.  (no flames please, I just mean that its PRIMARY purpose
is not scientific computation).

Maybe what one needs is a preprocessed language that takes matrix
operations, turns them into optimized C code which you can 
link with your C++ program.  

Any suggestions from the peanut gallery?

Steve Madere
steve@pkg.mcc.com

steve@taumet.com (Stephen Clamage) (01/10/91)

shirley@iuvax.cs.indiana.edu (peter shirley) writes:

>I have a C++ program that takes about 170% as long as a similar C program.
>I've run it on a vax and a sun with and without big-O.
>I've inlined everything I can, and I don't understand the slowdown.

>inline vector operator+(vector& u, vector& v)
>{
>    vector temp;
>    temp.data[0] = u.data[0] + v.data[0];
>    temp.data[1] = u.data[1] + v.data[1];
>    temp.data[2] = u.data[2] + v.data[2];
>    return temp;
>}

This is your problem.  You create a local object, then return it by
value.  This must then be copied into the target variable.  The
equivalent C code looks like this:

typedef struct { double data[3]; } vector;

vector plus( vector* this, vector* u, vector* v)
{
    ...the body is the same as above...
}

main()
{
    vector u, v, w;
    ...
    w = plus(&w, &u, &v);
    ...
}

Compare
    w = plus(&w, &u, &v);
to
    w.data[0] = u.data[0] + v.data[0];
    w.data[1] = u.data[1] + v.data[1];
    w.data[2] = u.data[2] + v.data[2];
There is an extra structure copy using plus() instead of direct code,
and the copy is in the inner loop.

Try rewriting operator+ as:

inline vector operator+(vector& u, vector& v)
{
    return vector(u.data[0] + v.data[0],
		  u.data[1] + v.data[1],
		  u.data[2] + v.data[2]);
}

Many C++ compilers are smart enough not to create a temporary object,
but to construct the new vector directly in the target variable.
-- 

Steve Clamage, TauMetric Corp, steve@taumet.com

vaughan@mcc.com (Paul Vaughan) (01/10/91)

   From: shirley@iuvax.cs.indiana.edu (peter shirley)
   Newsgroups: comp.lang.c++
   Date: 9 Jan 91 05:22:13 GMT
   Organization: Computer Science, Indiana University
   Lines: 115

   I have a C++ program that takes about 170% as long as a similar C program.
   I've run it on a vax and a sun with and without big-O.
   I've inlined everything I can, and I don't understand the slowdown.
   Could someone enlighten me?  (yes, I looked at the C code generated
   by CC, and am too dumb to understand it).

The answer is that the C++ version has to construct and copy a
temporary vector.  Here are some timings I made that show this effect
by noting the difference between using your version and using a
vector::operator+= that avoids the copy.  

These timings were all on a
SUN4.  Note the user time.

Using a = a + b
[1.95]sunspot) g++-1.90.04 -O -o trash trash.cc -lg++
trash.cc: In function class vector operator + (class vector &, class vector &):
trash.cc:62: warning: bitwise copy: `vector' defines operator=() 
[1.96]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
       32.1 real        10.0 user         0.2 sys 

(Ignore the warning from g++, it's inappropriate)

Using a += b;
[1.93]sunspot) g++-1.90.04 -O -o trash trash.cc -lg++
trash.cc: In function class vector operator + (class vector &, class vector &):
trash.cc:62: warning: bitwise copy: `vector' defines operator=()
[1.94]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
       13.3 real         5.2 user         0.0 sys  


Using a = a + b;
[1.97]sunspot) CC -O -o trash trash.cc
CC  trash.cc:
cc   -o /d2/prism/prism/src/sim/src/trash -I/net/sunspot/usr/cadsw/CC/sun4/incl  -O trash.c  -L/net/sunspot/usr/cadsw/CC/sun4/ -lC
[1.98]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
       23.0 real         8.2 user         0.1 sys  

Using a += b;
[1.99]sunspot) CC -O -o trash trash.cc
CC  trash.cc:
cc   -o /d2/prism/prism/src/sim/src/trash -I/net/sunspot/usr/cadsw/CC/sun4/incl  -O trash.c  -L/net/sunspot/usr/cadsw/CC/sun4/ -lC
[1.100]sunspot) /bin/time trash
(1e+06, 2e+06, 4e+06)
        8.8 real         3.4 user         0.1 sys  

Here is the vector::operator+= that I added:


inline vector& vector::operator+=(vector& v)
{

   data[0] += v.data[0];
   data[1] += v.data[1];
   data[2] += v.data[2];
   return *this;
}

The concept of a named return value that a compiler could use to
optimize away this sort of temporary copy has been proposed.  I think
it was Doug Lea who came up with that, but I'm not sure.  I don't know
the status of that suggestion.

Finally, here is the original code.


   The C, then C++ code follow.

   Thanks,

   Pete Shirley
   shirley@cs.indiana.edu

   **************** begin C code *********************

   #include "stdio.h"


   main()
   {
       int i;
       double a[3], b[3]; 

       a[0] = a[1] = a[2] = 0.0;
       b[0] = 1.0;
       b[1] = 2.0;
       b[2] = 4.0;

       for ( i = 0; i < 1000000; i++)
       {
	   a[0] = a[0] + b[0];
	   a[1] = a[1] + b[1];
	   a[2] = a[2] + b[2];
       }

       fprintf(stderr, "%lf %lf %lf\n", a[0], a[1], a[2]);
   }


   ************* end C code *******************

   ************** begin C++ code *********************

   #include <math.h>
   #include <stream.h>


   class vector {
   protected:
       double data[3];
   public:
       vector();
       vector(double, double, double);
       void set(double, double, double);
       void operator=(vector&);
       friend ostream& operator<<(ostream&, vector&);
       friend vector operator+(vector&, vector&);
   };


   inline vector::vector()
   {
   }


   inline vector::vector(double a, double b, double c)
   {
      data[0] = a;  data[1] = b;  data[2] = c;
   }


   inline void vector::set(double a, double b, double c)
   {
      data[0] = a;  data[1] = b;  data[2] = c;
   }

   inline void vector::operator=(vector& v)
   {

      data[0] = v.data[0];
      data[1] = v.data[1];
      data[2] = v.data[2];

   }



   inline vector operator+(vector& u, vector& v)
   {
       vector temp;

       temp.data[0] = u.data[0] + v.data[0];
       temp.data[1] = u.data[1] + v.data[1];
       temp.data[2] = u.data[2] + v.data[2];
       return temp;
   }

   inline ostream& operator<<(ostream& s, vector& t)
   {
      return s << "(" << t.data[0] << ", "<< t.data[1] <<", " << t.data[2] <<")";
   }


   main()
   {
       vector a(0.0, 0.0, 0.0);
       vector b(1.0, 2.0, 4.0);

       for (int i = 0; i < 1000000; i++)
	   a = a + b;

       cerr << a << "\n";
   }

   ************** end C++ code ***********************
-- 
 Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639
 Box 200195, Austin, TX 78720  | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan
----------------------------------------------------------------------------------
I spent from $3 to $10 today to pay interest on the national debt.  How about you?
----------------------------------------------------------------------------------


 Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639
 Box 200195, Austin, TX 78720  | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan
----------------------------------------------------------------------------------

jimad@microsoft.UUCP (Jim ADCOCK) (01/11/91)

In article <1991Jan9.101212@Pkg.Mcc.COM> steve@Pkg.Mcc.COM (Steve Madere) writes
|This is a problem with C++ that I have known about for some time.
|The lure of simplified notation leads one to try to implement
|vector and matrix operations with special classes that will hide
|the complexity.  Unfortunately, all of those temporary variables
|that must be created to realize any mathematical operations really
|waste CPU time.  It takes a finite amount of time to allocate space,
|copy data, and free up the space again.  In many cases it is a
|significant percentage of the total run time of the program.
|
|A friend of mine at UCSD has worked on this problem for some time
|and concluded that what one needs is some kind of matrix awareness
|in the compiler to recognize when operations can be somehow 
|combined.  Clearly this is not the kind of thing that C++ was designed
|to do.  It is appropriate to add matrix awareness to a computational
|language like FORTRAN but C++ is not a scientific computation
|language.  (no flames please, I just mean that its PRIMARY purpose
|is not scientific computation).
|
|Any suggestions from the peanut gallery?

My claim would be that if one is serious about working with large objects --
say large matrices and/or large vectors, then one can afford to 
"do the job right" in C++ -- which means probably using reference counting
on the internal storage of the vectors and/or matrices, while maintaining
an external polymorphic wrapper over those guts which represents the 
exact form of the vector or matrix: is this matrix a "plain old matrix",
tridiagonal, identity, .... 

[In fact, probably some vendors already provide such libraries]

If you have a parallel computer, and a highly parallelizing fortran compiler,
you might want to write some of the guts of the internal representation of
these matrices in fortran -- but better yet, you'll probably find that 
your vendor already provides highly optimized routines for performing
those guts -- linpack-type stuff.

There's no reason why you can't have both the nice syntax of C++, and
the speed of fortran -- if you're willing to do a little serious work.
On the other hand, if you don't care, well, then you don't care.

Bruce.Hoult@bbs.actrix.gen.nz (01/11/91)

Peter Shirley writes:
>I have a C++ program that takes about 170% as long as a similar C program.
>I've run it on a vax and a sun with and without big-O.
>I've inlined everything I can, and I don't understand the slowdown.
>Could someone enlighten me?  (yes, I looked at the C code generated
>by CC, and am too dumb to understand it).

I get the same thing on my Mac IIcx: the C program takes 31 seconds, and
the C++ one takes 54.

The problem is that the compiler isn't all that smart and is (on my
system, and probably on yours) copying the result of operator+ twice
on the way to its final destination -- the first time using a bitwise
copy (the default copy constructor), and the second time using the
operator= that you defined.

One way to work around the problem is to define an operator+=.  This will
drop the execution time down to virtually the same as for the C program.

Another way is to put the output of CFront through a smarter C compiler. I
suspect that GCC would do a rather better job of optomising the mess that
CFront leaves behind.
-- 
Bruce.Hoult@bbs.actrix.gen.nz   Twisted pair: +64 4 772 116
BIX: brucehoult                 Last Resort:  PO Box 4145 Wellington, NZ

mat@mole-end.UUCP (Mark A Terribile) (01/11/91)

> > I have a C++ program that takes about 170% as long as a similar C program.
> > ...  I don't understand the slowdown.

> The C program is written in terms of basic operations that the optimizer
> can easily improve; the C++ version isn't.  For example:
 
> >         a[0] = a[0] + b[0];
> This is better written as
> 	  a[0] += b[0];
 
> though modern compilers can usually make the two cases equivalent.  The
> C++ version is not so fortunate.  You write
 
> >         a = a + b;
 
> Since `a' is a structure, your `+' operator first has to store the sum in
> your `temp', then your `=' operator must copy it into `a'.  This takes a
> string move at best, and a slow function call at worst.  I tried redefining
> your `+' operator to be a `vector& +=' operator and my execution time on
> a tahoe became 6.0 seconds instead of 14.5.  That's still not quite as
> good as 4.5 for the C version, but it explains most of the difference.
 
There's also some genuinely bad use of C++ initialization rules around
returns and parameters.  Fixing it won't make the problem go away, but it
will probably shave about half off the extra cost.  In the operator+() , you
force the loading of a temporary vector, then a copy from the temporary to
the return value.  Sometimes this is unavoidable; here it is not.
Instead of writing

	inline vector operator+(vector& u, vector& v)
	{
	    vector temp;

	    temp.data[0] = u.data[0] + v.data[0];
	    temp.data[1] = u.data[1] + v.data[1];
	    temp.data[2] = u.data[2] + v.data[2];
	    return temp;
	}

write

	inline vector operator+(vector& u, vector& v)
	{
	    return vector( u.data[0] + v.data[0], u.data[1] + v.data[1],
						    u.data[2] + v.data[2] );
	}

There is no absolute guarantee that the compiler will not build a temporary
and do the copy, but at least some, under at least some circumstances,
will build the return value in the place in which it is to finally rest.
Class/struct return is usually implemented, and especially in cfront, by
passing a pointer to a place wherein the return value is to be built, or by
providing extra stack space for the return value.  In either case, writing
the `value builder' constructor in the return statement gives the compiler
permission to optimize away the copy constructor invocation by building the
result in place.


> > Could someone enlighten me?  (yes, I looked at the C code generated
> > by CC, and am too dumb to understand it).

It's not that bad.  Use a good editor to match up the parens and split the
lines to exhibit the parentheses nesting, then remove truly redundant parens.
Remember that `;' becomes `,' and if/then/else is implemented with  ?: .  A
`0' is used to represent a null-else.

Most of the rest of the opacity comes from name mangling and the nested-
variable prefixes.  Those you will get the hang of in time.
-- 

 (This man's opinions are his own.)
 From mole-end				Mark Terribile

jbuck@galileo.berkeley.edu (Joe Buck) (01/16/91)

In article <1877@ruunsa.fys.ruu.nl>, muts@fysaj.fys.ruu.nl (Peter Mutsaers /1000000) writes:
|> It is a pity that the c++ optimizer does not (yet) automatically
|> change a = a + b into a+=b if operator+= is defined too. It would
|> reduce however the freedom to use operators like += *= etc.

I got this trick from Andy Koenig.

Whenever I write a class with arithmetic operators, I always write
+=, *=, etc, as member functions and then write (let's say the class
is Foo)

inline Foo operator+ (const Foo& a, const Foo& b) {
	Foo tmp(a);
	tmp += b;
	return tmp;
}

Notice that this function does not need to be a friend (so when
people ask me whether arithmetic operators should be members or
friends, I reply, neither).  With this kind of structure, most
optimizers can convert

func (Foo& x, Foo& y) {
	...

	Foo c = x + y;
	...
}

into
	Foo c(x); c += y;

with little trouble.
--
Joe Buck
jbuck@galileo.berkeley.edu	 {uunet,ucbvax}!galileo.berkeley.edu!jbuck