[comp.lang.c++] Why is this program slow? / C++ versus C Performance

timh@.mentorg.com (Tim Harvey @ APD x1381) (01/12/91)

Peter Shirley posted the article (# 10640) "why is this program slow?"
I found the topic he brought up (C++ versus C performance) and the discussion
that followed useful to some work I'm doing and I wrote up the following
summary.  I'm posting it for the benefit of any one else who might be
interested this topic.  Of course, this summary isn't the "final word",
so I hope people will post corrections, clarifications, and comments.

Credit is due to those people upon whose contributions this summary is
based:  Peter Shirley, Griff Smith, Steve Simmons, Stephen Clamage,
Steve Madere, Paul Vaughan, and Bruce Hoult.



THE ISSUE
---------

C++ is often charged with being an inherently slower language than C.

A simple case was developed where a C++ program took almost twice as
long (about 170%) to run as a similar C program.  It illustrates one
situation where C++ appears slower than C.

The example involves nearly "equivalent" matrix addition C and C++
programs.  Matrix B is repeatly added to matrix A in a FOR loop.



EXAMPLE
-------

   /******************************************/
   /*              C Example                 */
   /******************************************/
   
   #include <stdio.h>
   
   main() {
     unsigned long   i ;
              double a[ 3 ],
                     b[ 3 ] ; 
   
     a[ 0 ] = a[ 1 ] = a[ 2 ] = 0.0 ;
     b[ 0 ] = 1.0 ;
     b[ 1 ] = 2.0 ;
     b[ 2 ] = 3.0 ;
   
     for ( i = 0 ; i < 1000000 ; i++ ) {
       a[ 0 ] = a[ 0 ] + b[ 0 ] ;
       a[ 1 ] = a[ 1 ] + b[ 1 ] ;
       a[ 2 ] = a[ 2 ] + b[ 2 ] ;
     }
   
     fprintf( stderr, "%lf %lf %lf\n", a[ 0 ], a[ 1 ], a[ 2 ] ) ;
   }
   
   
   /******************************************/
   /*            C++ Example                 */
   /******************************************/
   
   #include <math.h>
   #include <stream.h>
   
   class Vector {
     friend ostream& operator<<( ostream&, Vector& ) ;
     friend Vector   operator+ ( Vector&,  Vector& ) ;
   protected:
     double data[ 3 ] ;
   public:
            Vector() ;
            Vector( double, double, double ) ;
       void operator=( Vector& ) ;
   };
   
   inline Vector::Vector() { }
   
   inline Vector::Vector( double a, double b, double c ) {
     data[ 0 ] = a ;
     data[ 1 ] = b ;
     data[ 2 ] = c ;
   }
   
   inline void Vector::operator=( Vector& v ) {
     data[ 0 ] = v.data[ 0 ] ;
     data[ 1 ] = v.data[ 1 ] ;
     data[ 2 ] = v.data[ 2 ] ;
   }
   
   inline Vector& operator+( Vector& u, Vector& v ) {
     static Vector temp ;
     temp.data[ 0 ] = u.data[ 0 ] + v.data[ 0 ] ;
     temp.data[ 1 ] = u.data[ 1 ] + v.data[ 1 ] ;
     temp.data[ 2 ] = u.data[ 2 ] + v.data[ 2 ] ;
     return temp ;
   }
   
   ostream& operator<<( ostream& s, vector& t ) {
     return s << form( "%f  %f  %f", t.data[ 0 ], t.data[ 1 ], t.data[ 2 ] ) ;
   }
   
   
   main() {
     Vector a( 0.0, 0.0, 0.0 ) ;
     Vector b( 1.0, 2.0, 3.0 ) ;
   
     for ( int i = 0 ; i < 1000000 ; i++ ) {
       a = a + b ;
     }
   
     cout << a << endl ;
   }



REASON FOR SLOWER PERFORMANCE
-----------------------------

The C program is written in terms of basic operations that the compiler
can easily optimize while the C++ version isn't.  The C++ version ends up
having to construct and copy a temporary class Vector object.  This extra
work accounts for the difference in run time.

When the form

  a = a + b ;

is used, since `a' is a structure, the `+' operator first has to store the
sum in hidden (contained in the emitted cfront C code), temporary matrix
storage, then the `=' operator must copy the contents of the temporary
storage into `a'.  This takes a string move at best, and a slow function
call at worst.

Another way of looking at it is to make the C program truely equivalent to
the C++ version.  The C program segment would then look something like the
following to do what the C++ version actually does.

   for ( i = 0 ; i < 1000000 ; i++ ) {
     temp      = calloc( 3, sizeof( *temp ) ) ;
     temp[ 0 ] = a[ 0 ] + b[ 0 ] ;
     temp[ 1 ] = a[ 1 ] + b[ 1 ] ;
     temp[ 2 ] = a[ 2 ] + b[ 2 ] ;
     a[ 0 ]    = temp[ 0 ] ;
     a[ 1 ]    = temp[ 1 ] ;
     a[ 2 ]    = temp[ 2 ] ;
     free( temp ) ;
   }



DISCUSSION
----------

The problem is that cfront isn't smart enough to construct the vector sum
directly in the target variable 'a' instead of creating an intermediary
temporary object.

Given the two addition/assignment forms

  1.   a[ 0 ] = a[ 0 ] + b[ 0 ] ;
  2.   a[ 0 ] += b[ 0 ] ;

modern compilers should identify the two cases as equivalent and generate the
most efficient code, but cfront is not so fortunate.  It treats both forms
as different.  In the first case, cfront will generates code that copies the
result of the '+' operator twice on the way to its final destination.  The
first time a bitwise copy (the default copy constructor) is used, and the
second time the defined '=' operator is used.

The problem goes beyond cfront not being able to recognize equivalent forms.
The sort of C code cfront generates can actually interfer with the
downstream C compiler's ability of perform optimization.  For example,
something like

   int a = 0 ;
   for ( int i = 0 ; i < 1000000 ; i++ ) {
     a = a + 1 ;
   }

when written in C is easily optimized by most C compilers to:

   a = 1000000 ;

But the same operation can be written in C++ so that C code generated by cfront
will "hide" the optimization opportunity and leave the downstream C compiler
with no other choice but to generate unoptimized literal code for a lengthy
loop construction.

Eventually, the code optimization restrictions and problems of cfront will be
resolved by one or more of the following:

   o  cfront is modified to generate optimized C code,

   o  cfront is changed to generate C code in a form that can be
      easily optimized by existing C compilers,

   o  C compilers are developed that are better tuned to the the
      C code that cfront presently generates.

   o  reliable, fully optimized C++ native code comilers are
      developed.



WORKAROUND
----------
 
One way to work around the problem is to define a '+=' operator for the
Vector class.  There are many ways to do this.  Perhaps the simplest is to
define and use the '+=' operator as a member in the following way:

   class Vector {
   ...
   public:
   ...
     void operator+=( Vector& ) ;
   };

   void Vector::operator+=( Vector& v ) {
     data[ 0 ] += v.data[ 0 ] ;
     data[ 1 ] += v.data[ 1 ] ;
     data[ 2 ] += v.data[ 2 ] ;
   }  


   main() {
     ...
     for ( int i = 0 ; i < 1000000 ; i++ ) {
       a += b ;
     ...
   }

keffert@jacobs.CS.ORST.EDU (Thomas Keffer) (01/13/91)

In article <1991Jan12.040453.6887@mentor.com> timh@.mentorg.com (Tim Harvey @ APD x1381) writes:
>Peter Shirley posted the article (# 10640) "why is this program slow?"
>
>C++ is often charged with being an inherently slower language than C.
>
>A simple case was developed where a C++ program took almost twice as
>long (about 170%) to run as a similar C program.  It illustrates one
>situation where C++ appears slower than C.

Now hold on!  We have done extensive testing of our C++  Math.h++ class library
versus both C and Fortran and found only a very slight penalty
that can very easily be recovered by using the more sophisticated
algorithms made possible by the encapsulation, etc. properties of C++.

Here's some hard data.  The times shown are the times required to multiply
two matrices together of the given size.  The machine is a Sun 3/260
with a 68881 coprocessor.

N**2            g++ -m68881         f77 -f68881
25                  4.7s               3.9
100                16.4               15.5
400                64.2               62.1

You might want to implement a special "small matrix" class if you are
going to be working with very small N (2, 3 or 4) matrices (e.g., graphics
transformation matricies).  But for larger N (5 and up), the time
differences are completely trivial.

The trick is careful attention to detail: use reference counting, 
minimize memory allocations, maintain a pool of the small (reference)
structures, etc.

-tk

Disclaimer:
  I work for a company that vends a C++ math class.

___
Thomas Keffer
Rogue Wave
PO Box 2328 Corvallis, OR 97330
(503) 745-5908

mat@mole-end.UUCP (Mark A Terribile) (01/14/91)

> ...  Of course, this summary isn't the "final word", so I hope people will
> post corrections, clarifications, and comments.

> THE ISSUE
> ---------
 
> A simple case was developed where a C++ program took almost twice as
> long (about 170%) to run as a similar C program.  It illustrates one
> situation where C++ appears slower than C.
 
> REASON FOR SLOWER PERFORMANCE
> -----------------------------
 
> The C program is written in terms of basic operations that the compiler
> can easily optimize while the C++ version isn't.  The C++ version ends up
> having to construct and copy a temporary class Vector object.  This extra
> work accounts for the difference in run time.

> DISCUSSION
> ----------
 
> The problem is that cfront isn't smart enough to construct the vector sum
> directly in the target variable 'a' instead of creating an intermediary
> temporary object.

Well, I'm not so sure of that.  There is a guarantee that the temporary
objects WILL be created so that the side effects of constructors will occur--
EXCEPT in a very few circumstances involving a copy constructor used as the
second constructor in an initialization.

The problem, as coded, explicitly excludes any of these circumstances.  (It's
not that difficult to change so that one of the circumstances occurs, as we
have seen.)

In theory, a `sufficiently smart' compiler could determine that there are no
side effects and computation can proceed without creating the temporary, and
since the function is expanded inline, this would be one of the easier cases
for the compiler to detect.  (Without the inlining, the optimization and
code generation would have to take place AFTER symbol resolution, that is,
in the middle of the linking process.)
 
>   1.   a[ 0 ] = a[ 0 ] + b[ 0 ] ;
>   2.   a[ 0 ] += b[ 0 ] ;
> 
> modern compilers should identify the two cases as equivalent and generate the
> most efficient code, but cfront is not so fortunate.  ...

Modern compilers which have semantics for the type built-in, yes.  For C++,
no.  Perhaps it will one day be possible to specify more about the operator
semantics, but for a language like C++, I suspect that awaits some theory
of programming language semantics.

> The problem goes beyond cfront not being able to recognize equivalent forms.
> The sort of C code cfront generates can actually interfer with the
> downstream C compiler's ability of perform optimization.  ...

C++ asks more of a compiler, and for the near future, that will mean that
certain things that can be optimized in C-like code cannot be optimized when
they cross the boundaries provided by C++'s ways of organizing programs.

If the low-level performance of a type is important, then the type must be
designed and written carefully.  It's true that we've come to rely a lot on
optimizing compilers and less on how we code, but we can't rely on compilers
to optimize across separate compilation boundaries, nor to replace weak
expensive algorithms with good ones (even if code hoisting and strength
reduction can sometimes have that effect).  We have to seek to reduce wasted
motion in more-or-less machine independent ways.  Such ways exist (e.g. the
value-building constructor in the return statement).

> Eventually, the code optimization restrictions and problems of cfront
> will be resolved by one or more of the following:

I would say rather `ameliorated,' since it's not clear to me that a compiler,
given seperate code generation, can analyze away all the extra motion that
C++'s model seems (in the absence of optimization) to imply.

>    o  cfront is modified to generate optimized C code,
> 
>    o  cfront is changed to generate C code in a form that can be
>       easily optimized by existing C compilers,
> 
>    o  C compilers are developed that are better tuned to the the
>       C code that cfront presently generates.
> 
>    o  reliable, fully optimized C++ native code comilers are
>       developed.

I believe that another step is needed: code generation and optimization AFTER
symbol resolution.  Unfortunately, it will be years before we see this, even
if one language or machine vendor proves conclusively its worth.

> WORKAROUND
> ----------
>  
> One way to work around the problem is to define a '+=' operator for the
> Vector class.  ...

Another is to reduce by at least half the wasted motion in the  operator+() ,
by using the constructor value builder.  With inlines, this may clue the
compiler that all the temporaries can be eliminated.

						Mark T.
!FunkyStuff
-- 

 (This man's opinions are his own.)
 From mole-end				Mark Terribile

don@usna.navy.mil (Mr. Donald W. Garner (CADIG STAFF) <don@usna.navy.mil>) (01/17/91)

... it was noted sometime ago that the use of stream.h in a C++ program can
make it larger and slower than a C program that performs the same functions
without streams ...

jimad@microsoft.UUCP (Jim ADCOCK) (01/17/91)

In article <470@mole-end.UUCP> mat@mole-end.UUCP (Mark A Terribile) writes:
|> The problem is that cfront isn't smart enough to construct the vector sum
|> directly in the target variable 'a' instead of creating an intermediary
|> temporary object.
|
|Well, I'm not so sure of that.  There is a guarantee that the temporary
|objects WILL be created so that the side effects of constructors will occur--
|EXCEPT in a very few circumstances involving a copy constructor used as the
|second constructor in an initialization.

Ack!  I disagree that any such guarantee exists -- I claim that contrar-wise
ARM goes out of its way to state that any and all such uses of temporaries is
implementation dependent, and that the compiler does indeed have latitude to 
generate or not generate temporaries as it so chooses.

The commentary  on Page 299 of ARM states:

"The fundamental rule is that the introduction of a temporary object and the 
calls of its constructor/destructor pair may be eliminated if the only way
the user can detect its elimination or introduction is by observing side
effects generated by the constructor or destructor calls...."

This "fundamental rule" unfortunately, does nothing to clarify where
a compiler _ought_ to be generating a temporary, that it then might possibly
be "optimized away" using the above "fundamental rule."

One thing [about the only thing] you can be sure of is that if you 
initialize a reference with a temporary, the temporary exists for
the duration of the scope it is created in [but I can't find the
pertinent chapter and verse in ARM today]

For a really interesting and related issue, see the Dec. "Journal of
C Language Translation" regards the issue of whether [C] functions must
return values by copying.  Clearly, in C++ return-by-reference requires
that answers be returned without copying -- but, where is it stated that
non-return-by-reference functions must return their "value" by making
a copy of that value?

....Kind of makes me wonder how, if at all, a C++ programmer can certainly
force a copy [other than introducing a named variable to represent that
copy, and is even that guaranteed to work?] and how in general C++ 
programmers are suppose to control object identity -- when they're doing
OOP, that is.

???

jimad@microsoft.UUCP (Jim ADCOCK) (01/26/91)

In article <471@mole-end.UUCP| mat@mole-end.UUCP (Mark A Terribile) writes:
|In article <70067@microsoft.UUCP>, jimad@microsoft.UUCP (Jim ADCOCK) writes:
|> |...  There is a guarantee that the temporary objects WILL be created so that
|> |the side effects of constructors will occur--EXCEPT in a very few circum-
|> |tances involving a copy constructor ... the second ... in an initialization.
|
|> Ack!  I disagree that any such guarantee exists -- I claim that contrar-wise
|> ARM goes out of its way to state that any and all such uses of temporaries is
|> implementation dependent, ...
|
|If you go back to the original problem, you'll see that the `temporaries'
|in questions were not compiler-generated, but were written by the programmer.
|(Perhaps I should have written `intermediate' instead of `temporary'?)  And
|there is a place where the compiler states that an EXCPLICITLY WRITTEN
|temporary/intermediate may, in effect, be eliminated:
|
|		X f(  .  .  .  )
|		{
|			. . .
|			return X( xarg1, xarg2 );
|		}
|
|Here the compiler, instead of creating a local X to evaluate the value
|builder, and then using it to initialize the returned value, may use the
|constructor value builder directly to initialize the returned value.

Agreed.  A named local variable is not a temporary.  Temporaries are
unnamed objects.  See ARM page 22:

"A NAMED LOCAL OBJECT may not be destroyed before the end of its block nor
 may a local named object of a class with a constructor or a destructor 
 with side effects be eliminated even if it appears to be unused."

[emphasis mine]

Contrarwise, whether an UNNAMED TEMPORARY is actually created, when 
either "explicitly" or "implicitly" called for by a programmer is 
strictly implementation dependent.  UNNAMED TEMPORARIES can be 
"optimized away" even if their constructors or destructors have side
effects.

The basic rule is that if you want to ensure that a copy is being made, 
you must explicitly give that copy a name.  This has some interesting
consequences.  Consider what the following function does:

Foo do_I_make_a_copy(Foo* foo) { return *foo; }

???


Interestingly, if a temporary is given a name by assigning a reference to
it, then the rules change and it essentially becomes a named object,
and is guaranteed to stay alive for the scope of the reference.  So,
it appears to be the named'ness of an object that is the 
determining factor in whether a compiler can optimize it away
[and therefor its constructor/destructor side effects]

Another unsolved question:  Can a compiler optimize-away a named object
that *doesn't* have a constructor nor destructor with side effects?  This
is still an interesting question in dealing with object identity.