[comp.lang.c++] C++ vrs Fortran for linear algebra.

jon@hanauma (Jon Claerbout) (07/26/89)

ABSTRACT:	C++ can be customized to look like Fortran in its
		array subscription, but it has an advantage
		over Fortran that variable dimensioned arrays
		can be dynamically allocated.
		A benchmark shows C++ about half as fast on a DEC 3100.

ANALYSIS:	fortran			C
		real x(n,m)		x = alloc( n*m* sizeof(float))
		x(i,j) = etc		x[ i + index[j] ] = etc

COMMENT:	This is a great example for C++ textbook writers.

COMPILER:	AT&T C++ version 1.2.1

DEMONSTRATION BELOW -------------------------------------------------
/*	Benchmark Fortran equivalent (actually ratfor)
	real total, x(200,200)
	integer i,j,n
	n = 200
	total = 0.
	do i=1,n
		do j=1,n
			x(i,j) = sqrt(1.+i+j)
	do i=1,n
		do j=1,n
			do k=1,n
				total = total + x(i,k) * x(k,j)
	write(6,10) total
10      format('    1576677724. if double precision '/f15.0,' here in single')
	call exit(0);	end
*/
//------------------------------------ C++ user program equivalent to Fortran
#include <CC/stdio.h>
main () {
	double sqrt(float);
	int	i, j, k, n=200;
	float	total=0.;
	Matrix	x(n,n);
	for( i=0; i<n; i++ )
		for( j=0; j<n; j++ )
			x(i,j) = sqrt(1.+(i+1)+(j+1));
	for( i=0; i<n; i++ )
		for( j=0; j<n; j++ )
			for( k=0; k<n; k++ )
				total = total + x(i,k) * x(k,j);
	printf("total=%f\n    1576677724. if double precision\n",total );
}
//   7.0sec	1575556480. 	DEC 3100  (f77 -O3)
//  13.1sec	1575559552	DEC 3100  (C++ -O2)
//
//--------------- C++ header to handle arrays like fortran.
class Matrix {				// by dave nichols
	private:
		int *index;
		float *data;
	public:
		Matrix( int, int);
		inline float& operator()( int i, int j ) {
			return ( data[i + index[j] ] );
		}
		~Matrix();
	};
Matrix::Matrix( int n1, int n2) {
	index = new int[n2];
	for( int i =0; i<n2; i++ ) {
		index[i] = i*n1;
	}
	data = new float[n1*n2];
}
Matrix::~Matrix() { delete index; delete data; }

grunwald@flute.cs.uiuc.edu (Dirk Grunwald) (07/26/89)

I modified your example slightly and tried it on a Sun-3/60 w/68881
running SunOS 4.0.3, using g++ 1.35.1+

I got the following results:

Floating Point		Compiled by:				Time
--------------------------------------------------------------------------
real			f77 -O					84.7
real			f77 -O4					85.4
double precision	f77 -O4					88.1

float			g++ -O -fstrength-reduce		85.5
double			g++ -O -fstrength-reduce		95.9
double, global vars	g++ -O -fstrength-reduce		95.3
double, no dope vector	g++ -O -fstrength-reduce		97.3

Conclusions:
	+ I can't wait to get a DEC-3100

	+ G++ isn't so bad for numeric codes

	+ G++/Gcc needs better optimization for `doubles'

	+ G++ needs to improve optimization for induction variables
	  that are class member functions [posted to gnu.g++.bug]

Changes to the code & why:

	+ g++ doesn't grok inlines if they're in the class defn, so I
	  pulled it out

	+ G++ has dynamic local vectors, so you don't need the mallocs,
	  but you do need a macro [curse the language].

	  I think this saves MUCHO time. You could pull a similar trick
	  using ``alloca'' on a vanilla-flavoured C++ backend.
	  You could also make the variables be global, since FORTRAN
	  does that (no saving in of itself)

	+ used inline math library (<math.h> includes <math-68881.h>)

//--------------- C++ header to handle arrays like fortran.
// by dave nichols, hack by d. grunwald
class Matrix {
private:
    int *index;
    double *data;
public:
    Matrix( int n1, int n2, int *indx, double *dat);
    double& operator()( int i, int j );
};

static inline double&
Matrix::operator()( int i, int j ) {
    return ( data[i + index[j] ] );
}

static inline
Matrix::Matrix( int n1, int n2, int *indx, double *dat) {
    index = indx;
    data = dat;
    for( int i =0; i<n2; i++ ) {
	index[i] = i*n1;
    }
}

#include <stdio.h>
#include <math.h>
#include <generic.h>

#define MATRIX(name,n1,n2)\
    int name2(name,index)[n1];\
	double name2(name,data)[n1*n2]; \
	    Matrix name(n1, n2, name2(name,index), name2(name,data));

main () {
    int	i, j, k;
    double	total=0.;

    const int n=200;
    MATRIX(x,n,n);


    for( i=0; i<n; i++ )
	for( j=0; j<n; j++ )
	    x(i,j) = sqrt(1.+(i+1)+(j+1));
    for( i=0; i<n; i++ )
	for( j=0; j<n; j++ )
	    for( k=0; k<n; k++ )
		total = total + x(i,k) * x(k,j);
    printf("total=%f\n    1576677724. if double precision\n",
	   (double) total );
}
--
Dirk Grunwald -- Univ. of Illinois 		  (grunwald@flute.cs.uiuc.edu)

paul@gill.UUCP (Paul Nordstrom) (07/29/89)

In article <3967@portia.Stanford.EDU>, jon@hanauma (Jon Claerbout) writes:
> 
> ABSTRACT:	C++ can be customized to look like Fortran in its
> 		array subscription, but it has an advantage
> 		over Fortran that variable dimensioned arrays
> 		can be dynamically allocated.
> 		A benchmark shows C++ about half as fast on a DEC 3100.
> 
> ANALYSIS:	fortran			C
> 		real x(n,m)		x = alloc( n*m* sizeof(float))
> 		x(i,j) = etc		x[ i + index[j] ] = etc
> 
> COMMENT:	This is a great example for C++ textbook writers.
> 
> COMPILER:	AT&T C++ version 1.2.1
> 
> DEMONSTRATION BELOW -------------------------------------------------
> /*	Benchmark Fortran equivalent (actually ratfor)


code went here


>//   7.0sec	1575556480. 	DEC 3100  (f77 -O3)
>//  13.1sec	1575559552	DEC 3100  (C++ -O2)
>//

On a sun 3/60 w/ 12 MB and Oregon c++ 1.2Ag

/usr/local/paul % time bench
total=1576677723.402206
    1576677724. if double precision
108.7u 0.3s 1:52 96% 0+12k 5+1io 4pf+0w
/usr/local/paul % !!
time bench
total=1576677723.402206
    1576677724. if double precision
108.8u 0.2s 1:50 98% 0+12k 0+0io 0pf+0w


One thing bothers me though... my results are so accurate I think that
the compiler is using double precision, even though the program 
specifies float.  Havn't found anything relevant in the manual. Obviously
that would explain some of the difference in the timing.  I don't think that
that is the the entire difference, though :-).


-- 
Paul Nordstrom
Gill & Co., L.P.
uunet!gill!paul

schwartz@shire.cs.psu.edu (Scott Schwartz) (07/30/89)

In article <3967@portia.Stanford.EDU>, jon@hanauma (Jon Claerbout) writes:
|ABSTRACT:	C++ can be customized to look like Fortran in its
|		array subscription, but it has an advantage
|		over Fortran that variable dimensioned arrays
|		can be dynamically allocated.
|		A benchmark shows C++ about half as fast on a DEC 3100.
|
|ANALYSIS:	fortran			C
|		real x(n,m)		x = alloc( n*m* sizeof(float))
|		x(i,j) = etc		x[ i + index[j] ] = etc
|
|COMMENT:	This is a great example for C++ textbook writers.
|
|COMPILER:	AT&T C++ version 1.2.1
|
|DEMONSTRATION BELOW -------------------------------------------------
|//   7.0sec	1575556480. 	DEC 3100  (f77 -O3)
|//  13.1sec	1575559552	DEC 3100  (C++ -O2)


Results for a Sun4/280:

# f77 -O4 -R x.r /usr/lib/libm.il
time xr
    1576677724. if double precision
    1575556480. here in single
7.4u 0.1s 0:11 64% 0+432k 0+0io 0pf+0w

# g++ -O -fno-defer-pop -fstrength-reduce x.cc -lm
time xc
total=1575556480.000000
    1576677724. if double precision
19.3u 0.1s 0:34 56% 0+208k 0+0io 0pf+0w

Using Dirk Grunwald's modifications:
time yc
total=1576677723.737338
    1576677724. if double precision
24.6u 0.1s 0:27 90% 0+360k 0+0io 0pf+0w

--
Scott Schwartz		<schwartz@shire.cs.psu.edu>

grunwald@flute.cs.uiuc.edu (Dirk Grunwald) (07/30/89)

Is there a math inline library for the SPARC? The SQRT computation is a
fairly heafty part of the entire computation. On a Sun-3, the math-inline
library helps a good deal (that's what the .il library is doing). From
what I saw on the Sun-3, f77 -O4 appears to force the inlines (I didn't
specify them, but calls to fsqrtx were generated anyway).

It's hard to believe that using stacked arrays would make it *slower*
that a version using malloc; did my modified version eliminate the dope
vectors? I do notice that it used doubles, not floats.
--
Dirk Grunwald -- Univ. of Illinois 		  (grunwald@flute.cs.uiuc.edu)

dl@cmx.npac.syr.edu (Doug Lea) (08/05/89)

There is a thorny problem in attempting to write C++ matrix code
(among similar sorts of applications) that can be compiled, via an
optimizing compiler like g++, to run as fast as well-optimized Fortran
code. 

Instead of using the example of Dave Nichols & Dirk Grunwald,
I'll use a simpler example, that makes this clearer. In fact, the
example can be written using only C constructs, even though it is
definitely a C++-ish problem.

struct imatrix /* a matrix of integers */
{
  int  rows;   /* number of rows */
  int  cols;   /* number of columns */
  int* data;   /* pointer to rows*cols ints from freestore (or wherever) */
};

void fill(struct imatrix* mp, int x)
{
  int i, j;
  for (i = 0; i < mp->rows; ++i)
    for (j = 0; j < mp->cols; ++j)
      *(mp->data + i * mp->cols + j) = x; /* traverse in row-major order */
}
    

[In C++, you'd probably write this as something like

class Imatrix
{
private:
  int  rows;
  int  cols;
  int* data;
public:
       Imatrix(int r, int c) :rows(r), cols(c), data(new int[r * c]) {}
  int& operator() (int i, int j) { return *(data + i * cols + j); }
  void fill(int x)
  {
    for (int i = 0; i < rows; ++i)
      for (int j = 0; j < cols; ++j)
        (*this)(i, j) = x;
  }
};

but nothing I'll say depends on these differences.]

Fill() is an example of a function that cannot be optimized as well in
C or C++ as it could in other languages because the compiler cannot
safely assume that memory accesses through mp->data do not affect
mp->rows or mp->cols. Thus, neither mp->rows nor mp->cols may be
cached in a register, and very few other optimizations are possible.
This is easy to see by comparing the code generated for fill with that
for a similar fill2.

void fill2(int rows, int cols, int* data, int x)
{
  int i, j;
  for (i = 0; i < rows; ++i)
    for (j = 0; j < cols; ++j)
      *(data + i * cols + j) = x;
}

Using g++ (or gcc) (-O -fstrength-reduce), on a 1000 X 1000 imatrix,
the first fill is FIFTY TO EIGHT-HUNDRED PERCENT SLOWER than fill2,
across a few different machines and versions of g++/gcc. The problem
ought to show up most dramatically on RISC machines, where memory
access requires significant load delays over register access. Without
register caching, every single matrix reference must first retrieve
`mp->cols' from memory. Worse, no strength-reducing optimizations are
possible that would replace the i * mp->cols multiply by simple
address increments each time through the loop, as may be done (and is
done, in most recent versions of g++, on most machines) with fill2.

Thus, programming this kind of function in an `object-oriented' manner
results in an unacceptable performance hit.  (Of course, there is no
reason to traverse in row-major order in `fill', making it a poor
example in this sense, but the principle holds across more realistic
situations. The differences are slightly less dramatic in the
nichols/grunwald example because the time-consuming `sqrt' calculation
masks these effects to some extent on many machines.)

This is a very real, and very frustrating problem in g++. It is not
really a `problem' in AT&T cfront, since cfront generates C code; thus
nearly all optimization is in the hands of the backend C compiler, and
fewer such C++ (non-C) language issues impact the resulting code
efficiency. 

Nevertheless, I see the problem as a C++ issue, even though it also
occurs in C, because in C, you can always program things along the
lines of `fill2', while in C++, you cannot do this if you want to
simultaneously take advantages of the benefits of object-oriented
programming features.

Optimizers cannot pull the kinds of invariants needed to produce good
code out of thin air. There is nothing about the imatrix declaration
that allows a compiler to safely assume that `data' does not point to
the `rows' or `cols' fields. In fact, in the g++/gcc optimizer, even
if `data' is of type `double*', as would be the case in floating point
matrices, the optimizer still isn't sure that `data' doesn't point to
these fields, since a programmer could have applied a coercion (i.e.,
mp->data = (double*)(mp->rows)).

There *may* be one way out of this using existing C++ constructs. If
the `rows' and `cols' fields were declared to be `const',

struct imatrix { const int rows, const int cols, int* data; };

then the optimizer might be allowed to assume that access through
`data' couldn't affect them. However, even this is possibly unsafe,
since a programmer could still have legally applied a coercion (i.e.
mp->data = (int*)(mp->rows)). I cannot find anything officially stated
in the C++ reference manual that would enable a compiler to assume
that const fields are never altered via this kind of `side-channel'
access. Moreover, even if there were, using const fields in order
to guide optimization would preclude ever resizing the matrix.
In fact, this is probably the most important kind of drawback of using
`const' in such applications.

It seems that any other solution requires some kind of minor extension
in the C++ language. Here are my best three ideas about this:


                   *** Array References ***

I'll start with the simplest, but probably most controversial idea.
Suppose there were a way to change C++ so that so that the difference
between pointers/references to arrays and pointers/references to
single objects were *enforced*. For example if `data' were a pointer
or reference to an array of ints, it could not point to (refer to) a
single int.  Surprisingly, there is a tolerably natural syntax
available for this.  Consider

struct imatrix { int rows, int cols, int[]& data; }

Define this to mean that data *MUST* refer to an array. This would
cure the kinds of problems mentioned above with fill, since `data'
could not possibly refer (point) to `rows' or `cols'. 

`data' elements can be accessed in the usual way via data[i].

For good reason, this syntax bears strong resemblance to that
for `vector_new' (e.g., `new int[100]').

Unfortunately, there is also the need to support the `star' version
as well, mainly to support vector_new. `int[]* p' must mean that
p points to an array of ints, elementwise dereferenceable via (*p)[i].

This *might* be easier to implement than it first seems. I see only
a few major changes that would be required minimally:

    1) Any <type>[]& may be SILENTLY, AUTOMATICALLY coerced into a <type>*,
        without `wasting' a coercion (Recall the C++ only-one coercion rule).
    2) The opposite coercion <type>* to <type>[]& is NEVER performed,
        except by manual intervention. An optimizing compiler
        is allowed to ignore any aliasing to non-array objects created
        by such coercions.
    3) If `a' is declared as <type> a[<count>], the type of a
        mention of `a' is <type>[]&. (When coupled with (1), this
        change is transparent to older existing C & C++ code).
    4) `vector_new' must be retyped to return <type>[] *.
        and `vector_delete' should take a <type>[]*.

Note that `int[]* a' is VERY different from `int* a[]' (== `int** a') 
and that `int a[]' still means the same as `int* a'. 

The only bit of strangeness to this proposal is that `int[]& a' and
`int[]* p' serve only as idioms. Just using `int[] a' isn't
meaningful.

Note also that, depending on details that ought to be hammered out,
these rules seem to  allow for reference assignments, unlike
rules for regular references. I'm not sure whether this is good or
bad.

The construct might also be very useful for function arguments. In

void f(int[]& d, int& x);

the compiler knows that x isn't aliased by d (although x *could*
still be an element of d). More importantly, f will only take
arrays, not pointers to single objects, as arguments:

void usef()
{
  int a;
  int[]& b = *(new int[100]);
  int c[100];
  int* d = new int;
  int x;
  
  f(&a, x); // ERROR
  f(b, x);  // OK
  f(c, x);  // OK
  f(d, x);  // ERROR

  b = c;    //  what do you think?
}

If programmers used this construct consistently whenever using arrays,
they might be spared a great many pointer errors. Perhaps its greatest
attraction is that the construct significantly improves the type safety
of C++.

I think that this idea contains a lot of promise. However, it
sidesteps the basic aliasing issue. The next idea has the opposite
features.


             ***  Register struct/class fields ***

According to ANSI and the C++ reference manual, if an object is
declared as `register', there are two consequences: the compiler is
encouraged to allocate a register for it, and, even if not, its
address may not be taken. Thus a register variable or argument may
never have an alias.

With a good register-allocating compiler like g++/gcc, the storage
class specification aspect of `register' is almost archaic. Given
this, `register' is only useful in its second meaning, as a *type
qualifier*, in the same league as `const' and `volatile'.

This is in fact what I propose: That `register' be treated as a 
type qualifier with respect to struct/class components, as in

struct imatrix
{
  register int rows;
  register int cols;
  int* data;
};

So far as I can see, this usage of `register' does not create
any new C++ parsing difficulties.

Again, encouragement to place such fields in registers is only
indirectly related to the use of `register' in this context. The
meaning of interest would be that

For any struct imatrix imat, constructs of the form
     int* rp = &imat.rows
     int& rr = imat.rows;
would be illegal.

Constructs like
     imatrix* mp = &imat
would still be OK (unless, of course, imat itself were declared as register).

In other words, register fields may not have aliases, although the
structs they are a part of may be aliased.  Thus, imat's rows field
could be changed only by direct modification (imat.rows = x). An
imatrix* mp could have its rows changed either manually or indirectly
via any f(imatrix*) function. (I take it as irrelevant to the
definition of such fields as registers that `imat.rows = x' implicitly
requires an offset address to be computed inside the compiler.)

There is a very unfortunate snag: the cast
     int* p = (int*)(&imat);
would still enable side-channel modification (as in *p = 1234 causing
imat.rows to be 1234). There seem to be two routes in dealing with this.

A `weak' interpretation would disable optimization if any potential
anything* alias for m existed and were destructively accessed.
Sadly, the `weak' interpretation does not help out at all, since, for
example, fill's mp->data could be an alias for mp.  

A `strong' interpretation of `register' would be to assume that no
side-channel access occurs, treating any mp->rows as potentially
volatile only when mp (or a possible imatrix* alias) modifies rows or
is possibly changed via a f(imatrix*). It appears that a compiler
operating under such an interpretation would generate essentially
identical code for both fill and fill2.

This means that, in order to be useful, the `strong' interpretation
would be required, even though its assumptions cannot be enforced.  I
believe that it should be adopted anyway.  In fact, I think that an
optimizer should be able to assume lack of side channel access in all
three of the cases discussed:

    * when a field is marked as `register'
    * when a field is marked as `const'
    * when there is a pointer type mismatch (e.g., double* data vs int rows)

For the same reason that C++ compilers emit warnings when a const is
coerced to a non-const, it would be nice to have a warning if a
pointer to a struct containing register fields were coerced into
something else.

There are several other related niceties that could be introduced
to make this a stronger proposal. Entities other than struct
components could similarly profit from the `register-as-type-qualifier'
notion.

Note that this use of `register' is NOT the same as the proposed ANSI
`noalias' qualifier: `register' means `cannot have an alias', while
`noalias' means `does not have an alias'. This difference is actually
quite substantial: Noalias can be applied to pointers to indicate
uniqueness.  For example, 
    void f(noalias struct imatrix* a, noalias struct imatrix* b); 
means that *a and *b are guaranteed distinct. This condition is
extremely difficult to enforce, and, in any case applies to a whole
different set of problems (chiefly, parallelization issues). No
such effect seems inherent in any use of `register' as described here.


            *** Compile-time directives ***

It is possible, although incredibly tedious, to use some kind of
#pragmas or asserts that, if recognized by the compiler, would provide
the same kind information.  For example:

void fill(struct imatrix* mp, int x)
{
  assert(!((mp->data >= mp && mp->data < (char*)mp + sizeof(imatrix)) ||
          (mp->data + mp->rows * mp->cols >= mp && 
           mp->data + mp->rows * mp->cols < (char*)mp + sizeof(imatrix))))
  /* i.e., no requested offset of data == rows or cols */
  int i, j;
  for (i = 0; i < mp->rows; ++i)
    for (j = 0; j < mp->cols; ++j)
      *(mp->data + i * mp->cols + j) = x;
}

Pretty ugly, huh?


----

If anyone has any other ideas or comments on this, I'd love to hear them!
This issue is among the chief reasons that the libg++ matrix library
has not yet been released. I refuse to distribute a package that is
guaranteed to convince users that C++ is not a serious candidate
for fast Matrix computations! 


Doug Lea, Computer Science Dept., SUNY Oswego, Oswego, NY, 13126 (315)341-2367
email: dl@oswego.edu              or dl%oswego.edu@nisc.nyser.net
UUCP :...cornell!devvax!oswego!dl or ...rutgers!sunybcs!oswego!dl