[comp.lang.c++] inline and vectorization

bk19+@andrew.cmu.edu (Bradley D. Keister) (12/07/89)

I'd like to see some discussion on the use of inline in C++ for vectorizing
code to run on a machine like a Cray Y-MP.  Typical programs that I write
involve multidimensional integrals whose integrands in turn involve
manipulating small vectors and matrices, with a complex or real number
as a result.  I have now made classes for most of these vector and matrix
types.  The advantage of declaring them inline is that calls to their
functions in the innermost loop can be eliminated (a single call in the
innermost loop will defeat vectorization where it counts most: executing
pieces of the integration loops in parallel).  However, most of the C++
literature and postings I've seen strongly discourage the use of inline
except under special circumstances, and indeed this is borne out when
the backend C compiler gets overloaded with nested inline statements.

It appears that making vectorizable code in C++ does not have an advantage,
at least in terms of using inline.  Is there a way out of this?  Should I
go back to using macros?

Brad Keister
Physics Department
Carnegie Mellon U

keister@iguana.psc.edu

(412) 268-2772

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (12/07/89)

In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP (James
Crotinger) writes:
>  However I also have other concerns, which are generic to languages
>that support vector data types (ala CFT77 and Fortran 8x). Suppose
>I have a vector type and the following code:

>   vector A, B, C, D, E
>   E = A + B*C     // meaning elementwise multiplication
>   D = A - B*C 

>This is the style of programming that the vector syntax promotes. 

This is also the style of programming that is appropriate to
memory-to-memory vector machines (Cyber 205 and ETA-10), and (more
importantly) for SIMD parallel machines like the Connection Machine.
The code above runs at the same speed on the ETA-10 (for example)
whether B*C is pre-calculated or not, since the extra multiply can be
completely overlapped with the subtract in the second line.

Of course coding for the ETA-10 is not an interesting issue for most
of us these days, but I consider it very important to maintain a
reasonable level of source code compatibility between codes for the
Connection Machine and the Cray Y/MP and for other machines which have
insufficient memory bandwidth to run vector operations in the
streaming mode shown above.  These machines include: Cray-2, Convex,
IBM 3090, most (all?) Japanese supercomputers, Ardent Titan, as well
as most machines on the drawing boards (names withheld since I am
under non-disclosure on several of these.)  Even the Cray X/MP and
Y/MP benefit from reducing the memory traffic since this minimizes the
bank conflicts suffered in a multi-processing environment.

>My question
>is, how smart will the compilers get. Will compilers evaluate the common
>subexpression (B*C) once or twice?

I don't know of *any* vectorizer/optimizer which will do this sort of
optimization on vector quantities. Anyone from Cray care to comment on
the current status of the Cray compiler on this code? 

It is *very* important that this capability be developed, since more
and more machines are going to be memory-bandwidth-deficient in the
next few years.

>With the cfront model, the B*C stuff will
>end up in separate loops and it is highly unlikely that the compilers
>subrexpression analizer will pick it up. I think what it boils down to is
>this: will the compilers be able to do "loop jamming" on the loops that 
>are implied by the vector syntax. Even in Fortran, if you coded:

>   do i = 1, n
>     E(i) = A(i) + B(i) * C(i)
>   end do
>   do i = 1, n
>     D(i) = A(i) - B(i) * C(i)
>   end do

>the optimizer would not eliminate the common subexpression. But in fortran
>you'd never do this (well, I'd never). The loops would be "jammed" together:

>   do i = 1, n
>     E(i) = A(i) + B(i) * C(i)
>     D(i) = A(i) - B(i) * C(i)
>   end do

And converting array notation into this combined form requires a
significant data dependency analysis....  I think that part of the
problem is that vectorizers have been developed as stand-alone
source-to-source translators, and the optimization aspect of this
translation has been pretty minimal to date.

>Now the optimizer will optimize the heck out of this. Not only will the
>B*C product only be evaluated once, but most likely A, B, and C will
>only be fetched once. This latter point is not insignificant on a vector
>machine. Thus even if you write:

>    vector TMP = B * C
>    E = A + TMP
>    D = A - TMP

>if the underlying C compiler (or fortran compiler) can't figure out
>how to "jam" the loops itself, this will still be less efficient
>than hand coding the loop (and it takes more memory). 

yep....
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

jac@muslix.llnl.gov (James Crotinger) (12/08/89)

  I'd also like to see some discussion about using C++ (or other
languages with vector syntax) on vectorizing machines. We're going to
be getting C++ working here soon as well, and I'm wondering if it'll
be possible to write efficient vectorizing C++ code.

  I can forsee some problems. First of all, the C code cfront produces
when inlining is often quite complex looking.  I'm genuinly worried
that the C compilers that we are currently using will not be able to 
vectorize the  resulting code. If this is the case, then we lose big.

  However I also have other concerns, which are generic to languages
that support vector data types (ala CFT77 and Fortran 8x). Suppose
I have a vector type and the following code:

   vector A, B, C, D, E

   E = A + B*C     // meaning elementwise multiplication
   D = A - B*C 

This is the style of programming that the vector syntax promotes. My question
is, how smart will the compilers get. Will compilers evaluate the common
subexpression (B*C) once or twice? With the cfront model, the B*C stuff will
end up in seperate loops and it is highly unlikely that the compilers
subrexpression analizer will pick it up. I think what it boils down to is
this: will the compilers be able to do "loop jamming" on the loops that 
are implied by the vector syntax. Even in Fortran, if you coded:

   do i = 1, n

     E(i) = A(i) + B(i) * C(i)
   
   end do

   do i = 1, n

     D(i) = A(i) - B(i) * C(i)

   end do

the optimizer would not eliminate the common subexpression. But in fortran
you'd never do this (well, I'd never). The loops would be "jammed" together:

   do i = 1, n

     E(i) = A(i) + B(i) * C(i)
     D(i) = A(i) - B(i) * C(i)

   end do

Now the optimizer will optimize the heck out of this. Not only will the
B*C product only be evaluated once, but most likely A, B, and C will
only be fetched once. This latter point is not insignificant on a vector
machine. Thus even if you write:

    vector TMP = B * C

    E = A + TMP
    D = A - TMP

if the underlying C compiler (or fortran compiler) can't figure out
how to "jam" the loops itself, this will still be less efficient
than hand coding the loop (and it takes more memory). 

  Now then, you can't jam loops that are buried in seperate subroutine
calls.  This is an instance where inlining is clearly necessary for
full optimization. The common subexpression can't even be eliminated
unless you could convince the C compiler that the call to operator*(B,C) 
has no side effects! But I don't think CFRONT will inline this, since
it has a loop in it. Is this true? In spite of all the recent cautions
on inlining,  this is a case that is pretty clear to me. If you'd
otherwise code it out in long-hand, then you've got nothing to lose
from making it an inline function, and much to gain. Certainly, it is
much preferable to a macro. 

  So, can CFRONT inline this? Can it be fixed to do so? And if so, can
compilers be written to fully optimize this code by doing loop
jamming?  It almost seems like this type of optimization would be
easier to spot at the level of the C code, or the C++ code, rather
than after the code has been translated into some lower level
intermediate language, where most optimizers do their work.

  Jim

bright@Data-IO.COM (Walter Bright) (12/09/89)

In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP (James Crotinger) writes:
<Suppose I have a vector type and the following code:
<   vector A, B, C, D, E
<   E = A + B*C     // meaning elementwise multiplication
<   D = A - B*C 
<will the compilers be able to do "loop jamming" on the loops that 
<are implied by the vector syntax.

No compiler I know of can do this (in any language) if the loops are
inserted by inline functions.

The types of optimizations done are driven by the type of code that
programmers tend to write. If a language encourages things to be written
a certain way, but the optimizer generates lousy code for that, then
there is pressure to change the optimizer. I've made a number of
modifications to the optimizer (for ZTC) to recognize certain things
that are common in C++, but very rare in ordinary C.

I don't see a serious technical obstacle to an optimizer doing the loop
jamming you described. The optimization is not done currently because
in C people just don't write code that such an optimization would improve.
C++ changes the way code is written, however, so there will be pressure
on C++ vendors to do those optimizations.

Of course, that doesn't solve your problem today. Perhaps 2-3 years from
now :-).

jac@muslix.llnl.gov (James Crotinger) (12/09/89)

In article <MCCALPIN.89Dec7105739@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>This is also the style of programming that is appropriate to
>memory-to-memory vector machines (Cyber 205 and ETA-10), and (more
>importantly) for SIMD parallel machines like the Connection Machine.
>The code above runs at the same speed on the ETA-10 (for example)
>whether B*C is pre-calculated or not, since the extra multiply can be
>completely overlapped with the subtract in the second line.
>
  I guess I find that a bit hard to buy. The X-MP also does chaining,
but the above example runs 33% slower on our X-MP when written using
Cray Fortran's vector notation than it does when the loop is written
out explicitly, with both loops loops jammed into one. Interestingly,
on the Cray 2, which does no chaining, the vector style version is only
20% slower. I suspect that the memory bandwidth is what's really the 
killer here. In the version which is written out as one jammed loop,
the Cray should do the following:

      loop:

         load 64 elements of A
         load 64 elements of B
         load 64 elements of C
         calculate E = A + B*C 	(for 64 elements)
         calculate D = A - B*C 	(ditto)
         store E		(ditto)
         store D		(ditto)
         goto loop

(with appropriate logic to end the loop). The savings of not having 
to go out to memory to get A, B, and C twice are not small at all. 
Furthermore, on the Cray 2 stuff like storing E can be overlapped
with the calculation of D...

>
>>My question
>>is, how smart will the compilers get. Will compilers evaluate the common
>>subexpression (B*C) once or twice?
>
>I don't know of *any* vectorizer/optimizer which will do this sort of
>optimization on vector quantities. Anyone from Cray care to comment on
>the current status of the Cray compiler on this code? 
>
>It is *very* important that this capability be developed, since more
>and more machines are going to be memory-bandwidth-deficient in the
>next few years.
>

  Exactly.
>John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
>		   mccalpin@scri1.scri.fsu.edu
>		   mccalpin@delocn.udel.edu

  Jim

jimm@amc-gw.amc.com (Jim McElroy) (12/13/89)

In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP,
(James Crotinger) writes:

>   However I also have other concerns, which are generic to
>   languages that support vector data types (ala CFT77 and
>   Fortran 8x).  Suppose I have a vector type and the following
>   code:
>   
>      vector A, B, C, D, E
>      E = A + B*C     // meaning elementwise multiplication
>      D = A - B*C 
>   
>   This is the style of programming that the vector syntax
>   promotes.
   
The kind of optimizations that you are hoping for would be very
difficult for a language that tries to be general purpose. 
There are many examples of operations on classes that would not
accomplish the programmer's intended effect if the compiler
began omitting operations, saving intermediate results or even
altering the order of operations across statement boundaries. 
The compiler would have to know that the operations (*, +, etc.)
had no side effects that the programmer was counting on.

There is, however, a way.  Suppose we have a set of vector and
matrix classes that encapsulate the data values.  We then
implement the operations (*, +, etc.) so that the operations are
not carried out immediatly, but instead, a data structure is
built that records the desired operations and operands.  (A
"list of operations" structure.)  This structure would continue
to be added to until some operation is called for that would
carry a result value outside of the encapsulation -- that is,
outside of the set of vector and matrix classes.  For example, a
function such as "matrix.output()" would require that the
approprite value be actually calculated!).  Now that a computed
value is *really* needed, the big, hairy, optimizer-evaluator
function (that we have to figure out how to write) is called to
do the real work.  This function would analyze the "code" that
we have built in the "list of operations", study the data
dependencies, reorder the operations optimally, make necessary
copies of temporary results for later reuse, eliminate copies
that are not needed and so forth.  Now this optimizer-evaluator
function knows the problem domain and might also know the
peculiarities of the machine architecture that will finally
carry out the operations.  There is thus a very good opportunity
to exploit parallelism even in strongly machine-dependent ways.

It is likely that such a set of classes would need to be "tuned"
when the system is ported to a new architecture, but the
application using the classes would remain unchanged.  Yay.

How far can this idea be pushed?  Loop unrolling?

I hope somebody out there can work on this.


Jim McElroy

jac@muslix.llnl.gov (James Crotinger) (12/14/89)

In article <1061@amc-gw.amc.com> jimm@amc-gw.amc.com (Jim McElroy) writes:
>The kind of optimizations that you are hoping for would be very
>difficult for a language that tries to be general purpose. 
...
>The compiler would have to know that the operations (*, +, etc.)
>had no side effects that the programmer was counting on.
>
  But if the operations are inlined, then the compiler can figure
this out!

>There is, however, a way.  Suppose we have a set of vector and
...describes method where, if I understand it, the operators would
build up a syntax tree, and when a value was finally required (e.g.
operator=()) the syntax tree would be optimized and then evaluated.

  I've read this suggestion before, but in a different context as
a method to optimize complex vector expressions. (e.g. d = a + b + c)
This is related since currently d = a + b + c would compile into 
something like (I believe -- I really should check to make sure). 

  tmp = a + b;
  d = tmp + c

which, if inlined, would become (fortran notation)

  do i = 1, n
    tmp(i) = a(i) + b(i)
  end do

  do i = 1, n
    d(i) = tmp(i) + c(i)
  end do

Now, if these loops were jammed and if the compiler can figure out 
that tmp is unused in the rest of the program, it can optimize
this to

  do i = 1, n
    d(i) = a(i) + b(i) + c(i)
  end do

which is exactly what we want. So, if we can inline loops, and if the
optimizers can be made smart enough to do optimizations like the above,
then I suspect that the above solution is better than the one which
would build up a parse tree out of the operators and then have the class
itself try to optimize the parse tree when operator=() is called. 

>How far can this idea be pushed?  Loop unrolling?

  I think our C compiler already does this. Certainly our Fortran
compilers do.

>Jim McElroy