[comp.lang.fortran] strength reduction of array address calculations

gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) (11/29/90)

In article <BURLEY.90Nov28084920@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes:

>Hmm, could you explain to me how to strength-reduce multiplication to addition
>for random references to array elements within loops, as in

You can't. But many references are regular. So in this loop:

   do j = 1, 100
      do i = 1, 100
         bleah = a( i, j )
      enddo         ^^^^ - all loop vars
   enddo

you know that you can *always* strength-reduce the address
calculation, without any fancy analysis whatsoever. Moreover, even
with a 1-d array, you can often easily see that an address is being
incremented by 4 or 8 bytes, and avoid having to calculate an index
and shift it left 2 or 4 bits. A C compiler, on the other hand, would
have to verify that the loop variables weren't modified inside the
loop.

Since you're working with a sophisticated compiler, the last point
isn't that important to you, but if you are working on a simple
compiler, simple optimization possibilities are nice to have around.

burley@pogo.ai.mit.edu (Craig Burley) (11/29/90)

In article <1990Nov28.213735.25563@murdoch.acc.Virginia.EDU> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes:

   In article <BURLEY.90Nov28084920@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes:

   >Hmm, could you explain to me how to strength-reduce multiplication to addition
   >for random references to array elements within loops, as in

   You can't. But many references are regular. So in this loop:

      do j = 1, 100
	 do i = 1, 100
	    bleah = a( i, j )
	 enddo         ^^^^ - all loop vars
      enddo

C'mon guys, enough is enough: I said "for RANDOM REFERENCES to array
elements within loops" clearly enough!  We've beaten the non-random reference
issue to death.  I was responding (somewhat tongue in cheek, though hoping I'd
actually get an incredibly insightful answer) to someone who had said one can
strength-reduce the multiplies away in loops in response to yet another person
who clearly had been talking about, yes, RANDOM references.

So, yes, I know (or am pretty sure) you can't.  But I asked the question anyway,
because I wanted to know the poster's answer, or at least suggest without being
overly and publically rude that perhaps he shouldn't have responded to a post
about random references to arrays saying they can be strength-reduced.

A common example of random accesses are scatter/gather operations.  The array
element access computation in the following function in C:

    void foo(a,b,c)
    float a[100][100];
    int b[100];
    int c[100];
    {
    int i, j;

    for (i = 0; i < 100; ++i)
      for (j = 0; j < 100; ++j)
        a[b[i]][c[j]] = 3.1415*(i+j);
    }

can and must be implemented using pointer indirection, not multiplication.  The
same computation in this similar function in Fortran:

    SUBROUTINE FOO(A,B,C)
    REAL A(100,100)
    INTEGER B(100),C(100),I,J
C
    DO I=1,100
       DO J=1,100
          A(C(J),B(I)) = 3.1415*(I+J)
       END DO
    END DO
C
    END

can and must be implemented using multiplication, not pointer indirection.

For many applications, the choice is Fortran, not due to optimization but to
the requirement that the caller of FOO be able to pass any shape arrays to FOO,
as in:

    REAL ARRAY(10,10,10,10),INDX1(10,10),INDX2(5,2,5,2)
    CALL FOO(ARRAY,INDX1,INDX2)

The corresponding call in C:

    float array[10][10][10][10];
    float indx1[10][10];
    float indx2[5][2][5][2];
    foo(array,indx1,indx2);

would not work.

Further, with regards to optimization, it is likely true that on many computers
(especially older ones) references to memory (and lets assume we do cache misses)
are faster than integer multiplies.  However, it's my guess that on many machines
where people are actually coding fast numerical/scientific computing problems,
integer (and even floating-point) multiplies are as fast or faster, or are done
in a parallel fashion and end up being preferrable in cases where memory is
a bottleneck but the multiplier is available.
--

James Craig Burley, Software Craftsperson    burley@ai.mit.edu

burley@pogo.ai.mit.edu (Craig Burley) (11/29/90)

Woops, among any errors anyone else finds, the Fortran version of the subroutine
FOO should have calculated 3.1415*(I+J-2), not 3.1415*(I+J), since the C version
iterated i and j from 0 thru 99 while the Fortran version iterated from 1 thru
100.  Sorry.
--

James Craig Burley, Software Craftsperson    burley@ai.mit.edu

gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (11/29/90)

In article <BURLEY.90Nov28205003@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes:

> I was responding (somewhat tongue in cheek, though hoping I'd
>actually get an incredibly insightful answer) to someone who had said one can
>strength-reduce the multiplies away in loops in response to yet another person
>who clearly had been talking about, yes, RANDOM references.

I was not claiming that I could strength-reduce random references. I
was claiming that most references were in fact not random.

>So, yes, I know (or am pretty sure) you can't.  But I asked the
>question anyway, because I wanted to know the poster's answer, or at
>least suggest without being overly and publically rude that perhaps he
>shouldn't have responded to a post about random references to arrays
>saying they can be strength-reduced.

Well, I think you've managed to be publically rude ;-)

I responded because the original poster, Dan, made a statement
comparing flat arrays and double-pointer arrays that I found untrue.
Whether or not he's talking about random references, I find that
strength reduction can reduce the cost of flat arrays IN MY CODE to be
less than double-pointer arrays. Why? Because my array accesses are
regular. And that's what my posting was about.  If you want to talk
about random arrays, well, it's a fascinating subject, but that
doesn't preclude discussion of flat arrays.

>A common example of random accesses are scatter/gather operations.  The array
>element access computation in the following function in C:
>
>    void foo(a,b,c)
>    float a[100][100];
>    int b[100];
>    int c[100];
>    {
>    int i, j;
>
>    for (i = 0; i < 100; ++i)
>      for (j = 0; j < 100; ++j)
>        a[b[i]][c[j]] = 3.1415*(i+j);
>    }

This can be strength-reduced, although I would be hard pressed to see
how the compiler could do it automagaically. If, in Fortran, you do
the gather with each column getting one outer loop iteration, strength
reduction is possible. It'd look something like this:

    DO I = 1, NUMCOLS
       ICOL = B(I)
       DO J = 1, NUMROWS(I)
          A( C(I,J), ICOL ) = 3.1415*(whatever)
       ENDDO
    ENDDO

Since ICOL is constant across the inner loop, it can be strength
reduced. I hope I haven't switched rows and columns, I can never
remember which one's which by name although I know which index varies
fastest :-)

The major problem with implementing this is that you have a different
number of row elements to be gathered in each column, so using a flat
array C() wastes space if you don't have a good idea of the upper
bound. In cases where you are merely permuting things, this isn't a
problem. In other cases, it might be a big problem.

If you are using rectangular arrays, a smart FORTRAN compiler could
get the best of both worlds by using flat arrays and making a work
array of pointers to rows, which it could use for any address
calculation which couldn't be strength-reduced.

>For many applications, the choice is Fortran, not due to optimization
>but to the requirement that the caller of FOO be able to pass any
>shape arrays to FOO, as in:
>
>    REAL ARRAY(10,10,10,10),INDX1(10,10),INDX2(5,2,5,2)
>    CALL FOO(ARRAY,INDX1,INDX2)

But you can use flat arrays in C. All it takes is a define... and then
your C compiler can often strength-reduce regular acceses.

burley@pogo.ai.mit.edu (Craig Burley) (11/29/90)

In article <1990Nov29.041958.2254@murdoch.acc.Virginia.EDU> gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes:

   In article <BURLEY.90Nov28205003@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes:

   > I was responding (somewhat tongue in cheek, though hoping I'd
   >actually get an incredibly insightful answer) to someone who had said one can
   >strength-reduce the multiplies away in loops in response to yet another person
   >who clearly had been talking about, yes, RANDOM references.

   I was not claiming that I could strength-reduce random references. I
   was claiming that most references were in fact not random.

   >So, yes, I know (or am pretty sure) you can't.  But I asked the
   >question anyway, because I wanted to know the poster's answer, or at
   >least suggest without being overly and publically rude that perhaps he
   >shouldn't have responded to a post about random references to arrays
   >saying they can be strength-reduced.

   Well, I think you've managed to be publically rude ;-)

I see the smiley, but I do want to clarify: in no way did I consider
Greg's posting, to which I originally responded, silly or anything.
It just seemed to be a case of losing track of how the discussion
progressed, as most of us do (ok, I know I do it).  I did get a little
annoyed at the private and public responses to my post talking about
non-random references when clearly the particular subtopic had become
random references as a subclass of array references in general: "You know,
mammals are ..." (big discussion); "Well, that's usually the case, but if
you have to deal with horses, they are an exception, because they are...";
"Yes, but most mammals aren't horses, so it really isn't important!"  HUH?
Sigh.  We need an object-oriented newsgroup system (inheritance in
particular) because it is really hard to keep track of all these threads
of discussion, especially as topics are refined into subtopics without it
being clear in postings about subtopics that they are intentionally being
discussed as subtopics rather than supertopics.  Who wants to trace back
the discussion of every thread before replying?  Oh well, dream on.

   >A common example of random accesses are scatter/gather operations.  The array
   >element access computation in the following function in C:
   >
   >    void foo(a,b,c)
   >    float a[100][100];
   >    int b[100];
   >    int c[100];
   >    {
   >    int i, j;
   >
   >    for (i = 0; i < 100; ++i)
   >      for (j = 0; j < 100; ++j)
   >        a[b[i]][c[j]] = 3.1415*(i+j);
   >    }
   
   This can be strength-reduced, although I would be hard pressed to see
   how the compiler could do it automagaically. If, in Fortran, you do
   the gather with each column getting one outer loop iteration, strength
   reduction is possible. It'd look something like this:
   
       DO I = 1, NUMCOLS
          ICOL = B(I)
          DO J = 1, NUMROWS(I)
             A( C(I,J), ICOL ) = 3.1415*(whatever)
          ENDDO
       ENDDO
   
   Since ICOL is constant across the inner loop, it can be strength
   reduced. I hope I haven't switched rows and columns, I can never
   remember which one's which by name although I know which index varies
   fastest :-)

Hmm, first off, I should have specified b[(i+j)/2] in my sample program just
to make things sufficiently "random" (and done the equivalent for the
Fortran procedure).

Next, I should have pointed out that while references to <a> were random,
the references to <b> and <c> weren't, so this particular example (some kind
of scatter/gather thing, I don't write this kind of code so I'm no
expert on what it's used for) still contains array references that are
within the scope of discussion of C's pointer accesses vs. Fortran's
array accesses (though as has been discussed, with one-dimensional arrays
and optimizing compilers, there ain't much overall difference across all
computing systems used today).

Third, I think Greg's new version of the Fortran has "C(I,J)" where it
should have "C(J)" -- but I may just be confused.

Fourth, I break out in cold sweats whenever I think about writing anything
down in Fortran vs. other languages regarding multi-dimensional arrays
due to column-major vs. row-major ordering.  But when it actually
comes down to it, all I have to do is think about it in my favorite way
(which I think is called row-major), and then reverse the dimensions when
writing Fortran, and it works.  I hope.  Greg's example, aside from the
third nit above, looks fine to me.

Finally, this whole thread has been fascinating and highly informative for
two reasons (yes, here's another numbered list -- that's what you get for
reading the ramblings of a one-time tech writer):

One, I'm saving some of the juicy stuff for review when GNU Fortran is
working to serve as optimizations to check for and perhaps implement
as time becomes available.

Two, it makes me think about how both C and Fortran have fundamental
omissions/comissions regarding arrays.  C doesn't let you work with
multi-dimensional arrays directly, you must express them using arrays
of arrays (which is a different means of expression, even though it may or
may not be more efficient for a specific problem) or by writing your
own subscripting expressions for a single monolithic array (which allows
the auto-reshaping-across-procedure-boundary often called for by Fortran
programmers).  Fortran doesn't let you say that when you are expecting
an argument A to a procedure and say DIMENSION A(10,10), you really know
that the array passed will itself be dimensioned that way, and that when
passing an array to a procedure, it'll expect any optimization info on its
known shape to be passed; the result being that, unless inter-procedural
optimization is done, arrays must be passed as flat and must be received
as flat, and pointer maps to speed up multi-dimensional array accesses
can be used only entirely within the given procedure.

I don't know how easy or reasonable it would be to add multi-dimensional
arrays to C (I can see the newsgroup discussions now: "C is just as good
as Fortran for scientific computing"; "No it isn't, C doesn't have
true multi-dimensional arrays"; "Well, you can accomplish it using other
means"; "That can be inefficient"; "Well, it could be added to C fairly
easily"; "Ok, let's add it"; "Hey!  Why are you making my favorite
language bigger?  I don't need this feature!"; "Sure you do, it's great
for scientific computing"; "But I don't use C for scientific computing!").

But with Fortran there could be a means for declaring the interface to a
procedure as requiring an array of a particular shape, and the definition
of that procedure would codify that expectation, so all callers of the
procedure would not only know that they had to pass an array of that shape,
but that they could also pass any pointer maps that might be useful to
access elements of the array optimally on a given architecture, and the
procedure could assume the existence of those pointer maps.  Gee, maybe
that's what some of that stuff in the Fortran 90 spec was trying to do; I'll
have to look at it again to see if it makes more sense to me in the light
of this idea for a feature. (-:

Ultimately I always believe the better language is one in which you can
state at the highest level what you want to accomplish.  C's requirement
for either arrays of arrays or hand-coded subscript calculation (and I
don't consider macros or inline functions to affect this line of
reasoning) when you really want to simply specify a multi-dimensional array
means it is not as good as Fortran in this respect.  Fortran's self-
imposed requirement that it be able to reshape arrays across procedure
calls in all cases, by virtue of not allowing you to say "I know this
array is indeed this shape in both the called procedure and in all its
calling procedures", means it is not as good as C if you use C's array of
arrays approach (where it is a given that the actual and formal arrays are
the same shape, sort of).

When you are able to say what you mean in a language, then it is better
for maintainability, retargetability (in terms of other languages),
portability (other machines), and optimization, or at least as good.
An optimizing compiler might well be able to optimize multi-dimensional
array accesses coded that way, but not coded as low-level expressions a
la the approach one would use in C when representing a conceptual
multi-dimensional array as a single-dimensioned array in the code.

Some compilers might be able to optimize the low-level expressions as well
by recognizing patterns.  These don't have a real hard time with the
higher-level constructs; for compilation speed, they can optimize them
directly, or for internal elegance and verifiability they can "lower" them
to the equivalent low-level expressions and let the powerful optimization
at that level take care of the job.

An example of this kind of lowering and letting powerful low-level
optimization to good things is taking:

      DO 10 I=1,100
      ...
10    CONTINUE

and turning it into:

      I = 1                              (code-visible iterator)
      TEMP1 = (100-I+1)/1                (# of trips)
      TEMP2 = 1                          (increment value for iterator)
T1    IF (TEMP1.LE.0) GOTO T2            (any trips left?)
      ...
10    CONTINUE
      I = I + TEMP2                      (bump iterator for program)
      TEMP1 = TEMP1 - 1                  (countdown trip)
      GO TO T1                           (next iteration)
T2    ...

Then a general flow analyzer detects that T1:T2 is a loop and determines
whether it fits into one of various models of loops with corresponding
optimized implementations.  This way, not only is the original DO loop
shown optimized via this lowering, but code written like the lowered version
also is transformed.  Do people really write code like the lowered version?
Apparently, when you do things like dead-code elimination, flow reordering,
and other things, yes, you can end up with things that are really loops
that didn't look that way at first.  Useless-variable elimination via
def-use chains, along with array element access elimination, not only
eliminates some or all of the temporary variables created (either by
replacing their values with machine instruction equivalents, like replacing
I = I + TEMP2 with the machine's instruction for C's ++I or by at
least overlaying the temp with other temps in separate loops) but
may eliminate, for example, I as well (if the array references involving
I can be strength-reduced into, say, pointers incremented each time the
loop iterates, as one would do in C).  And some machines loop faster when
the loop ends on a value reaching 0 than reaching some other number, which
is why TEMP1 is set up to do that by calculating the number of trips.
(FYI: It is not valid standard Fortran for the program to change the value
of a DO loop iterator while its DO loop is still active, so the above
transformation safely ignores the possibility of a change to I during the
loop.  This is another reason Fortran code can sometimes be faster than
"equivalent" C code; and, of course, if the programmer isn't aware of the
restriction, violates it by modifying I during the loop and expecting the
number of trips to be adjusted accordingly, and doesn't get a compile-time
message, the result is buggy code that will work on some systems and not
others -- hey, I'm gonna go add just that kind of check to GNU Fortran right
now!)

The same can be done for array references.  Change:

A(I,J)

to:

A[(I-lbound(A,1))+((J-lbound(A,2))*extent(A,1))]

where lbound(x,y) is the lower bound of array x in dimension y, hbound
(not mentioned above) is the same for the upper bound, and extent(x,y)
is hbound(x,y)-lbound(x,y)+1.

Then, things like strength reduction and loop induction are still doable,
but might well apply to more cases than if restricted to explicit multi-
dimensional array accesses.

It is possible for an optimizer to strength-reduce too much, i.e. to the
point where the code ends up less optimal than it might otherwise have been,
on a machine where resource and instruction scheduling is important.  It
might be that the resulting intermediate code requires too many operations
of the adder unit and yet leaves the multiplier unit twiddling its thumbs
much of the time.  To keep this from happening, one could use a backtracking
mechanism directed to guess at a best goal for some code and repeatedly
try various optimizations and schedule the results until that goal is
achieved, and keep track of the best achievement in the meantime if it
isn't -- this'll handle the problem by not keeping the completely
strength-reduced version as the best achieved.  But I lean towards going
ahead and doing the strength-reduction, then providing information for
the instruction and resource scheduler so that it can selectively move
candidate operations off, in this case, a busy adder unit and onto an
idle multiplier unit, and automatically generate any support code that
kind of adjustment requires (and make sure things don't get worse in
the process), because it's the scheduler that knows just where such
movement is needed, not the code that decides whether to strength-reduce
or "induct" an expression.  But the whole proposition is RISCy... :-)
--

James Craig Burley, Software Craftsperson    burley@ai.mit.edu

3003jalp@ucsbuxa.ucsb.edu (Applied Magnetics) (11/30/90)

In article <BURLEY.90Nov29103957@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes:

>[Fascinating discussion of compiler internals]

Is it standard practice to translate the high-level language to
low-level construct and _then_ optimize?  Isn't it a case of
reverse-engineering?  (Then again, vectorizing Fortrans do just that:
reverse-engineer DO loops into vector operations.)

--P. Asselin, R&D, Applied Magnetics Corp.  I speak for me.

smryan@garth.UUCP (Steven Ryan) (12/04/90)

>... I was responding (somewhat tongue in cheek, though hoping I'd
>actually get an incredibly insightful answer) to someone who had said one can
>strength-reduce the multiplies away in loops in response to yet another person
>who clearly had been talking about, yes, RANDOM references.

Given something like

	real a(0:9,0:9),b(0:9,0:9)
	integer e,f,g,h
	do j=0,9
	  do i=0,9
	    a(e(i),f(j)) = b(g(i),h(j))
	  enddo
	enddo
	end

you can get rid of the multiplication by using

	real a(0:99),b(0:99)
	integer e,f,g,h
	integer m(0:9)
	data m/0,10,20,30,40,50,60,70,80,90/
	do j=0,9
	  do i=0,9
	    a(m(e(i))+f(j)) = b(m(g(i))+h(j))
	  enddo

which the compiler can do for you. Even when the multiplier is unknown,
as with adjustable arrays, the multiplier vector can be filled in with
a loop using only adds.
	enddo
	end
-- 
...!uunet!ingr!apd!smryan                                       Steven Ryan
...!{apple|pyramid}!garth!smryan              2400 Geng Road, Palo Alto, CA