gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) (11/29/90)
In article <BURLEY.90Nov28084920@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes: >Hmm, could you explain to me how to strength-reduce multiplication to addition >for random references to array elements within loops, as in You can't. But many references are regular. So in this loop: do j = 1, 100 do i = 1, 100 bleah = a( i, j ) enddo ^^^^ - all loop vars enddo you know that you can *always* strength-reduce the address calculation, without any fancy analysis whatsoever. Moreover, even with a 1-d array, you can often easily see that an address is being incremented by 4 or 8 bytes, and avoid having to calculate an index and shift it left 2 or 4 bits. A C compiler, on the other hand, would have to verify that the loop variables weren't modified inside the loop. Since you're working with a sophisticated compiler, the last point isn't that important to you, but if you are working on a simple compiler, simple optimization possibilities are nice to have around.
burley@pogo.ai.mit.edu (Craig Burley) (11/29/90)
In article <1990Nov28.213735.25563@murdoch.acc.Virginia.EDU> gl8f@astsun7.astro.Virginia.EDU (Greg Lindahl) writes: In article <BURLEY.90Nov28084920@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes: >Hmm, could you explain to me how to strength-reduce multiplication to addition >for random references to array elements within loops, as in You can't. But many references are regular. So in this loop: do j = 1, 100 do i = 1, 100 bleah = a( i, j ) enddo ^^^^ - all loop vars enddo C'mon guys, enough is enough: I said "for RANDOM REFERENCES to array elements within loops" clearly enough! We've beaten the non-random reference issue to death. I was responding (somewhat tongue in cheek, though hoping I'd actually get an incredibly insightful answer) to someone who had said one can strength-reduce the multiplies away in loops in response to yet another person who clearly had been talking about, yes, RANDOM references. So, yes, I know (or am pretty sure) you can't. But I asked the question anyway, because I wanted to know the poster's answer, or at least suggest without being overly and publically rude that perhaps he shouldn't have responded to a post about random references to arrays saying they can be strength-reduced. A common example of random accesses are scatter/gather operations. The array element access computation in the following function in C: void foo(a,b,c) float a[100][100]; int b[100]; int c[100]; { int i, j; for (i = 0; i < 100; ++i) for (j = 0; j < 100; ++j) a[b[i]][c[j]] = 3.1415*(i+j); } can and must be implemented using pointer indirection, not multiplication. The same computation in this similar function in Fortran: SUBROUTINE FOO(A,B,C) REAL A(100,100) INTEGER B(100),C(100),I,J C DO I=1,100 DO J=1,100 A(C(J),B(I)) = 3.1415*(I+J) END DO END DO C END can and must be implemented using multiplication, not pointer indirection. For many applications, the choice is Fortran, not due to optimization but to the requirement that the caller of FOO be able to pass any shape arrays to FOO, as in: REAL ARRAY(10,10,10,10),INDX1(10,10),INDX2(5,2,5,2) CALL FOO(ARRAY,INDX1,INDX2) The corresponding call in C: float array[10][10][10][10]; float indx1[10][10]; float indx2[5][2][5][2]; foo(array,indx1,indx2); would not work. Further, with regards to optimization, it is likely true that on many computers (especially older ones) references to memory (and lets assume we do cache misses) are faster than integer multiplies. However, it's my guess that on many machines where people are actually coding fast numerical/scientific computing problems, integer (and even floating-point) multiplies are as fast or faster, or are done in a parallel fashion and end up being preferrable in cases where memory is a bottleneck but the multiplier is available. -- James Craig Burley, Software Craftsperson burley@ai.mit.edu
burley@pogo.ai.mit.edu (Craig Burley) (11/29/90)
Woops, among any errors anyone else finds, the Fortran version of the subroutine FOO should have calculated 3.1415*(I+J-2), not 3.1415*(I+J), since the C version iterated i and j from 0 thru 99 while the Fortran version iterated from 1 thru 100. Sorry. -- James Craig Burley, Software Craftsperson burley@ai.mit.edu
gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) (11/29/90)
In article <BURLEY.90Nov28205003@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes: > I was responding (somewhat tongue in cheek, though hoping I'd >actually get an incredibly insightful answer) to someone who had said one can >strength-reduce the multiplies away in loops in response to yet another person >who clearly had been talking about, yes, RANDOM references. I was not claiming that I could strength-reduce random references. I was claiming that most references were in fact not random. >So, yes, I know (or am pretty sure) you can't. But I asked the >question anyway, because I wanted to know the poster's answer, or at >least suggest without being overly and publically rude that perhaps he >shouldn't have responded to a post about random references to arrays >saying they can be strength-reduced. Well, I think you've managed to be publically rude ;-) I responded because the original poster, Dan, made a statement comparing flat arrays and double-pointer arrays that I found untrue. Whether or not he's talking about random references, I find that strength reduction can reduce the cost of flat arrays IN MY CODE to be less than double-pointer arrays. Why? Because my array accesses are regular. And that's what my posting was about. If you want to talk about random arrays, well, it's a fascinating subject, but that doesn't preclude discussion of flat arrays. >A common example of random accesses are scatter/gather operations. The array >element access computation in the following function in C: > > void foo(a,b,c) > float a[100][100]; > int b[100]; > int c[100]; > { > int i, j; > > for (i = 0; i < 100; ++i) > for (j = 0; j < 100; ++j) > a[b[i]][c[j]] = 3.1415*(i+j); > } This can be strength-reduced, although I would be hard pressed to see how the compiler could do it automagaically. If, in Fortran, you do the gather with each column getting one outer loop iteration, strength reduction is possible. It'd look something like this: DO I = 1, NUMCOLS ICOL = B(I) DO J = 1, NUMROWS(I) A( C(I,J), ICOL ) = 3.1415*(whatever) ENDDO ENDDO Since ICOL is constant across the inner loop, it can be strength reduced. I hope I haven't switched rows and columns, I can never remember which one's which by name although I know which index varies fastest :-) The major problem with implementing this is that you have a different number of row elements to be gathered in each column, so using a flat array C() wastes space if you don't have a good idea of the upper bound. In cases where you are merely permuting things, this isn't a problem. In other cases, it might be a big problem. If you are using rectangular arrays, a smart FORTRAN compiler could get the best of both worlds by using flat arrays and making a work array of pointers to rows, which it could use for any address calculation which couldn't be strength-reduced. >For many applications, the choice is Fortran, not due to optimization >but to the requirement that the caller of FOO be able to pass any >shape arrays to FOO, as in: > > REAL ARRAY(10,10,10,10),INDX1(10,10),INDX2(5,2,5,2) > CALL FOO(ARRAY,INDX1,INDX2) But you can use flat arrays in C. All it takes is a define... and then your C compiler can often strength-reduce regular acceses.
burley@pogo.ai.mit.edu (Craig Burley) (11/29/90)
In article <1990Nov29.041958.2254@murdoch.acc.Virginia.EDU> gl8f@astsun.astro.Virginia.EDU (Greg Lindahl) writes: In article <BURLEY.90Nov28205003@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes: > I was responding (somewhat tongue in cheek, though hoping I'd >actually get an incredibly insightful answer) to someone who had said one can >strength-reduce the multiplies away in loops in response to yet another person >who clearly had been talking about, yes, RANDOM references. I was not claiming that I could strength-reduce random references. I was claiming that most references were in fact not random. >So, yes, I know (or am pretty sure) you can't. But I asked the >question anyway, because I wanted to know the poster's answer, or at >least suggest without being overly and publically rude that perhaps he >shouldn't have responded to a post about random references to arrays >saying they can be strength-reduced. Well, I think you've managed to be publically rude ;-) I see the smiley, but I do want to clarify: in no way did I consider Greg's posting, to which I originally responded, silly or anything. It just seemed to be a case of losing track of how the discussion progressed, as most of us do (ok, I know I do it). I did get a little annoyed at the private and public responses to my post talking about non-random references when clearly the particular subtopic had become random references as a subclass of array references in general: "You know, mammals are ..." (big discussion); "Well, that's usually the case, but if you have to deal with horses, they are an exception, because they are..."; "Yes, but most mammals aren't horses, so it really isn't important!" HUH? Sigh. We need an object-oriented newsgroup system (inheritance in particular) because it is really hard to keep track of all these threads of discussion, especially as topics are refined into subtopics without it being clear in postings about subtopics that they are intentionally being discussed as subtopics rather than supertopics. Who wants to trace back the discussion of every thread before replying? Oh well, dream on. >A common example of random accesses are scatter/gather operations. The array >element access computation in the following function in C: > > void foo(a,b,c) > float a[100][100]; > int b[100]; > int c[100]; > { > int i, j; > > for (i = 0; i < 100; ++i) > for (j = 0; j < 100; ++j) > a[b[i]][c[j]] = 3.1415*(i+j); > } This can be strength-reduced, although I would be hard pressed to see how the compiler could do it automagaically. If, in Fortran, you do the gather with each column getting one outer loop iteration, strength reduction is possible. It'd look something like this: DO I = 1, NUMCOLS ICOL = B(I) DO J = 1, NUMROWS(I) A( C(I,J), ICOL ) = 3.1415*(whatever) ENDDO ENDDO Since ICOL is constant across the inner loop, it can be strength reduced. I hope I haven't switched rows and columns, I can never remember which one's which by name although I know which index varies fastest :-) Hmm, first off, I should have specified b[(i+j)/2] in my sample program just to make things sufficiently "random" (and done the equivalent for the Fortran procedure). Next, I should have pointed out that while references to <a> were random, the references to <b> and <c> weren't, so this particular example (some kind of scatter/gather thing, I don't write this kind of code so I'm no expert on what it's used for) still contains array references that are within the scope of discussion of C's pointer accesses vs. Fortran's array accesses (though as has been discussed, with one-dimensional arrays and optimizing compilers, there ain't much overall difference across all computing systems used today). Third, I think Greg's new version of the Fortran has "C(I,J)" where it should have "C(J)" -- but I may just be confused. Fourth, I break out in cold sweats whenever I think about writing anything down in Fortran vs. other languages regarding multi-dimensional arrays due to column-major vs. row-major ordering. But when it actually comes down to it, all I have to do is think about it in my favorite way (which I think is called row-major), and then reverse the dimensions when writing Fortran, and it works. I hope. Greg's example, aside from the third nit above, looks fine to me. Finally, this whole thread has been fascinating and highly informative for two reasons (yes, here's another numbered list -- that's what you get for reading the ramblings of a one-time tech writer): One, I'm saving some of the juicy stuff for review when GNU Fortran is working to serve as optimizations to check for and perhaps implement as time becomes available. Two, it makes me think about how both C and Fortran have fundamental omissions/comissions regarding arrays. C doesn't let you work with multi-dimensional arrays directly, you must express them using arrays of arrays (which is a different means of expression, even though it may or may not be more efficient for a specific problem) or by writing your own subscripting expressions for a single monolithic array (which allows the auto-reshaping-across-procedure-boundary often called for by Fortran programmers). Fortran doesn't let you say that when you are expecting an argument A to a procedure and say DIMENSION A(10,10), you really know that the array passed will itself be dimensioned that way, and that when passing an array to a procedure, it'll expect any optimization info on its known shape to be passed; the result being that, unless inter-procedural optimization is done, arrays must be passed as flat and must be received as flat, and pointer maps to speed up multi-dimensional array accesses can be used only entirely within the given procedure. I don't know how easy or reasonable it would be to add multi-dimensional arrays to C (I can see the newsgroup discussions now: "C is just as good as Fortran for scientific computing"; "No it isn't, C doesn't have true multi-dimensional arrays"; "Well, you can accomplish it using other means"; "That can be inefficient"; "Well, it could be added to C fairly easily"; "Ok, let's add it"; "Hey! Why are you making my favorite language bigger? I don't need this feature!"; "Sure you do, it's great for scientific computing"; "But I don't use C for scientific computing!"). But with Fortran there could be a means for declaring the interface to a procedure as requiring an array of a particular shape, and the definition of that procedure would codify that expectation, so all callers of the procedure would not only know that they had to pass an array of that shape, but that they could also pass any pointer maps that might be useful to access elements of the array optimally on a given architecture, and the procedure could assume the existence of those pointer maps. Gee, maybe that's what some of that stuff in the Fortran 90 spec was trying to do; I'll have to look at it again to see if it makes more sense to me in the light of this idea for a feature. (-: Ultimately I always believe the better language is one in which you can state at the highest level what you want to accomplish. C's requirement for either arrays of arrays or hand-coded subscript calculation (and I don't consider macros or inline functions to affect this line of reasoning) when you really want to simply specify a multi-dimensional array means it is not as good as Fortran in this respect. Fortran's self- imposed requirement that it be able to reshape arrays across procedure calls in all cases, by virtue of not allowing you to say "I know this array is indeed this shape in both the called procedure and in all its calling procedures", means it is not as good as C if you use C's array of arrays approach (where it is a given that the actual and formal arrays are the same shape, sort of). When you are able to say what you mean in a language, then it is better for maintainability, retargetability (in terms of other languages), portability (other machines), and optimization, or at least as good. An optimizing compiler might well be able to optimize multi-dimensional array accesses coded that way, but not coded as low-level expressions a la the approach one would use in C when representing a conceptual multi-dimensional array as a single-dimensioned array in the code. Some compilers might be able to optimize the low-level expressions as well by recognizing patterns. These don't have a real hard time with the higher-level constructs; for compilation speed, they can optimize them directly, or for internal elegance and verifiability they can "lower" them to the equivalent low-level expressions and let the powerful optimization at that level take care of the job. An example of this kind of lowering and letting powerful low-level optimization to good things is taking: DO 10 I=1,100 ... 10 CONTINUE and turning it into: I = 1 (code-visible iterator) TEMP1 = (100-I+1)/1 (# of trips) TEMP2 = 1 (increment value for iterator) T1 IF (TEMP1.LE.0) GOTO T2 (any trips left?) ... 10 CONTINUE I = I + TEMP2 (bump iterator for program) TEMP1 = TEMP1 - 1 (countdown trip) GO TO T1 (next iteration) T2 ... Then a general flow analyzer detects that T1:T2 is a loop and determines whether it fits into one of various models of loops with corresponding optimized implementations. This way, not only is the original DO loop shown optimized via this lowering, but code written like the lowered version also is transformed. Do people really write code like the lowered version? Apparently, when you do things like dead-code elimination, flow reordering, and other things, yes, you can end up with things that are really loops that didn't look that way at first. Useless-variable elimination via def-use chains, along with array element access elimination, not only eliminates some or all of the temporary variables created (either by replacing their values with machine instruction equivalents, like replacing I = I + TEMP2 with the machine's instruction for C's ++I or by at least overlaying the temp with other temps in separate loops) but may eliminate, for example, I as well (if the array references involving I can be strength-reduced into, say, pointers incremented each time the loop iterates, as one would do in C). And some machines loop faster when the loop ends on a value reaching 0 than reaching some other number, which is why TEMP1 is set up to do that by calculating the number of trips. (FYI: It is not valid standard Fortran for the program to change the value of a DO loop iterator while its DO loop is still active, so the above transformation safely ignores the possibility of a change to I during the loop. This is another reason Fortran code can sometimes be faster than "equivalent" C code; and, of course, if the programmer isn't aware of the restriction, violates it by modifying I during the loop and expecting the number of trips to be adjusted accordingly, and doesn't get a compile-time message, the result is buggy code that will work on some systems and not others -- hey, I'm gonna go add just that kind of check to GNU Fortran right now!) The same can be done for array references. Change: A(I,J) to: A[(I-lbound(A,1))+((J-lbound(A,2))*extent(A,1))] where lbound(x,y) is the lower bound of array x in dimension y, hbound (not mentioned above) is the same for the upper bound, and extent(x,y) is hbound(x,y)-lbound(x,y)+1. Then, things like strength reduction and loop induction are still doable, but might well apply to more cases than if restricted to explicit multi- dimensional array accesses. It is possible for an optimizer to strength-reduce too much, i.e. to the point where the code ends up less optimal than it might otherwise have been, on a machine where resource and instruction scheduling is important. It might be that the resulting intermediate code requires too many operations of the adder unit and yet leaves the multiplier unit twiddling its thumbs much of the time. To keep this from happening, one could use a backtracking mechanism directed to guess at a best goal for some code and repeatedly try various optimizations and schedule the results until that goal is achieved, and keep track of the best achievement in the meantime if it isn't -- this'll handle the problem by not keeping the completely strength-reduced version as the best achieved. But I lean towards going ahead and doing the strength-reduction, then providing information for the instruction and resource scheduler so that it can selectively move candidate operations off, in this case, a busy adder unit and onto an idle multiplier unit, and automatically generate any support code that kind of adjustment requires (and make sure things don't get worse in the process), because it's the scheduler that knows just where such movement is needed, not the code that decides whether to strength-reduce or "induct" an expression. But the whole proposition is RISCy... :-) -- James Craig Burley, Software Craftsperson burley@ai.mit.edu
3003jalp@ucsbuxa.ucsb.edu (Applied Magnetics) (11/30/90)
In article <BURLEY.90Nov29103957@pogo.ai.mit.edu> burley@pogo.ai.mit.edu (Craig Burley) writes: >[Fascinating discussion of compiler internals] Is it standard practice to translate the high-level language to low-level construct and _then_ optimize? Isn't it a case of reverse-engineering? (Then again, vectorizing Fortrans do just that: reverse-engineer DO loops into vector operations.) --P. Asselin, R&D, Applied Magnetics Corp. I speak for me.
smryan@garth.UUCP (Steven Ryan) (12/04/90)
>... I was responding (somewhat tongue in cheek, though hoping I'd >actually get an incredibly insightful answer) to someone who had said one can >strength-reduce the multiplies away in loops in response to yet another person >who clearly had been talking about, yes, RANDOM references. Given something like real a(0:9,0:9),b(0:9,0:9) integer e,f,g,h do j=0,9 do i=0,9 a(e(i),f(j)) = b(g(i),h(j)) enddo enddo end you can get rid of the multiplication by using real a(0:99),b(0:99) integer e,f,g,h integer m(0:9) data m/0,10,20,30,40,50,60,70,80,90/ do j=0,9 do i=0,9 a(m(e(i))+f(j)) = b(m(g(i))+h(j)) enddo which the compiler can do for you. Even when the multiplier is unknown, as with adjustable arrays, the multiplier vector can be filled in with a loop using only adds. enddo end -- ...!uunet!ingr!apd!smryan Steven Ryan ...!{apple|pyramid}!garth!smryan 2400 Geng Road, Palo Alto, CA