mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (12/07/89)
In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP (James Crotinger) writes: > However I also have other concerns, which are generic to languages >that support vector data types (ala CFT77 and Fortran 8x). Suppose >I have a vector type and the following code: > vector A, B, C, D, E > E = A + B*C // meaning elementwise multiplication > D = A - B*C >This is the style of programming that the vector syntax promotes. This is also the style of programming that is appropriate to memory-to-memory vector machines (Cyber 205 and ETA-10), and (more importantly) for SIMD parallel machines like the Connection Machine. The code above runs at the same speed on the ETA-10 (for example) whether B*C is pre-calculated or not, since the extra multiply can be completely overlapped with the subtract in the second line. Of course coding for the ETA-10 is not an interesting issue for most of us these days, but I consider it very important to maintain a reasonable level of source code compatibility between codes for the Connection Machine and the Cray Y/MP and for other machines which have insufficient memory bandwidth to run vector operations in the streaming mode shown above. These machines include: Cray-2, Convex, IBM 3090, most (all?) Japanese supercomputers, Ardent Titan, as well as most machines on the drawing boards (names withheld since I am under non-disclosure on several of these.) Even the Cray X/MP and Y/MP benefit from reducing the memory traffic since this minimizes the bank conflicts suffered in a multi-processing environment. >My question >is, how smart will the compilers get. Will compilers evaluate the common >subexpression (B*C) once or twice? I don't know of *any* vectorizer/optimizer which will do this sort of optimization on vector quantities. Anyone from Cray care to comment on the current status of the Cray compiler on this code? It is *very* important that this capability be developed, since more and more machines are going to be memory-bandwidth-deficient in the next few years. >With the cfront model, the B*C stuff will >end up in separate loops and it is highly unlikely that the compilers >subrexpression analizer will pick it up. I think what it boils down to is >this: will the compilers be able to do "loop jamming" on the loops that >are implied by the vector syntax. Even in Fortran, if you coded: > do i = 1, n > E(i) = A(i) + B(i) * C(i) > end do > do i = 1, n > D(i) = A(i) - B(i) * C(i) > end do >the optimizer would not eliminate the common subexpression. But in fortran >you'd never do this (well, I'd never). The loops would be "jammed" together: > do i = 1, n > E(i) = A(i) + B(i) * C(i) > D(i) = A(i) - B(i) * C(i) > end do And converting array notation into this combined form requires a significant data dependency analysis.... I think that part of the problem is that vectorizers have been developed as stand-alone source-to-source translators, and the optimization aspect of this translation has been pretty minimal to date. >Now the optimizer will optimize the heck out of this. Not only will the >B*C product only be evaluated once, but most likely A, B, and C will >only be fetched once. This latter point is not insignificant on a vector >machine. Thus even if you write: > vector TMP = B * C > E = A + TMP > D = A - TMP >if the underlying C compiler (or fortran compiler) can't figure out >how to "jam" the loops itself, this will still be less efficient >than hand coding the loop (and it takes more memory). yep.... -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
jac@muslix.llnl.gov (James Crotinger) (12/08/89)
I'd also like to see some discussion about using C++ (or other languages with vector syntax) on vectorizing machines. We're going to be getting C++ working here soon as well, and I'm wondering if it'll be possible to write efficient vectorizing C++ code. I can forsee some problems. First of all, the C code cfront produces when inlining is often quite complex looking. I'm genuinly worried that the C compilers that we are currently using will not be able to vectorize the resulting code. If this is the case, then we lose big. However I also have other concerns, which are generic to languages that support vector data types (ala CFT77 and Fortran 8x). Suppose I have a vector type and the following code: vector A, B, C, D, E E = A + B*C // meaning elementwise multiplication D = A - B*C This is the style of programming that the vector syntax promotes. My question is, how smart will the compilers get. Will compilers evaluate the common subexpression (B*C) once or twice? With the cfront model, the B*C stuff will end up in seperate loops and it is highly unlikely that the compilers subrexpression analizer will pick it up. I think what it boils down to is this: will the compilers be able to do "loop jamming" on the loops that are implied by the vector syntax. Even in Fortran, if you coded: do i = 1, n E(i) = A(i) + B(i) * C(i) end do do i = 1, n D(i) = A(i) - B(i) * C(i) end do the optimizer would not eliminate the common subexpression. But in fortran you'd never do this (well, I'd never). The loops would be "jammed" together: do i = 1, n E(i) = A(i) + B(i) * C(i) D(i) = A(i) - B(i) * C(i) end do Now the optimizer will optimize the heck out of this. Not only will the B*C product only be evaluated once, but most likely A, B, and C will only be fetched once. This latter point is not insignificant on a vector machine. Thus even if you write: vector TMP = B * C E = A + TMP D = A - TMP if the underlying C compiler (or fortran compiler) can't figure out how to "jam" the loops itself, this will still be less efficient than hand coding the loop (and it takes more memory). Now then, you can't jam loops that are buried in seperate subroutine calls. This is an instance where inlining is clearly necessary for full optimization. The common subexpression can't even be eliminated unless you could convince the C compiler that the call to operator*(B,C) has no side effects! But I don't think CFRONT will inline this, since it has a loop in it. Is this true? In spite of all the recent cautions on inlining, this is a case that is pretty clear to me. If you'd otherwise code it out in long-hand, then you've got nothing to lose from making it an inline function, and much to gain. Certainly, it is much preferable to a macro. So, can CFRONT inline this? Can it be fixed to do so? And if so, can compilers be written to fully optimize this code by doing loop jamming? It almost seems like this type of optimization would be easier to spot at the level of the C code, or the C++ code, rather than after the code has been translated into some lower level intermediate language, where most optimizers do their work. Jim
jac@muslix.llnl.gov (James Crotinger) (12/09/89)
In article <MCCALPIN.89Dec7105739@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >This is also the style of programming that is appropriate to >memory-to-memory vector machines (Cyber 205 and ETA-10), and (more >importantly) for SIMD parallel machines like the Connection Machine. >The code above runs at the same speed on the ETA-10 (for example) >whether B*C is pre-calculated or not, since the extra multiply can be >completely overlapped with the subtract in the second line. > I guess I find that a bit hard to buy. The X-MP also does chaining, but the above example runs 33% slower on our X-MP when written using Cray Fortran's vector notation than it does when the loop is written out explicitly, with both loops loops jammed into one. Interestingly, on the Cray 2, which does no chaining, the vector style version is only 20% slower. I suspect that the memory bandwidth is what's really the killer here. In the version which is written out as one jammed loop, the Cray should do the following: loop: load 64 elements of A load 64 elements of B load 64 elements of C calculate E = A + B*C (for 64 elements) calculate D = A - B*C (ditto) store E (ditto) store D (ditto) goto loop (with appropriate logic to end the loop). The savings of not having to go out to memory to get A, B, and C twice are not small at all. Furthermore, on the Cray 2 stuff like storing E can be overlapped with the calculation of D... > >>My question >>is, how smart will the compilers get. Will compilers evaluate the common >>subexpression (B*C) once or twice? > >I don't know of *any* vectorizer/optimizer which will do this sort of >optimization on vector quantities. Anyone from Cray care to comment on >the current status of the Cray compiler on this code? > >It is *very* important that this capability be developed, since more >and more machines are going to be memory-bandwidth-deficient in the >next few years. > Exactly. >John D. McCalpin - mccalpin@masig1.ocean.fsu.edu > mccalpin@scri1.scri.fsu.edu > mccalpin@delocn.udel.edu Jim
sacilott@compass.com (Roger Sacilotto) (12/16/89)
On 7 Dec 89 15:57:39 GMT, mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) said: > In article <40827@lll-winken.LLNL.GOV> jac@muslix.UUCP (James > Crotinger) writes: >> However I also have other concerns, which are generic to languages >>that support vector data types (ala CFT77 and Fortran 8x). Suppose >>I have a vector type and the following code: >> vector A, B, C, D, E >> E = A + B*C // meaning elementwise multiplication >> D = A - B*C >>This is the style of programming that the vector syntax promotes. > This is also the style of programming that is appropriate to > memory-to-memory vector machines (Cyber 205 and ETA-10), and (more > importantly) for SIMD parallel machines like the Connection > Machine. [etc...] >>My question >>is, how smart will the compilers get. Will compilers evaluate the common >>subexpression (B*C) once or twice? > I don't know of *any* vectorizer/optimizer which will do this sort > of optimization on vector quantities. Anyone from Cray care to > comment on the current status of the Cray compiler on this code? Some compilers I have contributed to here at COMPASS include vectorizer/optimizer technology which performs these types of optimization on both source-level vector operations as well as vectorized scalar source code. This technology has been applied in various combinations to all COMPASS compilers, including the Fortran compiler developed for the Suprenum supercomputer consortium in Germany, the Connection Machine Fortran compiler developed for Thinking Machines, Inc., and the MasPar Fortran compiler being developed for MasPar Computer Corporation. > It is *very* important that this capability be developed, since > more and more machines are going to be memory-bandwidth-deficient > in the next few years. >>With the cfront model, the B*C stuff will >>end up in separate loops and it is highly unlikely that the compilers >>subrexpression analizer will pick it up. I think what it boils down to is >>this: will the compilers be able to do "loop jamming" on the loops that >>are implied by the vector syntax. Even in Fortran, if you coded: >> do i = 1, n >> E(i) = A(i) + B(i) * C(i) >> end do >> do i = 1, n >> D(i) = A(i) - B(i) * C(i) >> end do >>the optimizer would not eliminate the common subexpression. But in fortran >>you'd never do this (well, I'd never). The loops would be "jammed" together: >> do i = 1, n >> E(i) = A(i) + B(i) * C(i) >> D(i) = A(i) - B(i) * C(i) >> end do > And converting array notation into this combined form requires a > significant data dependency analysis.... I think that part of the > problem is that vectorizers have been developed as stand-alone > source-to-source translators, and the optimization aspect of this > translation has been pretty minimal to date. Our vectorizer is integrated into the middle of our compiler, after semantic analysis, and before strip mining, or sectioning. This architecture allows the vectorizer to be very knowledgable about optimization. In fact, our vectorizer has the opposite problem in terms of source-to-source translation: Since knowledge of the optimizer is embedded in the vectorizer, unparsed output can be hard to produce in readable form. >>Now the optimizer will optimize the heck out of this. Not only will the >>B*C product only be evaluated once, but most likely A, B, and C will >>only be fetched once. This latter point is not insignificant on a vector >>machine. Thus even if you write: >> vector TMP = B * C >> E = A + TMP >> D = A - TMP >>if the underlying C compiler (or fortran compiler) can't figure out >>how to "jam" the loops itself, this will still be less efficient >>than hand coding the loop (and it takes more memory). > yep.... Loop jamming (we call it loop fusion) and loop interchange are essential to our technology. Our compilers can fuse and interchange user-written loops, vectorizer-generated loops and strip-mine-generated loops when possible. Another issue, mainly in massively-parallel SIMD architectures, is inter-processor communication costs. The example E = A + B * C D = A - B * C can still be expensive if the associated elements of A, B, C, D and E don't all "live" in the same processor. Using the common subexpression helps, but if the result has to be moved to three separate locations (E, D, and A), communication costs could swamp the costs of the computation and memory instructions. Our approach here is to notice that occurences of arrays may be "allocated" (distributed among the processors) differently depending on usage. This example was taken from an existing program. DO I = 1,IMAX DO J= 1,JMAX TEMP(:) = A(I,J,:) A(I,J,:) = B(I,J,:) B(I,J,:) = TEMP(:) ENDDO ENDDO If allocation is done naively, then this example will result in communication in both the def and use of TEMP, or 2*IMAX*JMAX communications. Analysis of usage can tell you that A and B can be allocated identically ("aligned") in all dimensions and that TEMP can be aligned with the third dimension of A and B. This alignment will eliminate the need for any communication (i.e., cost*2*IMAX*JMAX -> 0) If multiple occurences of the same array are allocated differently, then communication is inserted into the IR at points which minimize the cost. Notice the allocation of A in this example: A = B * C ! allocation of A is LOC1 ! insert communication here of ! A from LOC1->LOC2 DO I = 1,N D = A + F(I) ! allocation of A is LOC2 etc... ENDDO If the communication is "on demand" inside the loop, then there are N communications of A, rather than 1 if done before the loop. For more on data optimization, see Knobe, Kathleen, Lukas, Joan D., and Steele, Guy L. Jr. "Massively Parrallel Data Optimization." Proceedings of the 2nd Symposium on the Frontiers of Massively Parallel Computation (Oct 1988), 551-558 Knobe, Kathleen, Lukas, Joan D., and Steele, Guy L. Jr. "Data Optimization: Allocation of Arrays to Reduce Communication on SIMD Machines." Technical Report CA-8902-2801. Compass, Inc., Wakefield MA. (to appear in the Feb 1990 issue of the Journal of Parallel and Distributed Computing) Roger Sacilotto, Compass, Inc. 550 Edgewater Dr., Wakefield, MA 01880, USA. think!compass!sacilott (617) 245-9540