psmith@convex.com (Presley Smith) (01/04/91)
Thought this might be interesting to readers of comp.lang.fortran. It quantifies the potential performance decrease that one may experience when converting from DO loop notation of FORTRAN 77 to the array notation available on the Cray. This is the same array notation that is specified in Fortran 90. Array notation provides a simplified way to specify array operations in many cases, but may, in fact, decrease the performance of the program. ------------------------------------------------------------------------ Path: convex!texsun!sundc!seismo!dimacs.rutgers.edu!mips!news.cs.indiana.edu!cica!tut.cis.ohio-state.edu!att!emory!hubcap!rcarter From: rcarter@nas.nasa.gov (Russell L. Carter) Newsgroups: comp.parallel Subject: Re: CM Fortran (actually, CRI array syntax performance) Message-ID: <12444@hubcap.clemson.edu> Date: 2 Jan 91 18:06:35 GMT References: <12301@hubcap.clemson.edu> <12324@hubcap.clemson.edu> <12381@hubcap.clemson.edu> <12398@hubcap.clemson.edu> <12410@hubcap.clemson.edu> Sender: fpst@hubcap.clemson.edu Reply-To: rcarter@wilbur.nas.nasa.gov (Russell L. Carter) Organization: NAS Program, NASA Ames Research Center, Moffett Field, CA Lines: 67 Approved: parallel@hubcap.clemson.edu In article <12410@hubcap.clemson.edu> john%ghostwheel.unm.edu@ariel.unm.edu (John Prentice) writes: >In article <12398@hubcap.clemson.edu> serafini@amelia.nas.nasa.gov (David B. Serafini) writes: >> >>CFT77 has had a subset of the Fortran 90 array syntax for quite a while now, >>at least a couple of years. It hasn't caught on much, (to my knowledge) >>probably because of the portability problems it creates. The Cray Fortran >>preprocessor, fpp, now can put out do-loop based code when given array code, >>so portability is less of a concern now. >> >This is rather interesting. Does fpp generate a new do-loop for every >array construct or is it smart enough to combine operations the way >you would if you were writing in Fortran 77? What is the effect on >the optimizer of all this? > >John K. Prentice >john@unmfys.unm.edu Well, let's look at some data. I converted the NAS Kernels, a popular CFD benchmark (at least here at NAS) to array syntax, and ran it on our YMP. Here is what I get: Using array syntax: THE NAS KERNEL BENCHMARK PROGRAM PROGRAM ERROR FP OPS SECONDS MFLOPS MXM 1.8085E-13 4.1943E+08 1.5879 264.15 CFFT2D 3.2001E-12 4.9807E+08 11.1267 44.76 CHOLSKY 1.8256E-10 2.2103E+08 4.9911 44.29 BTRIX 6.0622E-12 3.2197E+08 4.5159 71.30 GMTRY 1.0082E+00 2.2650E+08 3.5258 64.24 EMIT 1.5609E-13 2.2604E+08 1.3055 173.15 VPENTA 2.3541E-13 2.5943E+08 7.1281 36.40 TOTAL 1.0082E+00 2.1725E+09 34.1810 63.56 And using plain vanilla fortran 77: THE NAS KERNEL BENCHMARK PROGRAM PROGRAM ERROR FP OPS SECONDS MFLOPS MXM 1.8085E-13 4.1943E+08 1.5705 267.06 CFFT2D 3.2001E-12 4.9807E+08 7.0951 70.20 CHOLSKY 1.8256E-10 2.2103E+08 2.6393 83.75 BTRIX 6.0622E-12 3.2197E+08 2.3717 135.76 GMTRY 6.5609E-13 2.2650E+08 2.0910 108.32 EMIT 1.5609E-13 2.2604E+08 1.2987 174.05 VPENTA 2.3541E-13 2.5943E+08 4.7900 54.16 TOTAL 1.9305E-10 2.1725E+09 21.8563 99.40 The MFLOPS definitely decrease for this code when the DO loops are expressed in the array section syntax. We have smart users; They HATE to go slower. So few apparently use the array syntax here at NAS, unless they want a code that can be run with minimal changes on the CM. russell rcarter@wilbur.nas.nasa.gov
bernhold@qtp.ufl.edu (David E. Bernholdt) (01/04/91)
In article <1991Jan03.163532.22692@convex.com> psmith@convex.com (Presley Smith) writes: >Array notation provides a simplified way to specify array operations >in many cases, but may, in fact, decrease the performance of the program. >[and gives a concrete example] I know this has been discussed on this list before, and the point, as I recall it, was that the F90 array syntax can require extra temporary arrays and consequently extra operations to conform to the behavior required by the standard. (Please correct me if that's wrong.) Does the standard somehow mandate _different_ behavior for the same loop written in DO vs array syntax? Or is there some reason why the array syntax loop can't be transformed into a more efficient DO loop structure automatically? The end results are the same using either DO or array syntax, right? So if the array syntax is slower, it sounds to me like the compiler (optimizer?) writers have a problem, not the language. -- David Bernholdt bernhold@qtp.ufl.edu Quantum Theory Project bernhold@ufpine.bitnet University of Florida Gainesville, FL 32611 904/392 6365
tholen@uhccux.uhcc.Hawaii.Edu (David Tholen) (01/04/91)
Presley Smith writes: > Array notation provides a simplified way to specify array operations > in many cases, but may, in fact, decrease the performance of the program. Microsoft FORTRAN version 5.0 also supports array notation, and on page 41 of the reference manual is the following note: In processing array expressions, the compiler may generate a less efficient sequence of machine instructions than it would if the arrays were processed in a conventional DO loop. If execution speed is critical, it may be more efficient to handle arrays element-by-element. So the problem isn't limited to the Cray compiler. What I'd like to know are the kind of situations in which array notation slows things down. Something as simple as C = A + B where all three are conforming arrays, can be done in two ways: with sequential memory accesses or with nonsequential memory accesses. The former should be faster than the latter, especially if nonsequential memory access results in a lot of paging to disk! With DO loop syntax, it's easy to handle the array subscripts in the wrong order, thereby causing nonsequential memory access, and I would hope that array syntax would eliminate that common programming error and execute as fast as the DO loop version with the subscripts specified in the right order. Maybe it's the more complex expressions that can be programmed more efficiently with DO loops, but can somebody give some examples? I've avoided using array notation simply because I don't know under what circumstances execution will be slowed, and given that it isn't standard yet, it represents a compiler "enhancement" that I don't want to use. You would think the compiler writers would provide a little more detail on when to use it and when NOT to use it, but so far...
jlg@lanl.gov (Jim Giles) (01/04/91)
From article <1991Jan03.163532.22692@convex.com>, by psmith@convex.com (Presley Smith): > [...] > Using array syntax: > > THE NAS KERNEL BENCHMARK PROGRAM > [...] > TOTAL 1.0082E+00 2.1725E+09 34.1810 63.56 > > > And using plain vanilla fortran 77: > [...] > TOTAL 1.9305E-10 2.1725E+09 21.8563 99.40 > [...] > The MFLOPS definitely decrease for this code when the DO loops > are expressed in the array section syntax. We have smart users; > They HATE to go slower. So few apparently use the array > syntax here at NAS, unless they want a code that can be run > with minimal changes on the CM. This is obviously an implementation problem - NOT a problem inherent to the idea of using array syntax. Since the product you were testing was a preprocessor, I suspect that the compiler couldn't recover enough of the information that the preprocessor filtered out. (This is often a problem with preprocessing.) For example, consider the following Fortran-Extended code fragment: real a(10), b(10), c(10), d(10) ... a = b c = d ... If the preprocessor converts the two array assignments to separate do loops, then the compiler may not see that the following code is equivalent: real a(10), b(10), c(10), d(10) ... do 10 i=1,10 a(i) = b(i) c(i) = d(i) 10 continue That is, the two loops can be combined. CFT77, for example, cannot do this. Even so, it is clear that a compiler _could_ find this optimization (and, given the original source rather than a preprocessed version, it would probably be easier). So, you shouldn't condemn a language feature because of a single bad implementation. (On the other hand, if you have specific reasons that you think array syntax will be _inherently_ slower regardless of compiler cleverness - speak up!) J. Giles
john@ghostwheel.unm.edu (John Prentice) (01/04/91)
In article <10818@uhccux.uhcc.Hawaii.Edu> tholen@uhccux.uhcc.Hawaii.Edu (David Tholen) writes: > >So the problem isn't limited to the Cray compiler. What I'd like to know are >the kind of situations in which array notation slows things down. The sort of place I expect the Cray has trouble with array syntax is: A=B+C D=E+F where these arrays are all the same length. In a do-loop, one would do: do 10 i=1,len a(i)=b(i)+c(i) d(i)=e(i)+f(i) 10 continue and this would easily vectorize within a single loop. Will the Cray compiler interpret the array syntax equivalent the same way? John K. Prentice Amparo Corporation Albuquerque, NM
bleikamp@convex.com (Richard Bleikamp) (01/04/91)
The draft standard requires the entire right side of an assignment statement to be evaluated before any values of the left side target are updated. This requires a temporary only when there is some overlap. Unfortunately, only vectorizing/parallelizing compilers do enough dependency anaylsis to get this right all the time. So a scalar machine compiler will often introduce unnecessary temps (at least until they add dependency analysis). For example : DO 10 I=1,N-1 A(I) = B(I) + C(I) B(I+1) = D(I) * 3.14 10 CONTINUE contains a forward recurence, which most vectorizing compilers can happily vectorize. Note that the equivalent F90 statements are : B(2:N) = D(1:N-1) * 3.14 A(1:N-1) = B(1:N-1) + C(1:N-1) and the order is CRITICAL. A F90 compiler on a scalar machine will probably produce worse code for these two F90 statements than for the F77 equivalent (it will likely treat the two statements separately, in essence introducing another loop, and screwing up the cache hit rates). On a vector machine, the resulting code for the F90 statements won't be better than the F77 version, and sometimes worse. A massivelly parallel machine might do better with the F90 code, but not necessarily. In the case of a backward loop-carried dependency, i.e. DO 10 I=1,N 10 D(I+1) = D(I) * B(I) + C(I) there is NO LEGAL array notation equivalent. The real problem for array notation is that scalar compilers don't known enough to do a good job of converting array notation back into an appropriate DO loop without introducing unnecessary loops and temps. > > The end results are the same using either DO or array syntax, > right? So if the array syntax is slower, it sounds to me like the > compiler (optimizer?) writers have a problem, not the language. The moral of these examples is ARRAY NOTATION is NOT a direct replacement for DO loops. They are not always interchangeable. Array notation is sometimes a convenient shorthand for simple array operations, but will not inherently allow compilers to generate better code. -- ------------------------------------------------------------------------------ Rich Bleikamp bleikamp@convex.com Convex Computer Corporation
djh@xipe.osc.edu (David Heisterberg) (01/05/91)
In article <1991Jan4.004719.21640@ariel.unm.edu> john@ghostwheel.unm.edu (John Prentice) writes: >The sort of place I expect the Cray has trouble with array syntax is: > A=B+C > D=E+F >where these arrays are all the same length. In a do-loop, one would do: > do 10 i=1,len > a(i)=b(i)+c(i) > d(i)=e(i)+f(i) > 10 continue FPP 3.00Z36 will do exactly that. Also, given A (N, M), B (N, M), C (N, M) and C = A + B FPP will produce DO xxxxx J1X = 1, N*M C (J1X, 1) = A (J1X, 1) + B (J1X, 1) xxxxx CONTINUE I love it when compilers get better. But I do have CONVEX-envy when it comes to some constructs that the CRAY compiler still won't attempt. -- David J. Heisterberg djh@osc.edu And you all know The Ohio Supercomputer Center djh@ohstpy.bitnet security Is mortals' Columbus, Ohio 43212 ohstpy::djh chiefest enemy.
bernhold@qtp.ufl.edu (David E. Bernholdt) (01/05/91)
In article <1991Jan04.154747.18673@convex.com> bleikamp@convex.com (Richard Bleikamp) writes: >The moral of these examples is ARRAY NOTATION is NOT a direct replacement for >DO loops. They are not always interchangeable. Array notation is sometimes >a convenient shorthand for simple array operations, but will not inherently >allow compilers to generate better code. However every array syntax loop _does_ have an equivalent DO syntax. This is the important point, since the complaints are against the array syntax. >The real problem for array notation is that scalar compilers don't known >enough to do a good job of converting array notation back into an appropriate >DO loop without introducing unnecessary loops and temps. So, as I said before, it is the fault of the compiler, not the standard. If the array notation is slow, complain to the vendor. If the extra analysis makes the compilation slow, put in a switch and let the _user_ decide if they want fast compilation or fast execution. There is ample precedent(sp?) for this idea in the existing optimization switches available on most compilers. -- David Bernholdt bernhold@qtp.ufl.edu Quantum Theory Project bernhold@ufpine.bitnet University of Florida Gainesville, FL 32611 904/392 6365
maine@elxsi.dfrf.nasa.gov (Richard Maine) (01/05/91)
On 4 Jan 91 15:47:47 GMT, bleikamp@convex.com (Richard Bleikamp) said: Richard> Nntp-Posting-Host: mozart.convex.com Richard> The draft standard requires the entire right side of an Richard> assignment statement to be evaluated before any values of the Richard> left side target are updated. This requires a temporary only Richard> when there is some overlap. Unfortunately, only Richard> vectorizing/parallelizing compilers do enough dependency Richard> anaylsis to get this right all the time. So a scalar machine Richard> compiler will often introduce unnecessary temps (at least Richard> until they add dependency analysis). F77 character assignments have a simillar problem with temps, avoided in the F77 standard by the simple expedient of making overlap illegal. Of course, some compilers allow it as an extension, and other compilers don't diagnose the problem but don't do it "right" either. I've run into several program bugs caused by people who didn't know about the restriction and assumed that expressions like character a*16 a(2:) = a will do the intuitively "obvious" thing instead of what it really does in many cases. Richard> The moral of these examples is ARRAY NOTATION is NOT a direct Richard> replacement for DO loops. They are not always Richard> interchangeable. Array notation is sometimes a convenient Richard> shorthand for simple array operations, but will not Richard> inherently allow compilers to generate better code. I'd agree with this statement. Though I've occasionally seen people suggest that array notation might or might not produce better code for parallel machines, this doesn't seem the main point. To me, the above quoted phrase "a sometimes convenient shorthand" is much more to the point. Array notation is to make code more concise, clearer, easier to maintain, and less buggy. Of course it doesn't guarantee any of these characteristics, but it can be used to aid them. I'll take a 30% slower program that actually works instead of a fast buggy one that gives wrong results any day. (For instance the F77 program bugs in character string overlap that I cited above would have worked correctly in F90). But then I'm funny that way. I swear there are people that would rather have the fast wrong results. (Unfortunately, that is sometimes literally true, rather than just a sarcastic comment :-(). If it takes DO loops instead of array syntax to make a time-critical section of code faster, then I'll certainly use the DO loops there. If it takes assembly language to make a time-critical section of code fast enough, I'll use assembly language. I've not found that necessary for several years though, whether because compilers have gotten enough better or because machines have gotten enough faster is not clear. The whole purpose of compilers is to make writing code (particularly good, maintainable, bug-free, etc. code) easier for the user. If speed were the number 1 priority, we'd all be coding in assembly. Array notation has been high on the wish list of very many Fortran users for decades. I've certainly heard the request many times, and its almost never been because the user wanted his code to run faster, but because he (or she - sorry) wanted it to be easier to write. Indeed the complaint "Why doesn't F8x/F90 just add array notation and forget all this other stuff?" has been heard regularly. (I don't happen to agree with that position, but I've certainly heard it). -- Richard Maine maine@elxsi.dfrf.nasa.gov [130.134.64.6]
john@ghostwheel.unm.edu (John Prentice) (01/05/91)
Let me bring this discussion full circle now. The original question I posted (in the parallel newsgroup?) had to do with the fact that the CM-2 and a few other machines invoke parallelism by the use of array syntax. My argument was that this was a bit unfortunate since array syntax is not part of any existing standard and Fortran Extended is yet to even be ratified, much less a compiler written for it. The response was that many more compilers than I realized already support array syntax and that it is therefore not as unportable as I argued (this point is moot for us by the way, we are contractually obligated to use ANSI standard Fortran, so extensions are not helpful since we can only use them when we have no other choice, such as on the CM). However, now people are pointing out that array syntax is not a replacement for do-loops and that it may produce slower code, even on a vector or parallel computer. This would seem to drive us back toward my original point, array syntax for invoking parallelism is unfortunate until such time as Fortran Extended becomes available (whenver that might be). Even then what people are saying is that by writing my code to run on a CM, which REQUIRES array syntax, I am going to hobble it on most any other computer where the array syntax may defeat or confuse the optimizer. Is this a fair statement and if so, anybody have any suggestions for how to address this problem short of having two versions of the code (one for the CM and one for everyone else) ? John Prentice Amparo Corporation Albuqueruqe, NM