[comp.lang.fortran] Array Notation

psmith@convex.com (Presley Smith) (01/04/91)

Thought this might be interesting to readers of comp.lang.fortran.  It
quantifies the potential performance decrease that one may experience
when converting from DO loop notation of FORTRAN 77 to the array 
notation available on the Cray.  This is the same array notation
that is specified in Fortran 90. 

Array notation provides a simplified way to specify array operations
in many cases, but may, in fact, decrease the performance of the program.

------------------------------------------------------------------------

Path: convex!texsun!sundc!seismo!dimacs.rutgers.edu!mips!news.cs.indiana.edu!cica!tut.cis.ohio-state.edu!att!emory!hubcap!rcarter
From: rcarter@nas.nasa.gov (Russell L. Carter)
Newsgroups: comp.parallel
Subject: Re: CM Fortran (actually, CRI array syntax performance)
Message-ID: <12444@hubcap.clemson.edu>
Date: 2 Jan 91 18:06:35 GMT
References: <12301@hubcap.clemson.edu> <12324@hubcap.clemson.edu> <12381@hubcap.clemson.edu> <12398@hubcap.clemson.edu> <12410@hubcap.clemson.edu>
Sender: fpst@hubcap.clemson.edu
Reply-To: rcarter@wilbur.nas.nasa.gov (Russell L. Carter)
Organization: NAS Program, NASA Ames Research Center, Moffett Field, CA
Lines: 67
Approved: parallel@hubcap.clemson.edu

In article <12410@hubcap.clemson.edu> john%ghostwheel.unm.edu@ariel.unm.edu (John Prentice) writes:
>In article <12398@hubcap.clemson.edu> serafini@amelia.nas.nasa.gov (David B. Serafini) writes:
>>
>>CFT77 has had a subset of the Fortran 90 array syntax for quite a while now,
>>at least a couple of years.  It hasn't caught on much, (to my knowledge)
>>probably because of the portability problems it creates.  The Cray Fortran
>>preprocessor, fpp, now can put out do-loop based code when given array code,
>>so portability is less of a concern now. 
>>
>This is rather interesting.  Does fpp generate a new do-loop for every
>array construct or is it smart enough to combine operations the way
>you would if you were writing in Fortran 77?  What is the effect on
>the optimizer of all this?
>
>John K. Prentice
>john@unmfys.unm.edu

Well, let's look at some data. I converted the NAS Kernels, a popular
CFD benchmark (at least here at NAS) to array syntax, and ran
it on our YMP.  Here is what I get:

Using array syntax:

                THE NAS KERNEL BENCHMARK PROGRAM

 PROGRAM        ERROR          FP OPS       SECONDS      MFLOPS

 MXM          1.8085E-13     4.1943E+08      1.5879      264.15
 CFFT2D       3.2001E-12     4.9807E+08     11.1267       44.76
 CHOLSKY      1.8256E-10     2.2103E+08      4.9911       44.29
 BTRIX        6.0622E-12     3.2197E+08      4.5159       71.30
 GMTRY        1.0082E+00     2.2650E+08      3.5258       64.24
 EMIT         1.5609E-13     2.2604E+08      1.3055      173.15
 VPENTA       2.3541E-13     2.5943E+08      7.1281       36.40

 TOTAL        1.0082E+00     2.1725E+09     34.1810       63.56

And using plain vanilla fortran 77:

                THE NAS KERNEL BENCHMARK PROGRAM

 PROGRAM        ERROR          FP OPS       SECONDS      MFLOPS

 MXM          1.8085E-13     4.1943E+08      1.5705      267.06
 CFFT2D       3.2001E-12     4.9807E+08      7.0951       70.20
 CHOLSKY      1.8256E-10     2.2103E+08      2.6393       83.75
 BTRIX        6.0622E-12     3.2197E+08      2.3717      135.76
 GMTRY        6.5609E-13     2.2650E+08      2.0910      108.32
 EMIT         1.5609E-13     2.2604E+08      1.2987      174.05
 VPENTA       2.3541E-13     2.5943E+08      4.7900       54.16

 TOTAL        1.9305E-10     2.1725E+09     21.8563       99.40

The MFLOPS definitely decrease for this code when the DO loops
are expressed in the array section syntax.  We have smart users;
They HATE to go slower.  So few apparently use the array
syntax here at NAS, unless they want a code that can be run
with minimal changes on the CM.

russell
rcarter@wilbur.nas.nasa.gov

bernhold@qtp.ufl.edu (David E. Bernholdt) (01/04/91)

In article <1991Jan03.163532.22692@convex.com> psmith@convex.com (Presley Smith) writes:
>Array notation provides a simplified way to specify array operations
>in many cases, but may, in fact, decrease the performance of the program.
>[and gives a concrete example]

I know this has been discussed on this list before, and the point, as
I recall it, was that the F90 array syntax can require extra temporary
arrays and consequently extra operations to conform to the behavior
required by the standard. (Please correct me if that's wrong.)

Does the standard somehow mandate _different_ behavior for the same
loop written in DO vs array syntax?  Or is there some reason why 
the array syntax loop can't be transformed into a more efficient DO
loop structure automatically?

The end results are the same using either DO or array syntax,
right?  So if the array syntax is slower, it sounds to me like the
compiler (optimizer?) writers have a problem, not the language.
-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

tholen@uhccux.uhcc.Hawaii.Edu (David Tholen) (01/04/91)

Presley Smith writes:

> Array notation provides a simplified way to specify array operations
> in many cases, but may, in fact, decrease the performance of the program.

Microsoft FORTRAN version 5.0 also supports array notation, and on page 41
of the reference manual is the following note:

   In processing array expressions, the compiler may generate a less
   efficient sequence of machine instructions than it would if the
   arrays were processed in a conventional DO loop.  If execution
   speed is critical, it may be more efficient to handle arrays
   element-by-element.

So the problem isn't limited to the Cray compiler.  What I'd like to know are
the kind of situations in which array notation slows things down.  Something
as simple as
      C = A + B
where all three are conforming arrays, can be done in two ways:  with
sequential memory accesses or with nonsequential memory accesses.  The former
should be faster than the latter, especially if nonsequential memory access
results in a lot of paging to disk!  With DO loop syntax, it's easy to handle
the array subscripts in the wrong order, thereby causing nonsequential memory
access, and I would hope that array syntax would eliminate that common
programming error and execute as fast as the DO loop version with the
subscripts specified in the right order.  Maybe it's the more complex 
expressions that can be programmed more efficiently with DO loops, but can
somebody give some examples?

I've avoided using array notation simply because I don't know under what
circumstances execution will be slowed, and given that it isn't standard
yet, it represents a compiler "enhancement" that I don't want to use.  You
would think the compiler writers would provide a little more detail on
when to use it and when NOT to use it, but so far...

jlg@lanl.gov (Jim Giles) (01/04/91)

From article <1991Jan03.163532.22692@convex.com>, by psmith@convex.com (Presley Smith):
> [...]
> Using array syntax:
> 
>                 THE NAS KERNEL BENCHMARK PROGRAM
> [...]
>  TOTAL        1.0082E+00     2.1725E+09     34.1810       63.56
> 
> 
> And using plain vanilla fortran 77:
> [...]
>  TOTAL        1.9305E-10     2.1725E+09     21.8563       99.40
> [...]
> The MFLOPS definitely decrease for this code when the DO loops
> are expressed in the array section syntax.  We have smart users;
> They HATE to go slower.  So few apparently use the array
> syntax here at NAS, unless they want a code that can be run
> with minimal changes on the CM.

This is obviously an implementation problem - NOT a problem inherent
to the idea of using array syntax.  Since the product you were testing
was a preprocessor, I suspect that the compiler couldn't recover enough
of the information that the preprocessor filtered out.  (This is often
a problem with preprocessing.)  For example, consider the following
Fortran-Extended code fragment:

      real a(10), b(10), c(10), d(10)
      ...
      a = b
      c = d
      ...

If the preprocessor converts the two array assignments to separate
do loops, then the compiler may not see that the following code is
equivalent:

      real a(10), b(10), c(10), d(10)
      ...
      do 10 i=1,10
	 a(i) = b(i)
	 c(i) = d(i)
  10  continue

That is, the two loops can be combined.  CFT77, for example, cannot do
this.  Even so, it is clear that a compiler _could_ find this optimization
(and, given the original source rather than a preprocessed version, it
would probably be easier).  So, you shouldn't condemn a language feature
because of a single bad implementation.  (On the other hand, if you have
specific reasons that you think array syntax will be _inherently_ slower
regardless of compiler cleverness - speak up!)

J. Giles

john@ghostwheel.unm.edu (John Prentice) (01/04/91)

In article <10818@uhccux.uhcc.Hawaii.Edu> tholen@uhccux.uhcc.Hawaii.Edu (David Tholen) writes:
>
>So the problem isn't limited to the Cray compiler.  What I'd like to know are
>the kind of situations in which array notation slows things down. 

The sort of place I expect the Cray has trouble with array syntax is:
       A=B+C
       D=E+F
where these arrays are all the same length.  In a do-loop, one would do:
       do 10 i=1,len
           a(i)=b(i)+c(i)
           d(i)=e(i)+f(i)
   10  continue
and this would easily vectorize within a single loop.  Will the Cray
compiler interpret the array syntax equivalent the same way?

John K. Prentice
Amparo Corporation
Albuquerque, NM

bleikamp@convex.com (Richard Bleikamp) (01/04/91)

The draft standard requires the entire right side of an assignment statement
to be evaluated before any values of the left side target are updated.  This
requires a temporary only when there is some overlap.  Unfortunately, only
vectorizing/parallelizing compilers do enough dependency anaylsis to get this
right all the time.  So a scalar machine compiler will often introduce
unnecessary temps (at least until they add dependency analysis). For example :

    DO 10  I=1,N-1
        A(I) = B(I) + C(I)
        B(I+1) = D(I) * 3.14
10  CONTINUE

contains a forward recurence, which most vectorizing compilers can happily
vectorize.  Note that the equivalent F90 statements are :

B(2:N) = D(1:N-1) * 3.14
A(1:N-1) = B(1:N-1) + C(1:N-1)

and the order is CRITICAL.  A F90 compiler on a scalar machine will probably
produce worse code for these two F90 statements than for the F77 equivalent
(it will likely treat the two statements separately, in essence introducing
another loop, and screwing up the cache hit rates).
On a vector machine, the resulting code for the F90 statements won't be
better than the F77 version, and sometimes worse.  A massivelly parallel
machine might do better with the F90 code, but not necessarily.

In the case of a backward loop-carried dependency, i.e.

    DO 10 I=1,N
10      D(I+1) = D(I) * B(I) + C(I)

there is NO LEGAL array notation equivalent.  

The real problem for array notation is that scalar compilers don't known
enough to do a good job of converting array notation back into an appropriate
DO loop without introducing unnecessary loops and temps.

> 
> The end results are the same using either DO or array syntax,
> right?  So if the array syntax is slower, it sounds to me like the
> compiler (optimizer?) writers have a problem, not the language.

The moral of these examples is ARRAY NOTATION is NOT a direct replacement for
DO loops.  They are not always interchangeable.  Array notation is sometimes
a convenient shorthand for simple array operations, but will not inherently
allow compilers to generate better code.

--
------------------------------------------------------------------------------
Rich Bleikamp			    bleikamp@convex.com
Convex Computer Corporation

djh@xipe.osc.edu (David Heisterberg) (01/05/91)

In article <1991Jan4.004719.21640@ariel.unm.edu> john@ghostwheel.unm.edu (John Prentice) writes:
>The sort of place I expect the Cray has trouble with array syntax is:
>       A=B+C
>       D=E+F
>where these arrays are all the same length.  In a do-loop, one would do:
>       do 10 i=1,len
>           a(i)=b(i)+c(i)
>           d(i)=e(i)+f(i)
>   10  continue

FPP 3.00Z36 will do exactly that.

Also, given A (N, M), B (N, M), C (N, M) and
      C = A + B
FPP will produce
      DO xxxxx J1X = 1, N*M
         C (J1X, 1) = A (J1X, 1) + B (J1X, 1)
xxxxx CONTINUE

I love it when compilers get better.  But I do have CONVEX-envy when it
comes to some constructs that the CRAY compiler still won't attempt.
--
David J. Heisterberg		djh@osc.edu		And you all know
The Ohio Supercomputer Center	djh@ohstpy.bitnet	security Is mortals'
Columbus, Ohio  43212		ohstpy::djh		chiefest enemy.

bernhold@qtp.ufl.edu (David E. Bernholdt) (01/05/91)

In article <1991Jan04.154747.18673@convex.com> bleikamp@convex.com (Richard Bleikamp) writes:
>The moral of these examples is ARRAY NOTATION is NOT a direct replacement for
>DO loops.  They are not always interchangeable.  Array notation is sometimes
>a convenient shorthand for simple array operations, but will not inherently
>allow compilers to generate better code.

However every array syntax loop _does_ have an equivalent DO syntax.
This is the important point, since the complaints are against the
array syntax.

>The real problem for array notation is that scalar compilers don't known
>enough to do a good job of converting array notation back into an appropriate
>DO loop without introducing unnecessary loops and temps.

So, as I said before, it is the fault of the compiler, not the
standard.  If the array notation is slow, complain to the vendor.  If
the extra analysis makes the compilation slow, put in a switch and let
the _user_ decide if they want fast compilation or fast execution.
There is ample precedent(sp?) for this idea in the existing
optimization switches available on most compilers.
-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

maine@elxsi.dfrf.nasa.gov (Richard Maine) (01/05/91)

On 4 Jan 91 15:47:47 GMT, bleikamp@convex.com (Richard Bleikamp) said:
Richard> Nntp-Posting-Host: mozart.convex.com

Richard> The draft standard requires the entire right side of an
Richard> assignment statement to be evaluated before any values of the
Richard> left side target are updated.  This requires a temporary only
Richard> when there is some overlap.  Unfortunately, only
Richard> vectorizing/parallelizing compilers do enough dependency
Richard> anaylsis to get this right all the time.  So a scalar machine
Richard> compiler will often introduce unnecessary temps (at least
Richard> until they add dependency analysis).

F77 character assignments have a simillar problem with temps, avoided in
the F77 standard by the simple expedient of making overlap illegal.
Of course, some compilers allow it as an extension, and other compilers
don't diagnose the problem but don't do it "right" either.  I've run into
several program bugs caused by people who didn't know about the restriction
and assumed that expressions like
     character a*16
     a(2:) = a
will do the intuitively "obvious" thing instead of what it really does in
many cases.

Richard> The moral of these examples is ARRAY NOTATION is NOT a direct
Richard> replacement for DO loops.  They are not always
Richard> interchangeable.  Array notation is sometimes a convenient
Richard> shorthand for simple array operations, but will not
Richard> inherently allow compilers to generate better code.

I'd agree with this statement.  Though I've occasionally seen people
suggest that array notation might or might not produce better code for
parallel machines, this doesn't seem the main point.  To me, the above
quoted phrase "a sometimes convenient shorthand" is much more to the point.
Array notation is to make code more concise, clearer, easier to
maintain, and less buggy.  Of course it doesn't guarantee any of these
characteristics, but it can be used to aid them.

I'll take a 30% slower program that actually works instead of a fast
buggy one that gives wrong results any day.  (For instance the F77
program bugs in character string overlap that I cited above would have
worked correctly in F90).  But then I'm funny that way.  I swear
there are people that would rather have the fast wrong results.
(Unfortunately, that is sometimes literally true, rather than just
a sarcastic comment :-().

If it takes DO loops instead of array syntax to make a time-critical
section of code faster, then I'll certainly use the DO loops there.
If it takes assembly language to make a time-critical section of
code fast enough, I'll use assembly language.  I've not found that
necessary for several years though, whether because compilers have
gotten enough better or because machines have gotten enough faster
is not clear.

The whole purpose of compilers is to make writing code (particularly
good, maintainable, bug-free, etc. code) easier for the user.  If speed
were the number 1 priority, we'd all be coding in assembly.

Array notation has been high on the wish list of very many Fortran
users for decades.  I've certainly heard the request many times,
and its almost never been because the user wanted his code to run
faster, but because he (or she - sorry) wanted it to be easier to write.

Indeed the complaint "Why doesn't F8x/F90 just add array notation and forget
all this other stuff?" has been heard regularly. (I don't happen to
agree with that position, but I've certainly heard it).
--

Richard Maine
maine@elxsi.dfrf.nasa.gov [130.134.64.6]

john@ghostwheel.unm.edu (John Prentice) (01/05/91)

Let me bring this discussion full circle now.  The original question I
posted (in the parallel newsgroup?) had to do with the fact that the CM-2
and a few other machines invoke parallelism by the use of array syntax.
My argument was that this was a bit unfortunate since array syntax is
not part of any existing standard and Fortran Extended is yet to even be
ratified, much less a compiler written for it.  The response was that many
more compilers than I realized already support array syntax and that it is
therefore not as unportable as I argued (this point is moot for us by the way,
we are contractually obligated to use ANSI standard Fortran, so extensions
are not helpful since we can only use them when we have no other choice,
such as on the CM).  However, now people are pointing out that array syntax
is not a replacement for do-loops and that it may produce slower code, even
on a vector or parallel computer.  This would seem to drive us back toward
my original point, array syntax for invoking parallelism is unfortunate
until such time as Fortran Extended becomes available (whenver that might
be).  Even then what people are saying is that by writing my code to
run on a CM, which REQUIRES array syntax, I am going to hobble it on
most any other computer where the array syntax may defeat or confuse the
optimizer.  Is this a fair statement and if so, anybody have any suggestions
for how to address this problem short of having two versions of the code
(one for the CM and one for everyone else) ?

John Prentice
Amparo Corporation
Albuqueruqe, NM