[comp.lang.fortran] Fortran vs. C, and noalias: old postings, new data

dgh@validgh.com (David G. Hough on validgh) (12/06/90)

The current threads of discussion prove that nothing is ever settled
for long on USENET.   Following are some new data followed by some
old postings which explain the data, even though they apply to the
hardware and software available a couple of years ago.

The current data reflect Sun-4/75 and Sun-4/470 hardware, and developmental
versions of Sun Fortran 1.4 and (conventional, not ANSI) C 1.1.
(Released versions of these products may vary).
Fortran compilation options included -dalign, -libmil, -O4, -cg89, and
-Bstatic, while C added -fsingle to these.  So the C compilation 
environment was made as much like Fortran's as possible; these particular
compilers share global and local optimizers, code generators, and math
libraries.  Note that -O4 implies that some source-level inlining may
occur, while -libmil implies that assembly-language inline expansion
templates (that don't set errno or call matherr)
are used to remove most of C's disadvantage with respect to Fortran in
this regard.  The following table shows RATES (operations/second).
SP is single precision, DP double precision; roll is a Fortran source-rolled
100x100 Linpack source, croll is a fairly direct hand translation of that into C;
unroll and cunroll are corresponding source-unrolled versions;
similarly for whetstone and cwhetstone.  Fortran and C 
sources were adapted from versions available publicly on netlib
[send a message "send index from benchmark" to netlib@ornl.gov.
Neither of these benchmarks is good for comparing systems but they are
adequate for comparing certain language features.

			4/75		4/490

SP.roll	MFLOPS		6.3		5.2
SP.croll		4.7		3.5

DP.roll			3.8		4.1
DP.croll		3.0		2.8

SP.unroll		4.6		3.4
SP.cunroll		4.5		3.4

DP.unroll		3.0		2.7
DP.cunroll		3.0		2.7

SP.whetstone MWHETS	29		25
SP.cwhetstone		22		20

DP.whetstone		23		22
DP.cwhetstone		20		19

A proprietary  geometric-optics benchmark from one of our customers showed comparable
C and Fortran performance because it's dominated by sqrt rather than array
processing.  C performance would
have been much worse than Fortran, but for -libmil and -fsingle.

I would recommend obtaining the source codes from netlib and comparing the results
on other systems that have common-component C and Fortran compilers.  
You won't see any difference on Macintoshes or PC clones
that lack hardware floating point or have simple implementations like 6888x
or 80x87.  The differences should be apparent on on aggressively optimized
80486 and 68040 systems, however, as well as on most of the current RISC-based
systems.

Here's the explanation.  The context was a recent posting from dmr demanding
removal of noalias from the ANSI-C draft; the following is slightly edited
to preserve the main points:

-------------------------------------------------------------------------------

>From postnews Fri May 20 18:33:48 1988
Subject: no noalias not negligible - a difference between C and Fortran - long
Newsgroups: comp.lang.c,comp.lang.fortran
Status: R

noalias may be non-negotiable, but it may be non-negligible, as I found 
out somewhat to my surprise this week.
 
At various times I've needed a version of The Linpack Benchmark
written in C, usually because a Fortran compiler was not available;
The Linpack Benchmark is very useful for lots of surprising purposes,
like debugging caches and virtual memory systems.  Of course, The Linpack
Benchmark is by definition written entirely in Fortran, so for benchmarking
purposes a C translation is more or less useless.  This despite an
observation by one of my colleagues, viewing the C version, that
it appeared to still be written in Fortran.  But faithful preservation of
Fortran semantics, including memory access patterns, was one of the goals
of the translation.

My new manager wanted to tackle a small technical job to keep in shape,
and so I suggested this translation.  It was not quite as small a job
as we expected, but eventually she got it to produce identical numerical
results as the Fortran version, but she never could get comparable performance.

The results for double precision linpack on Sun-4 using SunOS 4.0 and
Fortran 1.1 were:

		Rolled Source		Unrolled Source

Fortran		1080 KFLOPS		875 KFLOPS
C		 850 KFLOPS		875 KFLOPS

Why was C slower than Fortran in the rolled case?

It turns out that almost all The Linpack Benchmark does is execute
a short subroutine of the following simplified Fortran forms:

      subroutine daxpy(n,da,dx,dy)
      doubleprecision dx(1),dy(1),da
      integer i,n
      do 30 i = 1,n,4
        dy(i)   = dy(i)   + da*dx(i)
        dy(i+1) = dy(i+1) + da*dx(i+1)
        dy(i+2) = dy(i+2) + da*dx(i+2)
        dy(i+3) = dy(i+3) + da*dx(i+3)
 30   continue
      return
      end

OR

      subroutine daxpy(n,da,dx,dy)
      doubleprecision dx(1),dy(1),da
      integer i,n
      do 30 i = 1,n
        dy(i) = dy(i) + da*dx(i)
 30   continue
      return
      end

The first of these is the standard UNROLLED form of the program;
the second is a questionably legal modification called the ROLLED form.  The
original Fortran was written with unrolled source because that generated
more efficient code ten years ago when most compilers didn't unroll
loops for you automatically.  Nowadays many Fortran compilers,
including Sun's, can get better
performance by unrolling the loops themselves than by attempting to
figure out unrolled source code.

Most of the benefit of loop unrolling in high-performance systems derives
from the possibilities it opens up for instruction scheduling across
independent iterations.  The current multiplication can overlap the
previous addition, or current computation can overlap previous stores
and subsequent loads; what is worthwhile varies among implementations.

The corresponding rolled C code could be written with a for loop

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        for (i = 0; i < n; i++) {
                dy[i] = dy[i] + da * dx[i];
        }
}
 
[and this is actually moved inline to the calling function (dgefa),
where it gets unrolled.  But much of the benefit of unrolling is lost.]
If the source form is unrolled, however, 
the optimizer can't do as much with Fortran,
and C performance is the same: 
optimizers do a lot better work with
simple loops than with clever ones.
 
Investigation revealed that the reason had to do with noalias:  all Fortran
pointer variables (which happen to be exactly the set of procedure parameters)
are defined by the Fortran standard to be "noalias", meaning a compiler
may optimize code based on the assumption that the pointers never reference
the same memory.  Alleged Fortran programs which break under such optimization
are declared by the Fortran standard to be non-standard.  Very neat.

C, in contrast, has other kinds of pointer variables than procedure
parameters, and many people believe that a global decree of the Fortran type 
would
break a lot of existing C programs.   So the default is that optimization
must assume that any two pointers may be pointing to the same thing unless
it can prove otherwise.  For a while X3J11 had a local "noalias" attribute
that you could attach to pointers, but later recanted in consideration to
assertions like 1) nobody had done it before, which is probably true, 
2) nobody could agree on exactly what it meant, which appeared to be
true, and 3) optimizing compilers should be able to figure out if aliasing
exists, which is definitely false in a separate compilation environment
(unless you want the linker to recompile everything, in which case the
linker is the compiler, and you're back to no separate compilation).
Anyway there is no
portable way in draft ANSI C to say "this pointer is guaranteed
to have no aliases".

Thus the first part of the C compiler does NOT tell the optimizer that any
pointers are guaranteed unaliased; the optimizer won't unroll anything if
there are potential aliasing problems: you don't dare load dx[i+1] before
you store dy[i] if there is any danger that they point to the same place.
The Fortran compiler need have no such qualms.

What is to be done?  I submitted extensive commentary to X3J11 during the
last public review period about numerical issues, but didn't mention noalias
because it was such a hot potato and I didn't think it mattered much, not
having investigated the possibilities.  Even if noalias could be proved
to be unquestionably a good idea, I doubt X3J11 would want to change its draft
again, since such proofs seem so easy to overturn.  Perhaps what will happen
is that high-performance
C compilers will adopt the questionable CDC/Cray Fortran practice 
of providing "unsafe" optimization levels that, for instance,
assume all pointers are unaliased.

-------------------------------------------------------------------------------

>From postnews Thu May 26 16:59:09 1988
Status: R

It's also worth noting that several commentators referred to this as a 
problem with vector machines, so that for instance PC programmers may have
concluded that it was irrelevant to their careers.  However the 
measurements taken above were on a Sun-4/260 which is not a vector machine,
but does allow the integer unit, the floating-point adder, and the
floating-point multiplier to operate in parallel in certain circumstances
which arise frequently when the instruction scheduler does its job well.

It's to be expected that floating-point units of comparable complexity
will be common in PC-class machines soon; the Weitek 1167 is an example
of such a unit intended to be installed in 80386 systems.  It has the
same floating-point ALU and multiplier as the Sun-4/260.  Don't assume
that consequently 80386-based PC's will have floating-point throughput
comparable to the Sun-4, however; there is a small matter of memory
bandwidth to contend with, which becomes more and more noticeable as the
floating-point hardware becomes faster, until at the top end it is the
main issue and the floating-point hardware is a minor distraction.

So the issues of loop unrolling and instruction scheduling (and
identifying unaliased operands) will soon
become important to everybody interested in getting the most out of
even simple scalar-oriented systems.
-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.stanford.edu

tmb@bambleweenie57.ai.mit.edu (Thomas M. Breuel) (12/08/90)

In article <221@validgh.com> dgh@validgh.com (David G. Hough on validgh) quotes
the following facts:

   --> some benchmarcks showing that C versions of linpack run somewhat
       slower on Sun4's that FORTRAN versions]

   --> noalias may be non-negotiable, but it may be non-negligible, as I found 
       out somewhat to my surprise this week.

   --> unrolling of loops may not be as profitable in C as it is in FORTRAN:

   daxpy(n,da,dx,dy)
   double *dx,*dy,da;
   {
	   int i;
	   for(i=0;i<n;i++) dy[i]=dy[i]+da*dx[i];
   }

   dx[i+1] cannot be loaded before dy[i] is stored because dy might
   be aliased to dx+1

   --> X3J11 seems to be reluctant to include a noalias declaration
       in C

   --> if something isn't done soon, vendors will go their own way

These are important issues to be addressed by the community of
numerical C programmers who want to get out the last bit of
efficiency.

Embracing the FORTRAN notion of "noalias" is, however, probably not
the right thing to do. Many C programmers, including myself, feel
that insuring absence of aliasing is a great burden in a language
that encourages the construction of highly linked data structures.

Traditionally, some of the effects of a "noalias" construct in C have
been gotten by the introduction of explicit temporaries. This is a
tradeoff between the inconvenience of having to do a certain amount
of CSE by hand and having to worry about avoiding aliasing.

Hand optimizations of this form are completely portable, and they
are only needed in critical sections of the code. Traditionally,
C programmers have never even expected the compiler to perform
any kind of CSE.

With more modern CPU's, certain kinds of optimizations cannot
be done by hand anymore. Loop unrolling is one example. The
noalias assumption certainly solves a lot of these problems with
one simple concept. However, I feel that it is the wrong
abstraction.

For example, in the loop given above, you don't really care about the
fact that dx and dy are not aliased. In fact, if they pointed at the same
array, the code would make perfect sense. What you really want to state
to the optimizer is the fact that there are no sequential dependencies
in this loop.

A clean way of doing this is by providing a separate looping construct
with special semantics. For example, you could add a "parallel"
construct, which carries with it the implicit assumption that there
are no sequential dependencies among the loop body, or a "vectorloop"
construct, which carries with it the implicit assumption that all
sequential dependencies are lexically apparent to the compiler (a kind
of "noalias" assumption for loop bodies).

Another way of avoiding the problem of aliasing in C completely is
to make sure that all assignments are to locations such that the
compiler can determine statically that they are not aliased. This
is actually a common programming practice. For example, in the above
loop, you could have written:

   #include <malloc.h>
   double *daxpy(n,da,dx,dy)
   double *dx,*dy,da;
   {
	   int i;
           double *result=malloc(n*sizeof(double));
	   for(i=0;i<n;i++) result[i]=dy[i]+da*dx[i];
	   return result;
   }

If you think about it, this is somewhat dual to the fact that
in FORTRAN you may have to copy one of the argument arrays
in order to make sure that the aliasing restriction is satisfied.
You may quibble that that occurs less frequently--but I think
that depends on your programming style.

Writing code like the above does actually not necessarily carry much
overhead compared to the code written under the "noalias" assumption,
and while in C, you have to worry about freeing "result" explicitly
later, in C++ all that can be taken care of automatically as well.

In summary, I think providing a specialized looping construct in C and
C++ that declares absent or limited sequential dependencies is
preferable to introducing a "noalias" construct. With it, and with
techniques that C programmers have been accustomed to for a long time,
it should be possible to write numerical code that compiles as efficiently
as FORTRAN, probably even on vector architectures. The language
that you get will have a very different flavor from FORTRAN for numerical
programming, but that's OK.

Adding a general notion of "noalias" to a language with lots of
pointers and nested data structures would probably lead to chaos, and
I anticipate that once FORTRAN 90 programmers start using dynamic
allocation and pointers heavily, they will discover that it can
be difficult to ensure absence of aliasing.

							Thomas.

PS: I actually am not convinced that in the loop example that dgh
gave, aliasing is actually even responsible for inhibiting the loop
unrolling or the optimization of the unrolled loop on a Sun4.
Certainly, the arguments he gave in his original posting weren't
sufficient to support that assertion. But there are architectures
where assuming absence of sequential dependencies is, of course, a big
win.

henry@zoo.toronto.edu (Henry Spencer) (12/09/90)

In article <TMB.90Dec7234544@bambleweenie57.ai.mit.edu> tmb@bambleweenie57.ai.mit.edu (Thomas M. Breuel) writes:
>   --> X3J11 seems to be reluctant to include a noalias declaration
>       in C

Not really right.  X3J11 believed that there was a fairly strong need for it.
The trouble was the near-total lack of any *experience* with such things in
C.  There is no substitute for actual experience with proposed language
features:  "expert intuition" is wrong more often than it's right.

X3J11 did indeed attempt to put a `noalias' feature into C, but persistent
technical problems (people knew what was wanted, or thought they did, but
nobody seemed to be able to come up with words that said it satisfactorily)
and some singularly bad timing (they waited until too late) provoked a
firestorm of protest and it was removed.

(Incidentally, note the past tense.  ANSI C is a standard now, and has been
for some time.  X3J11 is no longer actively considering language changes,
and won't be until the standard comes up for review several years hence.)

My own opinion is that we don't really know how to make this work in C yet.
We need actual working experience with some tentative solutions.  Standards
committees -- the ones that do their job right -- are supposed to standardize
proven existing practice, not invent their own untried solutions.  Contrary
to the popular misconceptions in certain quarters, the right thing for a
standards committee to do when faced with a demand for a standard in some
poorly-understood area with no existing practice is to *refuse*.  Bad
standards (actively bad ones, I mean, not just suboptimal ones) really are
worse than no standards, and committees have no way to magically produce
good standards when nobody understands the subject area well enough.
-- 
"The average pointer, statistically,    |Henry Spencer at U of Toronto Zoology
points somewhere in X." -Hugh Redelmeier| henry@zoo.toronto.edu   utzoo!henry

brnstnd@kramden.acf.nyu.edu (Dan Bernstein) (12/10/90)

To skip to the big question search for ``throw''.

In article <TMB.90Dec7234544@bambleweenie57.ai.mit.edu> tmb@bambleweenie57.ai.mit.edu (Thomas M. Breuel) writes:
> A clean way of doing this is by providing a separate looping construct
> with special semantics. For example, you could add a "parallel"
> construct, which carries with it the implicit assumption that there
> are no sequential dependencies among the loop body, or a "vectorloop"
> construct, which carries with it the implicit assumption that all
> sequential dependencies are lexically apparent to the compiler (a kind
> of "noalias" assumption for loop bodies).

I've been trying that in Q but have hit some snags.

My first try was forall. forall is just like for, but first executes the
increment and loop test until it knows what all the values of all index
variables are, then runs the loop in parallel for all the values, or for
a random order of the values. The problem is that a vector machine may
produce pure garbage if you have unexpected aliasing. That result
wouldn't correspond to any executions of the loop in any order.

So I added the obvious restriction: the program must not depend on the
order of execution. But what does this mean? What if a program did
forall(i = 0;i < 100;i++) x[i] = y[i], then used some sort of checksum
to test whether x was valid, then repeated the computation with more
reliable techniques and produced a guaranteed answer? I didn't want to
allow this, but the program would produce correct results no matter
what. How can you disallow the above loop if x and y are aliased?

Then I tried vector. vector loops a specified set of linearly dependent
variables through a linear set of values. This makes some things easier
to define but still doesn't avoid the problem.

My latest attempt is to say that no execution or evaluation of any
statement in the loop can depend on the order of the indices. So in

  forall(i = 0;i < 100;i++) x[i] = y[i]

we can *almost* disallow y[j] being at the same location of x[k]. After
all, i could take on the value of k first, then y[j] would be
overwritten with y[k], and when i took on the value of j, y[k] would be
corrupted. But if y overlaps x exactly, or if y[j] equals y[k], then
this still doesn't change.

So I throw this problem to the net. What restrictions can you put on
forall() so that the optimizer can know x and y are unaliased in this
loop?

  forall(i = 0;i < 100;i++) x[i] = y[i];

I want forall to work just like for, but to work ``the same'' if i runs
through 0..99 in a different order. But there has to be a further rule
to imply that x and y are unaliased. Maybe something like ``no input
inside the loop can be from the same location as an output inside the
loop''? But I worry that this will prohibit something useful.

> Adding a general notion of "noalias" to a language with lots of
> pointers and nested data structures would probably lead to chaos,

I doubt it. If in Q you say

  assert(x notin 100 of y);
  assert(y notin 100 of x);
  for(i = 0;i < 100;i++) x[i] = y[i];

then even my primitive q2c knows that the loop can be vectorized (not
that it can do anything about it). Note that ``assert'' is a pure
assertion mechanism---it isn't defined to do anything in particular if
the expression isn't true. The expression *must* be true.

---Dan