[comp.lang.c] no noalias not negligible - a difference between C and Fortran - long

dgh@dgh.UUCP (05/21/88)

noalias may be non-negotiable, but it may be non-negligible, as I found 
out somewhat to my surprise this week.
 
At various times I've needed a version of The Linpack Benchmark
written in C, usually because a Fortran compiler was not available;
The Linpack Benchmark is very useful for lots of surprising purposes,
like debugging caches and virtual memory systems.  Of course, The Linpack
Benchmark is by definition written entirely in Fortran, so for benchmarking
purposes a C translation is more or less useless.  This despite an
observation by one of my colleagues, viewing the C version, that
it appeared to still be written in Fortran.  But faithful preservation of
Fortran semantics, including memory access patterns, was one of the goals
of the translation.

My new manager wanted to tackle a small technical job keep in shape,
and so I suggested this translation.  It was not quite as small a job
as we expected, but eventually she got it to produce identical numerical
results as the Fortran version, but she never could get comparable performance.

The results for double precision linpack on Sun-4 using SunOS 4.0 and
Fortran 1.1 were:

		Rolled Source		Unrolled Source

Fortran		1080 KFLOPS		875 KFLOPS
C		 850 KFLOPS		875 KFLOPS

Why was C slower than Fortran in the rolled case?

It turns out that almost all The Linpack Benchmark does is execute
a short subroutine of the following simplified Fortran forms:

      subroutine daxpy(n,da,dx,dy)
      doubleprecision dx(1),dy(1),da
      integer i,n
      do 30 i = 1,n,4
        dy(i)   = dy(i)   + da*dx(i)
        dy(i+1) = dy(i+1) + da*dx(i+1)
        dy(i+2) = dy(i+2) + da*dx(i+2)
        dy(i+3) = dy(i+3) + da*dx(i+3)
 30   continue
      return
      end

OR

      subroutine daxpy(n,da,dx,dy)
      doubleprecision dx(1),dy(1),da
      integer i,n
      do 30 i = 1,n
        dy(i) = dy(i) + da*dx(i)
 30   continue
      return
      end

The first of these is the standard UNROLLED form of the program;
the second is a questionably legal modification called the ROLLED form.  The
original Fortran was written with unrolled source because that generated
more efficient code ten years ago when most compilers didn't unroll
loops for you automatically.  Nowadays many Fortran compilers,
including Sun's, can get better
performance by unrolling the loops themselves than by attempting to
figure out unrolled source code.

Most of the benefit of loop unrolling in high-performance systems derives
from the possibilities it opens up for instruction scheduling across
independent iterations.  The current multiplication can overlap the
previous addition, or current computation can overlap previous stores
and subsequent loads; what is worthwhile varies among implementations.

The corresponding rolled C code could be written with a for loop

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        for (i = 0; i < n; i++) {
                dy[i] = dy[i] + da * dx[i];
        }
}
 
but Sun compilers catch a red herring: for loops are compiled into two
basic blocks and the unroller won't unroll loops with more than one basic
block.  So to bypass that issue, rewrite a little bit more in the form

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        i = 0;
        do {
                dy[i] = dy[i] + da * dx[i];
        } while (i++ < n);
}

In this form the optimizer still won't unroll it, 
and consequently the performance is less than Fortran.  If the source
form is unrolled, however, the optimizer can't do as much with Fortran,
and C performance is the same.

Investigation revealed that the reason had to do with noalias:  all Fortran
pointer variables (which happen to be exactly the set of procedure parameters)
are defined by the Fortran standard to be "noalias", meaning a compiler
may optimize code based on the assumption that the pointers never reference
the same memory.  Alleged Fortran programs which break under such optimization
are declared by the Fortran standard to be non-standard.  Very neat.

C, in contrast, has other kinds of pointer variables than procedure
parameters, and many people believe that a global decree of the Fortran type 
would
break a lot of existing C programs.   So the default is that optimization
must assume that any two pointers may be pointing to the same thing unless
it can prove otherwise.  For a while X3J11 had a local "noalias" attribute
that you could attach to pointers, but later recanted in consideration to
assertions like 1) nobody had done it before, which is probably true, 
2) nobody could agree on exactly what it meant, which appeared to be
true, and 3) optimizing compilers should be able to figure out if aliasing
exists, which is definitely false in a separate compilation environment
(unless you want the linker to recompile everything, in which case the
linker is the compiler, and you're back to no separate compilation).
Anyway there is no
portable way in draft ANSI C to say "this pointers are guaranteed
to have no aliases".

Thus the first part of the C compiler does NOT tell the optimizer that any
pointers are guaranteed unaliased; the optimizer won't unroll anything if
there are potential aliasing problems: you don't dare load dx[i+1] before
you store dy[i] if there is any danger that they point to the same place.
The Fortran compiler need have no such qualms.

What is to be done?  I submitted extensive commentary to X3J11 during the
last public review period about numerical issues, but didn't mention noalias
because it was such a hot potato and I didn't think it mattered much, not
having investigated the possibilities.  Even if noalias could be proved
to be unquestionably a good idea, I doubt X3J11 would want to change its draft
again, since such proofs seem so easy to overturn.  Perhaps what will happen
is that high-performance
C compilers will adopt the questionable CDC/Cray Fortran practice 
of providing "unsafe" optimization levels that, for instance,
assume all pointers are unaliased.


David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough

bvs@light.uucp (Bakul Shah) (05/21/88)

In article <54080@sun.uucp> dgh%dgh@Sun.COM (David Hough) writes:
>noalias may be non-negotiable, but it may be non-negligible, as I found 
>out somewhat to my surprise this week.
...
>daxpy(n, da, dx, dy )
>        double          dx[], dy[], da;
>        int             n;
>{
>        int             i;
>
>        for (i = 0; i < n; i++) {
>                dy[i] = dy[i] + da * dx[i];
>        }
>}
...
>Anyway there is no
>portable way in draft ANSI C to say "this pointers are guaranteed
>to have no aliases".

How about adding a test before the for loop?  Something like:

#define overlap(x,y,n)    (!(x + n <= y || y + n <= x))

	if (overlap(dx, dy, n))
		return complain("overlapping arrays\n");

Now a smart compiler can figure out that dx, dy don't overlap at the
for-loop point and schedule load/stores or vectorize or whatever.  Also,
note that in a sense this is an explicit translation of fortran's
implicit noalias rule -- something you'd probably want in a similar
routine translated from fortran.

On a related topic:

Noalias had too many problems but I think a proper extension can be made
to C so that restrictions like the one above can be expressed at the
__function declaration point__.  Something like:

extern int
	daxpy(int n, double da, double dx[], double dy[])
		/* call */if (!overlap(dx, dy, n));

Here daxpy is called only if the if-expression is true; else an error is
generated.  Within the function body the condition can be assumed to be
true and that fact can be used for whatever purpose.  The if-expression
can be any expression involving constants, globals and function
parameters.

You can achieve the same effect at the point of call by using a macro
but that doesn't buy you any optimization since the compiler won't know
about it.  By expressing the same information in the language you are
telling the compiler something useful.  It can generate checking code at
every call if necessary.

The syntax is unambiguous, only one or two extra rules need to be
learned, the implementation is straight-forward and the concept is much
more powerful than noalias.  What do people think?  Of course, right now
this is just an intellectual exercise.

----
Bakul Shah <..!{ucbvax,sun}!amdcad!light!bvs>

gwyn@brl-smoke.ARPA (Doug Gwyn ) (05/21/88)

In article <54080@sun.uucp> dgh%dgh@Sun.COM (David Hough) writes:
>Perhaps what will happen is that high-performance
>C compilers will adopt the questionable CDC/Cray Fortran practice 
>of providing "unsafe" optimization levels that, for instance,
>assume all pointers are unaliased.

David's article did an excellent job of illustrating why many X3J11
members thought something like "noalias" would have been useful.
He correctly points out that ANSI C as it now stands requires that
aliasing be handled "right", and that this interferes with some
forms of optimization.

I think one of two possible "solutions" will be implemented by
vendors who care about this issue (for example, on vector hardware).
One is to adopt essentially "noalias", i.e. a type qualifier that
makes an appropriate guarantee so the optimizer is free to do its
thing (it would be best if this were "__noalias" or some other non-
conflicting extension to ANSI C).  The other approach is to provide
a compiler switch to enable such optimization globally on a per-
source file basis.  Both approaches are known to be viable; the
first is more general but requires more work on the part of the
programmer, and the resulting code is not strictly conforming.
It is even possible to support both methods simultaneously.

ark@alice.UUCP (05/22/88)

In article <54080@sun.uucp>, dgh@dgh.UUCP writes:

> What is to be done?

Before writing off C for lacking noalias, try working on the
program a little harder.  Here is the original program:

	daxpy(n, da, dx, dy )
		double          dx[], dy[], da;
		int             n;
	{
		int             i;

		i = 0;
		do {
			dy[i] = dy[i] + da * dx[i];
		} while (i++ < n);
	}

I don't really think there's much here that noalias would help:
the object code has to fetch dy[i], add da, multiply by
dx[i], and store it in dy[i] and aliasing is irrelevant to
each of these operations.

Four things could speed it up, though, on many C implementations:

	1. Make `i', `n', and `da' into register variables.

	2. Use pointers instead of subscripts.

	3. Use the += operator to avoid fetching and then storing dy[i].

	4. Replace i++<n by the faster, but equivalent, ++i<=n.

Doing all these things gives the following revised program:

	daxpy(n, da, dx, dy)
		register double *dx, *dy, da;
		int n;
	{
		register double *dylim = dy + n;

		do *dy += da * *dx++;
		while (++dy <= dylim);
	}

Why not try this and let us know the results?

henry@utzoo.uucp (Henry Spencer) (05/23/88)

Everyone agrees that the problem needs attention; unfortunately, there is
no well-proven solution to it for C, and thus it is inappropriate to try
to deal with it as part of standardization of the existing language.

> ...  Perhaps what will happen is that high-performance
> C compilers will adopt the questionable CDC/Cray Fortran practice 
> of providing "unsafe" optimization levels that, for instance,
> assume all pointers are unaliased.

High-performance compilers will have to have *SOME WAY* to find out about
aliasing or lack thereof.  The obvious way is to have some sort of magic
cookie in the source to selectively indicate when promises are made about
aliasing.  A compile-time option that covers a whole file isn't really a
good idea, considering how pervasive pointers are in C and how common
aliasing is.  The obvious type of cookie is #pragma.  (It is claimed,
probably correctly, that #pragma technically is not allowed to change
the semantics of the language, but that's not the way it will happen
in the real world.)
-- 
NASA is to spaceflight as            |  Henry Spencer @ U of Toronto Zoology
the Post Office is to mail.          | {ihnp4,decvax,uunet!mnetor}!utzoo!henry

tim@amdcad.AMD.COM (Tim Olson) (05/23/88)

In article <7879@alice.UUCP> ark@alice.UUCP writes:
| I don't really think there's much here that noalias would help:
| the object code has to fetch dy[i], add da, multiply by
| dx[i], and store it in dy[i] and aliasing is irrelevant to
| each of these operations.

But that forces each iteration of the loop to be performed sequentially.
I believe the original intent was to guarantee non-overlapping arrays to
allow vector machines to compute the loop iterations in parallel.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

bright@Data-IO.COM (Walter Bright) (05/25/88)

In article <1988May21.030207.25063@light.uucp> bvs@light.UUCP (Bakul Shah) writes:
>#define overlap(x,y,n)    (!(x + n <= y || y + n <= x))
>	if (overlap(dx, dy, n))
>		return complain("overlapping arrays\n");
>Now a smart compiler can figure out that dx, dy don't overlap at the
>for-loop point and schedule load/stores or vectorize or whatever.

Current production optimizers only do flow analysis to the point that they
can determine if variable v is/isn't always a specific value at a specific
point in the program. Extending this to analyse for an open-ended list
of ranges for v such as:
	(c1 < v <= c2) || (c3 <= v < c4) || ...
would wind up being v-e-r-y s-l-o-w, a memory hog, and prone to lots of
bugs. Optimizers already are notorious for having strange bugs in them,
increasing the complexity of the flow analysis by a couple orders of
magnitude is impractical.

The most pragmatic solution is to simply bag the aliasing problem. Optimizers
should do what they can assuming worst case aliasing. Functions where
aliasing assumptions kill performance should be written in another language
(dare I say Fortran), and then linked in.

C shouldn't be force-fit to solve all applications. The aliasing problem
is only reasonably solved by changing the semantics of the language, i.e.
creating a different language.

BTW, I do all my coding in C and ASM. When I have speed problems with a
function in C, I recode it in ASM, and then use every trick I know. The
C version of the function is left in (but #if'd out) for
documentation/debugging purposes. (Some of my ASM subroutines wind up
as a page of the most convoluted stuff you've ever seen, but it really
flies!)

njk@diku.dk (Niels J|rgen Kruse) (05/26/88)

In article <54080@sun.uucp>, dgh%dgh@Sun.COM (David Hough) writes:
>(...)
> it appeared to still be written in Fortran.  But faithful preservation of
> Fortran semantics, including memory access patterns, was one of the goals
> of the translation.

Since nobody else commented on this ...

If you didn't want to preserve memory access patterns so badly,
you could have done some hand scheduling on the unrolled loop :

(similarly simplified)

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;
        double a,b,c,d;

        for (i = 0; i < n; i++) {
                /*
                 *  Compute 4 independent expressions into
                 *  registers a,b,c,d.
                 */
                a = dy[i]   + da * dx[i];
                b = dy[i+1] + da * dx[i+1];
                c = dy[i+2] + da * dx[i+2];
                d = dy[i+3] + da * dx[i+3];
                /*
                 *  Store results back.
                 */
                dy[i] = a, dy[i+1] = b, dy[i+2] = c, dy[i+3] = d;
        }
}

This alleviates the constraints imposed on scheduling by
potential aliasing. The first store can be scheduled as soon as
the last load has completed. Given that a sufficient number of
registers are available for scheduling loads into, enough time
should be left before computations terminate to catch up on the
stores. If the fortran compiler doesn't unroll more than 4 times,
i see no reason why this should be slower than the rolled fortran
version.

On the other hand, i see no reason why the unrolled fortran
version should be slower than the rolled version either,
so what do i know.

        Niels J|rgen Kruse

bill@proxftl.UUCP (T. William Wells) (06/02/88)

In article <7879@alice.UUCP>, ark@alice.UUCP writes:
>       daxpy(n, da, dx, dy)
>               register double *dx, *dy, da;
>               int n;
>       {
>               register double *dylim = dy + n;
>
>               do *dy += da * *dx++;
>               while (++dy <= dylim);
>       }

If you go this far, you should do this:

	daxpy(n, da, dx, dy)
		register double *dx, *dy, da;
		int n;
	{
		register double *dylim = dy + n;

		do *dy++ += da * *dx++;
		while (dy <= dylim);
	}

[If n is the number of elements in dy, shouldn't the test be a <, not a <=?]

The only difference is moving the ++ from the test condition to the
assignment. For some machines (680?0 ?, VAX ?), this might give slightly
better results because it may be possible to use an addressing mode to
accomplish the increment. Also, the register variables should be ordered dy
and dx instead of dx and dy, this might not make any difference here, but had
these register variables been the second and third, this would have made a
minor difference (on an 80?86, for example).

ssd@sugar.UUCP (Scott Denham) (06/04/88)

In article <21727@amdcad.AMD.COM>, tim@amdcad.AMD.COM (Tim Olson) writes:
> In article <7879@alice.UUCP> ark@alice.UUCP writes:
> | I don't really think there's much here that noalias would help:
> | the object code has to fetch dy[i], add da, multiply by
> | dx[i], and store it in dy[i] and aliasing is irrelevant to
> | each of these operations.
> 
> But that forces each iteration of the loop to be performed sequentially.
> I believe the original intent was to guarantee non-overlapping arrays to
> allow vector machines to compute the loop iterations in parallel.
> 
> 	-- Tim Olson

 There is a real need to be able to assure the compiler that there is no
 recursion present in an expression in which it is impossible for the
 compiler to decide this for itself from the information available in
 the code. This is particularly true in subroutines - the designer can
 know things about the values of arguments that the compile could never
 guess, and in the case of any doubt, the compiler must play it "safe".
 IBM made a serious misjudgement in their first attempt at a vector 
 Fortran compiler and included no such facility - and it took some very
 nasty coding to "fool" the compiler into vectorizing code that when 
 written in a straightforward way left open the possiblity that the 
 loop would not give consistent results if pipelined or split into 
 parallel tasks.