[comp.lang.fortran] no noalias not negligible: more red herrings, conclusion unchanged

dgh%dgh@Sun.COM (David Hough) (05/27/88)

My previous posting on this subject was slightly in error.  It turns out
that a properly crafted C source will fool the intermediate optimizer
into thinking it's Fortran and loop unrolling will occur, at least
for SunOS 4.0.

However, the lack of noalias information is propagated to the instruction
scheduler which steadfastly refuses to schedule loads of x[i+1] before
stores of y[i].  Consequently the overall performance is identical between
the following versions of ROLLED source; the first results in unrolled
object code; the second and most other equivalent ones do not:

daxpy(n, da, dx, dy)
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        i = 0;
        do {
                dy[i] = dy[i] + da * dx[i];
                i++;
        } while (i < n);
}

daxpy(n, da, dx, dy)
        double          dx[], dy[], da;
        int             n;
{
        int             i;

        i = 0;
        do {
                dy[i] = dy[i] + da * dx[i];
        } while (i++ < n);
}

Niels J|rgen Kruse suggested hand unrolling; the following is an
example of the idea (add more stuff if n is not even):

daxpy(n, da, dx, dy )
        double          dx[], dy[], da;
        int             n;
{
        int             i;
        double a,b;
 
        do {
                a = dy[i]   + da * dx[i];
                b = dy[i+1] + da * dx[i+1];
                dy[i] = a; 
		dy[i+1] = b;
                i++; 
		i++;
        } while (i < n);
}

But this improves performance only slightly (850->900) due
to another important effect: optimizers do a lot better work with
simple loops than with clever ones.   The particular Sun release
in question generates some unnecessary fmove instructions for this loop.  
Maybe that
will get fixed in the future, but it will always be a good idea to make
your programs as simple as possible if you are going to depend on
optimizers.

It's also worth noting that several commentators referred to this as a 
problem with vector machines, so that for instance PC programmers may have
concluded that it was irrelevant to their careers.  However the 
measurements taken above were on a Sun-4/260 which is not a vector machine,
but does allow the integer unit, the floating-point adder, and the
floating-point multiplier to operate in parallel in certain circumstances
which arise frequently when the instruction scheduler does its job well.

It's to be expected that floating-point units of comparable complexity
will be common in PC-class machines soon; the Weitek 1167 is an example
of such a unit intended to be installed in 80386 systems.  It has the
same floating-point ALU and multiplier as the Sun-4/260.  Don't assume
that consequently 80386-based PC's will have floating-point throughput
comparable to the Sun-4, however; there is a small matter of memory
bandwidth to contend with, which becomes more and more noticeable as the
floating-point hardware becomes faster, until at the top end it is the
main issue and the floating-point hardware is a minor distraction.

So the issues of loop unrolling and instruction scheduling (and
identifying unaliased operands) will soon
become important to everybody interested in getting the most out of
even simple scalar-oriented systems.

David Hough

dhough@sun.com   
na.hough@na-net.stanford.edu
{ucbvax,decvax,decwrl,seismo}!sun!dhough

chris@mimsy.UUCP (Chris Torek) (05/29/88)

In article <54713@sun.uucp> dgh%dgh@Sun.COM (David Hough) writes:
>... the following versions of ROLLED source; the first results in unrolled
>object code; the second and most other equivalent ones do not:

[Some text deleted.  First version:]
>        i = 0;
>        do {
>                dy[i] = dy[i] + da * dx[i];
>                i++;
>        } while (i < n);

[Second:]
>        i = 0;
>        do {
>                dy[i] = dy[i] + da * dx[i];
>        } while (i++ < n);

This illustrates a very important point.  These two loops are *not*
equivalent!  If n is 2, the first loop executes twice (as it should);
the second executes three times.  So remember:

  Make it right before you make it fast, and do try not to break it
  when you make it fast. :-)

(I still do not understand the insistence on avoiding the form

	do {
		dy[i] += da * dx[i];
	} while (++i < n);

which is much easier to read, and remains correct.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris