dgh%dgh@Sun.COM (David Hough) (05/27/88)
My previous posting on this subject was slightly in error. It turns out that a properly crafted C source will fool the intermediate optimizer into thinking it's Fortran and loop unrolling will occur, at least for SunOS 4.0. However, the lack of noalias information is propagated to the instruction scheduler which steadfastly refuses to schedule loads of x[i+1] before stores of y[i]. Consequently the overall performance is identical between the following versions of ROLLED source; the first results in unrolled object code; the second and most other equivalent ones do not: daxpy(n, da, dx, dy) double dx[], dy[], da; int n; { int i; i = 0; do { dy[i] = dy[i] + da * dx[i]; i++; } while (i < n); } daxpy(n, da, dx, dy) double dx[], dy[], da; int n; { int i; i = 0; do { dy[i] = dy[i] + da * dx[i]; } while (i++ < n); } Niels J|rgen Kruse suggested hand unrolling; the following is an example of the idea (add more stuff if n is not even): daxpy(n, da, dx, dy ) double dx[], dy[], da; int n; { int i; double a,b; do { a = dy[i] + da * dx[i]; b = dy[i+1] + da * dx[i+1]; dy[i] = a; dy[i+1] = b; i++; i++; } while (i < n); } But this improves performance only slightly (850->900) due to another important effect: optimizers do a lot better work with simple loops than with clever ones. The particular Sun release in question generates some unnecessary fmove instructions for this loop. Maybe that will get fixed in the future, but it will always be a good idea to make your programs as simple as possible if you are going to depend on optimizers. It's also worth noting that several commentators referred to this as a problem with vector machines, so that for instance PC programmers may have concluded that it was irrelevant to their careers. However the measurements taken above were on a Sun-4/260 which is not a vector machine, but does allow the integer unit, the floating-point adder, and the floating-point multiplier to operate in parallel in certain circumstances which arise frequently when the instruction scheduler does its job well. It's to be expected that floating-point units of comparable complexity will be common in PC-class machines soon; the Weitek 1167 is an example of such a unit intended to be installed in 80386 systems. It has the same floating-point ALU and multiplier as the Sun-4/260. Don't assume that consequently 80386-based PC's will have floating-point throughput comparable to the Sun-4, however; there is a small matter of memory bandwidth to contend with, which becomes more and more noticeable as the floating-point hardware becomes faster, until at the top end it is the main issue and the floating-point hardware is a minor distraction. So the issues of loop unrolling and instruction scheduling (and identifying unaliased operands) will soon become important to everybody interested in getting the most out of even simple scalar-oriented systems. David Hough dhough@sun.com na.hough@na-net.stanford.edu {ucbvax,decvax,decwrl,seismo}!sun!dhough
chris@mimsy.UUCP (Chris Torek) (05/29/88)
In article <54713@sun.uucp> dgh%dgh@Sun.COM (David Hough) writes: >... the following versions of ROLLED source; the first results in unrolled >object code; the second and most other equivalent ones do not: [Some text deleted. First version:] > i = 0; > do { > dy[i] = dy[i] + da * dx[i]; > i++; > } while (i < n); [Second:] > i = 0; > do { > dy[i] = dy[i] + da * dx[i]; > } while (i++ < n); This illustrates a very important point. These two loops are *not* equivalent! If n is 2, the first loop executes twice (as it should); the second executes three times. So remember: Make it right before you make it fast, and do try not to break it when you make it fast. :-) (I still do not understand the insistence on avoiding the form do { dy[i] += da * dx[i]; } while (++i < n); which is much easier to read, and remains correct.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris