[comp.arch] Vectorising conditional code.

smryan@garth.UUCP (Steven Ryan) (07/08/88)

>The loop can be ``partially vectorized'' if not all of
>the statements are part of the computation of the
>exit branch

Given something like

        for i
          a[i] := ...
          exit if p[i]
          b[i] := ...

The actual number of iterations is controlled by the predicate p[]. This means
the actual number of elements stored into a[] is determined by a subsequent
statement. I see no simple way to handle this.

Two possible solutions would be (1) buffer the assignments to a[] and b[];
stores back to arrays are deferred until the actual iteration count is
determined. (2) Float the computation of p[] to the top of loop and then
split it into a separate scalar loop. Anything between p[] and the top of
loop also has to be in the scalar loop, unless it will be bufferred as in (1).

If I were actually working on this at the moment, I would like see enough
typical cases where all the extra work is worth the effort.

weemba@garnet.berkeley.edu (Obnoxious Math Grad Student) (07/09/88)

In article <893@garth.UUCP>, smryan@garth (Steven Ryan) writes:

>        for i
>          a[i] := ...
>          exit if p[i]
>          b[i] := ...

>The actual number of iterations is controlled by the predicate p[].
>This means the actual number of elements stored into a[] is determined
>by a subsequent statement. I see no simple way to handle this.

See my comments below.  If the above code fragment is the inner loop of
a bigger calculation, and each inner loop is independent of all others,
then vectorization can be done straightforwardly.

>If I were actually working on this at the moment, I would like see enough
>typical cases

I once played with the Mandelbrot set on a Cray-1.  The relevant code frag-
ment is (with complex arithmetic):

	for c in [set of pixels]
	{	for(z=i=0; |z|<2 && i<MAX_ITER; i++)
			z=z*z+c;
		count[c]=i;
	}

Version 1 was written with complex arithmetic.  My main concern at this
point was with figuring out to use the available graphics.  Its speed
was roughly .5 Mflops.

Version 2 was written with real arithmetic.  In particular, the loop con-
dition cabs(z)<2.0 became x*x+y*y<4.0, and redundant intermediates were
not recalculated.  This one went by at about 5 Mflops.

Version 3 was written in assembly, and vectorized by hand.  A small frac-
tion of the time one of the pixels currently in the vector register would
finish its loop and get replaced.  I'm essentially pretending that the
Cray consists of 64 parallel processors.  Thanks to the miracles of vec-
tor chaining, this went by like a bat out of hell at some 50 Mflops.

>	       where all the extra work is worth the effort.

In this case, it was definitely worth it.

Now back to your example and my promised comments.

>        for i
>          a[i] := ...
>          exit if p[i]
>          b[i] := ...

What I did, then, was to vectorize the a[i] calculation.  I had no b[i]
calculation--this is true for any while loop.  But this could be handled
just as easily.  I write one part of the code that does A-P, and another
part that does B-A-P.  Every pixel would do one round of A-P, and then
it's just a B-A-P while loop.

More fun happens if you replace the "exit" with "continue": then the pix-
els start in the A-P batch and eventually enough migrate to B-A-P allow-
ing both to loop for a while.  Keeping the vector registers full and beat-
ing off gridlock is tedious, but it is not overly difficult.

Remember, this method works if you have a huge outer loop that guarantees
that the code fragments A-P and B-A-P are vector calculations.  It can,
in principle, vectorize arbitrarily complex conditionals.

ucbvax!garnet!weemba	Matthew P Wiener/Brahms Gang/Berkeley CA 94720

smryan@garth.UUCP (Steven Ryan) (07/11/88)

>I once played with the Mandelbrot set on a Cray-1.  The relevant code frag-
>ment is (with complex arithmetic):
>
>	for c in [set of pixels]
>	{	for(z=i=0; |z|<2 && i<MAX_ITER; i++)
>			z=z*z+c;
>		count[c]=i;
>	}
>
>Version 3 was written in assembly, and vectorized by hand.

I don't know what version 3 is but version 1 is recursive. I don't if the
Cray can generate a geometric progression in one swoop. If it can, fine.
Otherwise, I would have changed this to

          bit in_range[] := all true
	  z[] := 0
	  i := 1;
          while some bit set in in_range and i<=max_iter
            for c in pixel set
              if in_range[c]
                z[c] := z[c]*z[c] + c
              in_range[c] := in_range[c] and abs z[c] <2
            i+:=1

The inner loop is not recursive and it has single entry/single exit body.
Such loops are vectorisable, in principle anyway. I'm not sure what this
has to do with early exits.
 
>Now back to your example and my promised comments.
>
>>        for i
>>          a[i] := ...
>>          exit if p[i]
>>          b[i] := ...
>
>What I did, then, was to vectorize the a[i] calculation.

Sorry, don't understand.
 
>More fun happens if you replace the "exit" with "continue"

Then it very easy.  "continue if p[i]" becomes

         for i
           a[i] := ...
           if not p[i]
             b[i] := ...

Again, a single entry/single exit body.

aglew@urbsdc.Urbana.Gould.COM (07/12/88)

..> weemba playing with Mandelbrot on a Cray-1:
..> .5 Mflops using complex arithmetic,
..> 5 Mflops using real
..> 50 Mflops using assembly

This causes me to wonder what the state of the art is in
vectorizing complex arithmetic. Of course, the transformations
from complex to real, and the appropriate vector optimizations
with short circuits, are all well known, but they do no good
if they aren't fully implemented. How good are available
compilers at vectorizing complex arithmetic? Do people short
shrift complex because most "important" programs have already
been translated to use reals? Is this a self-fulfilling prophecy?

smryan@garth.UUCP (Steven Ryan) (07/15/88)

>                                  How good are available
>compilers at vectorizing complex arithmetic? Do people short
>shrift complex because most "important" programs have already
>been translated to use reals? Is this a self-fulfilling prophecy?

On the 205, complex is vectorised but into library calls. Vector complex
operators require multiple scratch vectors and the code generator is
only set up to define one scratch vector per operator. (Modular code?
We don't need no steenking modular code.)

dik@cwi.nl (Dik T. Winter) (07/16/88)

In article <963@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
 > >                                  How good are available
 > >compilers at vectorizing complex arithmetic? Do people short
 > >shrift complex because most "important" programs have already
 > >been translated to use reals? Is this a self-fulfilling prophecy?
 > 
 > On the 205, complex is vectorised but into library calls. Vector complex
 > operators require multiple scratch vectors and the code generator is
 > only set up to define one scratch vector per operator. (Modular code?
 > We don't need no steenking modular code.)

And we might want some better library routines.  Complex on the 205 is
much slower than it needs to be.
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

cik@l.cc.purdue.edu (Herman Rubin) (07/18/88)

In article <7584@boring.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes:
> In article <963@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>  > >                                  How good are available
>  > >compilers at vectorizing complex arithmetic? Do people short
>  > >shrift complex because most "important" programs have already
>  > >been translated to use reals? Is this a self-fulfilling prophecy?
>  > 
>  > On the 205, complex is vectorised but into library calls. Vector complex
>  > operators require multiple scratch vectors and the code generator is
>  > only set up to define one scratch vector per operator. (Modular code?
>  > We don't need no steenking modular code.)
> 
> And we might want some better library routines.  Complex on the 205 is
> much slower than it needs to be.

The present types of complex and also of multiple, but fixed, precision where
the hardware does not already handle the precision (i.e, the double precision
on machines such as the VAX etc. for which operations on double precision
quantities are hardware operations do not qualify as multiple precision) could
be better handled if we used an array of vectors.

For example, suppose that the present procedure to deal with double precision
comples numbers is to have a vector DC in which each item is four machine
words, and the vector operations can do such things as compute both the
upper and lower parts of a double precision operation on two words.  Then
we should have four vectors, say RU (the upper part of the real part), RL,
CU, and CL.  This way it would not be necessary to set up the scratch vectors 
and to start out by separating the DC vector into the four parts, which we
are likely to have to do anyhow!

Other than the changes in the languages needed to accomplish this, I see little
problem except for input and output, and possibly the concatenation of vectors.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

dik@cwi.nl (Dik T. Winter) (07/18/88)

In article <837@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
 > For example, suppose that the present procedure to deal with double precision
 > comples numbers is to have a vector DC in which each item is four machine
 > words, and the vector operations can do such things as compute both the
 > upper and lower parts of a double precision operation on two words.  Then
 > we should have four vectors, say RU (the upper part of the real part), RL,
 > CU, and CL.  This way it would not be necessary to set up the scratch vectors 
 > and to start out by separating the DC vector into the four parts, which we
 > are likely to have to do anyhow!
 > 
 > Other than the changes in the languages needed to accomplish this, I see little
 > problem except for input and output, and possibly the concatenation of vectors.
A problem arises if we want to pass an element as a variable to a subroutine.
(Things like this were possible in Algol 68 anyway.)
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

wallach@convex.UUCP (07/20/88)

on the convex c series complex data types - complex *8 and complex *16
are vectorized directly without library calls of any type.  this
includes complex add,subtract, multiply and divide, as well as real
and imag part.  this is generalized to include complex data types
executed under if statements or intrinsics that will accept complex
data type.