smryan@garth.UUCP (Steven Ryan) (07/08/88)
>The loop can be ``partially vectorized'' if not all of >the statements are part of the computation of the >exit branch Given something like for i a[i] := ... exit if p[i] b[i] := ... The actual number of iterations is controlled by the predicate p[]. This means the actual number of elements stored into a[] is determined by a subsequent statement. I see no simple way to handle this. Two possible solutions would be (1) buffer the assignments to a[] and b[]; stores back to arrays are deferred until the actual iteration count is determined. (2) Float the computation of p[] to the top of loop and then split it into a separate scalar loop. Anything between p[] and the top of loop also has to be in the scalar loop, unless it will be bufferred as in (1). If I were actually working on this at the moment, I would like see enough typical cases where all the extra work is worth the effort.
weemba@garnet.berkeley.edu (Obnoxious Math Grad Student) (07/09/88)
In article <893@garth.UUCP>, smryan@garth (Steven Ryan) writes: > for i > a[i] := ... > exit if p[i] > b[i] := ... >The actual number of iterations is controlled by the predicate p[]. >This means the actual number of elements stored into a[] is determined >by a subsequent statement. I see no simple way to handle this. See my comments below. If the above code fragment is the inner loop of a bigger calculation, and each inner loop is independent of all others, then vectorization can be done straightforwardly. >If I were actually working on this at the moment, I would like see enough >typical cases I once played with the Mandelbrot set on a Cray-1. The relevant code frag- ment is (with complex arithmetic): for c in [set of pixels] { for(z=i=0; |z|<2 && i<MAX_ITER; i++) z=z*z+c; count[c]=i; } Version 1 was written with complex arithmetic. My main concern at this point was with figuring out to use the available graphics. Its speed was roughly .5 Mflops. Version 2 was written with real arithmetic. In particular, the loop con- dition cabs(z)<2.0 became x*x+y*y<4.0, and redundant intermediates were not recalculated. This one went by at about 5 Mflops. Version 3 was written in assembly, and vectorized by hand. A small frac- tion of the time one of the pixels currently in the vector register would finish its loop and get replaced. I'm essentially pretending that the Cray consists of 64 parallel processors. Thanks to the miracles of vec- tor chaining, this went by like a bat out of hell at some 50 Mflops. > where all the extra work is worth the effort. In this case, it was definitely worth it. Now back to your example and my promised comments. > for i > a[i] := ... > exit if p[i] > b[i] := ... What I did, then, was to vectorize the a[i] calculation. I had no b[i] calculation--this is true for any while loop. But this could be handled just as easily. I write one part of the code that does A-P, and another part that does B-A-P. Every pixel would do one round of A-P, and then it's just a B-A-P while loop. More fun happens if you replace the "exit" with "continue": then the pix- els start in the A-P batch and eventually enough migrate to B-A-P allow- ing both to loop for a while. Keeping the vector registers full and beat- ing off gridlock is tedious, but it is not overly difficult. Remember, this method works if you have a huge outer loop that guarantees that the code fragments A-P and B-A-P are vector calculations. It can, in principle, vectorize arbitrarily complex conditionals. ucbvax!garnet!weemba Matthew P Wiener/Brahms Gang/Berkeley CA 94720
smryan@garth.UUCP (Steven Ryan) (07/11/88)
>I once played with the Mandelbrot set on a Cray-1. The relevant code frag- >ment is (with complex arithmetic): > > for c in [set of pixels] > { for(z=i=0; |z|<2 && i<MAX_ITER; i++) > z=z*z+c; > count[c]=i; > } > >Version 3 was written in assembly, and vectorized by hand. I don't know what version 3 is but version 1 is recursive. I don't if the Cray can generate a geometric progression in one swoop. If it can, fine. Otherwise, I would have changed this to bit in_range[] := all true z[] := 0 i := 1; while some bit set in in_range and i<=max_iter for c in pixel set if in_range[c] z[c] := z[c]*z[c] + c in_range[c] := in_range[c] and abs z[c] <2 i+:=1 The inner loop is not recursive and it has single entry/single exit body. Such loops are vectorisable, in principle anyway. I'm not sure what this has to do with early exits. >Now back to your example and my promised comments. > >> for i >> a[i] := ... >> exit if p[i] >> b[i] := ... > >What I did, then, was to vectorize the a[i] calculation. Sorry, don't understand. >More fun happens if you replace the "exit" with "continue" Then it very easy. "continue if p[i]" becomes for i a[i] := ... if not p[i] b[i] := ... Again, a single entry/single exit body.
aglew@urbsdc.Urbana.Gould.COM (07/12/88)
..> weemba playing with Mandelbrot on a Cray-1: ..> .5 Mflops using complex arithmetic, ..> 5 Mflops using real ..> 50 Mflops using assembly This causes me to wonder what the state of the art is in vectorizing complex arithmetic. Of course, the transformations from complex to real, and the appropriate vector optimizations with short circuits, are all well known, but they do no good if they aren't fully implemented. How good are available compilers at vectorizing complex arithmetic? Do people short shrift complex because most "important" programs have already been translated to use reals? Is this a self-fulfilling prophecy?
smryan@garth.UUCP (Steven Ryan) (07/15/88)
> How good are available >compilers at vectorizing complex arithmetic? Do people short >shrift complex because most "important" programs have already >been translated to use reals? Is this a self-fulfilling prophecy? On the 205, complex is vectorised but into library calls. Vector complex operators require multiple scratch vectors and the code generator is only set up to define one scratch vector per operator. (Modular code? We don't need no steenking modular code.)
dik@cwi.nl (Dik T. Winter) (07/16/88)
In article <963@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: > > How good are available > >compilers at vectorizing complex arithmetic? Do people short > >shrift complex because most "important" programs have already > >been translated to use reals? Is this a self-fulfilling prophecy? > > On the 205, complex is vectorised but into library calls. Vector complex > operators require multiple scratch vectors and the code generator is > only set up to define one scratch vector per operator. (Modular code? > We don't need no steenking modular code.) And we might want some better library routines. Complex on the 205 is much slower than it needs to be. -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
cik@l.cc.purdue.edu (Herman Rubin) (07/18/88)
In article <7584@boring.cwi.nl>, dik@cwi.nl (Dik T. Winter) writes: > In article <963@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: > > > How good are available > > >compilers at vectorizing complex arithmetic? Do people short > > >shrift complex because most "important" programs have already > > >been translated to use reals? Is this a self-fulfilling prophecy? > > > > On the 205, complex is vectorised but into library calls. Vector complex > > operators require multiple scratch vectors and the code generator is > > only set up to define one scratch vector per operator. (Modular code? > > We don't need no steenking modular code.) > > And we might want some better library routines. Complex on the 205 is > much slower than it needs to be. The present types of complex and also of multiple, but fixed, precision where the hardware does not already handle the precision (i.e, the double precision on machines such as the VAX etc. for which operations on double precision quantities are hardware operations do not qualify as multiple precision) could be better handled if we used an array of vectors. For example, suppose that the present procedure to deal with double precision comples numbers is to have a vector DC in which each item is four machine words, and the vector operations can do such things as compute both the upper and lower parts of a double precision operation on two words. Then we should have four vectors, say RU (the upper part of the real part), RL, CU, and CL. This way it would not be necessary to set up the scratch vectors and to start out by separating the DC vector into the four parts, which we are likely to have to do anyhow! Other than the changes in the languages needed to accomplish this, I see little problem except for input and output, and possibly the concatenation of vectors. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
dik@cwi.nl (Dik T. Winter) (07/18/88)
In article <837@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > For example, suppose that the present procedure to deal with double precision > comples numbers is to have a vector DC in which each item is four machine > words, and the vector operations can do such things as compute both the > upper and lower parts of a double precision operation on two words. Then > we should have four vectors, say RU (the upper part of the real part), RL, > CU, and CL. This way it would not be necessary to set up the scratch vectors > and to start out by separating the DC vector into the four parts, which we > are likely to have to do anyhow! > > Other than the changes in the languages needed to accomplish this, I see little > problem except for input and output, and possibly the concatenation of vectors. A problem arises if we want to pass an element as a variable to a subroutine. (Things like this were possible in Algol 68 anyway.) -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
wallach@convex.UUCP (07/20/88)
on the convex c series complex data types - complex *8 and complex *16 are vectorized directly without library calls of any type. this includes complex add,subtract, multiply and divide, as well as real and imag part. this is generalized to include complex data types executed under if statements or intrinsics that will accept complex data type.