mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (11/27/89)
Does anyone know offhand how the Cray autotasking splits loops between processors? Specifically, in the following loop: do 200 iseg=1,8 do 100 i=istart(iseg),istop(iseg) a(i,j(iseg)) = b(i,j(iseg))+two*c(i,j(iseg)) 100 continue 200 continue Does a 4-cpu system do iterations like this cpu # iseg ----- ---- 0 1,2 1 3,4 2 5,6 3 7,8 or like this cpu # iseg ----- ---- 0 1,5 1 2,6 2 3,7 3 4,8 I plan to sort the segments by length to help with load balancing, and need to know which segments go to which cpu so that I can decide exactly how to sort them. -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
koeninger@apple.com (R. Kent Koeninger) (11/28/89)
In article <MCCALPIN.89Nov27080806@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: > Does anyone know offhand how the Cray autotasking splits loops between > processors? I have not looked at autotasking lately, but I assume it works much like microtasking did. The allocation of tasks to processors is dynamic and therefore cannot be statically determined. You must assume they will be allocated in any order to any number of available processors. (Welcome to the wonderful non-determinant world of parallel processing.) The most efficient way to autotask is to have many tasks of vectorized loops. The efficiency of vectorization increases with the vector length. The efficiency of load balancing the tasks increases with the number of tasks. Lengthening the vectors, decreases the number of tasks. Optimizing vectorization is usually more important than optimizing autotasking. Breaking the problem into 8 tasks to match 8 CPUs will work well in benchmark or dedicated situations, but will be counter-productive in a mixed job environment. You will probably get less than the full compliment of 8 CPUs, leaving straggling long vectors to process at the end. In general, you split the loop into tasks of vectors of length 64 to 128. This provides fairly optimum vector lengths while still maximizing the number of tasks. If autotasking is up-to-snuff, you would present it with one long loop, and it would vectorize and parallelize that one loop for you, choosing a fairly optimum distribution. Does anyone know if autotasking will split one loop into multiple tasks of vectorized loops yet? My assumptions is autotasking works like microtasking. Is this assumption valid? Kent Koeninger Cray Evangelist Apple Computer <koeninger@apple.com>
bernhold@qtp.ufl.edu (David E. Bernholdt) (11/28/89)
In article <5414@internal.Apple.COM> koeninger@apple.com (R. Kent Koeninger) writes: >If autotasking is up-to-snuff, you would present it with one long loop, >and it would vectorize and parallelize that one loop for you, choosing a >fairly optimum distribution. > >Does anyone know if autotasking will split one loop into multiple tasks of >vectorized loops yet? My assumptions is autotasking works like >microtasking. Is this assumption valid? I've messed around with autotasking a little. It will do this. In fact, if you take a look at the source code after its been through the compiler's preprocessors, you'll see that there are tests inserted on the length of the vectors to decide how to distribute the work. In other words, if the loop isn't long enough to warrent it, it won't be parallelized, but if it is, it will "stripmine" the loop -- send pieces out to be done in parallel on the available CPUs. By giving some command-line or (perhaps) source-embedded directives, oyu can change the tolerances it uses to make these decisions. I recommend the Cray Autotasking User's Guide (unfortunately I'm not at the office, so I can't get the number for it) -- interesting reading. -- David Bernholdt bernhold@qtp.ufl.edu Quantum Theory Project bernhold@ufpine.bitnet University of Florida Gainesville, FL 32611 904/392 6365
malcolm@Apple.COM (Malcolm Slaney) (11/29/89)
In article <MCCALPIN.89Nov27080806@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >Does anyone know offhand how the Cray autotasking splits loops between >processors? Specifically, in the following loop: > do 200 iseg=1,8 > do 100 i=istart(iseg),istop(iseg) > a(i,j(iseg)) = b(i,j(iseg))+two*c(i,j(iseg)) > 100 continue > 200 continue You should sort them with the ones taking the longest time going first in the loop. For example, if you have one very large segment and lots of little ones you don't want to do all the little ones first and then have all the other processors waiting for the last (big) one to finish. I think I remember looking at the assembler generated and seeing that each processor just takes the next iteration in sequence. I'm pretty sure that Autotasking and MicroTasking are identical in this regard. Now, if I can just get my hands on a version of Standard C that autotasks.... Malcolm
desj@idacrd.UUCP (David desJardins) (11/29/89)
From article <MCCALPIN.89Nov27080806@masig3.ocean.fsu.edu>, by mccalpin@masig3.ocean.fsu.edu (John D. McCalpin): > Does anyone know offhand how the Cray autotasking splits loops > between processors? I plan to sort the segments by length to help > with load balancing, and need to know which segments go to which cpu > so that I can decide exactly how to sort them. The tasks are dynamically allocated, in the order that they occur in the loop. For load balancing, my strong intuition is that you will do pretty well if you simply sort the tasks in descending order of work. (This is particularly true if you aren't sure how many processors you will have or what else will be running.) -- David desJardins