[comp.unix.cray] Cray Autotasking

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (11/27/89)

Does anyone know offhand how the Cray autotasking splits loops between
processors?  Specifically, in the following loop:
	do 200 iseg=1,8
	    do 100 i=istart(iseg),istop(iseg)
		a(i,j(iseg)) = b(i,j(iseg))+two*c(i,j(iseg))
   100	    continue
   200	continue

Does a 4-cpu system do iterations like this
	cpu #		iseg
	-----		----
	  0		 1,2
	  1		 3,4
	  2		 5,6
	  3		 7,8
or like this
	cpu #		iseg
	-----		----
	  0		 1,5
	  1		 2,6
	  2		 3,7
	  3		 4,8

I plan to sort the segments by length to help with load balancing, and
need to know which segments go to which cpu so that I can decide
exactly how to sort them.
--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

koeninger@apple.com (R. Kent Koeninger) (11/28/89)

In article <MCCALPIN.89Nov27080806@masig3.ocean.fsu.edu> 
mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
> Does anyone know offhand how the Cray autotasking splits loops between
> processors?

I have not looked at autotasking lately, but I assume it works much like 
microtasking did.

The allocation of tasks to processors is dynamic and therefore cannot be 
statically determined.  You must assume they will be allocated in any 
order to any number of available processors.  (Welcome to the wonderful 
non-determinant world of parallel processing.)

The most efficient way to autotask is to have many tasks of vectorized 
loops.  The efficiency of vectorization increases with the vector length.  
The efficiency of load balancing the tasks increases with the number of 
tasks.  Lengthening the vectors, decreases the number of tasks.  
Optimizing vectorization is usually more important than optimizing 
autotasking.

Breaking the problem into 8 tasks to match 8 CPUs will work well in 
benchmark or dedicated situations, but will be counter-productive in a 
mixed job environment.  You will probably get less than the full 
compliment of 8 CPUs, leaving straggling long vectors to process at the 
end.

In general, you split the loop into tasks of vectors of length 64 to 128.  
This provides fairly optimum vector lengths while still maximizing the 
number of tasks.

If autotasking is up-to-snuff, you would present it with one long loop, 
and it would vectorize and parallelize that one loop for you, choosing a 
fairly optimum distribution.

Does anyone know if autotasking will split one loop into multiple tasks of 
vectorized loops yet?  My assumptions is autotasking works like 
microtasking.  Is this assumption valid?

Kent Koeninger   Cray Evangelist    Apple Computer   <koeninger@apple.com>

bernhold@qtp.ufl.edu (David E. Bernholdt) (11/28/89)

In article <5414@internal.Apple.COM> koeninger@apple.com (R. Kent Koeninger) writes:
>If autotasking is up-to-snuff, you would present it with one long loop, 
>and it would vectorize and parallelize that one loop for you, choosing a 
>fairly optimum distribution.
>
>Does anyone know if autotasking will split one loop into multiple tasks of 
>vectorized loops yet?  My assumptions is autotasking works like 
>microtasking.  Is this assumption valid?

I've messed around with autotasking a little.  It will do this.  In
fact, if you take a look at the source code after its been through the
compiler's preprocessors, you'll see that there are tests inserted on
the length of the vectors to decide how to distribute the work.

In other words, if the loop isn't long enough to warrent it, it won't
be parallelized, but if it is, it will "stripmine" the loop -- send
pieces out to be done in parallel on the available CPUs.  By giving
some command-line or (perhaps) source-embedded directives, oyu can
change the tolerances it uses to make these decisions.

I recommend the Cray Autotasking User's Guide (unfortunately I'm not at the
office, so I can't get the number for it) -- interesting reading.

-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

malcolm@Apple.COM (Malcolm Slaney) (11/29/89)

In article <MCCALPIN.89Nov27080806@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>Does anyone know offhand how the Cray autotasking splits loops between
>processors?  Specifically, in the following loop:
>	do 200 iseg=1,8
>	    do 100 i=istart(iseg),istop(iseg)
>		a(i,j(iseg)) = b(i,j(iseg))+two*c(i,j(iseg))
>   100	    continue
>   200	continue

You should sort them with the ones taking the longest time going first
in the loop.  For example, if you have one very large segment and lots of
little ones you don't want to do all the little ones first and then have
all the other processors waiting for the last (big) one to finish.  I think
I remember looking at the assembler generated and seeing that each processor 
just takes the next iteration in sequence.

I'm pretty sure that Autotasking and MicroTasking are identical in this 
regard.

Now, if I can just get my hands on a version of Standard C that autotasks....

								Malcolm

desj@idacrd.UUCP (David desJardins) (11/29/89)

From article <MCCALPIN.89Nov27080806@masig3.ocean.fsu.edu>, by mccalpin@masig3.ocean.fsu.edu (John D. McCalpin):
> Does anyone know offhand how the Cray autotasking splits loops
> between processors?  I plan to sort the segments by length to help
> with load balancing, and need to know which segments go to which cpu
> so that I can decide exactly how to sort them.

   The tasks are dynamically allocated, in the order that they occur
in the loop.  For load balancing, my strong intuition is that you will
do pretty well if you simply sort the tasks in descending order of
work.  (This is particularly true if you aren't sure how many
processors you will have or what else will be running.)

   -- David desJardins