[comp.sys.alliant] inter-loop locking

xxrich@alliant1 (Rich Rinehart) (03/27/90)

I would like to force some MIMD like operation on the following 
loop but am having troubles figuring out the synchronization.
Can anyone be of help?  The problem is something like this....

cvd$ concur
	do 100 I=1,100
		call sub1
		call sub2
c do sub3 only after sub1 and 2 have completed
		call sub3
100	contiue

Do I need to use the lock system call or is there another way?  Anyone 
have any examples?  It's been a while since the optimization class...

--
-----------------------------------------------------------------------------
Rich Rinehart                  |     phone: 216-433-5211
NASA Lewis Research Center     |     email: xxrich@alliant1.lerc.nasa.gov
-----------------------------------------------------------------------------

muller@Alliant.COM (Jim Muller) (03/28/90)

In article <1990Mar27.132340.4243@eagle.lerc.nasa.gov>
    xxrich@alliant1.lerc.nasa.gov (Rich Rinehart) writes:

>I would like to force some MIMD like operation on the following 
>loop but am having troubles figuring out the synchronization.
>Can anyone be of help?  The problem is something like this....

>  cvd$ concur
>          do 100 I=1,100
>             call sub1
>             call sub2
>  c do sub3 only after sub1 and 2 have completed
>             call sub3
>  100     continue

>Do I need to use the lock system call or is there another way?  Anyone 
>have any examples?  It's been a while since the optimization class...

(I'll try to answer this, but I'm not sure you meant it exactly as written.)

As written, every iteration in the concurrent 'do 100' loop will be assigned
its own processor, and each will execute its own copy of sub1 then sub2 then
sub3, in order.  So as it stands, it will do exactly what you asked for, on a
per-processor basis until 100 copies have been run, without using any lock or
unlock calls.  The only requirment is that all subroutines called concurrently
(and all that they also call, etc.) must be compiled 'recursive', via either
the '-recursive' switch from the fortran command or with the subroutine header
'recursive subroutine sub1'.  By the way, your directive 'cvd$ concur' should
be 'cvd$ cncall' to tell the compiler that concurrent calls are okay in this
loop.  Concurrency must be invoked from the fortran command; by itself, the
directive 'cvd$ concur' simply turns concurrency on if it was invoked from the
fortran command but turned off earlier in the file with a similar directive.
If the fortran command specifies concurrency optimization, the default is for
all loops to be analyzed for concurrency unless a directive turns it off.
[Note:  Be careful about what the subroutines do in common blocks, since
common blocks are not replicated by 'recursive'.  You can use the -interface
mechanism to let the compiler determine if those subroutines are "cncall-able"
but it will not examine common blocks either.  Anyway, you might get better
performance by inlining those subroutines from the fortran command with
-inline sub1 -inline sub2 -inline sub3.]
----------
However, I suspect perhaps you meant something different.  One possibility is
to force *simultaneous execution* of sub1 and sub2, *then* do sub3:

	call sub1   !   do sub1 on one processor
	call sub2   !   do sub2 on a different processor
   Sub1 and sub2 will be running simultaneously on different processors.
   When both sub1 and sub2 are done, then do:
	call sub3   !   do sub3 only after sub1 and 2 have completed

In this case, one way to do it would be:

cvd$ cncall
	do 100 I=1,2
		if (I.eq.1) call sub1
		if (I.eq.2) call sub2
100	continue
	call sub3

The 'do 100' loop is a trick to create a concurrent loop with each iteration
doing something different.  Again, no locks are required, but both sub1 and
sub2 (and any child-subroutines) must be compiled 'recursive'.  (The test
for iteration number 'if (I.eq.n)' can be kept in this subroutine or moved
into the called subroutines by passing the iteration number I as a parameter.
If it is moved into the called subroutines, sub1 and sub2 could be merged.)
----------
Another possibility is that multiple copies of sub1 and sub2 are needed, and
after all of them are done, call sub3.  This is like your original code, but
with 'call sub3' outside of the loop:

cvd$ cncall
	do 100 I=1,100
		call sub1
		call sub2
100	continue
	call sub3

Again, no locks are needed, but 'recursive' is needed for sub1 and sub2.
----------
The only reason you might ever need lock/unlock is if you have a concurrent
loop in which a portion of the code (which may contain a subroutine call) must
be *forced* into one-copy-at-a-time execution.  A typical situation would
be if different iterations write to the same memory location or common block,
or if a cncall'ed subroutine might possibly do concurrent writes (which causes
a run-time error if it happens).  In essence, you want each processor, as it
enters that critical code, to lock out all the others if it "gets there first"
or stall until the code is free if another processor has already locked it.

The usual syntax to do this is to declare a variable to hold the lock state
(typically as an integer though it will be used as a Boolean) and initialize
it to 0.  Then the statement 'call lock(l)' (where l is that variable) will do
exactly what is wanted, locking out other processors until that lock is opened
with 'call unlock(l)', or stalling this processor until some other processor
opens it.  When a lock is set, other processors are not affected until each
encounters that lock too, at which point it will stall.  If several processors
are waiting for the same lock, only one will be allowed access when it opens,
and of course, its first action will be to lock it up again.  In this way the
processors are allowed one-at-a-time entry to that code.  The order of access
between several stalled processors is not predictable (at least by a normal
user).  The state of the lock can be read, e.g. if(l.ne.0), but it must *not*
be changed except with the lock/unlock calls, i.e. you must not do l = 0 to
unlock the lock l.  It is not necessary that the same processor that sets a
lock be the one to unlock it, nor that it be locked when an unlock call is
executed.  However if a lock is set but never unlocked and the other processors
encounter it, they will stall until the job is finished.  Likewise, if a
lock is set and no other processors unlock it, and the processor that set it
encounters it *again* before unlocking it, it too will stall, so the job may
never finish.

Here is an example that uses lock to prevent concurrent writes from within a
function that would otherwise run concurrently:

	program lockdemo
	real x(100)
	integer i, l

c  Initialize the lock l.
	l = 0

cvd$ cncall sub1
	do i = 1,100
	   x(i) = sub1(i,l)
	enddo

	write(6,*) 'Done.  x(100) =', x(100)
	stop
	end


	recursive function sub1(i,l)
	integer i, l
	real sub1

	sub1 = .002 * float(i) - .003 * float(i*i)
c  Do more work here.

c  Set lock.  Other processors cannot pass here until lock is opened.
	call lock(l)
	write(20,*) 'In sub1, completed:',i
	call unlock(l)

	return
	end

Jim Muller
-- 
    - Jim Muller

xxrich@alliant1 (Rich Rinehart) (03/29/90)

>>I would like to force some MIMD like operation on the following 
>>loop but am having troubles figuring out the synchronization.
>>Can anyone be of help?  The problem is something like this....
>
>>  cvd$ concur
>>          do 100 I=1,100
>>             call sub1
>>             call sub2
>>  c do sub3 only after sub1 and 2 have completed
>>             call sub3
>>  100     continue
>
>
>As written, every iteration in the concurrent 'do 100' loop will be assigned
>its own processor, and each will execute its own copy of sub1 then sub2 then
>sub3, in order.  So as it stands, it will do exactly what you asked for, on a

This isn't what i wanted but i didn't realize this.  It would complete 100
iterations of sub1, then go on to sub2?

>However, I suspect perhaps you meant something different.  One possibility is
>to force *simultaneous execution* of sub1 and sub2, *then* do sub3:

Yes, this is what i wanted

>cvd$ cncall
>	do 100 I=1,2
>		if (I.eq.1) call sub1
>		if (I.eq.2) call sub2
>100	continue
>	call sub3
>
>The 'do 100' loop is a trick to create a concurrent loop with each iteration
>doing something different.  Again, no locks are required, but both sub1 and
>sub2 (and any child-subroutines) must be compiled 'recursive'.  (The test

Neat trick.
How can i guarantee that sub1 and sub2 will act in a mimd fassion?  What
if sub1 call other routines which the compiler might find what it thinks 
to be a 'better' loop to parallelize?   

Why must it be compiled recursively for this? (or is this the answer).

Thank you for your locking explaination as I plan to use it for another
analysis.

Thanks again for help!
-rich

--
-----------------------------------------------------------------------------
Rich Rinehart                  |     phone: 216-433-5211
NASA Lewis Research Center     |     email: xxrich@alliant1.lerc.nasa.gov
-----------------------------------------------------------------------------