[comp.arch] Task synchronization in multiprocessors?

fouts@orville.nas.nasa.gov (Marty Fouts) (03/03/88)

I've just finished Stones recent book (High Performance Computer
Architecture) which has left me wanting to know more about
architectural support for task synchronization in multiprocessors.

I'm particularly interested in production MIMD shared memory systems
and alternatives to semaphore critical region or barrier
synchronization.  A large part of the problem with using a semaphore
for synchronization is that it introduces a hot spot in the memory
system and forces some potentially parallel activities to be serial.

Stone mentions combining networks, "compare and swap", and "fetch and
add" as alternatives to a simple semaphore.  He doesn't talk about any
of the other software organizations for synchronization, such as
monitors and blocking sends, because he is discussing high granularity
numeric processes.

Does anyone care to comment on current implementations of these mechanisms,
especially comparing them?  Also, is anyone aware of any other
mechanisms in use in commercially available MIMD machines?

Marty

brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (03/04/88)

In article <317@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes:
>I've just finished Stones recent book (High Performance Computer
>Architecture) which has left me wanting to know more about
>architectural support for task synchronization in multiprocessors.
>
>I'm particularly interested in production MIMD shared memory systems
>and alternatives to semaphore critical region or barrier
>synchronization.  A large part of the problem with using a semaphore
>for synchronization is that it introduces a hot spot in the memory
>system and forces some potentially parallel activities to be serial.
For a bottleneck free barrier synchronization algorithm, which is heavily
used on several existing multiprocessors, see "The Butterfly Barrier",
International Journal of Parallel Programming, Vol. 15, No. 4, Aug 86,
pp. 295-307.  This software algorithm, which is bottleneck free and runs
in logarithmic time, is critically compared to direct hardware barrier
support (in the context of Gauss elimination algorithms) using the Cerberus
multiprocessor simulator in the proceedings of the 1987 SIAM conference on
parallel processing for scientific computing, which should be out any
month now.....  See an article titled "Gaussian techniques on shared
memory multiprocessors", the Cerberus multiprocessor simulator itself is
also described in this proceedings.

As for alternatives to simple "locks", (implemented with swap register
with memory) and barriers (implemented either in hardware or with a
good software algorithm) you can go a surprising long way with these
and the "live lock" problems incurred with the combining operations
should not be ignored.


What?  You want to use a thousand processors?  We would be happy to
get a few tens jumping on a single application efficiently!


						Eugene

lisper@yale.UUCP (Bjorn Lisper) (03/04/88)

In article <4530@lll-winken.llnl.gov> brooks@lll-crg.llnl.gov.UUCP (Eugene
D. Brooks III) writes:
>In article <317@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty
>Fouts) writes:
>>I've just finished Stones recent book (High Performance Computer
>>Architecture) which has left me wanting to know more about
>>architectural support for task synchronization in multiprocessors.
>>
>>I'm particularly interested in production MIMD shared memory systems
>>and alternatives to semaphore critical region or barrier
>>synchronization.  A large part of the problem with using a semaphore
>>for synchronization is that it introduces a hot spot in the memory
>>system and forces some potentially parallel activities to be serial.
>For a bottleneck free barrier synchronization algorithm, which is heavily
>used on several existing multiprocessors, see "The Butterfly Barrier",
>International Journal of Parallel Programming, Vol. 15, No. 4, Aug 86,
>pp. 295-307.  This software algorithm, which is bottleneck free and runs
>in logarithmic time, is critically compared to direct hardware barrier
....

If you are interested in what might come in the future, check
Ranade, Bhatt, Johnsson: "The Fluent Abstract Machine", technical report
YALEU/DCS/TR-573, CS Dept., Yale University, January 1988. Abhiram (Ranade)
has found a simple way of routing in a distributed system so that memory
accesses with overwhelming probability will take logarithmic time (in the
number of processors) no matter how far the requesting processor is from the
processor with the referred memory. This will make the machine look as a
shared memory architecture without the congestion bottlenecks such an
architecture has if implemented directly.

I think copies can be requested from Ms Susan McBride,
mcbride-susan@cs.yale.edu.

Bjorn Lisper