fouts@orville.nas.nasa.gov (Marty Fouts) (03/03/88)
I've just finished Stones recent book (High Performance Computer Architecture) which has left me wanting to know more about architectural support for task synchronization in multiprocessors. I'm particularly interested in production MIMD shared memory systems and alternatives to semaphore critical region or barrier synchronization. A large part of the problem with using a semaphore for synchronization is that it introduces a hot spot in the memory system and forces some potentially parallel activities to be serial. Stone mentions combining networks, "compare and swap", and "fetch and add" as alternatives to a simple semaphore. He doesn't talk about any of the other software organizations for synchronization, such as monitors and blocking sends, because he is discussing high granularity numeric processes. Does anyone care to comment on current implementations of these mechanisms, especially comparing them? Also, is anyone aware of any other mechanisms in use in commercially available MIMD machines? Marty
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (03/04/88)
In article <317@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty Fouts) writes: >I've just finished Stones recent book (High Performance Computer >Architecture) which has left me wanting to know more about >architectural support for task synchronization in multiprocessors. > >I'm particularly interested in production MIMD shared memory systems >and alternatives to semaphore critical region or barrier >synchronization. A large part of the problem with using a semaphore >for synchronization is that it introduces a hot spot in the memory >system and forces some potentially parallel activities to be serial. For a bottleneck free barrier synchronization algorithm, which is heavily used on several existing multiprocessors, see "The Butterfly Barrier", International Journal of Parallel Programming, Vol. 15, No. 4, Aug 86, pp. 295-307. This software algorithm, which is bottleneck free and runs in logarithmic time, is critically compared to direct hardware barrier support (in the context of Gauss elimination algorithms) using the Cerberus multiprocessor simulator in the proceedings of the 1987 SIAM conference on parallel processing for scientific computing, which should be out any month now..... See an article titled "Gaussian techniques on shared memory multiprocessors", the Cerberus multiprocessor simulator itself is also described in this proceedings. As for alternatives to simple "locks", (implemented with swap register with memory) and barriers (implemented either in hardware or with a good software algorithm) you can go a surprising long way with these and the "live lock" problems incurred with the combining operations should not be ignored. What? You want to use a thousand processors? We would be happy to get a few tens jumping on a single application efficiently! Eugene
lisper@yale.UUCP (Bjorn Lisper) (03/04/88)
In article <4530@lll-winken.llnl.gov> brooks@lll-crg.llnl.gov.UUCP (Eugene D. Brooks III) writes: >In article <317@amelia.nas.nasa.gov> fouts@orville.nas.nasa.gov (Marty >Fouts) writes: >>I've just finished Stones recent book (High Performance Computer >>Architecture) which has left me wanting to know more about >>architectural support for task synchronization in multiprocessors. >> >>I'm particularly interested in production MIMD shared memory systems >>and alternatives to semaphore critical region or barrier >>synchronization. A large part of the problem with using a semaphore >>for synchronization is that it introduces a hot spot in the memory >>system and forces some potentially parallel activities to be serial. >For a bottleneck free barrier synchronization algorithm, which is heavily >used on several existing multiprocessors, see "The Butterfly Barrier", >International Journal of Parallel Programming, Vol. 15, No. 4, Aug 86, >pp. 295-307. This software algorithm, which is bottleneck free and runs >in logarithmic time, is critically compared to direct hardware barrier .... If you are interested in what might come in the future, check Ranade, Bhatt, Johnsson: "The Fluent Abstract Machine", technical report YALEU/DCS/TR-573, CS Dept., Yale University, January 1988. Abhiram (Ranade) has found a simple way of routing in a distributed system so that memory accesses with overwhelming probability will take logarithmic time (in the number of processors) no matter how far the requesting processor is from the processor with the referred memory. This will make the machine look as a shared memory architecture without the congestion bottlenecks such an architecture has if implemented directly. I think copies can be requested from Ms Susan McBride, mcbride-susan@cs.yale.edu. Bjorn Lisper