af@spice.cs.cmu.edu (Alessandro Forin) (10/18/89)
A number of posts in various newsgroups raised the issue of how to write parallel programs for MIPS-based machines, given the lack of any form of interlocked instruction in the instruction set. This article describes how the problem was solved in the port of Mach to the MIPS M-500 and DECStation-3100 done at CMU. We wanted a machine-independent solution, but with attention to performance issues. User programs should be allowed to run unmodified on different MIPS boxes running Mach, modulo the byte-order. We decided to provide two solutions: one that is guaranteed to work under all conditions, and one that might be more efficient, but which might fail if the hardware or the user program do not meet certain restrictions. Programs are only guaranteed to be portable if they use the former solution, as the second might make use of special hardware features only available on some specific MIPS (multi)processor. Users are expected to write parallel programs using appropriate compilers or support libraries, such as the Mach C-Threads library. The library will export the appropriate interface functions, e.g. mutex_try_lock() for C-Threads. The first solution is implemented by adding a new instruction to the MIPS architecture, a test-and-set instruction (you guessed it). It is encoded as a "special" instruction with "function" code #define tas_op 0x0F which is emulated in the kernel. Since the MIPS architecture can handle exceptions very efficiently the emulation is also efficient and gives a constant value of less than 8usec on the Pmax, in the absence of page faults. If one is smart enough to use test-and-testandset spin locks the latency on a busy lock is then just one memory reference. The second solution for the Pmax is currently based on a new algorithm (derived from Lamport's) that does not need hardware support, as the hardware provides none. Given the type of cache used on the Pmax (write-through) the performance of this solution is not as good as one would like: about 2usecs in the good outcome. This algorithm is also only working for uniprocessor machines, and known to be failable on multiprocessor machines. We will continue to explore this software alternative for the user-level libraries and see if we can make it a guaranteed, machine-independent solution. Clearly, a multiprocessor without cache-coherency in hardware is hopeless in this respect, but might stand a chance of implementing the first solution. Since Lamport's algorithms were mentioned, it is worth noting that those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm" are inappropriate for user-level programs on MIPS. The first problem is that Lamport explicitly noted that the first (fast) algorithm was only intended for interrupt-level routines in the kernel, e.g. with no pre-emption. Too many people fail to realize this, even if it is clearly stated in the paper. The second, much more subtle problem, is that the other algorithms described in the paper use an atomic "compare register with memory" instruction, which is not available on any RISC machine. These algorithms are instead directly applicable to CISC machines like the Vax. sandro- Alessandro Forin School of Computer Science Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 ARPA: af@cs.cmu.edu Phone: (412) 268-6861
roger@wraxall.inmos.co.uk (Roger Shepherd) (10/19/89)
In article <6565@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes: > >A number of posts in various newsgroups raised the issue of how to write >parallel programs for MIPS-based machines, given the lack of any form of >interlocked instruction in the instruction set. This article describes how >the problem was solved in the port of Mach to the MIPS M-500 and >DECStation-3100 done at CMU. > ... >The second solution for the Pmax is currently based on a new algorithm >(derived from Lamport's) that does not need hardware support, as the >hardware provides none. > ... >Since Lamport's algorithms were mentioned, it is worth noting that >those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm" >are inappropriate for user-level programs on MIPS. The first problem >is that Lamport explicitly noted that the first (fast) algorithm was >only intended for interrupt-level routines in the kernel, e.g. with no >pre-emption. Too many people fail to realize this, even if it is >clearly stated in the paper. The problem of how to write parallel programs without hardware support for interlocking is well known and a solution was published by E. W. Dijkstra in Communications of the ACM, Volume 8, Number 9, September 1965. The abstract reads "A number of mainly independent sequential-cyclic processes with restricted means of communication with each other can be made in such a way that at any moment one and only one of them is engaged in the ``critical section'' of its cycle.'' I suspect there are serious efficiency problems in using his solutoin on a modern cache-coherent multiprocessor. However, it continues to amaze me that it is possible to implement critical regions using only read and write operations even in the presence of pre-emption. Roger Shepherd, INMOS Ltd JANET: roger@uk.co.inmos 1000 Aztec West UUCP: ukc!inmos!roger or uunet!inmos-c!roger Almondsbury INTERNET: roger@inmos.com +44 454 616616 ROW: roger@inmos.com OR roger@inmos.co.uk
karsh@trifolium.esd.sgi.com (Bruce Karsh) (10/20/89)
In article <6565@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes: >The second solution for the Pmax is currently based on a new algorithm >(derived from Lamport's) that does not need hardware support, as the >hardware provides none. So what is this new algorithm? -- Bruce Karsh karsh@sgi.com
mark@hubcap.clemson.edu (Mark Smotherman) (10/20/89)
From article <2591@ganymede.inmos.co.uk>, by roger@wraxall.inmos.co.uk (Roger Shepherd): > In article <6565@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes: >> >>Since Lamport's algorithms were mentioned, it is worth noting that >>those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm" >>are inappropriate for user-level programs on MIPS. The first problem >>is that Lamport explicitly noted that the first (fast) algorithm was >>only intended for interrupt-level routines in the kernel, e.g. with no ^^^^^^^ >>pre-emption. Too many people fail to realize this, even if it is ^^^^^^^^^^^ >>clearly stated in the paper. > > The problem of how to write parallel programs without hardware support > for interlocking is well known and a solution was published by E. W. > Dijkstra in Communications of the ACM, Volume 8, Number 9, September > 1965. ... However, it continues to amaze > me that it is possible to implement critical regions using only read and > write operations even in the presence of pre-emption. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Several elegant busy-waiting solutions to the mutual exclusion problem (including Dekker's, Peterson's simpler version of 1981, and the unadorned use of test and set) often rely on an unstated assumption of "fair" scheduling. That is, they assume that the process currently in its critical section will not be indefinitely postponed by scheduling precedence or priorities. Unfortunately, consider the following scenario. A low-priority process is executing inside its critical section. An interrupt, say I/O completion, is serviced and unblocks a higher-priority process. The interrupt handler exits through the dispatcher, which selects the higher-priority process based on a preemptive HPF policy. The higher-priority process now executes and tries to enter its critical section. But how can it enter when the low- priority process cannot be dispatched and thus allowed to leave its critical section? ** DEADLOCK ** This is why the IBM S/370 P.O. manual says to only use test and set when running in supervisor state with interrupts off. That is, use it only for synchronization between OS modules in different CPUs. Compare and swap is the appropriate instruction to implement busy waiting synchronization between user processes, but note that the shared resource must be either a word or a doubleword. (Of course, the IBM OS provides users with blocking primitives such as WAIT and POST, etc.) The potential for deadlock is also why Univac implemented test and set on later versions of the 1108 to cause an interrupt whenever the instruction found the bit (byte/word?) already set. Thus the OS gets involved immediately and blocks the second process attempting to enter. -- Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634 INTERNET: mark@hubcap.clemson.edu UUCP: gatech!hubcap!mark
vaughan@mcc.com (Paul Vaughan) (10/21/89)
This is a little off the topic of test-and-set, because it has to do with multiple processor synchronization rather than multiple process synchronization. But, . . . How many of you are familiar with the fetch&op circuit for interprocessor arithmetic and synchronization developed by G.J. Lipovski at the University of Texas? It is a tree based circuit with as few as 5 gates per node (bit serial) that allows a form of combining addition (fetch&add), as well as a wide variety of synchronization and priority operations between multiple processors. As far as hardware support for multiple processor synchronization goes, it is about as simple as can be implemented, but it can do a lot of useful things. How popular is the fetch&add hardware model for designing parallel algorithms? Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639 Box 200195, Austin, TX 78720 | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan
dolf@idca.tds.PHILIPS.nl (Dolf Grunbauer) (10/23/89)
In article <6831@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
->Several elegant busy-waiting solutions to the mutual exclusion problem
->(including Dekker's, Peterson's simpler version of 1981, and the unadorned
->use of test and set) often rely on an unstated assumption of "fair"
->scheduling. That is, they assume that the process currently in its critical
->section will not be indefinitely postponed by scheduling precedence or
->priorities.
->Unfortunately, consider the following scenario. A low-priority process is
->executing inside its critical section. An interrupt, say I/O completion,
->is serviced and unblocks a higher-priority process. The interrupt handler
->exits through the dispatcher, which selects the higher-priority process
->based on a preemptive HPF policy. The higher-priority process now executes
->and tries to enter its critical section. But how can it enter when the low-
->priority process cannot be dispatched and thus allowed to leave its critical
->section? ** DEADLOCK **
Although the above state scenario is true, I do hope that no operation system
behaves like this. I worked on an operating system which also had to deal with
this problem. Actually there are two deadlock possibilities: user and
operating system.
A deadlock in user mode (i.e. made by user processes) cannot be detected for
100% (as the above described situation might just be what the user wanted :-).
A deadlock in the operating system or kernel of course must
be detected. There are a few options to bypass this deadlock. One of them
(in fact: Unix) raises the priority of a ready-to-run process on every
schedule when this process does not get running, so eventually the low order
process becomes running. Another solution, which is implemented in for example
DRM System, is priority semaphores. When a process enters a critical section,
the process priority is changed to a fixed (possible very high) priority, to
garantee that it will be executing & leave this section, avoiding the above
described deadlock. When leaving the critical section, the process priority
is restored. If the process has to wait before entering this critical section
(as another process is already in it), the process may be put in a priority
based FIFO queue, i.e. the higher the original process prority, the sooner
it will be the one awoken by the process leaving this section, and within
one priority the processes are queued on FIFO basis.
--
Dolf Grunbauer Tel: +31 55 432764 Internet dolf@idca.tds.philips.nl
Philips Telecommunication and Data Systems UUCP ....!mcvax!philapd!dolf
Dept. SSP, P.O. Box 245, 7300 AE Apeldoorn, The Netherlands
joe@modcomp.UUCP (10/24/89)
> Unfortunately, consider the following scenario. A low-priority process > is executing inside its critical section. An interrupt, say I/O > completion, is serviced and unblocks a higher-priority process. The > interrupt handler exits through the dispatcher, which selects the > higher-priority process based on a preemptive HPF policy. The > higher-priority process now executes and tries to enter its critical > section. But how can it enter when the low- priority process cannot be > dispatched and thus allowed to leave its critical section? ** DEADLOCK ** This problem is of course fundamental to the uniprocessor/spinlock combination (and still exists to a lesser degree on mp's), and is a primary reason why user-level code should stick to sleep semaphores. The only safe place spinlocks can be used is in the kernel, where context switching (and possibly other interrupts) can be locked out for the duration of the critical section. -- joe korty "for every vengence there is an equal and uunet!modcomp!joe opposite revengence." (from Cartoon Laws)
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/25/89)
..> [Deadlock avoidance by priority manipulation when synchronizing] Trouble is, changing priority usually requires kernel intervention. What we need is a fast, non-kernel, path to change priority (and also handle issues like faults in the critical section). Similarly, if the user process is trying to synchronize with an interrupt routine, then not only do you need a kernel free path for changing process scheduling priority, but also interrupt priority. Kernel-free == free of excessive kernel overhead. Fast path syscalls possible. -- Andy "Krazy" Glew, Motorola MCD, aglew@urbana.mcd.mot.com 1101 E. University, Urbana, IL 61801, USA. {uunet!,}uiucuxc!udc!aglew My opinions are my own; I indicate my company only so that the reader may account for any possible bias I may have towards our products.