[comp.arch] Mips, Mach, test-and-set

af@spice.cs.cmu.edu (Alessandro Forin) (10/18/89)

A number of posts in various newsgroups raised the issue of how to write
parallel programs for MIPS-based machines, given the lack of any form of
interlocked instruction in the instruction set.  This article describes how
the problem was solved in the port of Mach to the MIPS M-500 and
DECStation-3100 done at CMU.

We wanted a machine-independent solution, but with attention to performance
issues.  User programs should be allowed to run unmodified on different MIPS
boxes running Mach, modulo the byte-order.
We decided to provide two solutions: one that is guaranteed to work under
all conditions, and one that might be more efficient, but which might fail
if the hardware or the user program do not meet certain restrictions.
Programs are only guaranteed to be portable if they use the former solution,
as the second might make use of special hardware features only available on
some specific MIPS (multi)processor.  Users are expected to write parallel
programs using appropriate compilers or support libraries, such as the Mach
C-Threads library.  The library will export the appropriate interface
functions, e.g.  mutex_try_lock() for C-Threads.  

The first solution is implemented by adding a new instruction to
the MIPS architecture, a test-and-set instruction (you guessed it).
It is encoded as a "special" instruction with "function" code 
	#define tas_op	0x0F
which is emulated in the kernel.  Since the MIPS architecture
can handle exceptions very efficiently the emulation is also
efficient and gives a constant value of less than 8usec on the Pmax,
in the absence of page faults.
If one is smart enough to use test-and-testandset spin locks
the latency on a busy lock is then just one memory reference.

The second solution for the Pmax is currently based on a new algorithm
(derived from Lamport's) that does not need hardware support, as the
hardware provides none.  Given the type of cache used on the Pmax
(write-through) the performance of this solution is not as good as one would
like: about 2usecs in the good outcome.  This algorithm is also only working
for uniprocessor machines, and known to be failable on multiprocessor
machines.  We will continue to explore this software alternative for the
user-level libraries and see if we can make it a guaranteed,
machine-independent solution.  Clearly, a multiprocessor without
cache-coherency in hardware is hopeless in this respect, but might stand a
chance of implementing the first solution.

Since Lamport's algorithms were mentioned, it is worth noting that
those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm"
are inappropriate for user-level programs on MIPS.  The first problem
is that Lamport explicitly noted that the first (fast) algorithm was
only intended for interrupt-level routines in the kernel, e.g. with no
pre-emption.  Too many people fail to realize this, even if it is
clearly stated in the paper.
The second, much more subtle problem, is that the other algorithms described
in the paper use an atomic "compare register with memory" instruction, which
is not available on any RISC machine.  These algorithms are instead directly
applicable to CISC machines like the Vax.

sandro-

	Alessandro Forin
	School of Computer Science
	Carnegie Mellon University
	5000 Forbes Avenue
	Pittsburgh, PA 15213
	ARPA: af@cs.cmu.edu
	Phone: (412) 268-6861

roger@wraxall.inmos.co.uk (Roger Shepherd) (10/19/89)

In article <6565@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes:
>
>A number of posts in various newsgroups raised the issue of how to write
>parallel programs for MIPS-based machines, given the lack of any form of
>interlocked instruction in the instruction set.  This article describes how
>the problem was solved in the port of Mach to the MIPS M-500 and
>DECStation-3100 done at CMU.
> ...
>The second solution for the Pmax is currently based on a new algorithm
>(derived from Lamport's) that does not need hardware support, as the
>hardware provides none. 
> ...
>Since Lamport's algorithms were mentioned, it is worth noting that
>those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm"
>are inappropriate for user-level programs on MIPS.  The first problem
>is that Lamport explicitly noted that the first (fast) algorithm was
>only intended for interrupt-level routines in the kernel, e.g. with no
>pre-emption.  Too many people fail to realize this, even if it is
>clearly stated in the paper.

The problem of how to write parallel programs without hardware support
for interlocking is well known and a solution was published by E. W.
Dijkstra in Communications of the ACM, Volume 8, Number 9, September
1965. The abstract reads 

	"A number of mainly independent sequential-cyclic processes
	 with restricted means of communication with each other can
 	 be made in such a way that at any moment one and only one
	 of them is engaged in the ``critical section'' of its cycle.''

I suspect there are serious efficiency problems in using his solutoin on
a modern cache-coherent multiprocessor. However, it continues to amaze
me that it is possible to implement critical regions using only read and
write operations even in the presence of pre-emption.
 

Roger Shepherd, INMOS Ltd   JANET:    roger@uk.co.inmos 
1000 Aztec West             UUCP:     ukc!inmos!roger or uunet!inmos-c!roger
Almondsbury                 INTERNET: roger@inmos.com
+44 454 616616              ROW:      roger@inmos.com OR roger@inmos.co.uk

karsh@trifolium.esd.sgi.com (Bruce Karsh) (10/20/89)

In article <6565@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes:
>The second solution for the Pmax is currently based on a new algorithm
>(derived from Lamport's) that does not need hardware support, as the
>hardware provides none.  

So what is this new algorithm?

--

			Bruce Karsh
			karsh@sgi.com

mark@hubcap.clemson.edu (Mark Smotherman) (10/20/89)

From article <2591@ganymede.inmos.co.uk>, by roger@wraxall.inmos.co.uk (Roger Shepherd):
> In article <6565@pt.cs.cmu.edu> af@spice.cs.cmu.edu (Alessandro Forin) writes:
>>
>>Since Lamport's algorithms were mentioned, it is worth noting that
>>those described in the TOCS paper "A Fast Mutual-Exclusion Algorithm"
>>are inappropriate for user-level programs on MIPS.  The first problem
>>is that Lamport explicitly noted that the first (fast) algorithm was
>>only intended for interrupt-level routines in the kernel, e.g. with no
                                                                 ^^^^^^^
>>pre-emption.  Too many people fail to realize this, even if it is
  ^^^^^^^^^^^
>>clearly stated in the paper.
> 
> The problem of how to write parallel programs without hardware support
> for interlocking is well known and a solution was published by E. W.
> Dijkstra in Communications of the ACM, Volume 8, Number 9, September
> 1965.  ... However, it continues to amaze
> me that it is possible to implement critical regions using only read and
> write operations even in the presence of pre-emption.
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Several elegant busy-waiting solutions to the mutual exclusion problem
(including Dekker's, Peterson's simpler version of 1981, and the unadorned
use of test and set) often rely on an unstated assumption of "fair"
scheduling.  That is, they assume that the process currently in its critical
section will not be indefinitely postponed by scheduling precedence or
priorities.

Unfortunately, consider the following scenario.  A low-priority process is
executing inside its critical section.  An interrupt, say I/O completion,
is serviced and unblocks a higher-priority process.  The interrupt handler
exits through the dispatcher, which selects the higher-priority process
based on a preemptive HPF policy.  The higher-priority process now executes
and tries to enter its critical section.  But how can it enter when the low-
priority process cannot be dispatched and thus allowed to leave its critical
section?  ** DEADLOCK **

This is why the IBM S/370 P.O. manual says to only use test and set when
running in supervisor state with interrupts off.  That is, use it only for
synchronization between OS modules in different CPUs.  Compare and swap is
the appropriate instruction to implement busy waiting synchronization between
user processes, but note that the shared resource must be either a word or
a doubleword.  (Of course, the IBM OS provides users with blocking primitives
such as WAIT and POST, etc.)

The potential for deadlock is also why Univac implemented test and set on later
versions of the 1108 to cause an interrupt whenever the instruction found the
bit (byte/word?) already set.  Thus the OS gets involved immediately and blocks
the second process attempting to enter.
-- 
Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: mark@hubcap.clemson.edu    UUCP: gatech!hubcap!mark

vaughan@mcc.com (Paul Vaughan) (10/21/89)

	This is a little off the topic of test-and-set, because it has
to do with multiple processor synchronization rather than multiple
process synchronization. But, . . .
	How many of you are familiar with the fetch&op circuit for
interprocessor arithmetic and synchronization developed by G.J.
Lipovski at the University of Texas?  It is a tree based circuit with
as few as 5 gates per node (bit serial) that allows a form of
combining addition (fetch&add), as well as a wide variety of
synchronization and priority operations between multiple processors.
As far as hardware support for multiple processor synchronization
goes, it is about as simple as can be implemented, but it can do a lot
of useful things.  How popular is the fetch&add hardware model for
designing parallel algorithms?


 Paul Vaughan, MCC CAD Program | ARPA: vaughan@mcc.com | Phone: [512] 338-3639
 Box 200195, Austin, TX 78720  | UUCP: ...!cs.utexas.edu!milano!cadillac!vaughan

dolf@idca.tds.PHILIPS.nl (Dolf Grunbauer) (10/23/89)

In article <6831@hubcap.clemson.edu> mark@hubcap.clemson.edu (Mark Smotherman) writes:
->Several elegant busy-waiting solutions to the mutual exclusion problem
->(including Dekker's, Peterson's simpler version of 1981, and the unadorned
->use of test and set) often rely on an unstated assumption of "fair"
->scheduling.  That is, they assume that the process currently in its critical
->section will not be indefinitely postponed by scheduling precedence or
->priorities.
->Unfortunately, consider the following scenario.  A low-priority process is
->executing inside its critical section.  An interrupt, say I/O completion,
->is serviced and unblocks a higher-priority process.  The interrupt handler
->exits through the dispatcher, which selects the higher-priority process
->based on a preemptive HPF policy.  The higher-priority process now executes
->and tries to enter its critical section.  But how can it enter when the low-
->priority process cannot be dispatched and thus allowed to leave its critical
->section?  ** DEADLOCK **

Although the above state scenario is true, I do hope that no operation system
behaves like this. I worked on an operating system which also had to deal with
this problem. Actually there are two deadlock possibilities: user and
operating system.
A deadlock in user mode (i.e. made by user processes) cannot be detected for
100% (as the above described situation might just be what the user wanted :-).
A deadlock in the operating system or kernel of course must
be detected. There are a few options to bypass this deadlock. One of them
(in fact: Unix) raises the priority of a ready-to-run process on every
schedule when this process does not get running, so eventually the low order
process becomes running. Another solution, which is implemented in for example
DRM System, is priority semaphores. When a process enters a critical section,
the process priority is changed to a fixed (possible very high) priority, to
garantee that it will be executing & leave this section, avoiding the above
described deadlock. When leaving the critical section, the process priority
is restored. If the process has to wait before entering this critical section
(as another process is already in it), the process may be put in a priority
based FIFO queue, i.e. the higher the original process prority, the sooner
it will be the one awoken by the process leaving this section, and within
one priority the processes are queued on FIFO basis.
-- 
Dolf Grunbauer          Tel: +31 55 432764  Internet dolf@idca.tds.philips.nl
Philips Telecommunication and Data Systems  UUCP ....!mcvax!philapd!dolf
Dept. SSP, P.O. Box 245, 7300 AE Apeldoorn, The Netherlands

joe@modcomp.UUCP (10/24/89)

> Unfortunately, consider the following scenario.  A low-priority process
> is executing inside its critical section.  An interrupt, say I/O
> completion, is serviced and unblocks a higher-priority process.  The
> interrupt handler exits through the dispatcher, which selects the
> higher-priority process based on a preemptive HPF policy.  The
> higher-priority process now executes and tries to enter its critical
> section.  But how can it enter when the low- priority process cannot be
> dispatched and thus allowed to leave its critical section? ** DEADLOCK **

This problem is of course fundamental to the uniprocessor/spinlock
combination (and still exists to a lesser degree on mp's), and is a
primary reason why user-level code should stick to sleep semaphores.

The only safe place spinlocks can be used is in the kernel, where
context switching (and possibly other interrupts) can be locked out for
the duration of the critical section.
--
joe korty			"for every vengence there is an equal and
uunet!modcomp!joe		opposite revengence." (from Cartoon Laws)

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/25/89)

..> [Deadlock avoidance by priority manipulation when synchronizing]

Trouble is, changing priority usually requires kernel intervention.
What we need is a fast, non-kernel, path to change priority (and also
handle issues like faults in the critical section).

Similarly, if the user process is trying to synchronize with an interrupt
routine, then not only do you need a kernel free path for changing
process scheduling priority, but also interrupt priority.

Kernel-free == free of excessive kernel overhead. Fast path syscalls 
possible.
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.