[comp.arch] Atomic operations

hamilton@siberia.rtp.dg.com (Eric Hamilton) (05/03/91)

In article <1991Apr22.175410.9840@decvax.dec.com>, kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:
|> 
|> The 88K has xmem, not test and set, which is guaranteed to be an atomic
|> operation (the bus is locked between the read and write cycles).  This
|> can be used to create locks or compare and swap without disabling itnerrupts.
|> 
Now I'm curious.  How does one fabricate compare-and-swap out of xmem
without disabling interrupts?  Xmem has essentially the same restriction as
test-and-set, which is that the stored value must be a constant independent of
the loaded value.

-- 
----------------------------------------------------------------------
Eric Hamilton				+1 919 248 6172
Data General Corporation		hamilton@dg-rtp.rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC  27709, USA

johnl@iecc.cambridge.ma.us (John R. Levine) (05/04/91)

In article <1991May2.201917.15062@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
>Now I'm curious.  How does one fabricate compare-and-swap out of xmem
>without disabling interrupts?

You can't, at least not without a scheme that uses spin locks which are at
least as bad as disabling interrupts.  See Herlihy's article on "Wait-free
Synchronization" in the January TOPLAS.  It turns out that CAS is
fundamentally more powerful than most other synchronization primitives.

-- 
John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 492 3869
johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl
Cheap oil is an oxymoron.

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (05/06/91)

In article <1991May2.201917.15062@dg-rtp.dg.com>,
hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
|> In article <1991Apr22.175410.9840@decvax.dec.com>,
kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:
|> |> 
|> |> The 88K has xmem, not test and set, which is guaranteed to be an atomic
|> |> operation (the bus is locked between the read and write cycles).  This
|> |> can be used to create locks or compare and swap without disabling
itnerrupts.
|> |> 
|> Now I'm curious.  How does one fabricate compare-and-swap out of xmem
|> without disabling interrupts?  Xmem has essentially the same restriction as
|> test-and-set, which is that the stored value must be a constant
independent of
|> the loaded value.
|> 

Once you've created a lock (with xmem) you can do almost anything within the
locked region of code without disabling interrupts.

-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0011			   ==
-----------------------------------------------------------------------------

jthomas@nmsu.edu (James Thomas) (05/13/91)

In article <1991May6.113149.14531@decvax.dec.com> kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:

jk> Once you've created a lock (with xmem) you can do almost anything
jk> within the locked region of code without disabling interrupts.

Once I've created a lock with Dekker's algorithm, I can do anything :-)

Jim

glew@pdx007.intel.com (Andy Glew) (05/13/91)

>Once you've created a lock (with xmem) you can do almost anything within the
>locked region of code without disabling interrupts.

Including being preempted, leaving all the other processors spinning
for a lock held by a process which is swapped out.

Gang scheduling is only half a solution.

In my synchronization survey I call this the "spinning on a
non-running processor" problem.  So far, only the i860 has a general
solution to this problem, with its operation that blocks interrupts
for a short length of time from user mode.

--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

hamilton@siberia.rtp.dg.com (Eric Hamilton) (05/13/91)

In article <1991May6.113149.14531@decvax.dec.com>, kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:
|> In article <1991May2.201917.15062@dg-rtp.dg.com>,
|> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
|> |> In article <1991Apr22.175410.9840@decvax.dec.com>,
|> kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:
|> |> |> 
|> |> |> The 88K has xmem, not test and set, which is guaranteed to be an atomic
|> |> |> operation (the bus is locked between the read and write cycles).  This
|> |> |> can be used to create locks or compare and swap without disabling
|> itnerrupts.
|> |> |> 
|> |> Now I'm curious.  How does one fabricate compare-and-swap out of xmem
|> |> without disabling interrupts?  Xmem has essentially the same restriction as
|> |> test-and-set, which is that the stored value must be a constant
|> independent of
|> |> the loaded value.
|> |> 
|> 
|> Once you've created a lock (with xmem) you can do almost anything within the
|> locked region of code without disabling interrupts.
|> 

True, but implementing compare-and-swap is not one of those things.  Two reasons,
one of which is has nothing to do with interrupt disable.

1) Compare-and-swap will be atomic only with respect to other compare-and-swap
operations (or other operations which honor the lock).  It will not be
atomic with respect to stores, nor to any non-CPU accesses, such as I/O
references, ref/mod bit updates in page table entries, and the like.  Note
that xmem doesn't suffer from this defect.

Of course, we can't remove this restriction even if we had interrupt
disable.  So, for present purposes, let's restrict compare-and-swap so
that it is only atomic with respect to other compare-and-swap operations.
Of course, this is a big restriction, and by itself may be enough to
dispose of the misbegotten notion that test-and-set, exchange-memory, or the
like are sufficient in an architecture.

2) Compare-and-swap is non-blocking, in the sense that it is guaranteed to
complete one way or another, store or no store, succeed or fail, in a bounded
time (to be useful the bound must be fairly low - a few tens of cycles is nice).
The "you can do anything within the locked region" approach doesn't have this
non-blocking property.  Suppose one process holds the lock because it is in the
middle of a compare-and-swap, and a second process tries to obtain the lock.
The second process will block until the first process releases the lock.

But there is no assurance that the first process will ever release the
lock.  It may take a time slice interrupt, be descheduled, swapped out,
and languish there indefinitely.  It may be killed.  It may take a signal,
and in its signal handler try to do another compare-and-swap, need
another lock, and deadlock.  Meanwhile, the second process will wait
forever.

The only way to guarantee the non-blocking behavior of compare-and-swap
is to guarantee that whenever a process begins a compare-and-swap, it will
expeditiously complete it and release the lock.  And the only way to
make this guarantee is to....

		DISABLE INTERRUPTS WHILE THE LOCK IS HELD

Note that xmem doesn't have this defect either.  Xmem is non-blocking
and atomic with respect to interrupts - there's no such thing as an interrupt
in the middle of an xmem.

The moral of this is roughly as follows:
Atomic operations, such as compare-and-swap, that update memory with a value which
is a function of the old value are generally useful.  They allow you to do
things that cannot be done with operations such as xmem or test-and-set, which
update memory with a value that cannot be a function of the previous value.

Compare-and-swap cannot be fabricated out of xmem or test-and-set without
disabling interrupts or severely constraining its use.  This is unsatisfactory
for user code and many places inside the operating system.

Therefore, cost-benefit tradeoffs permitting, one should make an effort to
include compare-and-swap or something similar in the architecture.  Omitting
it is a defect.
 
-- 
----------------------------------------------------------------------
Eric Hamilton				+1 919 248 6172
Data General Corporation		hamilton@dg-rtp.rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC  27709, USA

johnl@iecc.cambridge.ma.us (John R. Levine) (05/14/91)

In article <1991May13.140634.17669@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
>Atomic operations, such as compare-and-swap, that update memory with a value
>which is a function of the old value are generally useful.  They allow you
>to do things that cannot be done with operations such as xmem or test-and-
>set, which update memory with a value that cannot be a function of the
>previous value.
>
>Compare-and-swap cannot be fabricated out of xmem or test-and-set without
>disabling interrupts or severely constraining its use.  This is
>unsatisfactory for user code and many places inside the operating system.

This is entirely true.  Everyone who has any interest in locking primitives
should immediately get a copy of the January 1991 ACM TOPLAS and read the
article by Maurice Herlihy entitled "Wait-free synchronization."  He shows
that for a realistic model problem (getting N processors to agree on the
value of a shared variable) CAS can synchronize arbitrarily many processes or
processors without blocking, while test-and-set or xmem can only do two.

Most other locking primitives are less powerful than CAS.  The only ones that
are as powerful are at least as complex, e.g. atomic queue operations with
peeking or atomic move from one shared memory location to another.

Many people seem to believe that one can't do useful synchronization without
disabled interrupts, spin locks, or some other scheme that makes one
processor hang around and wait for another one to get out of its way.  They
are wrong.

-- 
John R. Levine, IECC, POB 349, Cambridge MA 02238, +1 617 492 3869
johnl@iecc.cambridge.ma.us, {ima|spdcc|world}!iecc!johnl
Cheap oil is an oxymoron.

kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) (05/14/91)

In article <GLEW.91May13102912@pdx007.intel.com>, glew@pdx007.intel.com
(Andy Glew) writes:
|> 
|> >Once you've created a lock (with xmem) you can do almost anything
within the
|> >locked region of code without disabling interrupts.
|> 
|> Including being preempted, leaving all the other processors spinning
|> for a lock held by a process which is swapped out.
|> 

Several postings have convinced me on this issue.

|> 
|> In my synchronization survey I call this the "spinning on a
|> non-running processor" problem.  So far, only the i860 has a general
|> solution to this problem, with its operation that blocks interrupts
|> for a short length of time from user mode.
|> 

BBN's latest Butterfly was designed with this feature, although I don't
think it got used.  (Yes, I know they called it the TC-2000, not the
Butterfly 2.  What do they know.)  Locking was done with xmem instead.

-----------------------------------------------------------------------------
==	jeff kenton		Consulting at kenton@decvax.dec.com        ==
==	(617) 894-4508			(603) 881-0011			   ==
-----------------------------------------------------------------------------

terry@venus.sunquest.com (Terry R. Friedrichsen) (05/14/91)

While we're on this subject, can somebody give me a rundown on how locking
primitives cam be implemented on the MIPS RISC chips?

I know that load-linked (LL) and store-conditionally (SC) are the answer
for the R6000, but what about the R2000 and R3000?

AdTHANKSvance.

Terry R. Friedrichsen

terry@venus.sunquest.com  (Internet)
uunet!sunquest!terry	  (Usenet)
terry@sds.sdsc.edu        (alternate address; I live in Tucson)

Quote:  "Do, or do not.  There is no 'try'." - Yoda, The Empire Strikes Back

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (05/14/91)

In article <1991May14.033023.26083@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes:

| Many people seem to believe that one can't do useful synchronization without
| disabled interrupts, spin locks, or some other scheme that makes one
| processor hang around and wait for another one to get out of its way.  They
| are wrong.

  Spin locking is generally less wasteful of CPU than a complete context
switch to a process which is not blocked, and back again. There are only
two types of lock which seem to make sense to me: for short action a
spin lock, and for long stuff something like pseudo-io which blocks the
waiting process and enables it back to the dispatcher when the inlock
occurs.

  Short is anything up to 4x a context switch, since a spin lock will
average half the worst case time and a block will always cost 2x context
switch. 
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
        "Most of the VAX instructions are in microcode,
         but halt and no-op are in hardware for efficiency"

bob@tera.com (Bob Alverson) (05/14/91)

In article <1991May14.033023.26083@iecc.cambridge.ma.us> johnl@iecc.cambridge.ma.us (John R. Levine) writes:
>In article <1991May13.140634.17669@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
>>Atomic operations, such as compare-and-swap, that update memory with a value
>>which is a function of the old value are generally useful.  They allow you
>>to do things that cannot be done with operations such as xmem or test-and-
>>set, which update memory with a value that cannot be a function of the
>>previous value.
>>
>>Compare-and-swap cannot be fabricated out of xmem or test-and-set without
>>disabling interrupts or severely constraining its use.  This is
>>unsatisfactory for user code and many places inside the operating system.
>
> ...
>
>Most other locking primitives are less powerful than CAS.  The only ones that
>are as powerful are at least as complex, e.g. atomic queue operations with
>peeking or atomic move from one shared memory location to another.
>

Unfortunately, CAS is much more costly in hardware than a plain swap
or a fetch and op.  A load uses an address and returns a word.  A
store uses an address and a word.  Swap is the union of load and
store, taking an address and a data word and returning a data word.
Fetch and op is a little more, making the data stored a function of
the data read.  CAS is still more, needing an extra data word which
may blaze new word wide datapaths through the memory access logic.

I also take issue with the idea of n threads reaching consensus by
acting on a single memory location.  Using CAS the software is wait
free, but the hardware must serialize all the accesses.  I suppose you
could do a combining network (NYU style), but can't the combining be
implemented in software somehow?  Or is that fundamentally impossible
to do without waiting?

Bob

moss@cs.umass.edu (Eliot Moss) (05/14/91)

I posted something liek this recently, but would like to repeat it. I have
been working with Maurice Herlihy for a little while now on problems such as
lock-free garbage collection (that work will be presented at the next SPAA and
will appear in an IEEE journal next year). I have a firm belief (but no proof
yet) that CAS2, which allows atomic compare and swap of TWO, ARBITRARY memory
locations is strictly more powerful than CAS, in the sense that it allows
implementation of EFFICIENT lock-free (and wait-free) algorithms that CAS is
not strong enought to implement efficiently. In particular, to update many
elements of a shared structure atomically with CAS appears to require building
a copy of the ENTIRE structure and then using CAS to swing a pointer to the
new version. CAS2 supports a reasonable, atomic, update-in-place of any number
of elements of a shared data structure. I hope to have firm results and proofs
before too long. Meanwhile, it is something hardware designers may wish to
start thinking about ....					Eliot Moss
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206, 545-1249 (fax); Moss@cs.umass.edu

zalman@mips.com (Zalman Stern) (05/15/91)

In article <19305@sunquest.UUCP> terry@venus.sunquest.com (Terry R. Friedrichsen) writes:
[How do you do atomic locking using the MIPS-I instruction set (i.e. R[23]000).]

Either operating system support or extra hardware. Here are a few
design points:

MIPS RISC/os (and Ultrix and others) provide a system call which masks
    interrupts and does a test-and-set like operation. (You can also use
    SysV semaphores of course, but there are many problems there.)
Mach traps an illegal instruction to do the operation. This is faster than
    a system call. They also claim to have a completely user space
    algorithm that works even with preemption. I don't know the details.
SGI provides hardware on their multi-processor machines. This involves
    special memory (a section of) which is mapped into a processes address
    space via a system call. A load from the memory location tries to
    acquire the lock and returns success/or failure as the loaded value. A
    store clears the lock.
DEC SRC is doing work where the OS and compiler cooperate such that the OS
    can tell when a user program was trying to acquire a lock. If
    preemption occurs during a lock operation, the OS allows it to complete
    (or completes the operation itself, I'm not sure). (This is for their
    Topaz OS not UNIX.)
-- 
Zalman Stern, MIPS Computer Systems, 928 E. Arques 1-03, Sunnyvale, CA 94088
zalman@mips.com OR {ames,decwrl,prls,pyramid}!mips!zalman     (408) 524 8395
  "So you see, what we can do is try something new / If you're crazy too
   I don't really see why can't we go on as three" Triad By David Crosby

glew@pdx007.intel.com (Andy Glew) (05/17/91)

>|> In my synchronization survey I call this the "spinning on a
>|> non-running processor" problem.  So far, only the i860 has a general
>|> solution to this problem, with its operation that blocks interrupts
>|> for a short length of time from user mode.
>|> 
>
>BBN's latest Butterfly was designed with this feature, although I don't
>think it got used.  (Yes, I know they called it the TC-2000, not the
>Butterfly 2.  What do they know.)  Locking was done with xmem instead.

I have the TC2000 manuals, and I believe that I looked closely for any
feature that avoids the "spinning on a non-running processor" problem.
There does not seem to be any hardware support for this, although if I
have missed it I would be happy to be corrected.  If there is support
for the "spinning on a non-running processor" problem in the BBN
TC2000, I conjecture that it is purely in the OS in the form of
something like gang scheduling.

The TC2000 does have the ability to block interrupts, but not from
user mode. Again, correct me if I am wrong.

The TC2000 UNIX does provide a variety of atomic synchronization
primitives, like compare-and-swap, fetch-and-add, fetch-and-or, etc.
These are implemented as system calls. The system call blocks
interrupts, XMEM is used for the hardware lock that excludes other
processors, and the operation is performed.  
    I think that what we are all interested in is a way of doing these
forms of synchronization without system calls.

Spinlocks on the TC2000 must be test-and-set spinloops.  You can't
even use a cache test to do a test-and-test-and-set spinloop.  The
TC2000 manual goes into great detail about adding delays to a
test-and-set spinloop to reduce contention.



By the way - MIPS' load-locked/store-conditional synchronization
primitives might conceivably be used to block interrupts as well as
providing atomic interprocessor RMWs.
    Question for MIPS: are interrupts blocked inside a
load-locked/store-conditional sequence?
    Personal preference: it's better to separate the interrupt
blocking and the atomic RMW functionality.  



[*] thanks to someone at BBN - if you want the synchronization
features of your machine to be discussed in the December, 1991,
edition of the synchronization survey, tell me about it now. "Telling
me about it" can comprise email about the synchronization features,
although spec sheets, users manuals, etc. are always preferred (mainly
because half the things people have told me by email have been wrong).
Send to:

    My email address is glew@ichips.intel.com

    My office phone is 503-696-4119
    Office FAX: 503-648-7397

    My mail address is

	Andy Glew
	Mail stop JF1-19
	Intel Corporation
	5200 NE Elam Young Parkway
	Hillsboro, OR 97124-6497
--

Andy Glew, glew@ichips.intel.com
Intel Corp., M/S JF1-19, 5200 NE Elam Young Parkway, 
Hillsboro, Oregon 97124-6497

This is a private posting; it does not indicate opinions or positions
of Intel Corp.

bret@orac.UUCP (Bret Indrelee) (05/19/91)

>kenton@abyss.zk3.dec.com (Jeff Kenton OSG/UEG) writes:

>Once you've created a lock (with xmem) you can do almost anything within the
>locked region of code without disabling interrupts.
>

So let me see if I have this right?  Instead of a Compare-And-Swap, you
would compete for a lock bit, do your atomic instruction, and then release
the lock bit?

I would much rather have the CAS in hardware.  With CAS, you can emulate
the other common atomic accesses (Test and Set, Fetch and Add, parallel
linked list manipulation, and the list goes on) without need for a
special lock bit on the data.

The importance of atomic accesses is even more important when you
start putting more than one processor into a single system.
I kind of like being able to manipulate linked lists without having
to lock the whole list.

-Bret
-- 
------------------------------------------------------------------------------
Bret Indrelee		|	<This space left intentionally blink>
bret@orac.edgar.mn.org	|					;^)

lkaplan@star-trek.bbn.com (Larry Kaplan) (05/28/91)

In article <GLEW.91May17100952@pdx007.intel.com> glew@pdx007.intel.com (Andy Glew) writes:
>>|> In my synchronization survey I call this the "spinning on a
>>|> non-running processor" problem.  So far, only the i860 has a general
>>|> solution to this problem, with its operation that blocks interrupts
>>|> for a short length of time from user mode.
>>|> 
>>
>>BBN's latest Butterfly was designed with this feature, although I don't
>>think it got used.  (Yes, I know they called it the TC-2000, not the
>>Butterfly 2.  What do they know.)  Locking was done with xmem instead.
>
>The TC2000 does have the ability to block interrupts, but not from
>user mode. Again, correct me if I am wrong.

At one time it was thought that the TC2000 could give the user the option of
blocking out interrupts via something called a user augmentations register
available in user space.  This turned out to not work well enough for O/S
related reasons.

Instead, the way we currently deal with the problem is to not worry about
the interrupts themselves.  They are not going to cause user mode locks any
problems since no interrupt code is ever going to take a user lock.  The
real problem, as stated before, is getting context switched as a result of
an interrupt (e.g. a scheduler interrupt).  To solve this, we provide
various system calls for the user to disable preemption.  Look at
user_nopreempt(2).  This allows XMEM based locks to avoid the "spinning on a
non-running processor" problem.

>The TC2000 UNIX does provide a variety of atomic synchronization
>primitives, like compare-and-swap, fetch-and-add, fetch-and-or, etc.
>These are implemented as system calls. The system call blocks
>interrupts, XMEM is used for the hardware lock that excludes other
>processors, and the operation is performed.

This is not exactly correct.  System calls are used for these atomic
operations, but they are of the "fast trap" kind where no context is saved
and no interrupts or other exceptions are handled during the call.  Each
operation is fairly short so this works well.  Also, XMEMs are NOT used to
implement any of these operations.  The TC2000 hardware has true locking
semantics on its memories and interconnection network.  This works by
locking a memory when a locked operation starts, and keeping it locked until
the entire operation is complete.  These are activated automatically by a
user mode XMEM and manually in the kernel to do the other atomic operations.
The general sequence for a fetch and op is:

Do preparatory work like checking that the memory is resident
Disable interrupts
Set lock bit (lock not taken until next real memory reference)
Read location (sets lock)
Operate on data in registers
Write result
Clear locks
Return original value

More details of the locking protocol are available in
"Inside the TC2000".

>    I think that what we are all interested in is a way of doing these
>forms of synchronization without system calls.
True enough.

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		70 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238