[comp.arch] H/W Write Buffers, S/W Synchronization

viggy@hpsal2.HP.COM (Viggy Mokkarala) (10/26/87)

    Hardware Write Buffers and Software Synchronization

I have been looking into a problem involving software synchronization
using shared variables in tightly coupled private cache multiprocessor
systems with hardware write buffers.

I would like to hear other people's experiences with the problem: should
an architecture allow the performance advantage of write buffers and
restrict the way software synchronizes?

Hardware write (store) buffers provide queueing to smooth out the total
instruction flow by allowing the execution unit to proceed in spite of
unpredictable delays caused by the storage unit (cache miss).  In a
shared memory, private cache (write back) multiprocessor system, a
write buffer can cause temporary staleness of data.  If such data is
being shared between processes that are executing on different processors,
as in the example below, there can be serious problems with inconsistencies,
or deadlocks.

             Master                              Slave

Create work;                           Consume work;
Block;                                 Completed++;
available++;                           if (completed < available)
if ((available - completed) > 1)           wakeup(master);
    sleep;                             else sleep;
else
    wakeup (slave);

In this example, synchronization is accomplished through modification of
shared variables 'available', and 'completed'.  Changes to these variables
are not instantaneously visible in the other processor modules.  This causes
caches to become temporarily stale, which causes the problem - both master
and slave go to sleep forever.

The question is not "how to synchronize with write buffers", but rather the
follwoing:

  1.  How much code already uses this?

  2.  Is it difficult to write software with such a restriction?, and

  3.  Would it be appropriate to force software writers to identify shared
      variables?

John Mashey, are you listening?

Viggy Mokkarala (hplabs!hpda!viggy)
(408)447-5983
19420 Homestead Road, Cupertino, CA 95014.

mash@mips.UUCP (10/28/87)

In article <2280002@hpsal2.HP.COM> viggy@hpsal2.HP.COM (Viggy Mokkarala) writes:

>    Hardware Write Buffers and Software Synchronization

>I have been looking into a problem involving software synchronization
>using shared variables in tightly coupled private cache multiprocessor
>systems with hardware write buffers.
....disucssion....

>The question is not "how to synchronize with write buffers", but rather the
>follwoing:

>  1.  How much code already uses this?

>  2.  Is it difficult to write software with such a restriction?, and

>  3.  Would it be appropriate to force software writers to identify shared
>      variables?

>John Mashey, are you listening?

Yes, but I can't really say too much on this.  Existing MIPS boxes use
write-thru caches with 4-deep write buffers, so the only time they have
anything like this problem is when synchronizing with smart controller
cards.  In some cases, this requires a wbflush() routine that loops
until the write-buffer is empty.  This is often needed when you're
writing to a control register, then expect to look at a status register
that is not in the same word as the control register, so the normal
read/write conflcit detection has no chance to synchronize it itself.

I've never worked on a system that had the particular combination
described.  A few general principles seem to obtain though:
1) The further away you get from normal memory semantics, the more of
a pain it is for software. Exactly how much pain can be stood depends
on overall system design tradeoffs.  It is absolutely clear that
current trends in system design tend to make caches more visible to
software than they used to be.

2) If only the kernel has to deal with such things, it's not too bad,
if it can be carefully encapsulated.  There are usually ways around it,
like using uncached memory for locks, although that would place
restrictions on where locks go (i.e., it becomes more expensive to
get locks into existing data structures that want caching).

3) If you want user-level locks to be pleasant, more hardware
support is likely to be needed.

3) In general, I'd rather have cache coherency thought about from day
one and get the right mechanisms in there.  Multiprocessors are hard
enough to get right without having to do weird things.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

zs01+@andrew.cmu.edu.UUCP (10/28/87)

Seems to me that code like this already has to use an interlocked
read-modify-write cycle for the incrementing of the <available> and
<completed> shared variables. Otherwise there are race conditions in the
code. Therefore my answer to question 1 is that code which accesses shared
variables without some sort of locking mechanism is broken.

I am not sure what restriction you are refering to in question 2. I think it
is impossible to write correct multi-threaded (asynchronous) code without
some kind of indivisible read-modify-write memory operation.

Most modern processors offer at least a test-and-set instruction which
implements a single bit semaphore. Some processors offer larger granularity
locking mechanisms (i.e. the lock bit in the AMD 29000). Corect code will
use one of these mechanisms to access shared variables. In any event, these
instructions are already identifying shared variable accesses to the memory
subsystem. So my answer to question number 3 is that it is entirely
appropriate to require code to identify shared variables.

Sincerely,
Zalman Stern
Internet: zs01+@andrew.cmu.edu     Usenet: ...seismo!andrew.cmu.edu!zs01+
Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890

crowl@rochester.UUCP (10/28/87)

In article <QVVPcty00Vs8yy40q5@andrew.cmu.edu> zs01+@andrew.cmu.edu
(Zalman Stern) writes:
>Seems to me that code like this already has to use an interlocked
>read-modify-write cycle for the incrementing of the <available> and
><completed> shared variables. Otherwise there are race conditions in the
>code. Therefore my answer to question 1 is that code which accesses shared
>variables without some sort of locking mechanism is broken.

The original poster did not indicate whether there was a single slave process
or multiple slave processes.  If there is a single slave process, then each
of the counter variables is modified by only one process and the protocol will
work without read-modify-write (assuming values are indeed propogated).  There
are many concurrent algorithms for which certain variables may only be modified
by one process at a time, so the absence of a locking mechanism does not imply
broken code.

>I am not sure what restriction you are refering to in question 2. I think it
>is impossible to write correct multi-threaded (asynchronous) code without
>some kind of indivisible read-modify-write memory operation.

It is possible, I have done it.  As above, you must insure that only one
process may ever modify a variable at a time.

>Most modern processors offer at least a test-and-set instruction which
>implements a single bit semaphore. Some processors offer larger granularity
>locking mechanisms (i.e. the lock bit in the AMD 29000). Corect code will
>use one of these mechanisms to access shared variables. In any event, these
>instructions are already identifying shared variable accesses to the memory
>subsystem. So my answer to question number 3 is that it is entirely
>appropriate to require code to identify shared variables.

It is appropriate to dynamically indentify shared variables, but it is not
appropriate to statically identify shared variables.  Putting my locking bits
off in a special memory discourages me from coding highly concurrent data
structures.

-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

hansen@mips.UUCP (Craig Hansen) (10/29/87)

In article <2280002@hpsal2.HP.COM> viggy@hpsal2.HP.COM (Viggy Mokkarala) writes:
>I have been looking into a problem involving software synchronization
>using shared variables in tightly coupled private cache multiprocessor
>systems with hardware write buffers.

>The question is not "how to synchronize with write buffers", but rather the
>following:

>  1.  How much code already uses this?

>  2.  Is it difficult to write software with such a restriction?, and

>  3.  Would it be appropriate to force software writers to identify shared
>      variables?

The problem you've been looking at is well-known in the multiprocessor
environment. The question really is how tightly-coupled (synchronized) two
processes on separate processors really are. The mechanism used by the code
example only works if writes are highly synchronized, for example, if you
ensure that before a write completes, that all writes that appeared on the
bus prior to that write are reflected in the state of the processor's cache.
Such a synchronization requirement may be violated not only because of
write buffers (which permit the processor to continue execution before
the common bus sees the write), but also because of cache-invalidate
or cache-update buffering (which permit the bus to continue execution
before the cache is invalidated or updated due to the bus write).

Many multiprocessor systems are being built (I am aware of several being
built with MIPS CPUs) that do not have this property, and software which
runs on such machines must perform a synchronization operation of some kind
before reading a shared variable that may have been written by another CPU.

This doesn't mean that all reads of shared variables need to be specially
synchronized, nor does it mean that all writes of shared variables need to
be specially synchronized; it means that between a write of a shared
variable by one processor and a read of the same shared variable by another,
that at least one synchronization must occur between those two processors,
or between each of those two processors and a common bus.

As to whether it's difficult to write software with such a restriction,
that's somewhat an open issue. Current research efforts are investigating
ways of writing parallelized code without explicit synchronization
operations, and some of these methods are assuming rather strong
requirements on cache coherence. These strong requirements end up requiring
implicit synchronization operations in basic operations, and so are
undesirable (because the extra synchronization operations impede buffering
and parallelism itself).

Identifying shared variables is either very easy or very difficult depending
on the application and environment. For example, when a monolithic kernel
(such as UNIX) is run on a shared-memory MP, essentially all variables end
up shared (when any processor can run any portion of the kernel).  Process
migration may also permit any variable (even in a single-process
application) to be shared between processors. In these cases, normally, an
explicit synchronization operation occurs between the write by one processor
and the read of another, so while it's hard to get away with explicitly
indicating shared variables, the explicit synchronization operation occurs.

What would be useful for this sort of hardware/software environment is an
indication, not of whether a variable is "shared," because all variables
essentially are shared or sharable, but whether such a variable requires
implicit synchronization, associated with reads and writes of the variable.
I would presume that when such a variable is so indicated to the HLL
compiler system, that such "implicit" synchronization can actually be
performed explicitly by software, so that the basic hardware mechanism can
always use the weaker coherence forms that can be efficiently buffered.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

paulsc@orca.TEK.COM (Paul Scherf) (10/30/87)

In article <QVVPcty00Vs8yy40q5@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes:
>Seems to me that code like this already has to use an interlocked
>read-modify-write cycle for the incrementing of the <available> and
><completed> shared variables. Otherwise there are race conditions in the
>code. Therefore my answer to question 1 is that code which accesses shared
>variables without some sort of locking mechanism is broken.

>I am not sure what restriction you are refering to in question 2. I think it
>is impossible to write correct multi-threaded (asynchronous) code without
>some kind of indivisible read-modify-write memory operation.

I agree that indivisible read-modify-write memory operations
are most often used for multiprocessor shared-memory syncronization.
My point is that they are not required.
There are software algorithms for this
(e.g. Decker's algorithm.  I may have mispelled the name).
I am told that, in practice, software solutions have been very inefficient,
so the read-modify-write cycle solutions are usually preferred.

This is probably part of the reason why, as Zalman Stern points out,
most modern processors have some sort of read-modify-write capability.

Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA
paulsc@orca.GWD.Tek.COM			tektronix!orca!paulsc

chips@usfvax2.UUCP (10/31/87)

In article <849@mips.UUCP>, hansen@mips.UUCP (Craig Hansen) writes:
} What would be useful for this sort of hardware/software environment is an
} indication, not of whether a variable is "shared," because all variables
} essentially are shared or sharable, but whether such a variable requires
} implicit synchronization, associated with reads and writes of the variable.

As, for example, the C keyword "volatile"...

-- 
Chip Salzenberg            "chip%ateng.uucp@UU.NET"  or  "uunet!ateng!chip"
A.T. Engineering, Tampa    CIS: 73717,366    Last Chance: "chips@usfvax2.UUCP"
"Use the Source, Luke!"    My opinions do not necessarily agree with anything.

steve@edm.UUCP (Stephen Samuel) (11/06/87)

The only way I could see dealing with multiple processors and a write cache
would be to have some method of forcing a write-thru. Some processors (like
the 68000 series) have a signal to indicate a read/modify/write cycle.
 With processors like these, you would  have to have your buffer respond
to this signal by flushing the cache for that address -- both before the
read and after the write.
   You have to flush before the read or else you can end up reading stale
data on the cache.

   On systems without a locked-cycle signal, the only solution I can see is
having some address to write to to signal that the next write (barring an
interrupt is going to be a locked-cycle. In this case, the caches for ALL
processors are going to have to respond to the signal or else you'll once
again end up with the problem of stale read caches.
-- 
-------------
 Stephen Samuel 			Disclaimer: You betcha!
  {ihnp4,ubc-vision,seismo!mnetor,vax135}!alberta!edm!steve
  BITNET: USERZXCV@UQV-MTS

ted@cadnetix.UUCP (Ted Smith) (11/13/87)

In article <QVVPcty00Vs8yy40q5@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes:
>I think it
>is impossible to write correct multi-threaded (asynchronous) code without
>some kind of indivisible read-modify-write memory operation.

The book
   Algorithms for Mutual Exclusion
   Michel Raynal
   The MIT Press, Cambridge, MA  02142
   ISBN 0-262-18119-3
contains many solutions to the mutual exclusion problem.

Here is a simple protocol for two processes which is fair (from
   Peterson G. L. (1981): Myths about the mutual exclusion problem;
   Inf. Proc. Lett., 12(3), 115-116

The variables flag and turn are shared between the two processes; process
i executes the following code (i = 0 or 1 and j = (i + 1) mod 2):
   var
     flag : array [0..1] of boolean
     turn : 0..1
   

   flag[i] := true
   turn := i
   wait until (not flag[j]) or (turn = j)
   < critical section >
   flag[i] := false

-- 
    Ted Smith     	UUCP: cadnetix!ted
    Cadnetix Corp.	      hao!ico!cadnetix!ted
    5757 Central Ave.
    Boulder, CO 80301