viggy@hpsal2.HP.COM (Viggy Mokkarala) (10/26/87)
Hardware Write Buffers and Software Synchronization I have been looking into a problem involving software synchronization using shared variables in tightly coupled private cache multiprocessor systems with hardware write buffers. I would like to hear other people's experiences with the problem: should an architecture allow the performance advantage of write buffers and restrict the way software synchronizes? Hardware write (store) buffers provide queueing to smooth out the total instruction flow by allowing the execution unit to proceed in spite of unpredictable delays caused by the storage unit (cache miss). In a shared memory, private cache (write back) multiprocessor system, a write buffer can cause temporary staleness of data. If such data is being shared between processes that are executing on different processors, as in the example below, there can be serious problems with inconsistencies, or deadlocks. Master Slave Create work; Consume work; Block; Completed++; available++; if (completed < available) if ((available - completed) > 1) wakeup(master); sleep; else sleep; else wakeup (slave); In this example, synchronization is accomplished through modification of shared variables 'available', and 'completed'. Changes to these variables are not instantaneously visible in the other processor modules. This causes caches to become temporarily stale, which causes the problem - both master and slave go to sleep forever. The question is not "how to synchronize with write buffers", but rather the follwoing: 1. How much code already uses this? 2. Is it difficult to write software with such a restriction?, and 3. Would it be appropriate to force software writers to identify shared variables? John Mashey, are you listening? Viggy Mokkarala (hplabs!hpda!viggy) (408)447-5983 19420 Homestead Road, Cupertino, CA 95014.
mash@mips.UUCP (10/28/87)
In article <2280002@hpsal2.HP.COM> viggy@hpsal2.HP.COM (Viggy Mokkarala) writes: > Hardware Write Buffers and Software Synchronization >I have been looking into a problem involving software synchronization >using shared variables in tightly coupled private cache multiprocessor >systems with hardware write buffers. ....disucssion.... >The question is not "how to synchronize with write buffers", but rather the >follwoing: > 1. How much code already uses this? > 2. Is it difficult to write software with such a restriction?, and > 3. Would it be appropriate to force software writers to identify shared > variables? >John Mashey, are you listening? Yes, but I can't really say too much on this. Existing MIPS boxes use write-thru caches with 4-deep write buffers, so the only time they have anything like this problem is when synchronizing with smart controller cards. In some cases, this requires a wbflush() routine that loops until the write-buffer is empty. This is often needed when you're writing to a control register, then expect to look at a status register that is not in the same word as the control register, so the normal read/write conflcit detection has no chance to synchronize it itself. I've never worked on a system that had the particular combination described. A few general principles seem to obtain though: 1) The further away you get from normal memory semantics, the more of a pain it is for software. Exactly how much pain can be stood depends on overall system design tradeoffs. It is absolutely clear that current trends in system design tend to make caches more visible to software than they used to be. 2) If only the kernel has to deal with such things, it's not too bad, if it can be carefully encapsulated. There are usually ways around it, like using uncached memory for locks, although that would place restrictions on where locks go (i.e., it becomes more expensive to get locks into existing data structures that want caching). 3) If you want user-level locks to be pleasant, more hardware support is likely to be needed. 3) In general, I'd rather have cache coherency thought about from day one and get the right mechanisms in there. Multiprocessors are hard enough to get right without having to do weird things. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
zs01+@andrew.cmu.edu.UUCP (10/28/87)
Seems to me that code like this already has to use an interlocked read-modify-write cycle for the incrementing of the <available> and <completed> shared variables. Otherwise there are race conditions in the code. Therefore my answer to question 1 is that code which accesses shared variables without some sort of locking mechanism is broken. I am not sure what restriction you are refering to in question 2. I think it is impossible to write correct multi-threaded (asynchronous) code without some kind of indivisible read-modify-write memory operation. Most modern processors offer at least a test-and-set instruction which implements a single bit semaphore. Some processors offer larger granularity locking mechanisms (i.e. the lock bit in the AMD 29000). Corect code will use one of these mechanisms to access shared variables. In any event, these instructions are already identifying shared variable accesses to the memory subsystem. So my answer to question number 3 is that it is entirely appropriate to require code to identify shared variables. Sincerely, Zalman Stern Internet: zs01+@andrew.cmu.edu Usenet: ...seismo!andrew.cmu.edu!zs01+ Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890
crowl@rochester.UUCP (10/28/87)
In article <QVVPcty00Vs8yy40q5@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes: >Seems to me that code like this already has to use an interlocked >read-modify-write cycle for the incrementing of the <available> and ><completed> shared variables. Otherwise there are race conditions in the >code. Therefore my answer to question 1 is that code which accesses shared >variables without some sort of locking mechanism is broken. The original poster did not indicate whether there was a single slave process or multiple slave processes. If there is a single slave process, then each of the counter variables is modified by only one process and the protocol will work without read-modify-write (assuming values are indeed propogated). There are many concurrent algorithms for which certain variables may only be modified by one process at a time, so the absence of a locking mechanism does not imply broken code. >I am not sure what restriction you are refering to in question 2. I think it >is impossible to write correct multi-threaded (asynchronous) code without >some kind of indivisible read-modify-write memory operation. It is possible, I have done it. As above, you must insure that only one process may ever modify a variable at a time. >Most modern processors offer at least a test-and-set instruction which >implements a single bit semaphore. Some processors offer larger granularity >locking mechanisms (i.e. the lock bit in the AMD 29000). Corect code will >use one of these mechanisms to access shared variables. In any event, these >instructions are already identifying shared variable accesses to the memory >subsystem. So my answer to question number 3 is that it is entirely >appropriate to require code to identify shared variables. It is appropriate to dynamically indentify shared variables, but it is not appropriate to statically identify shared variables. Putting my locking bits off in a special memory discourages me from coding highly concurrent data structures. -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
hansen@mips.UUCP (Craig Hansen) (10/29/87)
In article <2280002@hpsal2.HP.COM> viggy@hpsal2.HP.COM (Viggy Mokkarala) writes: >I have been looking into a problem involving software synchronization >using shared variables in tightly coupled private cache multiprocessor >systems with hardware write buffers. >The question is not "how to synchronize with write buffers", but rather the >following: > 1. How much code already uses this? > 2. Is it difficult to write software with such a restriction?, and > 3. Would it be appropriate to force software writers to identify shared > variables? The problem you've been looking at is well-known in the multiprocessor environment. The question really is how tightly-coupled (synchronized) two processes on separate processors really are. The mechanism used by the code example only works if writes are highly synchronized, for example, if you ensure that before a write completes, that all writes that appeared on the bus prior to that write are reflected in the state of the processor's cache. Such a synchronization requirement may be violated not only because of write buffers (which permit the processor to continue execution before the common bus sees the write), but also because of cache-invalidate or cache-update buffering (which permit the bus to continue execution before the cache is invalidated or updated due to the bus write). Many multiprocessor systems are being built (I am aware of several being built with MIPS CPUs) that do not have this property, and software which runs on such machines must perform a synchronization operation of some kind before reading a shared variable that may have been written by another CPU. This doesn't mean that all reads of shared variables need to be specially synchronized, nor does it mean that all writes of shared variables need to be specially synchronized; it means that between a write of a shared variable by one processor and a read of the same shared variable by another, that at least one synchronization must occur between those two processors, or between each of those two processors and a common bus. As to whether it's difficult to write software with such a restriction, that's somewhat an open issue. Current research efforts are investigating ways of writing parallelized code without explicit synchronization operations, and some of these methods are assuming rather strong requirements on cache coherence. These strong requirements end up requiring implicit synchronization operations in basic operations, and so are undesirable (because the extra synchronization operations impede buffering and parallelism itself). Identifying shared variables is either very easy or very difficult depending on the application and environment. For example, when a monolithic kernel (such as UNIX) is run on a shared-memory MP, essentially all variables end up shared (when any processor can run any portion of the kernel). Process migration may also permit any variable (even in a single-process application) to be shared between processors. In these cases, normally, an explicit synchronization operation occurs between the write by one processor and the read of another, so while it's hard to get away with explicitly indicating shared variables, the explicit synchronization operation occurs. What would be useful for this sort of hardware/software environment is an indication, not of whether a variable is "shared," because all variables essentially are shared or sharable, but whether such a variable requires implicit synchronization, associated with reads and writes of the variable. I would presume that when such a variable is so indicated to the HLL compiler system, that such "implicit" synchronization can actually be performed explicitly by software, so that the basic hardware mechanism can always use the weaker coherence forms that can be efficiently buffered. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen
paulsc@orca.TEK.COM (Paul Scherf) (10/30/87)
In article <QVVPcty00Vs8yy40q5@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes: >Seems to me that code like this already has to use an interlocked >read-modify-write cycle for the incrementing of the <available> and ><completed> shared variables. Otherwise there are race conditions in the >code. Therefore my answer to question 1 is that code which accesses shared >variables without some sort of locking mechanism is broken. >I am not sure what restriction you are refering to in question 2. I think it >is impossible to write correct multi-threaded (asynchronous) code without >some kind of indivisible read-modify-write memory operation. I agree that indivisible read-modify-write memory operations are most often used for multiprocessor shared-memory syncronization. My point is that they are not required. There are software algorithms for this (e.g. Decker's algorithm. I may have mispelled the name). I am told that, in practice, software solutions have been very inefficient, so the read-modify-write cycle solutions are usually preferred. This is probably part of the reason why, as Zalman Stern points out, most modern processors have some sort of read-modify-write capability. Paul Scherf, Tektronix, Box 1000, MS 61-033, Wilsonville, OR, USA paulsc@orca.GWD.Tek.COM tektronix!orca!paulsc
chips@usfvax2.UUCP (10/31/87)
In article <849@mips.UUCP>, hansen@mips.UUCP (Craig Hansen) writes: } What would be useful for this sort of hardware/software environment is an } indication, not of whether a variable is "shared," because all variables } essentially are shared or sharable, but whether such a variable requires } implicit synchronization, associated with reads and writes of the variable. As, for example, the C keyword "volatile"... -- Chip Salzenberg "chip%ateng.uucp@UU.NET" or "uunet!ateng!chip" A.T. Engineering, Tampa CIS: 73717,366 Last Chance: "chips@usfvax2.UUCP" "Use the Source, Luke!" My opinions do not necessarily agree with anything.
steve@edm.UUCP (Stephen Samuel) (11/06/87)
The only way I could see dealing with multiple processors and a write cache would be to have some method of forcing a write-thru. Some processors (like the 68000 series) have a signal to indicate a read/modify/write cycle. With processors like these, you would have to have your buffer respond to this signal by flushing the cache for that address -- both before the read and after the write. You have to flush before the read or else you can end up reading stale data on the cache. On systems without a locked-cycle signal, the only solution I can see is having some address to write to to signal that the next write (barring an interrupt is going to be a locked-cycle. In this case, the caches for ALL processors are going to have to respond to the signal or else you'll once again end up with the problem of stale read caches. -- ------------- Stephen Samuel Disclaimer: You betcha! {ihnp4,ubc-vision,seismo!mnetor,vax135}!alberta!edm!steve BITNET: USERZXCV@UQV-MTS
ted@cadnetix.UUCP (Ted Smith) (11/13/87)
In article <QVVPcty00Vs8yy40q5@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes: >I think it >is impossible to write correct multi-threaded (asynchronous) code without >some kind of indivisible read-modify-write memory operation. The book Algorithms for Mutual Exclusion Michel Raynal The MIT Press, Cambridge, MA 02142 ISBN 0-262-18119-3 contains many solutions to the mutual exclusion problem. Here is a simple protocol for two processes which is fair (from Peterson G. L. (1981): Myths about the mutual exclusion problem; Inf. Proc. Lett., 12(3), 115-116 The variables flag and turn are shared between the two processes; process i executes the following code (i = 0 or 1 and j = (i + 1) mod 2): var flag : array [0..1] of boolean turn : 0..1 flag[i] := true turn := i wait until (not flag[j]) or (turn = j) < critical section > flag[i] := false -- Ted Smith UUCP: cadnetix!ted Cadnetix Corp. hao!ico!cadnetix!ted 5757 Central Ave. Boulder, CO 80301