herlihy@crl.dec.com (Maurice Herlihy) (06/05/91)
Is the Multimax memory sequentially consistent? (I.e., do reads and writes to memory occur in the order requested?) In the course of tracking down strange behavior in an experimental mutual exclusion protocol, I wrote the following simple memory exerciser (code at the end of the message). Processes P_0 and P_1 share a two-element buffer, initially [0,0]. The processes enter a loop. On the k-th iteration, process P_i writes k to slot i and reads the other slot. It is an easy case analysis to show that if P_i and P_j respectively read v_i and v_j on round k, then one or both of v_i and v_j must be greater than or equal to k, provided the memory is sequentially consistent. Every few million iterations, however, this property fails. I wrote the exerciser in C both using explicit system calls (proc_fork, share_malloc, etc.), and using the C parallel extensions (shared volatile declarations, etc.). The assembly language listings from the C compiler don't show any reordering. Any enlightenment would be greatly appreciated. Maurice Herlihy -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= /* sequential consistency checker */ #include <parallel.h> #define TEST_LENGTH 1024*1024 main() { int i, j, k; int *buffer, *log_0, *log_1, *log_i, *log_j; /* allocate shared data structures */ share_malloc_init(4 * TEST_LENGTH * sizeof(int)); buffer = (int*) share_malloc(2 * sizeof(int)); log_0 = (int*) share_malloc(TEST_LENGTH * sizeof(int)); log_1 = (int*) share_malloc(TEST_LENGTH * sizeof(int)); /* initialze shared data structures */ for (k = 0; k < TEST_LENGTH; k++){ log_0[k] = 0; log_1[k] = 0; }; buffer[0] = 0; buffer[1] = 0; /* start processes */ i = proc_fork(2); /* i is my process id */ j = i ? 0 : 1; /* j is the other's process id */ log_i = i ? log_1 : log_0; /* log_i is my log array */ log_j = j ? log_1 : log_0; /* log_j is the other's log array */ for (k = 0; k < TEST_LENGTH; k++){ buffer[i] = k; /* write my slot */ log_i[k] = buffer[j]; /* read and record other's slot */ } proc_join(); /* check for inconsistencies */ for (k = 0; k < TEST_LENGTH; k++){ if (log_0[k] < k && log_1[k] < k) printf("log_0[%d] = %d \t log_1[%d] = %d\n", k, log_0[k], k, log_1[k]); } }
alan@encore.encore.COM (Alan Langerman) (06/21/91)
In article <1991Jun4.210114.11355@crl.dec.com>, herlihy@crl.dec.com (Maurice Herlihy) writes: |> Is the Multimax memory sequentially consistent? (I.e., do reads and writes |> to memory occur in the order requested?) In the course of tracking down |> strange behavior in an experimental mutual exclusion protocol, I wrote the |> following simple memory exercis|> er (code at the end of the message). The Multimax obeys weak sequential consistency. In a strongly consistent system, all reads and writes from a given processor would immediately invalidate/update all caches in the system. However, there is a non-trivial performance penalty to be paid for this consistency. In a weakly consistent system, we can take full advantage of a split- transaction, pended bus; that means that operations are never synchronous across the bus, so the bus is never tied up waiting for a memory or cache to respond. In a tightly-coupled, shared-memory multiprocessor built around a central system bus, bus bandwidth becomes a very precious resource indeed, so there is a significant advantage to using a split-transaction, pended bus. However, the implication is that multiprocessor read/write traffic can become interleaved and results returned in an unexpected order. In the Multimax, all single-stream operations are strongly consistent. That is, the sequence of reads and writes generated by a single processor will always be seen by the processor itself in the original order. For instance, if a write operation is pending in a buffer, a subsequent read operation by the same processor will hit in the write buffer, rather than obtaining a stale value from cache or memory. However, the story becomes more complex when processors interact. Invalidate traffic queues up through a FIFO for each cache. These invalidates are normally processed asynchronously to the stream of requests flowing from processor to cache. Thus, the folowing scenario becomes possible: 1. Processor A writes memory location X. 2. Invalidate for location X queued in Processor B's FIFO. 3. Processor B issues read request for location X. 4. Processor B hits in cache and uses stale data. 5. Processor B's cache finally processes invalidate request. Too late. Of course, the story does not end here. A lock instruction (sbitib) is detected by the cache hardware, causing special processing. In brief, this processing guarantees that the particular processor/cache's view of the world will be consistent. All pending write operations are flushed back to memory and all pending invalidates are handled by the cache before the processor is allowed to continue. (The processor may then acquire the lock, or fail to acquire it; that's irrelevant.) By using lock instructions to guard access to shared data structures, processors guarantee correct software synchronization. The hardware detects lock instructions to guarantee strong sequential consistency around synchronization points. So, in sum: the Multimax implements weak sequential consistency: strong strong sequential consistency is guaranteed around synchronization points but not elsewhere. Alan P.S. If memory serves me, the "cinv" instruction exists only on NS32532 processors (and possibly follow-ons), and is privileged. If you use lock instructions, you will not need to flush the cache.