[comp.sys.encore] Sequential Consistency?

herlihy@crl.dec.com (Maurice Herlihy) (06/05/91)

Is the Multimax memory sequentially consistent?  (I.e., do reads and writes to memory occur in the order requested?)  In the course of tracking down strange behavior in an experimental mutual exclusion protocol, I wrote the following simple memory exerciser (code at the end of the message).

Processes P_0 and P_1 share a two-element buffer, initially [0,0].  The
processes enter a loop.  On the k-th iteration, process P_i writes k to slot i
and reads the other slot.  It is an easy case analysis to show that if P_i and
P_j respectively read v_i and v_j on round k, then one or both of v_i and v_j
must be greater than or equal to k, provided the memory is sequentially consistent.

Every few million iterations, however, this property fails.  I wrote the
exerciser in C both using explicit system calls (proc_fork, share_malloc, etc.), and using the C parallel extensions (shared volatile declarations, etc.).  The assembly language listings from the C compiler don't show any reordering.

Any enlightenment would be greatly appreciated.

	Maurice Herlihy

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
/*
  
  sequential consistency checker
  
  */

#include <parallel.h>

#define TEST_LENGTH 1024*1024

main() {
  int i, j, k;
  int *buffer, *log_0, *log_1, *log_i, *log_j;

  /* allocate shared data structures */
  share_malloc_init(4 * TEST_LENGTH * sizeof(int));
  buffer = (int*) share_malloc(2 * sizeof(int));
  log_0 = (int*) share_malloc(TEST_LENGTH * sizeof(int));
  log_1 = (int*) share_malloc(TEST_LENGTH * sizeof(int));
  
  /* initialze shared data structures */
  for (k = 0; k < TEST_LENGTH; k++){
    log_0[k] = 0;
    log_1[k] = 0;
  };
  buffer[0] = 0;
  buffer[1] = 0;

  /* start processes */
  i = proc_fork(2);		/* i is my process id */
  j = i ? 0 : 1;		/* j is the other's process id */
  log_i = i ? log_1 : log_0;	/* log_i is my log array */
  log_j = j ? log_1 : log_0;	/* log_j is the other's log array */
  for (k = 0; k < TEST_LENGTH; k++){
    buffer[i] = k;		/* write my slot */
    log_i[k] = buffer[j];	/* read and record other's slot */
  }
  proc_join();

  /* check for inconsistencies */
  for (k = 0; k < TEST_LENGTH; k++){
    if (log_0[k] < k && log_1[k] < k)
      printf("log_0[%d] = %d \t log_1[%d] = %d\n",
	     k, log_0[k], k, log_1[k]);
  }
}

alan@encore.encore.COM (Alan Langerman) (06/21/91)

In article <1991Jun4.210114.11355@crl.dec.com>, herlihy@crl.dec.com (Maurice Herlihy) writes:
|> Is the Multimax memory sequentially consistent?  (I.e., do reads and writes
|> to memory occur in the order requested?)  In the course of tracking down
|> strange behavior in an experimental mutual exclusion protocol, I wrote the
|> following simple memory exercis|> er (code at the end of the message).

The Multimax obeys weak sequential consistency.

In a strongly consistent system, all reads and writes from a given
processor would immediately invalidate/update all caches in the
system.  However, there is a non-trivial performance penalty to be paid
for this consistency.

In a weakly consistent system, we can take full advantage of a split-
transaction, pended bus; that means that operations are never synchronous
across the bus, so the bus is never tied up waiting for a memory or cache
to respond.  In a tightly-coupled, shared-memory multiprocessor built
around a central system bus, bus bandwidth becomes a very precious resource
indeed, so there is a significant advantage to using a split-transaction,
pended bus.  However, the implication is that multiprocessor read/write
traffic can become interleaved and results returned in an unexpected order.

In the Multimax, all single-stream operations are strongly consistent.
That is, the sequence of reads and writes generated by a single
processor will always be seen by the processor itself in the original
order.  For instance, if a write operation is pending in a buffer, a
subsequent read operation by the same processor will hit in the write
buffer, rather than obtaining a stale value from cache or memory.

However, the story becomes more complex when processors interact.
Invalidate traffic queues up through a FIFO for each cache.  These
invalidates are normally processed asynchronously to the stream of
requests flowing from processor to cache.  Thus, the folowing scenario
becomes possible:
	1.  Processor A writes memory location X.
	2.  Invalidate for location X queued in Processor B's FIFO.
	3.  Processor B issues read request for location X.
	4.  Processor B hits in cache and uses stale data.
	5.  Processor B's cache finally processes invalidate request.
		Too late.

Of course, the story does not end here.

A lock instruction (sbitib) is detected by the cache hardware, causing
special processing.  In brief, this processing guarantees that the particular
processor/cache's view of the world will be consistent.  All pending write
operations are flushed back to memory and all pending invalidates are handled
by the cache before the processor is allowed to continue.  (The processor may
then acquire the lock, or fail to acquire it; that's irrelevant.)

By using lock instructions to guard access to shared data structures,
processors guarantee correct software synchronization.  The hardware detects
lock instructions to guarantee strong sequential consistency around
synchronization points.

So, in sum:  the Multimax implements weak sequential consistency:  strong
strong sequential consistency is guaranteed around synchronization points
but not elsewhere.

Alan

P.S.  If memory serves me, the "cinv" instruction exists only on NS32532
processors (and possibly follow-ons), and is privileged.  If you use lock
instructions, you will not need to flush the cache.