[comp.arch] Synchronization primitives and cache coherence

huck@aspen.IAG.HP.COM (Jerry Huck) (06/13/90)

I'm trying to catalog the synchronization approaches that current commercial
architectures are using between processors and I/O modules.  Could others
send me (or post) information on architectures they are aware of?  I'll be
happy to tabulate and summarize.  The questions are:
  1.) Which instruction level synchronization primitives are supported?
  2.) How are I/O modules accesses synchronized with processor caches?

For example, PA-RISC defines the "load and clear" instruction as an
atomic memory-based synchronization primitive (its just like test and
set except it clears the location).  Cache coherence with I/O modules
requires Software support to flush the memory addresses at the appropriate
times.

For those curious 'why?', I'm looking into the current standards
activity in cache coherent buses (like futurebus+ and SCI) and their
requirements on future processors.  These efforts define both coherence
protocols and synchronization primitives (but that would be a different
note).

Thanks,
Jerry Huck (Hewlett Packard)
           (huck@iag.hp.com)

aglew@lasso.csg.uiuc.edu (Andy Glew) (06/14/90)

I just noticed that Jerry is specifically interested in
synchronization between processor and I/O modules, whereas my survey
was between processor and memory (and other processors).
Although it should apply to memory mapped I/O.

Nonetheless, I hope that some people may be interested in the
comparative tables of survey primitives I have posted.

--
Andy Glew, aglew@uiuc.edu

aglew@dual.csg.uiuc.edu (Andy Glew) (06/14/90)

(Second try at posting - I hate silent errors)

I've been working in this area, and have written a survey paper which
is on hold pending finishing my thesis.

I'm currently doing work on this, with a work in progress paper coming
out in the midsummer Computer Architecture News issue, entitled
"Snoopy Cache Test-and-test-and-set Without Excessive Bus Contention".

The survey paper is 77K of troff.
I could split that and email it to you if you want
(or do the same with Postscript)
or I could physical mail you the paper.

Your post prompted me to posting the comparative tables of this
paper to comp.arch. I'll email them to you as well.

The comparative tables are 132 columns wide.  The terms used in them are
explained in the paper itself.


1.1.  Table 1: Processors and Synchronization
- -   ----- -  ---------- --- ---------------

--------------------------------------------------------------------------------------------------------------------------------
|                                           Table 1: Processors and Synchronization					       |
| Processor                       Architecture                  |                         Implementation		       |
|            |     Instruction    |             Type            |            Cache          |  External|        Comments       |
|            |                    |                             |                           |  Signal  |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|MIPS R2000  |  none              |                             |                           |          |  see Table 2	       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|NS32xxx     |  CBIT/SBIT         |  test-and-set-arbitrary-bit |  bypass                   |  ILO     |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|MC68030     |  TAS               |  load-store-fixed-bit	|       		    |          |		       |
|            | -------------------|-----------------------------|                           |          | ----------------------|
|            |  CAS               |  compare-and-swap           |  bypass                   |  RMC     |  Not always	       |
|            |  CAS2              |  " " queues                 |                           |          |  implemented	       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|MC88x00     |  XMEM              |  exchange-with-memory       |  bypass                   |  LOCK    |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|WE32100     |                    |  exchange-with-memory	|       		    |          |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|Pyramid     |  BITSW             |  test-and-set-arbitrary-bits|                           |          |                       |
|            | -------------------|-----------------------------|  bypass                   |          |                       |
|            |  ICMPW             |  fetch-and-add              |  ------                   |          |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|Clipper C300|  TSTS              |  test-and-set-arbitrary-bit |  bypass                   |  LOCK    |                       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|Intel 80286 |  LOCK prefix       |  arbitrary operations       |  -                        |  LOCK#   |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|Intel i486  |  LOCK prefix       |  short RMW instructions     |                           |  LOCK#   |                       |
|            | -------------------|-----------------------------|                           |          |                       |
|            |  XCHG              |  exchange-with-memory       |  bypass cache             |          |		       |
|            | -------------------|-----------------------------|  flush buffers            |          |		       |
|            |  XADD              |  fetch-and-add              |  ----- -------            |          |                       |
|            | -------------------|-----------------------------|                           |          |                       |
|            |  CMPXCHG           |  compare-and-swap           |  checks in cache first    |          |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|Intel i860  |  lock              |  locks bus and              |  bypasses    unsnooped    |  LOCK#   |  user                 |
|            |  arbitrary code    |  blocks interrupts          |  chip cache               |          |                       |
|            |  unlock            |  for 32 cycles              |                           |          |                       |
|            | -------------------|-----------------------------|                           |          | ----------------------|
|            |  LK bit            |  arbitrary sequences        |                           |          |  priviliged	       |
|            |                    |                             |                           |          |  no time limit	       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|AMD29000    |  LOADSET           |  load-store-fixed           |                           |  LOCK    |                       |
|            | -------------------|-----------------------------|                           |          | ----------------------|
|            |  LOADL, STOREL     |  not RMW                    |                           |          |  For use by systems   |
|            |                    |                             |                           |          |  with special support?|
|            | -------------------|-----------------------------|                           |          | ----------------------|
|            |  LK bit in register|  arbitrary sequences        |                           |          |  priviliged           |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|SPUR        |  TEST AND SET      |  load-store-fixed           |  RMW in cache             |          |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|SPARC       |  LDSTUB            |  load-store-fixed           |  bypass                   |  LDST    |                       |
|            | -------------------|-----------------------------|                           |          |                       |
|            |  SWAP              |  exchange-with-memory       |                           |          |		       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|Gould NP1   |  TSM, ZSM, SSM     |  test-and-set-arbitrary-bits|  bypass cache             |          |                       |
|            |                    |                             |  flush local              |          |                       |
|            |                    |                             |   and ISBL buffers        |          |                       |
|            |                    |                             |  changes priority         |          |                       |
|            |                    |                             |   on access to local cache|          |                       |
|------------|--------------------|-----------------------------|---------------------------|----------|-----------------------|
|IBM RT      |  TSH               |  load-store-fixed           |                           |          |                       |
|------------------------------------------------------------------------------------------------------------------------------|
|IBM America                               801 Storage - lock bits on cache lines, implicitly locked			       |
|------------------------------------------------------------------------------------------------------------------------------|
|    HEP                                              Full/empty bits on words in memory                                       |
--------------------------------------------------------------------------------------------------------------------------------




1.2.  Table 2: Systems and Synchronization
- -   ----- -  ------- --- ---------------

---------------------------------------------------------------------------------------------------------------------------------
|                                             Table 2: Systems and Synchronization                                              |
|   System    |  Processor |        Cache       |  Synchronization|        Type      |           Features         |   Privilige |
|             |            |                    |  Support        |                  |                            |             |
|-----------------------------------------------------------------|-------------------------------------------------------------|
|             |            |                    |  SLIC gates     |  test-and-set    |  special distributed       |  kernel     |
|             |            |                    |                 |                  |  synchronization registers.|             |
|             |            |                    |                 |                  |                            |             |
|             |            |                    |                 |                  |  special bus for           |             |
|             |            |                    |                 |                  |  synchronization           |             |
|Sequent      |            |  write-through     |                 |                                                             |
|             |            |                    |                 |                  |  flush buffers             |		|
|Balance      |  NS32000   |  invalidate        |                 |                                                             |
|             |            |                    |                 |                  |  on release                |             |
|             |            |                    | ----------------|------------------+----------------------------+-------------|
|             |            |                    |  ALM            |  load-store-fixed|  special memory            |  user       |
|             |            |                    |                 |                  |  normal bus                |             |
|             |            |                    |                 |                  |  mapped                    |             |
|             |            |                    | ----------------|-------------------------------------------------------------|
|             |            |                    |  xchg           |  exchange-       |  normal memory             |  user       |
|             |            |                    |                 |   with-memory    |  normal bus                |             |
|-----------------------------------------------------------------|-------------------------------------------------------------|
|Sequent      |  386?      |  copy-back         |  parallel locks |                  |                            |  kernel     |
|Symmetry     |            |                    |                 |                  |                            |  user       |
|-------------+------------+--------------------+-----------------|------------------+----------------------------+-------------|
|SGI          |  MIPS R2000|  copy-back         |                 |  test-and-set    |  special bus               |  kernel     |
|Powerseries  |            |                    |                 |                  |  special memory            |  user       |
|             |            |                    |                 |                  |  mapped                    |             |
|-----------------------------------------------------------------|-------------------------------------------------------------|
|Alliant FX   |  custom    |  shared cache      |  CCU, CCB       |  fetch-and-add   |  special distributed       |  user (gang)|
|             |  (68000    |                    |                 |                  |  synchronization registers |             |
|             |  inst. set)|                    |                 |                  |                            |		|
|             |            |                    |                 |                  |  special bus               |             |
|             |            |                    |                 |                  |                            |             |
|             |            |                    |                 |                  |  bus based combining       |             |
|-------------+------------+--------------------+-----------------|------------------+----------------------------+-------------|
|Gould NP1    |  custom    |  write-through     |  semaphore      |  test-and-set-   |  flush local and           |  user       |
|             |            |  invalidate        |  instructions   |   arbitrary-bit  |   remote buffers           |             |
|             |            |                    |  TSM, ZSM, SSM  |                  |  changes priority          |             |
|             |            |                    |                 |                  |   on cache access          |             |
|             |            |                    | ----------------|-------------------------------------------------------------|
|             |            |                    |  queue ops      |  complex         |                            |  user       |
|-------------+------------|--------------------+-----------------+------------------+----------------------------+--------------
|Encore       |  NS32x32   |  copy-back         |                 |  load-store-fixed|  special bus               |  user       |
|Multimax     |            |                    |                 |                  |   transaction              |             |
|             |            |                    |                 |                  |  normal memory             |             |
|             |            |                    |                 |                  |  changed processor         |             |
|             |            |                    |                 |                  |   semantics                |             |
|-------------|------------|--------------------|-----------------|------------------|----------------------------|-------------|
|             |            |  trap, interrupts  |                 |                  |                            |		|
|DECSTATION   |            |                    |                 |                  |                            |             |
|             |            |   blocked in kernel|                 |                  |                            |		|
|3100 (MACH)  |  MIPS      |                    |  test-and-set   |  uniprocessor    |  user                      |             |
|             |            | -------------------|                 |                  |                            |		|
|             |            |  Lamport           |                 |                  |                            |             |
|-------------|------------|--------------------|-----------------|------------------|----------------------------|-------------|
|NYU          |            |  write-back        |                 |  fetch-and-add   |  non-bus based             |  user       |
|Ultracomputer|            |                    |                 |                  |  combining in              |             |
|             |            |                    |                 |                  |   interconnect             |             |
---------------------------------------------------------------------------------------------------------------------------------





1.3.  Table 3: Cache Interfaces and Synchronization
- -   ----- -  ----- ---------- --- ---------------
   
   ------------------------------------------------------
 |     Table 3: Cache Interfaces and Synchronization   |
 |    Interface    |      Actions    |     Comments    |
 |-----------------------------------------------------+
    SPUR    cache  |   TEST AND SET  |   Modified
 |                         -   -                       |
 |  controller     |   performed  in |   NuBus.        |
 |                 |   cache         |                 |
  -----------------+-----------------+------------------
 |  Austek A38152  |   bypasses      |   80x86   cache |
 |  Microcache     |   cache invali- |   controller    |
 |                 |   dates  cached |                 |
 |                 |   copies     of |                 |
 |                 |   lock          |                 |
 |-----------------------------------------------------+
 |  Encore NanoBus |   Converts      |                 |
 |                 |   NS32xxx   set |                 |
 |                 |   bit    inter- |                 |
 |                 |   locked   into |                 |
 |                 |   load-store-   |                 |
 |                 |   fixed         |                 |
  ------------------------------------------------------



1.4.  Table 4: Busses and Synchronization
- -   ----- -  ------ --- ---------------

----------------------------------------------------------------------------------
|                      Table 4: Busses and Synchronization                       |
|      Bus      |           Synchronization Support          |      Comments     |
+--------------------------------------------------------------------------------+
| EISA          |   LOCK signal                              |                   |
----------------+--------------------------------------------+--------------------
| Encore        |   special      separate                    |   Memory  cards   |
| nanoBus       |   address/data   return                    |   perform         |
|               |   operation to  acquire                    |   load-store-     |
|               |   lock                                     |   fixed           |
|               |                                            |                   |
                |                                            |                   |
|               |                                            |        inter-     |
|               |                                            |   locked opera-   |
|               |                                            |   tions     can   |
|               |                                            |   overlap         |
+--------------------------------------------------------------------------------+
| NuBus         |   non-preemptive    re-                    |   Based on  bus   |
|               |   arbitration                              |   scheduling      |
----------------+--------------------------------------------+--------------------
| VSB           |   LOCK* signal                             |   INTRASEQs       |
|                                                              ------------------+
|               |                                            |   Higher  level   |
|               |                                            |   protocols re-   |
|               |                                            |   quired    for   |
|               |                                            |   INTERSEQs       |
----------------+--------------------------------------------+--------------------
| Futurebus     |   locked transactions                      |   Busy unlocking  |
|               |                                            |   multiple slaves |
+--------------------------------------------------------------------------------+
| Futurebus+    |   locked transactions                      |   in-order        |
                | -------------------------------------------+--------------------
|               |   Lock operations:                         |   out-of-order    |
|               |    exchange                                |                   |
|               |    fetch and add                           |                   |
|               |    compare and swap                        |                   |
+--------------------------------------------------------------------------------+
| VME           |   non-preemptive                           |   bus lock        |
                | -------------------------------------------+--------------------
|               |   continuous address strobe                |   interpreted     |
|               |                                            |   as a resource   |
|               |                                            |   lock in  some   |
|               |                                            |   systems         |
|                 ---------------------------------------------------------------+
|               |   non-standard LOCK signal on reserved pin |   resource lock   |
----------------------------------------------------------------------------------


1.5.  Table 5: Papers
- -   ----- -  ------

----------------------------------------------------------------------------------------
|                                   Table 5: Papers                                    |
|      Authors       |      Synchronization Technique     |           Comments         |
+--------------------------------------------------------------------------------------+
| VMP                |   software snoop actions           |   implemented              |
|                    |   lock release bus signal          |                            |
---------------------+------------------------------------+-----------------------------
| Rudolph and Segall |   test-and-test-and-set            |   ideal                    |
|                        test-and-test-and-set                coded - with race        |
+--------------------------------------------------------------------------------------+
  Bitar and Despain  |   lock states in cache protocol    |
---------------------+------------------------------------+-----------------------------
| Eggers                 two write brodcasts                  after Rudolph and Segall |
|                        then write invalidate                                         |
+--------------------------------------------------------------------------------------+
  Goodman et al      |   queues through empty cache lines |   multiple busses
---------------------+------------------------------------+-----------------------------
| Anderson               burst on lock release                software solutions       |
|                                                             backoff                  |
+--------------------------------------------------------------------------------------+
  Dubois and Briggs  |   LOADSYNC, CONDSTORE              |   use snooper
---------------------+------------------------------------+-----------------------------
| Glew and Hwu           test-and-test-and-set                bursts eliminated        |
|                    |   with bus abandonment             |                            |
----------------------------------------------------------------------------------------




--
Andy Glew, aglew@uiuc.edu

aglew@dwarfs.csg.uiuc.edu (Andy Glew) (06/14/90)

By the way, I am maintaining the paper "A Survey of Synchronization
Primitives" from which the previously posted comparative tables are
extracted.
    "Maintaining" means that I am trying to keep it reasonably up to
date and complete; ideally it would get distributed once every year or
two with the latest updates.

If you know the details of the synchronization primitives of systems
that are not in the tables or the paper, or if I have made any errors,
I'd appreciate learning about them.

[Ideally]
    Send me a technical reference manual for the system cpu and/or cache
    and/or bus and/or memory?  I need all or most of these because the aspects
    that I'm interested in involve interactions between all of the
    components. (Eg. many 68000 based *systems* do not implement CAS)
    But any single manual helps.

    (Well, you can't blame me for trying).

[Less ideally]
    Tell me where to send for such a reference.  (If it costs money I'm
    unlikely to be able to afford it, but the address or phone may be
    useful.  I'm getting good at begging for free info.  If you had a copy
    to lend I'd send it back to you when I'm done)

[Least ideally]
    Give me the scoop yourself?  I usually don't quote information
    received by hearsay or email, because I've found it t be less than
    reliable - even when the engineer who designed the part is talking!
    But your description might (1) sensitize me to things I should be
    looking for if I get the real stuff, and (2) tell me about things that
    aren't in the documentation.

I'm particularly interested in:

a) what the atomic instructions actually are
b) what bus transactions are actually produced
c) are the bus transactions split?
d) how is atomicity maintained?
    is the bus exclusively locked throughout?
    or is there a lock maintained at the memory controller?
e) does your atomic operation use the processor cache,
    or just bypass it?
    does it invalidate other caches?
    does it invalidate its own cache?
f) are there conditions where the atomic operation 
    can short circuit without going through a full RMW?
    after reading the cache?
    after going to the bus?
g) how are bus transactions scheduled?
h) once a processor requests the bus for the atomic operation,
    can it abandon its request?

(The last point is important. If you can abandon a pending bus request
the time for a lock transfer in a test-and-test-and-set spinloop goes
to O(1) from O(n), or even O(n^2) if the bus scheduling is, eg., fixed
priority, where n is the number of processors).


    
--
Andy Glew, aglew@uiuc.edu