[net.arch] Long-winded cache coherence

mzp@uicsg.UUCP (04/26/84)

#N:uicsg:3200002:000:5923
uicsg!mzp    Apr 26 07:52:00 1984

[Graduate.  Graduate.  Graduate.  Graduate.  AAARRRGGHH!]

  Ah, cache coherence, a subject near and dear to my heart, especially now
that my master's is almost comleted.  Well, I have devoted a lot of corporate
research to this, and there seem to be a bunch of schools of thought:

  1.  The old IBM 3081 type, with a central store controller, keeping track
	  of the directories of each processor.  Grotesque if you want more than
	  2 processors.  In fact, the 4-processor 3084 uses 2 of these things
	  interconnected.  Yuck.  Algorithms are pretty straightforward.
  2.  Global directories.  I hate these things.  (sorry about the personal
	  bias creeping into my list here)  Associate between 2 and N+1 bits with
	  each block in main memory, telling whether it is resident in any
	  processor, possibly exactly where.  Algorithms are also obvious.
  3.  Bus type solutions.  I like these a lot better, especially when 
	  microprocessors are involved.  True, you don't get the possibility
	  of various interconnection networks, but you don't have to pay for
	  them either!  Personally, I would rather have a system that I can
	  just plug boards into.  But then, I feed on micros...

But anyway, here is what we have done, as food for thought.  Given a bus
of some reasonable width and a bunch of slots.  Each processor board that
plugs in has a processor, cache, cache controller, and one segment of a
distributed bus arbiter (built into the cache controller).  You still need
some local memory (ROM, please) for bootstrapping.  Every cached block has
one of 4 statuses:
	Invalid
	Exclusive, Unmodified (held by no other cache, equivalent to memory copy)
	Exclusive, Modified (written since being loaded)
	Shared, Unmodified (held by more than one cache, probably*)

On a read operation, you just read.  If you miss, send a request for the
block over the bus.  Each cache has 2 directories, the second one is devoted
almost exclusively to handling external requests.  While all the caches are
looking up the request, main memory is fetching and OUTPUTTING the block.
When all caches have looked up, some control signals determine whether the
requestor can just take the block off the bus as it stands.  Otherwise, one
of the responders has to preempt the main directory/data and send the block
itself, tristating the memory output.  The end effect of all this garbage is
that the requestor gets the most current copy of the block.  If it came from
cache, the block gets marked Shared, otherwise, Exclusive.  Simple as that.
If the copy in the other cache was modified, it is invalidated in that cache
and marked modified in the requesting cache.  Now, you say, what happens
when the other cache evicts its copy of the block?  Nothing, I say.  Who
cares?  You get some spurious invalidations, that's all, but all you waste
is a bus request- noone actually gets preempted by those invalidates.  Besides
this is going to happen very often.

On a write operation, if the block is exclusive, you write it and set it to
modified.  No invalidate.  No nonsense.  If the block is shared, you have to
send an invalidate to everybody else.  On a write miss, things go pretty much
the same as for read misses, except that the responder must always mark the
block invalid.

Some miscellaneous considerations here.  First of all, you ask, "What about
dual directory interference?"  There isn't any.  All the dual directory has
to keep track of is whether the block is in the cache or not (1 bit).  Since
this information can only change when blocks are being loaded in from external
sources or when we are processing a successful invalidate, there isn't ever
any contention.  The bus master always has control over the dual directories.
The bus arbitration is daisy-chained, with a synchronous block transfer cycle.
On the last bus cycle you are using the bus for the current operation, you
send a release signal to all other processors.  The serial arbitration proceeds
during this bus cycle and the next bus master is decided in time for him to
take charge.  You also need a serial arbitration mechanism to decide who is
going to transmit a block if there are multiple copies.  This, as mentioned
above, happens during main memory lookup.

We have done some performance analysis based
on what we think is a pretty reasonable model (miss rate, write frequency,
1st write frequency, %sharing, writeback probability, transfer times), and
the approximate analysis agrees nicely with a simulation of the model.  What
we find is that it is not unreasonable to get 8x uniprocessor performance
with a 5% miss ratio.  Cut the miss ratio and things get very nice.  All the
graphs go up to N=20 (!).  It gets to the point where bus contention is no
longer the problem.  Even though it takes 5, 10, even 20 bus cycles to get
bus mastership, when the miss ratio is very small the penalty is not
significant.

Problems?  I/O, particularly fast I/O, is a major bitch.  There are a few
ways of resolving this.  The gross way is to have processor control of the
cache.  Nicer ways involved cached I/O, using either this scheme or a variation
(like write-through).  You are permitted to have write-through caches in our
system, just as long as they conform to the bus protocol.

What does all this mean?  I think it means a lot.  If I didn't think so, I
wouldn't be blabbing it all over the net.  If I sound incoherent (:-), it is
because I am incredibly tired and busy working on the final version of all
this.  Any questions, comments, suggestions, whatever would be welcomed.
Personally, I believe that this has a lot of promise for high-performance
multi-micros, and nothing would delight me more than for someone to do
something with it.  Look for it in Proc IEEE Comp Arch, this June.

Mark Papamarcos		ihnp4!uiucdcs!uicsg!mzp
'It sure feels good to be able to explain this stuff in my "normal" voice.'
'Damn, but you sure talk funny.'