mzp@uicsg.UUCP (04/26/84)
#N:uicsg:3200002:000:5923 uicsg!mzp Apr 26 07:52:00 1984 [Graduate. Graduate. Graduate. Graduate. AAARRRGGHH!] Ah, cache coherence, a subject near and dear to my heart, especially now that my master's is almost comleted. Well, I have devoted a lot of corporate research to this, and there seem to be a bunch of schools of thought: 1. The old IBM 3081 type, with a central store controller, keeping track of the directories of each processor. Grotesque if you want more than 2 processors. In fact, the 4-processor 3084 uses 2 of these things interconnected. Yuck. Algorithms are pretty straightforward. 2. Global directories. I hate these things. (sorry about the personal bias creeping into my list here) Associate between 2 and N+1 bits with each block in main memory, telling whether it is resident in any processor, possibly exactly where. Algorithms are also obvious. 3. Bus type solutions. I like these a lot better, especially when microprocessors are involved. True, you don't get the possibility of various interconnection networks, but you don't have to pay for them either! Personally, I would rather have a system that I can just plug boards into. But then, I feed on micros... But anyway, here is what we have done, as food for thought. Given a bus of some reasonable width and a bunch of slots. Each processor board that plugs in has a processor, cache, cache controller, and one segment of a distributed bus arbiter (built into the cache controller). You still need some local memory (ROM, please) for bootstrapping. Every cached block has one of 4 statuses: Invalid Exclusive, Unmodified (held by no other cache, equivalent to memory copy) Exclusive, Modified (written since being loaded) Shared, Unmodified (held by more than one cache, probably*) On a read operation, you just read. If you miss, send a request for the block over the bus. Each cache has 2 directories, the second one is devoted almost exclusively to handling external requests. While all the caches are looking up the request, main memory is fetching and OUTPUTTING the block. When all caches have looked up, some control signals determine whether the requestor can just take the block off the bus as it stands. Otherwise, one of the responders has to preempt the main directory/data and send the block itself, tristating the memory output. The end effect of all this garbage is that the requestor gets the most current copy of the block. If it came from cache, the block gets marked Shared, otherwise, Exclusive. Simple as that. If the copy in the other cache was modified, it is invalidated in that cache and marked modified in the requesting cache. Now, you say, what happens when the other cache evicts its copy of the block? Nothing, I say. Who cares? You get some spurious invalidations, that's all, but all you waste is a bus request- noone actually gets preempted by those invalidates. Besides this is going to happen very often. On a write operation, if the block is exclusive, you write it and set it to modified. No invalidate. No nonsense. If the block is shared, you have to send an invalidate to everybody else. On a write miss, things go pretty much the same as for read misses, except that the responder must always mark the block invalid. Some miscellaneous considerations here. First of all, you ask, "What about dual directory interference?" There isn't any. All the dual directory has to keep track of is whether the block is in the cache or not (1 bit). Since this information can only change when blocks are being loaded in from external sources or when we are processing a successful invalidate, there isn't ever any contention. The bus master always has control over the dual directories. The bus arbitration is daisy-chained, with a synchronous block transfer cycle. On the last bus cycle you are using the bus for the current operation, you send a release signal to all other processors. The serial arbitration proceeds during this bus cycle and the next bus master is decided in time for him to take charge. You also need a serial arbitration mechanism to decide who is going to transmit a block if there are multiple copies. This, as mentioned above, happens during main memory lookup. We have done some performance analysis based on what we think is a pretty reasonable model (miss rate, write frequency, 1st write frequency, %sharing, writeback probability, transfer times), and the approximate analysis agrees nicely with a simulation of the model. What we find is that it is not unreasonable to get 8x uniprocessor performance with a 5% miss ratio. Cut the miss ratio and things get very nice. All the graphs go up to N=20 (!). It gets to the point where bus contention is no longer the problem. Even though it takes 5, 10, even 20 bus cycles to get bus mastership, when the miss ratio is very small the penalty is not significant. Problems? I/O, particularly fast I/O, is a major bitch. There are a few ways of resolving this. The gross way is to have processor control of the cache. Nicer ways involved cached I/O, using either this scheme or a variation (like write-through). You are permitted to have write-through caches in our system, just as long as they conform to the bus protocol. What does all this mean? I think it means a lot. If I didn't think so, I wouldn't be blabbing it all over the net. If I sound incoherent (:-), it is because I am incredibly tired and busy working on the final version of all this. Any questions, comments, suggestions, whatever would be welcomed. Personally, I believe that this has a lot of promise for high-performance multi-micros, and nothing would delight me more than for someone to do something with it. Look for it in Proc IEEE Comp Arch, this June. Mark Papamarcos ihnp4!uiucdcs!uicsg!mzp 'It sure feels good to be able to explain this stuff in my "normal" voice.' 'Damn, but you sure talk funny.'