jeff@Alliant.COM (Jeff Collins) (04/29/88)
Here's an interesting question for the SPARC guru's of the world: Given that the SPARC must use a virtual cache to get optimal performance, how does one build a multiprocessor with a SPARC? As far as I know, no one has solved the virtual cache coherency problem yet...
guy@gorodish.Sun.COM (Guy Harris) (04/29/88)
> Given that the SPARC must use a virtual cache to get optimal > performance, how does one build a multiprocessor with a SPARC? Excuse me? Where is it "given that the SPARC must use a virtual cache to get optimal performance?" It is the case that the SPARC requires some form of very fast memory to get optimal performance, but that's true of a hell of a lot of machines these days. The virtual cache permits you to bypass the MMU on a cache hit, but I don't know that this is *required* for SPARC - or for any of a number of other microprocessor architectures. It may be the case that with the current SPARC implementations, with no on-chip MMU, that it's easier to get high performance with a virtual cache than with a physical cache, or even that you can't get optimal performance *for those implementations* with a physical cache (although I suspect the latter is not true). This certainly doesn't say that the SPARC *architecture* requires a virtual cache. Another way of putting this is that I have no particular reason to believe that all high-end SPARC machines built by Sun will have virtual caches (it is already the case that not all SPARC machines built by Sun have virtual caches; the 4/110 SCRAM memory acts more like a physical cache than a virtual one).
gert@prls.UUCP (Gert Slavenburg) (04/29/88)
> As far as I know, no one has solved the virtual cache coherency > problem yet... Coherency can be maintained in a virtual address cache in exactly the same way as it is maintained in a multiprocessor : pick you favorite 'bus watch' style multiprocessor consistency protocol. Now use this protocol at the back (bus) end of a virtual address cache, where requests go out in terms of physical addresses, applying it TO YOURSELF AS WELL AS TO OTHERS. For example, if you have an ownership like protocol, aquiring ownership over a main memory slot (the memory unit that goes into a cache line) requires 'negotiation' with other caches that hold a copy of this slot, including yourself if you hold such copies under different virtual address names. In order to implement this, the virtually addressed cache needs to be able to 'observe' (and/or 'interact with' in some ownership schemes) bus transactions that occur on those physical addresses of which this cache holds copies. Again, this can be done in a variety of ways, either involving a double set of tags or some form of reverse address translation (this may be a one to many mapping). The architecture of a multiprocessor with virtual address caches that applies the above ideas in some unconventional ways was presented at the 13th Symposium on Computer Architecture in Tokyo (1986) (`Software-controlled caches in the VMP Multiprocessor', by D.R. Cheriton, G.A. Slavenburg and P.D. Boyle). At this year's 15th symposium on Computer Architecture in Hawaii, measurement results of the system, as in operation at Stanford University, will be presented. just thought you might like to know that this problem has been solved, Gerrit A. Slavenburg
walter@garth.UUCP (Walter Bays) (04/30/88)
In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: > Given that the SPARC must use a virtual cache to get optimal > performance, how does one build a multiprocessor with a SPARC? > As far as I know, no one has solved the virtual cache coherency > problem yet... Clipper uses 'bus watch' to invalidate references to stale data, when used in multiprocessor (including CPU-IOP) modes, and when using 'copy-back' (as opposed to write-through) cache modes. The newly announced Motorola 88000 uses a similar scheme, called 'bus snoop'. With SPARC, Clipper without CAMMU chips, or 88000 without CAMMU chips, you implement your own cache, and can build whatever you choose. -- ------------------------------------------------------------------------------ Any similarities between my opinions and those of the person who signs my paychecks is purely coincidental. E-Mail route: ...!pyramid!garth!walter USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303 Phone: (415) 852-2384 ------------------------------------------------------------------------------
jeff@Alliant.COM (Jeff Collins) (04/30/88)
In article <51321@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >> Given that the SPARC must use a virtual cache to get optimal >> performance, how does one build a multiprocessor with a SPARC? > >Excuse me? Where is it "given that the SPARC must use a virtual cache to get >optimal performance?" It is the case that the SPARC requires some form of very >fast memory to get optimal performance, but that's true of a hell of a lot of >machines these days. The virtual cache permits you to bypass the MMU on a >cache hit, but I don't know that this is *required* for SPARC - or for any of a >number of other microprocessor architectures. > >It may be the case that with the current SPARC implementations, with no on-chip >MMU, that it's easier to get high performance with a virtual cache than with a >physical cache, or even that you can't get optimal performance *for those >implementations* with a physical cache (although I suspect the latter is not >true). This certainly doesn't say that the SPARC *architecture* requires a >virtual cache. > Sorry, Guy but you misunderstood the purpose of my posting. I was not attempting to attack the SPARC. I am simply attempting to see if anyone has given any thought the the problem of using the SPARC with a virtual cache in a multiprocessor. It is my understanding (and I admit that you are closer to the issue than I am) that it is not easy, and may not be possible, to put a virtual to physical translation before the SPARC's cache and still get no-wait-state performance. (Please note that I didn't say that it was impossible to do this, only that the performance would be degraded.) If one looks at other current microprocessors (NS32532, Mot. 88000, MIPS 3000, etc.) they all have standard MMUs available with the part and have a fairly credible story of how to make them perform in a multiprocessor. There is no standard MMU (to my knowledge) available for the SPARC, and at least the current implementations seem to have a major drawback when attempting to put them in a multiprocessor. The question still remains - if you want to run the SPARC with a virtual cache in a multiprocessor - how do you do it? Other questions that get to the same issue are: - How does one get no-wait-state performance in a multiprocessor using the SPARC? - Is there a standard MMU going to be available, and how will it work with a multiprocessor? - Is there a solution to the virtual cache problems? - Is there an announced version of the SPARC that allows time between address-ready and data-ready to have an MMU before the cache?
mat@amdahl.uts.amdahl.com (Mike Taylor) (04/30/88)
In article <1671@alliant.Alliant.COM>, jeff@Alliant.COM (Jeff Collins) writes: > > Here's an interesting question for the SPARC guru's of the world: > > Given that the SPARC must use a virtual cache to get optimal > performance, how does one build a multiprocessor with a SPARC? > > As far as I know, no one has solved the virtual cache coherency > problem yet... > Amdahl machines starting with the 580 series in 1982 have used a coherent virtual cache. -- Mike Taylor ...!{ihnp4,hplabs,amdcad,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
dre%ember@Sun.COM (David Emberson) (04/30/88)
1) There is nothing inherent in the SPARC architecture that requires the use of a virtual cache to obtain high performance. 2) It is possible to implement snoopy caches with caches that appear to the processors as virtual caches. 'Nuff said. 3) We have a very good architectural solution to this problem for SPARC which we are developing now. We are not yet prepared to go public with it but it has been presented to some selected customers on a non-disclosure basis. It is somewhat frustrating to us not to have an announced solution, but we'd rather do it right than do something that does not scale well (Yes, I did get my hands on the 88200 data sheet!). All I can say is that your patience will be rewarded. 4) When the pieces are in place, you can bet that the technology will be open (i.e. available to all) and will embrace the concept of standard- ization that is so dear to us here at Sun. In other words, it is the stated policy of the company that we will make the components available so that if you want to build a SPARC machine--even one that competes with ours--Sun will encourage you to do so. A license from Sun is not required to use SPARC components. I may be shot for talking about this stuff, but I think it is important to the success of SPARC that Sun not be perceived as ignoring this important technology. Support for multiprocessors has already been announced as a deliverable item in Phase III of the AT&T-Sun Unix merge. Dave Emberson (dre@sun.com)
guy@gorodish.Sun.COM (Guy Harris) (04/30/88)
> Sorry, Guy but you misunderstood the purpose of my posting. I was not > attempting to attack the SPARC. No, I didn't. I merely pointed out that you were, as far as I could tell, making an unwarranted assumption at the start of your discussion; this assumption unnecessarily colored the rest of your discussion. David Emberson of Sun has noted that there is nothing in the *architecture* that demands a virtual cache. Current *implementations* may make a virtual cache the best, or only, way to get maximum performance, but that's a different matter. If you were, in fact, referring to the current chips, rather than the SPARC architecture, I apologize. > I am simply attempting to see if anyone has given any thought the > the problem of using the SPARC with a virtual cache in a > multiprocessor. David Emberson has also already given an answer to this question; the answer is "yes, Sun has". Unfortunately, as he indicated, he's not in a position to discuss it in detail now.
bcase@Apple.COM (Brian Case) (05/01/88)
In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: > > Here's an interesting question for the SPARC guru's of the world: > > Given that the SPARC must use a virtual cache to get optimal > performance, how does one build a multiprocessor with a SPARC? > > As far as I know, no one has solved the virtual cache coherency > problem yet... Read about the SPUR project being done at Berkeley. They have a large virtual cache, in-cache address translation (only done on misses), and the system concept is a small (about 10) multiprocessor. Some neat ideas.
brooks@lll-crg.llnl.gov (Eugene D. Brooks III) (05/01/88)
In article <1680@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: > - Is there an announced version of the SPARC that allows time > between address-ready and data-ready to have an MMU before the > cache? > This is the key to bringing these RISC chips into the realm of real supercomputing and will of course happen as it will be driven by market pressures. You need to allow an "arbitrary" time between address-ready and data-ready to allow successful use in a multiprocessor environment where the latency of memory is more or less undetermined due to conflicts in the shared memory subsystem. Basically, part of the address ready lines include a tag which indentifies the request, and when the response to the request returns the copy of the tag that arrives with it allows the cpu to figure out what to do with the data. The number of tag bits limits the number of outstanding requests for the cpu. One is likely to sequence the tag bits in order for speed so the number of outstanding requests will be further limited by flucuations in arrival order of the responses. If the request gets satisfied by the cache it comes back with a low latency, but if it goes to main memory (shared memory in a multiprocessor) it might have a substantial latency. By allowing many requests to be pending at once one can get "no wait state" performance in the same sense that internal pipelining delivers "no wait state" performance for the internal cpu functions. The cpu must be able to efficiently handle the fact that requests come back out of order, which means that simple fifo's a la the WM machine won't do. Rest assured that some future SPARC, MOT88000, Clipper, ..., implementation will provide this capability by some means as it will be the only way to further increase performance in the face of the memory latency of a shared memory multiprocessor. Whether MIPS will pull this off is not clear, their basis design principle (religion) of not having harware interlocks would seem orthogonal to doing it, in short they will find that their basic design princible was wrong when they try to pipeline cache misses in a shared memory environment.
hankd@pur-ee.UUCP (Hank Dietz) (05/01/88)
I hate to add to this pile of news, but why hasn't anyone talked about the fact that processors like SPARC are not really designed for large-scale multiprocessing, e.g., they have no provision for "hiding" big, stochastic, memory reference delays across a log n stage interconnection network, etc.? I think it's pretty uninteresting to talk about multi-processor systems which are small enough that "snooping caches" work; as many of you have pointed-out, that's been essentially a non-problem for quite some years. How about some discussion of what processors do when simple cache protocols aren't good enough or are not implementable? It seems we nearly got into such a discussion when WM was brought up, but WM isn't really billed as being a processor design for large-scale multiprocessors (read MIMDs), hence people didn't seem to notice that it was addressing the stochastic delay memory reference problem so endemic to big MIMDs. Consider all those other processors with microtasking or other out-of-order or multiple-memory-pipeline structures. Anyone care to get a discussion along these lines going? -hankd
mike@arizona.edu (Mike Coffin) (05/02/88)
From article <8029@pur-ee.UUCP>, by hankd@pur-ee.UUCP (Hank Dietz): > I hate to add to this pile of news, but why hasn't anyone talked about the > fact that processors like SPARC are not really designed for large-scale > multiprocessing, e.g., they have no provision for "hiding" big, stochastic, > memory reference delays across a log n stage interconnection network, etc.? > [...] > Anyone care to get a discussion along these lines going? > -hankd One machine that attacked this problem was the Denelcor HEP. Although it's now defunct, my impression is that its demise had little to do with technical merit. -- Mike Coffin mike@arizona.edu Univ. of Ariz. Dept. of Comp. Sci. {allegra,cmcl2,ihnp4}!arizona!mike Tucson, AZ 85721 (602)621-4252
mbutts@mntgfx.mentor.com (Mike Butts) (05/02/88)
From article <1671@alliant.Alliant.COM>, by jeff@Alliant.COM (Jeff Collins): > > Given that the SPARC must use a virtual cache to get optimal > performance, how does one build a multiprocessor with a SPARC? > > As far as I know, no one has solved the virtual cache coherency > problem yet... > Here's another log for the fire... Apollo says in their marketing booklet about the new DN10000 architecture (which has 1 to 4 15-MIPS-RISC processors sharing main memory): "...the Series 10000's caches incorporate the best features of both fully physical and fully virtual caches. The result is a new *virtually indexed, physically tagged write-through cache* that lets cache RAM access proceed entirely overlapped in time with any virtual-to-physical address translation. This overlap reduces memory access pipeline depth, guaranteeing single-cycle execution and eliminating the translation penalty typical with physically indexed designs. The physical tags allow cache validation across processors. The addressing scheme, based on the virtual address coupled with the physical tag, maintains coherency and allows data to be shared by multiple processors." -- Mike Butts, Research Engineer KC7IT 503-626-1302 Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton OR 97005 ...!{sequent,tessi,apollo}!mntgfx!mbutts OR mbutts@pdx.MENTOR.COM These are my opinions, & not necessarily those of Mentor Graphics.
jeff@Alliant.COM (Jeff Collins) (05/03/88)
In article <623@garth.UUCP> walter@garth.UUCP (Walter Bays) writes: >In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: >> Given that the SPARC must use a virtual cache to get optimal >> performance, how does one build a multiprocessor with a SPARC? >> As far as I know, no one has solved the virtual cache coherency >> problem yet... > >Clipper uses 'bus watch' to invalidate references to stale data, when >used in multiprocessor (including CPU-IOP) modes, and when using >'copy-back' (as opposed to write-through) cache modes. The newly >announced Motorola 88000 uses a similar scheme, called 'bus snoop'. > >With SPARC, Clipper without CAMMU chips, or 88000 without CAMMU chips, >you implement your own cache, and can build whatever you choose. >-- Actually I am familiar with the Clipper and the 88000. I know how they support multiprocessing. The point here was that these chips put the cache after the MMU. This means that the caches contain physical address and the tag stores record physical addresses. When an address goes across the bus it is not a big deal to "watch it" and determine if the address is in the cache or not. Current implementations of the SPARC, on the other hand, get optimal performance using a virtual cache - ie. the cache is BEFORE the MMU. It is a much more complicated procedure to "watch" these addresses. With a virtual cache, each physical address on the bus must first be translated to it's corresponding virtual address through some sort of reverse TLB, then the address can be provided to the bus watcher to see if the data is in the cache. This presents a number of problems that are not present in the case of the 88000 or the Clipper (or any other micro that uses physical caches). One of the problems is aliasing - this is when multiple virtual addresses refer to the same physical address. In order to properly snoop, all of the aliases must be checked for in the cache. Another problem, is simply the implementation of the reverse TLB - what happens when the reverse TLB misses? Which set of page tables should the TLB walk?
walter@garth.UUCP (Walter Bays) (05/04/88)
>>In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: >>> Given that the SPARC must use a virtual cache to get optimal >>> performance, how does one build a multiprocessor with a SPARC? >>> As far as I know, no one has solved the virtual cache coherency >>> problem yet... >>[I replied about 'bus watch' citing Clipper and 88000.] In article <1699@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: > Actually I am familiar with the Clipper and the 88000. I know how they > support multiprocessing. The point here was that these chips put the > cache after the MMU. [Good description of the problems in 'bus > watching' with virtual caches.] You're right. I missed your point. Clipper has a physical cache, while the Sun 4 SPARC has a virtual cache. -- ------------------------------------------------------------------------------ Any similarities between my opinions and those of the person who signs my paychecks is purely coincidental. E-Mail route: ...!pyramid!garth!walter USPS: Intergraph APD, 2400 Geng Road, Palo Alto, California 94303 Phone: (415) 852-2384 ------------------------------------------------------------------------------
gruger@convex.UUCP (05/04/88)
>/* Written 5:51 pm May 2, 1988 by jeff@alliant.Sun.Com > ... It is a much more > complicated procedure to "watch" these addresses. With a virtual > cache, each physical address on the bus must first be translated to > it's corresponding virtual address through some sort of reverse TLB, > then the address can be provided to the bus watcher to see if the data > is in the cache. > ...In order to > properly snoop, all of the aliases must be checked for in the cache. > Another problem, is simply the implementation of the reverse TLB - > what happens when the reverse TLB misses? Which set of page tables > should the TLB walk? Nah, it ain't that hard. One can have a virtually mapped, physically tagged cache. As the virtual cache is filled with data, the translated physical address from the MMU is written into the tag RAMs. The physical tags are then used by the bus watcher to selectively invalidate the cache entries. This is in fact how cache coherency is maintained in the Convex C-2 machines. Each of the CPUs has a set of remote invalidation tag memories which watches all of the other physical address buses (3 other CPUs and one I/O). When there is a hit, the validity bits are cleared for that particular entry in the virtually mapped cache. Jeff Gruger (ihnp4!convex!gruger)
przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (05/04/88)
In article <1671@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes: > > As far as I know, no one has solved the virtual cache coherency > problem yet... > See ``An in-cache address translation mechanism'' by D.A. Wood et all. in: Proc 13th annual ACM/IEEE symposium on comp. arch. Tokyo June 1986. They describe SPUR solution to the problem. Basically it consists of forcing Global Virtual address (which is almost but not quite virtual addr) to be the samo for all shared data. Other solutions used Reverse TLB's to retrieve virtual address and then use normal (i.e. bus snooping techniques). Last but not least, why not delegate responsibility to software? It is only a question of finding a convenient paradigm. przemek@psuvaxg.bitnet psuvax1!gondor!przemek przemek@psuvaxg.bitnet psuvax1!gondor!przemek
przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (05/04/88)
In article <51321@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes: >> Given that the SPARC must use a virtual cache to get optimal >> performance, how does one build a multiprocessor with a SPARC? > >Excuse me? Where is it "given that the SPARC must use a virtual cache to get >optimal performance?" IMHO the virtual cache ($) has serious advantages, one of which is that separate TLB (which is nothing else but separate cache for translation data) is not necessary. And of course convenient overlapping of $ access with V->R translation. przemek@psuvaxg.bitnet psuvax1!gondor!przemek przemek@psuvaxg.bitnet psuvax1!gondor!przemek
jh@mancol.UUCP (John Hanley) (05/04/88)
In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >...why hasn't anyone talked about the fact that processors like SPARC are not >really designed for large-scale multiprocessing, e.g., they have no provision >for "hiding" BIG, stochastic, memory reference delays across a log n stage >interconnection network, etc.? I think it's pretty uninteresting to talk about >multi-processor systems which are small enough that "snooping caches" work.... My favorite method of keeping the CPU busy while a memory-read request is traversing the network is the one used by the Denelcor HEP: context-switch on a cache miss. When a process requests that a disk block be read on a time-sharing system, the scheduler tries to get useful work done during the rotational latency by blocking the process and executing another one; when a process requests a memory word on a HEP'ish system the CPU tries to get useful work done during the network latency by executing another process. This, of course, requires extremely light-weight processes so that time to context-switch is comparable to time to execute any other instruction. One way of doing this is to have a very memory-intensive architecture, with almost no registers besides PC and PSW (and even the status word can be dispensed with; c.f. recent comp.arch discussion). Another method is to sacrifice single-process speed for parellel speedup, by putting a multiplexor in front of every single register, so that a context switch is effected by simply changing the index register that addresses the MUX. To prevent the need for reloading the MMU's page-descriptors on every context-switch, it is preferable for most switches to be between "threads" of the same program (same virtual address space) rather than between processes running unrelated programs. Another tack is to have a conventional ("context switches are expensive") processor that rarely waits on cache misses. Writing to a memory location is always fast because you don't have to wait around for it to finish. Reads cause problems. Suppose you come to a code fragment that is about to do three array references (low probability of being in the cache). Rather than saying LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B, PREFETCH C, LD A, LD B, LD C. If the time to execute the three non-blocking prefetch instructions is comparable to the network latency, you win big, since they execute during time that would have been spent idle anyway. Code density is shot to hell, and the compiler has to be _very_ smart about cache-hit likelihoods (or else runtime profiling has to be done, which is tricky because adding and removing where the PREFETCH instructions go is going to change the pattern of what's in the cache when). Something I haven't seen is the above PREFETCH instructions implemented in hardware. Call it an intelligent look-ahead cache, or an aux. CPU. Predictive memory requests are made not only on the instruction stream, but also on the data stream, a few instructions ahead of time. Is this impractical because the aux. CPU has to be nearly as complicated as the CPU itself, so you'd get better elapsed times from dual processors that spend a lot of time waiting, rather than a single processor that hardly ever waits on a cache miss? (The aux. CPU could be on the same chip as the CPU -- do any available processors do data prefetch as well as instruction prefetch?) In some cases the address to prefetch simply can't be computed soon enough (LD i, LD base_addr+4*i), but usually there's enough play in the data dependencies that instructions can be rescheduled to allow predictions to be made in time (LD i, do something else useful while simultaneously computing base_addr+4*i, prefetch base_addr+4*i while doing some other useful things, do the array reference (either by recalculating base_addr+4*i or by grabbing the already computed result from the aux. CPU)). Since all we're interested in is reducing the percentage of cache misses, it is by no means necessary to make the aux. CPU as intelligent as the primary CPU; the aux. is permitted to give up on a complicated address calculation and say, "I don't know," incurring only the penalty of a few wasted cycles on the primary. Is this a loose enough constraint to make the aux. CPU paractical, or is the idea economically infeasible (the dollars for extra compute power would be better spent on another processor) ? --John Hanley System Programmer, Manhattan College ..!cmcl2.nyu.edu!manhat!jh or hanley@nyu.edu (CMCL2<=>NYU.EDU)
tve@alice.UUCP (05/05/88)
As far as the "new virtually indexed, physically tagged write-through cache" in the Apollo goes, there ain't anything *new*. The name implies they use the virtual address to address the cache and then compare the tags to the result of the translation, which will have occured simultaneously (remember: the TLB is a cache as well and thus has about the same access time). Since nothing comes for free (usually), there is a penalty using this scheme: the cache set size is limited by the number of address bits available before translation (i.e. the bits which aren't translated). This usually means the cache set size is limited to the size of a page. This number is usually around 4Kbytes these days. If you want a larger cache, you have to make it set associative, which is a pain in discrete logic (multiplexing etc...). I think the virtual-index/physical-tag caches will become very attractive when integrated on a chip, since on a chip, set associative caches are easier to build and are are anyway (more or less) required for speed. Thorsten von Eicken AT&T Bell Laboratories research!tve tve@research.att.com
aglew@urbsdc.Urbana.Gould.COM (05/05/88)
> Another problem, is simply the implementation of the reverse TLB - > what happens when the reverse TLB misses? Which set of page tables > should the TLB walk? If we are still talking about virtual caches, then you don't need a reverse TLB. What you do is have two sets of cache directories, one physical, one virtual. Use the virtual directory on local access to the cache. Use the physical directory when snooping on the bus. The bus carries physical addresses, not virtual. The virtual/physical synonym problem can be solved in several ways. (1) arrange the cache so all virtual addresses for the same physical location map to the same set. Requires a bit of work when replacing synonyms in the same set, but is better than having to search the entire cache. (2) still use a virtual to physical translation, but take it off the critical path. Ie. access the cache using the virtual address, assuming that there are no synonyms. But, at the same time, initiate the virtual to physical translation. Later, on the next cycle if your TLB is fast enough, or much later if the TLB misses (may have to stall processor, but TLB misses are rare), when you've got the physical address see if there was already such a location in cache - backup and repair if stale data was read. (3) Arrange with the OS so that condition (1) is satisfied, or even to maintain the invariant that there are no synonyms, or that synonyms are used in carefully controlled situations.
hankd@pur-ee.UUCP (Hank Dietz) (05/06/88)
In article <381@mancol.UUCP>, jh@mancol.UUCP (John Hanley) writes: > In article <8029@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > >...why hasn't anyone talked about the fact that processors like SPARC are not > >really designed for large-scale multiprocessing, e.g., they have no provision > >for "hiding" BIG, stochastic, memory reference delays across a log n stage > >interconnection network, etc.? I think it's pretty uninteresting to talk about > >multi-processor systems which are small enough that "snooping caches" work.... > > My favorite method of keeping the CPU busy while a memory-read request is > traversing the network is the one used by the Denelcor HEP: context-switch > on a cache miss.... [or alternatively....] > LD A, <wait>, LD B, <wait>, LD C, <wait>, you could say PREFETCH A, PREFETCH B, > PREFETCH C, LD A, LD B, LD C. If the time to execute the three non-blocking > prefetch instructions is comparable to the network latency, you win big, since > they execute during time that would have been spent idle anyway.... > Something I haven't seen is the above PREFETCH instructions implemented in > hardware. Call it an intelligent look-ahead cache, or an aux. CPU. > Predictive memory requests are made not only on the instruction stream, > but also on the data stream, a few instructions ahead of time.... Burton Smith, of HEP fame, is sort-of doing both in his latest machine; so are we (CARP -- the Compiler-oriented Architecture Research group at Purdue). I believe Burton's machine microtasks, a la HEP, but he also has a method whereby many memory references (or other slow operations) can be initiated without waiting for earlier ones to complete. (I still don't know how much of his design is in the public domain, so I can't say much more about it.) The CARP machine doesn't microtask, but we do some very sneaky interrupt enabling... the CARP machine processor has provision for multiple delayed operations to be initiated without waiting for earlier ones to complete, and interrupts are only enabled when the processor has to wait LONGER than the compiler expected. The interrupt latency is high this way (perhaps dozens of instructions between interrupt accept states), but this isn't such a bad problem when you consider a multiprocessor machine where ANY processor could service ANY interrupt. There are actually two separate delayed operation mechanisms in the CARP machine: one for compile-time known delays and one for delays where only the expected delay is known at compile-time. For some operations, the expected-delay-based mechanism is late targeting; i.e., the destination register in register address space is not specified until the item has arrived, hence the usable register address space is not reduced by having multiple items pending (selection of a staging register is implicit in the type of the delayed operation). We look at it this way: if you want to get high speedup by multiprocessing, since not everything can be parallelized, we don't want to slow the sequential parts by microtasking. The result is that we implement machine use priorities by dynamically changing the parallelism-width dedicated to each task, and we concentrate on other mechanisms for hiding delays... preferably mechanisms which do not "use-up" parallelism that we could have used to achieve speedup through parallel execution. -hankd
daveb@geac.UUCP (David Collier-Brown) (05/06/88)
In article <381@mancol.UUCP> jh@mancol.UUCP (John Hanley) writes: >Something I haven't seen is the above PREFETCH instructions implemented in >hardware. Call it an intelligent look-ahead cache, or an aux. CPU. >Predictive memory requests are made not only on the instruction stream, >but also on the data stream, a few instructions ahead of time. Well, not on a RISC machine... It is logically similar to (perhaps identical to?) a pipeline on a CISC. My old 'bun used to look "forward" in the instruction stream with wild abandon, beacuse it took so long to decode the instructions (:-)). --dave (data, now, is quite a different matter) c-b -- David Collier-Brown. {mnetor yunexus utgpu}!geac!daveb Geac Computers International Inc., | Computer Science loses its 350 Steelcase Road,Markham, Ontario, | memory (if not its mind) CANADA, L3R 1B3 (416) 475-0525 x3279 | every 6 months.
jeffs@pyrthoth (Jeff Sewall) (05/07/88)
In article <63900014@convex> gruger@convex.UUCP writes: > >>/* Written 5:51 pm May 2, 1988 by jeff@alliant.Sun.Com >> ...In order to >> properly snoop, all of the aliases must be checked for in the cache. > >Nah, it ain't that hard. One can have a virtually mapped, physically tagged >cache. As the virtual cache is filled with data, the translated physical >address from the MMU is written into the tag RAMs. The physical tags are >then used by the bus watcher to selectively invalidate the cache entries. > >This is in fact how cache coherency is maintained in the Convex C-2 machines. >Each of the CPUs has a set of remote invalidation tag memories which watches >all of the other physical address buses (3 other CPUs and one I/O). When >there is a hit, the validity bits are cleared for that particular entry in >the virtually mapped cache. > >Jeff Gruger This only works for small caches. Once the size of each set of the cache exceeds your page size, the location of an entry in a physically mapped "remote invalidation tag memory" will be different than the location of that entry in the virtually mapped cache. This is because the page index alone is no longer sufficient to address the cache. This problem can be solved by passing part of the virtual page address along with the physical address on memory requests. Then the remote tag can be addressed with the virtual address and guarantee mapping to the same location. But this works only if synonyms are not allowed. BTW, this is the approach taken in the SPUR architecture. I think that the original poster's question is still valid. Is there a good solution to snooping a virtually mapped cache with the following constraints: (1) Synonyms are allowed (2) The cache is large enough that the size of a set exceeds the page size
petolino%joe@Sun.COM (Joe Petolino) (05/07/88)
>>> ...In order to >>> properly snoop, all of the aliases must be checked for in the cache. >> >>Nah, it ain't that hard. One can have a virtually mapped, physically tagged >>cache. As the virtual cache is filled with data, the translated physical >>address from the MMU is written into the tag RAMs. The physical tags are >>then used by the bus watcher to selectively invalidate the cache entries. > >This only works for small caches. Once the size of each set of the cache >exceeds your page size, the location of an entry in a physically mapped >"remote invalidation tag memory" will be different than the location of >that entry in the virtually mapped cache. This is because the page index >alone is no longer sufficient to address the cache. > >This problem can be solved by passing part of the virtual page address >along with the physical address on memory requests. Then the remote tag >can be addressed with the virtual address and guarantee mapping to the >same location. But this works only if synonyms are not allowed. > > . . . Is there a >good solution to snooping a virtually mapped cache with the following >constraints: >(1) Synonyms are allowed >(2) The cache is large enough that the size of a set exceeds the page size ^^^ (you don't really mean 'set' here, do you :-) ) The problem of finding virtual aliases in a virtually-addressed cache is not unique to multiprocessor systems. A cache deeper than the page size can harbor more than one copy of the same line if unrestricted aliases are allowed. That can cause problems even with a single processor. I've seen a few solutions to this problem: 1) Use a high enough degree of cache associativity so that you get the desired cache size with a one-page-deep cache. This was used on the Amdahl 470/V7 (eight-way set associative!). This violates constraint 2 (above). 2) Restrict virtual addresses so that no two aliases map into different cache sets ('set' used differently here than above). This is used by Sun systems. It technically violates constraint 1 (but not by much). 3) When doing the snooping, sequentially search all the possible places where that data might be (this is done while waiting for memory to respond). This is used in the Amdahl 5890, which has to look in four 2-element instruction-cache sets and eight 2-element data-cache sets on each processor. My first two examples are from single-processor machines. -Joe
chris@mimsy.UUCP (Chris Torek) (05/07/88)
[Virtual caches vs bus snooping with physical addresses] Another possibility---if perhaps somewhat unusual and maybe quite difficult---would be to be able to spot invalid cache accesses and fault the original instruction before it is allowed to complete. That is, allow the instruction to execute using the cached data, and if the cached data is stale, zap it before it is too late. I imagine this would get nightmarish when debugging the hardware. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
news@dartvax.Dartmouth.EDU (USENET News System) (05/10/88)
Expires: References: <1521@pt.cs.cmu.edu> <28200135@urbsdc> <4921@bloom-beacon.MIT.EDU> <1671@alliant.Alliant.COM> Sender: Reply-To: fagin@eleazar.dartmouth.edu (Barry S. Fagin) Followup-To: Distribution: Organization: Dartmouth College, Hanover, NH Keywords: From: fagin@eleazar.dartmouth.edu (Barry S. Fagin) Path: eleazar.dartmouth.edu!fagin Jeff Collins writes: > As far as I know, no one has solved the virtual cache coherency > problem yet... Check out the SPUR project at Berkeley, a RISC multiprocessor that uses virtual address caches. --Barry
bartlett@encore.UUCP (John Bartlett) (05/11/88)
In article <63900014@convex> gruger@convex.UUCP writes: > >>/* Written 5:51 pm May 2, 1988 by jeff@alliant.Sun.Com >> ... It is a much more >> complicated procedure to "watch" these addresses. With a virtual > >Nah, it ain't that hard. One can have a virtually mapped, physically tagged >cache. As the virtual cache is filled with data, the translated physical >address from the MMU is written into the tag RAMs. The physical tags are >then used by the bus watcher to selectively invalidate the cache entries. This sounds easy, but every time I have anaylzed this I have come to the conclusion that one of those tag stores has to be fully associative, to insure that the two tag stores will always have the same addresses allocated. Am I missing something here? In our systems, we can't afford the realestate for a fully associative tag store for each processor cache. John Bartlett {ihnp4,decvax,allegra,linus}!encore!bartlett Encore Computer Corp. 257 Ceder Hill Street Marlboro, Mass. 01752 (617) 460-0500 Opinions are not necessarily those of Encore Computer Corp.
gruger@convex.UUCP (05/12/88)
>/* Written 9:32 pm May 10, 1988 by bartlett@encore.Sun.COM in convex:comp.arch */ > >This sounds easy, but every time I have anaylzed this I have come to the >conclusion that one of those tag stores has to be fully associative, to insure >that the two tag stores will always have the same addresses allocated. Am I >missing something here? > >In our systems, we can't afford the realestate for a fully associative tag store >for each processor cache. > Maybe there is some semantics problem here in the communication... Which "two tag stores" are you referring to? By "fully associative" I take it you mean a true content addressable memory for the entire tag memory?? I believe you only need to have high associativity when searching through multiple sets of your data cache. Our cache structures consist of: a) a virtually addressed data RAM b) a virtually addressed validity RAM c) a physically addressed tag RAM The tag ram is only written as read data returns to the cache. The tag ram is read only as remote processor writes occur, and if there is a hit, the validity bits are cleared. All these RAMs certainly take up a lot of space (although we manage to pull a lot inside gate arrays). We also reached the conclusion that we could not afford the real estate of multiple cache sets and the increased complexity/cost/low-pay-back. Prior responses have discussed the increasing complexity of the tag RAM as your data cache gets deeper - you have to have search more than just one tag as your cache size increases beyond the page size. We have found it quite effective in a _vector_processing_ machine to have a fairly small cache equal to page size for scalar operands only. We bypass vector operands around the cache and invalidate entries if a vector load/store encounters one. There is NO performance improvement in running vector data through a cache if you have enough basic bandwidth (which not all parallel-vector machines have). Large caches best serve plain old scalar machines that have to stuff entire data arrays into cache in order to achieve performance. Jeff Gruger
bartlett@encore.UUCP (John Bartlett) (05/23/88)
In article <63900015@convex> gruger@convex.UUCP writes: >Which "two tag stores" are you referring to? In our system we have two separate tag stores, one for processor access to the cache, one for bus invalidation. This is because the bus and the processors run on different clocks. >By "fully associative" >I take it you mean a true content addressable memory for the entire tag >memory?? I believe you only need to have high associativity when >searching through multiple sets of your data cache. > >Our cache structures consist of: > a) a virtually addressed data RAM > b) a virtually addressed validity RAM > c) a physically addressed tag RAM >The tag ram is only written as read data returns to the cache. The >tag ram is read only as remote processor writes occur, and if there >is a hit, the validity bits are cleared. > You must have some trick for insuring complete mapping between your virtual index and your physical index. In order to insure complete mapping, one of them has to be fully associateive (yes CAM) does it not? If you limit the combinations in some way to prevent this problem, don't you take a hit on hit rate ? (oh, sorry 'bout the bad pun) John Bartlett {ihnp4,decvax,allegra,linus}!encore!bartlett Encore Computer Corp. 257 Ceder Hill Street Marlboro, Mass. 01752 (617) 460-0500 Opinions are not necessarily those of Encore Computer Corp.
gruger@convex.UUCP (05/25/88)
>/* Written 9:02 pm May 22, 1988 by bartlett@encore.Sun.COM >You must have some trick for insuring complete mapping between your virtual >index and your physical index. In order to insure complete mapping, one >of them has to be fully associateive (yes CAM) does it not? If you limit >the combinations in some way to prevent this problem, don't you take a hit >on hit rate ? (oh, sorry 'bout the bad pun) > > >John Bartlett {ihnp4,decvax,allegra,linus}!encore!bartlett >Encore Computer Corp. Its not a very fancy trick. Physical index = virtual index. Yes, the cache has to be smaller than you might like it to maximize hit rate. However, when you work with 5nsec ECL RAMs its difficult (impossible today) to find any larger than 4K bits. The raw cycle time this permits more than compensates for a small sacrifice of hit rate. Jeff Gruger Convex Computer Corp. {ihnp4!convex!gruger}