zs01+@andrew.cmu.edu (Zalman Stern) (07/02/90)
The main problem with unsynchronized split I&D caches is the lack of efficient software support for flushing the caches and synchronizing them. For example, the MIPS cachectl system call requires one to invalidate at least an entire page. That artificially inflates the cost of leaving cache consistency out of the architecture for many cases. (Not to mention the overhead of the system call to begin with.) Perhaps this is why John suggests using a large chunk of memory and only flushing it when part of it is reused. A better example is the IBM RIOS which includes user level instructions for cache flushing and synchronization. The flush instruction takes a virtual address and flushes the cache line containing that address. (If the line is not present, then it does nothing). The cache flush instruction is also setup so that it can be efficiently used in a loop. One still has to pay the overhead of going through main memory on the RIOS. In addition to the the data cache writes to memory (the RIOS uses write back), there will be I cache misses when the written instructions are executed. The I cache misses may not come to many cycles though. In theory this could be avoided by providing some explicit mechanism to forward data from the data cache to the instruction cache. This isn't so unreasonable to contemplate on the RIOS in that I cache reload goes through the D cache chips anyway. (My understanding is that this was done for lack of hardware resources and is not good for performance.) Sincerely, Zalman Stern | Internet: zs01+@andrew.cmu.edu | Usenet: I'm soooo confused... Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890 *** Friends don't let friends program in C++ ***
yuhara@minako.stars.flab.fujitsu.co.jp (Masanobu Yuhara) (07/02/90)
When I designed Cache/MMU for a microprocessor, I came across the same problem (Consistency between I-cache and D-cache). Self-modifying code does not frequently appear, but processors should support it somehow. What I had in mind was a garbage collector of Lisp. At least one of our Lisps do garbage collection of code area as well as heap area. Our solution was to provide an instruction called PIB (Purge Instruction Buffer). This instruction guarantees the modification of code is reflected. It is NOT a privileged instruction. On most processors, cache purge is privileged. But not all operating systems provide cache purge system call. (I heard SUN-OS has such a system call though not documented. ) The processor I designed has separate instruction and data caches with snooping. (Physical cache, write through, dual-port TAG, invalidation type snooping). Instruction cache snoops stores not only by other processors but also the same processor. This means PIB instruction does not have to purge the instruction cache at all. All you have to do is to purge the instruction queue and the pipeline, and to ensure that a write from the same processor was snooped. This scheme requires very little additional hardware and has no impact on performance. ------ yuhara@flab.fujitsu.co.jp Masanobu YUHARA Fujitsu Laboratories
mo@messy.bellcore.com (Michael O'Dell) (07/02/90)
The SPARCitecture provides for a FLUSH instruction specifically required to be issued by a program when it diddles instructions. (Earlier versions of the SPARCitecture document called it "IFLUSH", but it probably must diddle both caches if there are two, and the single cache if there is only one, so they changed the name at some point.) At Prisma, the first program we found which didn't issue them was the dynamic linker! When we first tried to run a dynolink pgm on the Prisma P1 simulator (with 2 caches), the results were spectacular. The bug was trivial to fix and everyone was appropriately chagrined. -Mike
hawkes@mips.COM (John Hawkes) (07/02/90)
In article <gaXkbY600asBMBA8cI@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes: >The main problem with unsynchronized split I&D caches is the lack of >efficient software support for flushing the caches and synchronizing them. >For example, the MIPS cachectl system call requires one to invalidate at >least an entire page. That artificially inflates the cost of leaving cache >consistency out of the architecture for many cases. (Not to mention the >overhead of the system call to begin with.) Perhaps this is why John >suggests using a large chunk of memory and only flushing it when part of it >is reused. The MIPS 'cachectl' syscall flips a virtual address range from cached to uncached, and vice versa, and indeed it is implemented with a page-sized granularity. I suggest you instead consider 'cacheflush', which flushes/invalidates arbitrary sized pieces of the address space from the I-cache, D-cache, or both. Granted, this isn't as efficient as an inline user-executable instruction sequence, but it is more efficient than the sledgehammer approach of 'cachectl'. -- -- John Hawkes {ames,decwrl}!mips!hawkes OR hawkes@mips.com
jack@cwi.nl (Jack Jansen) (07/03/90)
This I/D cache discussion sparked a thought: I get the impression that separate I/D caches are used mainly to get the benefit of two-way set associative caches without the cost involved. If there aren't any other added advantages, why not just have two caches that cache both instructions and data, and steal the top address bit to denote which cache to use? The OS can easily set things up so that text segments are mapped in with the top bit clear, and data segments with the top bit set. You could even extend the scheme to add more caches, and let kernel-I references use a different cache from user-I refs, so that the poor application program whose inner loop happens to collide with the clock interrupt routine doesn't suffer unduly. And, of course, this neatly solves the closure problem. -- -- Een volk dat voor tirannen zwicht | Oral: Jack Jansen zal meer dan lijf en goed verliezen | Internet: jack@cwi.nl dan dooft het licht | Uucp: hp4nl!cwi.nl!jack
mash@mips.COM (John Mashey) (07/03/90)
In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >This I/D cache discussion sparked a thought: I get the impression that >separate I/D caches are used mainly to get the benefit of two-way >set associative caches without the cost involved. If there aren't any >other added advantages,... No, the main reason is to double the peak bandwidth from cache -> CPU with the same speed SRAMs. The more usual effect is that a RISC CPU can fetch an instruction every cycle, and could still do a load or store every cycle without penalty. Of course, realistically, since the usual rule of thumb is 20% loads, 10% stores, using a joint cache would give you a 30% cycle count increase, assuming a 1-cycle hit per load/store. Straighforward writeback caches may add more, double-width I-fetches may reduce the penalty, etc, etc. In general, the decision to use split caches comes from the wish to avoid this overhead. The effect that partially simulated 2-way set-associativy is OK, also. pp. 423-425 of H&P, also section 6.4. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jack@cwi.nl (Jack Jansen) (07/06/90)
In article <39868@mips.mips.COM> mash@mips.COM (John Mashey) writes: >In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >>This I/D cache discussion sparked a thought: I get the impression that >>separate I/D caches are used mainly to get the benefit of two-way >>set associative caches without the cost involved. If there aren't any >>other added advantages,... > >No, the main reason is to double the peak bandwidth from cache -> CPU >with the same speed SRAMs. Ok, but that doesn't invalidate the point: separate I/D caches limit the OS in the ways it can make best use of the available cache, while a cache system that uses an address bit (or more address bits) to select the cache (or cache bank, more correctly) should give the same performance and more freedom. -- -- Een volk dat voor tirannen zwicht | Oral: Jack Jansen zal meer dan lijf en goed verliezen | Internet: jack@cwi.nl dan dooft het licht | Uucp: hp4nl!cwi.nl!jack
mash@mips.COM (John Mashey) (07/06/90)
In article <1782@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >In article <39868@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >>>This I/D cache discussion sparked a thought: I get the impression that >>>separate I/D caches are used mainly to get the benefit of two-way >>>set associative caches without the cost involved. If there aren't any >>>other added advantages,... >>No, the main reason is to double the peak bandwidth from cache -> CPU >>with the same speed SRAMs. >Ok, but that doesn't invalidate the point: separate I/D caches limit >the OS in the ways it can make best use of the available cache, while >a cache system that uses an address bit (or more address bits) to select >the cache (or cache bank, more correctly) should give the same performance >and more freedom. I must have misunderstood the original posting, as I didn't really understand how this design was supposed to work at all. Either: a) An instruction reference, and a data reference to the same address reference the same bits in the same SRAM. OR b) They don't. People design computers where: a) Is no problem, hence consistency is there, except maybe for prefetch buffers & similar things. b) Consistency is handled by software, explicitly, which is what almost everybody does that has separate I & D. c) Like b), except that extra hardware (somewhere) is used to maintain consistency, most likely by using a multiprocessor cache coherency protocol that must gain ownership of a cache line before doing the write. (Amdahl) Having multiple banks of SRAMs, decoded by address bits, doesn't make the fundamental problem go away, unless I completely misunderstand the description. A convincing description would start with a virtual address, showing Virtual Page Number and Byte Offset fields, and work thru the translation process to a physical address (i.e., page frame, and byte-offset fields), which combined with an I/D bit selection, access the SRAMs. Remember, the standard split I & D cache never makes any I reference or D reference wait for the other one. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
alvitar@xavax.com (Phillip Harbison) (07/07/90)
In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes: > This I/D cache discussion sparked a thought: I get the impression that > separate I/D caches are used mainly to get the benefit of two-way > set associative caches without the cost involved. If there aren't any > other added advantages, why not just have two caches ... This is not the reason for split I&D caches. If you think about it, the split I&D caches require about as much hardware as any two-way set associative cache (two sets of comparators, two sets of tag RAMS, two data paths, two bus connections, etc.). There appears to be two major reasons for using a split I&D cache. [1] It allows the CPU to implement a Harvard architecture and enjoy the associated benefits, i.e. twice as much memory bandwidth. [2] It allows each cache to be tuned for performing a task. For example, large cache block sizes may be very beneficial in an I-cache but less useful in a D-cache. Also, the I-cache doesn't have to be writeable by the CPU. > ... instructions and data, and steal the top address bit to denote > which cache to use? The OS can easily set things up so that text > segments are mapped in with the top bit clear, and data segments > with the top bit set. If you design a cache in this manner, it is not a 2-way set associative cache as your previous comments imply. In fact, it is just a direct- mapped cache that is twice as large. -- Live: Phil Harbison, Xavax, P.O. Box 7413, Huntsville, AL 35807 Uucp: alvitar@xavax.com Bell: 205-883-4233, 205-880-8951
mash@mips.COM (John Mashey) (07/09/90)
In article <1990Jul7.041100.2413@xavax.com> alvitar@xavax.com (Phillip Harbison) writes: >In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >> This I/D cache discussion sparked a thought: I get the impression that >> separate I/D caches are used mainly to get the benefit of two-way >This is not the reason for split I&D caches. If you think about it, >the split I&D caches require about as much hardware as any two-way >set associative cache (two sets of comparators, two sets of tag RAMS, >two data paths, two bus connections, etc.). There appears to be two >major reasons for using a split I&D cache. [1] It allows the CPU to >implement a Harvard architecture and enjoy the associated benefits, >i.e. twice as much memory bandwidth. [2] It allows each cache to be >tuned for performing a task. For example, large cache block sizes >may be very beneficial in an I-cache but less useful in a D-cache. >Also, the I-cache doesn't have to be writeable by the CPU. All of these are good comments. In addition: 1) In some cases, especially with an off-chip cache, the tag-comparison on a load or instruction fetch is in series with delivery to the processor. It is probably easier to hide this for an I-fetch than for a load instruction, by various trickery. However, in some designs, it can be part of the critical path for a load instruction, either: a) Adding a cycle to the load latency. OR b) Lenghtening the cycle time. ALl of this is easier to deal with, with on-chip caches. 2) With direct-mapped caches, it is sometimes easier to optimize the (most-common) cache-hit case, i.e., because you can send the data from cache -> pipeline, and be doing tag check and parity check in parallel, and then suppress the load and restart the pipeline as needed. 3) Depending on the nature of the cache, sometimes adirect-mapped cache can easily implement a 1-cycle store, which is difficult for a 2-set associative cache. Of course, for any store into a cache for which there is not a separate tag for the unit being stored, you're likely to pay 2 cycles or more anyway. Fortunately, some degree of write buffering helps mask all of this in almost any cache design. 4) See: Steven Przylblski, CACHE AND MEMORY HIERARCHY DESIGN, Morgan Kaufmann, Spring 1990, for thorough analyses of the tradeoffsa among cache designs, good references to important papers in this area. It especially talks about the cycle-time impacts of various design choices, or why: What is your cache-miss rate? is NOT a very useful question compared to What % of the total cycles are lost to the memory system? which is a better question. Of course, the best question is: How long does it take to run? because it is quite possible for a system to look WORSE on either of the measures above, if it turns out that a "more-efficient-looking" design lengthens the cycle time of the machine MORE than its efficiency improvement. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086