paulr@decwrl.DEC.COM (Paul Richardson) (03/05/86)
Has anyone taken data as to the performance gains/losses attributed to separate data/instruction caches in processors.You can reply to me at the above address,or to the net,but i am interested in info Paul Richardson
aglew@ccvaxa.UUCP (03/07/86)
> /* Written 10:40 pm Mar 4, 1986 by paulr@decwrl.DEC.COM */ > Has anyone taken data as to the performance gains/losses attributed > to separate data/instruction caches in processors.You can reply to me > at the above address,or to the net,but i am interested in info > > Paul Richardson I'd be interested in the results.
holmer@ji.berkeley.edu (Bruce K. Holmer) (03/10/86)
In article <5100023@ccvaxa> aglew@ccvaxa.UUCP writes: > >> /* Written 10:40 pm Mar 4, 1986 by paulr@decwrl.DEC.COM */ >> Has anyone taken data as to the performance gains/losses attributed >> to separate data/instruction caches in processors.You can reply to me >> at the above address,or to the net,but i am interested in info >> >> Paul Richardson > >I'd be interested in the results. If you haven't already done so, be sure to take a look at the article: Smith, A.J., "Cache Memories," Computing Surveys 14, 3 (Sept. 1982) pp. 473-530. Section 2.8 of this paper gives trace-driven simulations for separate data/instruction caches. Also, several machines are mentioned (S-1, 801, Hitachi H200, and Itel AS/6), but no measurements were available at the time the article was written.
aglew@ccvaxa.UUCP (03/13/86)
Separate instruction/data caches can perform much better than "two-way" caches. (1) because different algorithms can be used in each cache, to take into account the different behaviour of instructions and data (eg. keep loop heads and return points in the cache longer); but the big win is (2) you can fetch both instruction and data (for the previous instruction) simultaneously. This way all you have to do is provide two ports on your CPU and cache chips, which is expensive, but possible. You don't have to dual-port all your memory chips. If you can squeeze both caches onto the CPU chip, you only have to have one off-chip memory port. By this simple fiat, you can almost double throughput. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ...!ihnp4!uiucdcs!ccvaxa!aglew ARPAnet: aglew@gswd-vms
ackerman@garth.UUCP (Mike Ackerman) (03/22/86)
John Mashey of MIPS writes: > Andy "Krazy" Glew. Gould CSD-Urbana, writes: > > Separate instruction/data caches can perform much better than "two-way" > > caches. (1) because different algorithms can be used in each cache, > > to take into account the different behaviour of instructions and data > > (eg. keep loop heads and return points in the cache longer); > > but the big win is (2) you can fetch both instruction and data (for the > > previous instruction) simultaneously. This way all you have to do is > > provide two ports on your CPU and cache chips, which is expensive, but > > possible. You don't have to dual-port all your memory chips. If you can > > squeeze both caches onto the CPU chip, you only have to have one off-chip > > memory port. By this simple fiat, you can almost double throughput. > > The general idea seems right [i.e., split I & D caches are Good Things], > but some more detail might be useful: > > 1) From our simulations [unpublished], the hit rate for split, direct-mapped > (i.e. associativity == 1) I & D caches, each of size N, is almost as good, > but not quite as good, as a joint, 2-way-set-associative cache of size > 2N. There seems to be some confusion here over terminology. In Andy's initial statement, "two-way caches" seems to refer to what most people call "unified caches"; in other words, caches that hold both instructions and data. Comparing separate size N direct-mapped caches to a unified size 2N two-way set associative cache is not a meaningful way to illustrate your point. According to Alan Jay Smith of UC Berkeley, "there is approximately a 20 percent to 40 percent advantage for two-way set associative over direct mapped."* For a given level of associativity, a unified cache of size N does achieve a slightly better hit ratio than separate I & D caches of size N. > 2) Regarding the use of different algorithms for each cache, I'd be interested > in hearing of live examples of doing this. Except for machines with explicit > branch caches, I haven't seen this. The Fairchild CLIPPER allows caching strategies to be specified on a per-page basis. Pages can be dynamically designated as noncacheable, write-through or copy-back. Copy-back (also known as write-back) is a mode where writes to the cache do not necessarily result in writes to main memory. This can result in a significant reduction in main memory bus traffic. Copy-back is a new to the minicomputer world and is only implemented by one microprocessor manufacturer. Having separate I & D allows the I-cache to implement a prefetch algorithm. When line N of the cache (16 byte lines) is accessed, line N + 1 is fetched if it is not already present. Cache prefetching increases the I-cache hit ratio in CLIPPER from 88 to 96 percent.** > 3) As stated, the big win is the increased bandwidth. If you're sneaky > about it, you don't need to fetch both I & D simultaneously, but can > interleave them. In that case, the cache chips can be ordinary single-ported > SRAMS, and the CPU chip needs just 1 off-chip memory port. Bus bandwidth is a traditional bottleneck, the effects of which have been discussed in detail in literature.*** Being "sneaky" entails time multiplexing the bus through the use of a multi-phase bus clock. The "sneaky" approach to increasing bus bandwidth tends to work at low frequencies such as 8 MHz but does not easily scale to higher frequencies. > 4) At the current state of technology, think twice about trying > to squeeze the caches on-chip, unless there's an efficient 2nd-level > cache off-chip, and unless there's no more useful function (like an MMU) > to use the real-estate on-chip. In general, I-caches would like to be > at least 2-4K bytes, and D-caches bigger. At the current state of technology, > it is very hard to get even 8K bytes on a CPU chip, given everything > else that must be done. It's not until you get down around 1micron CMOS > that there's even much of a hope. About the only plausible use for > a small on-chip I-cache on a a RISC system is for speedups of otherwise > cacheless designs, i.e., you could think of doing that with a PC/RT's > ROMP chip, for example. At the current state of technology, one could incorporate an FPU into the execution unit of the CPU, where it belongs, and build an separate integrated Cache and Memory Management chip. This way that nasty coprocessor overhead would be eliminated and the MMU overhead could be hidden by cache access. Single chip cache solutions make desirable features such as copy-back, line sizes greater than 4 bytes and two-way set associativity available in microprocessor systems. At the current state of technology one can create a compact module with a CPU and separate I & D Cache and Memory Management Units. 2 micron CMOS permits the design of a Cache and MMU chip with a 128 entry two-way set associative TLB and a 4K byte two-way set associative cache with a 16 byte line size. This organization yields the following hit rates: TLB Cache Instructions > 99.5% > 96% Data > 99.5% 90% Most discrete caches are unified direct mapped with a 4 byte line size. This organization requires a cache size of 128K bytes to achieve hit rates similar to the VLSI example given. To quote an previous John Mashey posting, "isn't VLSI nice!" Mike Ackerman Fairchild Semiconductor {ihnp4, hplabs}!qantel!vlsvax1!garth!ackerman * Alan Jay Smith, "The Effect of the Degree of Associativity on Miss Ratio in a Set Associative Cache," October 15th, 1985. ** Alan Jay Smith, "Cache Evaluation and the Impact of Workload Choice," Report UCB/CSD85/229, March, 1985, Proc. 12'th International Symposium on Computer Architecture, June 17-19, 1985, Boston, MA, pp. 64-75. *** Howard Sachs, "Improved Cache Scheme Boosts System Performance," Computer Design, November 1st, 1985.