[net.arch] risc questions

paulr@decwrl.DEC.COM (Paul Richardson) (03/05/86)

Has anyone taken data as to the performance gains/losses attributed
to separate data/instruction caches in processors.You can reply to me
at the above address,or to the net,but i am interested in info

						Paul Richardson

aglew@ccvaxa.UUCP (03/07/86)

> /* Written 10:40 pm  Mar  4, 1986 by paulr@decwrl.DEC.COM */
> Has anyone taken data as to the performance gains/losses attributed
> to separate data/instruction caches in processors.You can reply to me
> at the above address,or to the net,but i am interested in info
> 
> 						Paul Richardson

I'd be interested in the results. 

holmer@ji.berkeley.edu (Bruce K. Holmer) (03/10/86)

In article <5100023@ccvaxa> aglew@ccvaxa.UUCP writes:
>
>> /* Written 10:40 pm  Mar  4, 1986 by paulr@decwrl.DEC.COM */
>> Has anyone taken data as to the performance gains/losses attributed
>> to separate data/instruction caches in processors.You can reply to me
>> at the above address,or to the net,but i am interested in info
>> 
>> 						Paul Richardson
>
>I'd be interested in the results. 

If you haven't already done so, be sure to take a look at the article:
	Smith, A.J., "Cache Memories," Computing Surveys 14, 3 
	(Sept. 1982) pp. 473-530.
Section 2.8 of this paper gives trace-driven simulations for separate
data/instruction caches.  Also, several machines are mentioned (S-1,
801, Hitachi H200, and Itel AS/6), but no measurements were available at
the time the article was written.

aglew@ccvaxa.UUCP (03/13/86)

Separate instruction/data caches can perform much better than "two-way" 
caches. (1) because different algorithms can be used in each cache,
to take into account the different behaviour of instructions and data
(eg. keep loop heads and return points in the cache longer);
but the big win is (2) you can fetch both instruction and data (for the
previous instruction) simultaneously. This way all you have to do is
provide two ports on your CPU and cache chips, which is expensive, but 
possible. You don't have to dual-port all your memory chips. If you can
squeeze both caches onto the CPU chip, you only have to have one off-chip
memory port. By this simple fiat, you can almost double throughput.

Andy "Krazy" Glew. Gould CSD-Urbana. 
USEnet: ...!ihnp4!uiucdcs!ccvaxa!aglew
ARPAnet: aglew@gswd-vms

ackerman@garth.UUCP (Mike Ackerman) (03/22/86)

John Mashey of MIPS writes:
> Andy "Krazy" Glew. Gould CSD-Urbana, writes:
> > Separate instruction/data caches can perform much better than "two-way" 
> > caches. (1) because different algorithms can be used in each cache,
> > to take into account the different behaviour of instructions and data
> > (eg. keep loop heads and return points in the cache longer);
> > but the big win is (2) you can fetch both instruction and data (for the
> > previous instruction) simultaneously. This way all you have to do is
> > provide two ports on your CPU and cache chips, which is expensive, but 
> > possible. You don't have to dual-port all your memory chips. If you can
> > squeeze both caches onto the CPU chip, you only have to have one off-chip
> > memory port. By this simple fiat, you can almost double throughput.
> 
> The general idea seems right [i.e., split  I & D caches are Good Things],
> but some more detail might be useful:
> 
> 1) From our simulations [unpublished], the hit rate for split, direct-mapped
> (i.e. associativity == 1) I & D caches, each of size N, is almost as good,
> but not quite as good, as a joint, 2-way-set-associative cache of size
> 2N.

There seems to be some confusion here over terminology.  In Andy's
initial statement, "two-way caches" seems to refer to what most
people call "unified caches"; in other words, caches that hold
both instructions and data.

Comparing separate size N direct-mapped caches to a unified
size 2N two-way set associative cache is not a meaningful way
to illustrate your point.  According to Alan Jay Smith of UC
Berkeley, "there is approximately a 20 percent to 40 percent
advantage for two-way set associative over direct mapped."*
For a given level of associativity, a unified cache of size N
does achieve a slightly better hit ratio than separate I & D
caches of size N.

> 2) Regarding the use of different algorithms for each cache, I'd be interested
> in hearing of live examples of doing this. Except for machines with explicit
> branch caches, I haven't seen this.

The Fairchild CLIPPER allows caching strategies to be specified on
a per-page basis.  Pages can be dynamically designated as noncacheable,
write-through or copy-back.  Copy-back (also known as write-back) is a
mode where writes to the cache do not necessarily result in writes to
main memory.  This can result in a significant reduction in main memory
bus traffic.  Copy-back is a new to the minicomputer world and is only
implemented by one microprocessor manufacturer.

Having separate I & D allows the I-cache to implement a prefetch
algorithm.  When line N of the cache (16 byte lines) is accessed,
line N + 1 is fetched if it is not already present.  Cache prefetching
increases the I-cache hit ratio in CLIPPER from 88 to 96 percent.**

> 3) As stated, the big win is the increased bandwidth.  If you're sneaky
> about it, you don't need to fetch both I & D simultaneously, but can
> interleave them.  In that case, the cache chips can be ordinary single-ported
> SRAMS, and the CPU chip needs just 1 off-chip memory port.

Bus bandwidth is a traditional bottleneck, the effects of which have
been discussed in detail in literature.***  Being "sneaky" entails
time multiplexing the bus through the use of a multi-phase bus clock.
The "sneaky" approach to increasing bus bandwidth tends to work at
low frequencies such as 8 MHz but does not easily scale to higher
frequencies.

> 4) At the current state of technology, think twice about trying
> to squeeze the caches on-chip, unless there's an efficient 2nd-level
> cache off-chip, and unless there's no more useful function (like an MMU)
> to use the real-estate on-chip.  In general, I-caches would like to be
> at least 2-4K bytes, and D-caches bigger.  At the current state of technology,
> it is very hard to get even 8K bytes on a CPU chip, given everything
> else that must be done. It's not until you get down around 1micron CMOS
> that there's even much of a hope.  About the only plausible use for
> a small on-chip I-cache on a a RISC system is for speedups of otherwise
> cacheless designs, i.e., you could think of doing that with a PC/RT's
> ROMP chip, for example.

At the current state of technology, one could incorporate an FPU
into the execution unit of the CPU, where it belongs, and build
an separate integrated Cache and Memory Management chip.  This
way that nasty coprocessor overhead would be eliminated and the
MMU overhead could be hidden by cache access.  Single chip cache
solutions make desirable features such as copy-back, line sizes
greater than 4 bytes and two-way set associativity available in
microprocessor systems.

At the current state of technology one can create a compact module
with a CPU and separate I & D Cache and Memory Management Units.
2 micron CMOS permits the design of a Cache and MMU chip with
a 128 entry two-way set associative TLB and a 4K byte two-way
set associative cache with a 16 byte line size.  This organization
yields the following hit rates:

			TLB		Cache
	Instructions	> 99.5%		> 96%
	Data		> 99.5%		90%

Most discrete caches are unified direct mapped with a 4 byte line
size.  This organization requires a cache size of 128K bytes to
achieve hit rates similar to the VLSI example given. To quote an
previous John Mashey posting, "isn't VLSI nice!"


Mike Ackerman
Fairchild Semiconductor
{ihnp4, hplabs}!qantel!vlsvax1!garth!ackerman


* Alan Jay Smith, "The Effect of the Degree of Associativity on Miss Ratio
in a Set Associative Cache," October 15th, 1985.

** Alan Jay Smith, "Cache Evaluation and the Impact of Workload Choice,"
Report UCB/CSD85/229, March, 1985, Proc. 12'th International Symposium
on Computer Architecture, June 17-19, 1985, Boston, MA, pp. 64-75.

*** Howard Sachs, "Improved Cache Scheme Boosts System Performance,"
Computer Design, November 1st, 1985.