[comp.arch] Instruction caches and closures

zs01+@andrew.cmu.edu (Zalman Stern) (07/02/90)

The main problem with unsynchronized split I&D caches is the lack of
efficient software support for flushing the caches and synchronizing them.
For example, the MIPS cachectl system call requires one to invalidate at
least an entire page. That artificially inflates the cost of leaving cache
consistency out of the architecture for many cases. (Not to mention the
overhead of the system call to begin with.) Perhaps this is why John
suggests using a large chunk of memory and only flushing it when part of it
is reused.

A better example is the IBM RIOS which includes user level instructions for
cache flushing and synchronization. The flush instruction takes a virtual
address and flushes the cache line containing that address. (If the line is
not present, then it does nothing). The cache flush instruction is also
setup so that it can be efficiently used in a loop.

One still has to pay the overhead of going through main memory on the RIOS.
In addition to the the data cache writes to memory (the RIOS uses write
back), there will be I cache misses when the written instructions are
executed.  The I cache misses may not come to many cycles though. In theory
this could be avoided by providing some explicit mechanism to forward data
from the data cache to the instruction cache. This isn't so unreasonable to
contemplate on the RIOS in that I cache reload goes through the D cache
chips anyway. (My understanding is that this was done for lack of hardware
resources and is not good for performance.)

Sincerely,
Zalman Stern | Internet: zs01+@andrew.cmu.edu | Usenet: I'm soooo confused...
Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890
*** Friends don't let friends program in C++ ***

yuhara@minako.stars.flab.fujitsu.co.jp (Masanobu Yuhara) (07/02/90)

When I designed Cache/MMU for a microprocessor,
I came across the same problem (Consistency between I-cache and D-cache).

Self-modifying code does not frequently appear,
but processors should support it somehow.
What I had in mind was a garbage collector of Lisp.
At least one of our Lisps do garbage collection of code area as well as
heap area.

Our solution was to provide an instruction called PIB (Purge Instruction
Buffer). This instruction guarantees the modification of code is reflected.
It is NOT a privileged instruction.

On most processors, cache purge is privileged.
But not all operating systems provide cache purge system call.
(I heard SUN-OS has such a system call though not documented. )

The processor I designed has separate instruction and data caches with
snooping.
(Physical cache, write through, dual-port TAG, invalidation type snooping).
Instruction cache snoops stores not only by other processors but also the
same processor. This means PIB instruction does not have to purge the
instruction cache at all. All you have to do is to purge the instruction
queue and the pipeline, and to ensure that a write from the same processor
was snooped.

This scheme requires very little additional hardware and
has no impact on performance.

------
yuhara@flab.fujitsu.co.jp
Masanobu YUHARA
Fujitsu Laboratories

mo@messy.bellcore.com (Michael O'Dell) (07/02/90)

The SPARCitecture provides for a FLUSH instruction specifically
required to be issued by a program when it diddles instructions.
(Earlier versions of the SPARCitecture document called it
"IFLUSH", but it probably must diddle both caches if there
are two, and the single cache if there is only one,
so they changed the name at some point.)

At Prisma, the first program we found which didn't issue
them was the dynamic linker! When we first tried to run a
dynolink pgm on the Prisma P1 simulator
(with 2 caches),  the results were spectacular.  The bug was
trivial to fix and everyone was appropriately chagrined.

	-Mike

hawkes@mips.COM (John Hawkes) (07/02/90)

In article <gaXkbY600asBMBA8cI@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes:
>The main problem with unsynchronized split I&D caches is the lack of
>efficient software support for flushing the caches and synchronizing them.
>For example, the MIPS cachectl system call requires one to invalidate at
>least an entire page. That artificially inflates the cost of leaving cache
>consistency out of the architecture for many cases. (Not to mention the
>overhead of the system call to begin with.) Perhaps this is why John
>suggests using a large chunk of memory and only flushing it when part of it
>is reused.

The MIPS 'cachectl' syscall flips a virtual address range from cached to
uncached, and vice versa, and indeed it is implemented with a page-sized
granularity.

I suggest you instead consider 'cacheflush', which flushes/invalidates arbitrary
sized pieces of the address space from the I-cache, D-cache, or both.  Granted,
this isn't as efficient as an inline user-executable instruction sequence, but
it is more efficient than the sledgehammer approach of 'cachectl'.

-- 
-- 
John Hawkes
{ames,decwrl}!mips!hawkes  OR  hawkes@mips.com

jack@cwi.nl (Jack Jansen) (07/03/90)

This I/D cache discussion sparked a thought: I get the impression that
separate I/D caches are used mainly to get the benefit of two-way
set associative caches without the cost involved. If there aren't any
other added advantages, why not just have two caches that cache both
instructions and data, and steal the top address bit to denote which
cache to use? The OS can easily set things up so that text segments
are mapped in with the top bit clear, and data segments with the top
bit set. You could even extend the scheme to add more caches, and
let kernel-I references use a different cache from user-I refs, so that
the poor application program whose inner loop happens to collide with
the clock interrupt routine doesn't suffer unduly.

And, of course, this neatly solves the closure problem.
--
--
Een volk dat voor tirannen zwicht	| Oral:     Jack Jansen
zal meer dan lijf en goed verliezen	| Internet: jack@cwi.nl
dan dooft het licht			| Uucp:     hp4nl!cwi.nl!jack

mash@mips.COM (John Mashey) (07/03/90)

In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>This I/D cache discussion sparked a thought: I get the impression that
>separate I/D caches are used mainly to get the benefit of two-way
>set associative caches without the cost involved. If there aren't any
>other added advantages,...

No, the main reason is to double the peak bandwidth from cache -> CPU
with the same speed SRAMs.  The more usual effect is that a RISC CPU
can fetch an instruction every cycle, and could still do a load or store
every cycle without penalty.  Of course, realistically, since the usual
rule of thumb is 20% loads, 10% stores, using a joint cache would
give you a 30% cycle count increase, assuming a 1-cycle hit per load/store.
Straighforward writeback caches may add more, double-width I-fetches
may reduce the penalty, etc, etc.  In general, the decision to use
split caches comes from the wish to avoid this overhead.
The effect that partially simulated 2-way set-associativy is OK, also.
pp. 423-425 of H&P, also section 6.4.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jack@cwi.nl (Jack Jansen) (07/06/90)

In article <39868@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>>This I/D cache discussion sparked a thought: I get the impression that
>>separate I/D caches are used mainly to get the benefit of two-way
>>set associative caches without the cost involved. If there aren't any
>>other added advantages,...
>
>No, the main reason is to double the peak bandwidth from cache -> CPU
>with the same speed SRAMs.

Ok, but that doesn't invalidate the point: separate I/D caches limit
the OS in the ways it can make best use of the available cache, while
a cache system that uses an address bit (or more address bits) to select
the cache (or cache bank, more correctly) should give the same performance
and more freedom.
--
--
Een volk dat voor tirannen zwicht	| Oral:     Jack Jansen
zal meer dan lijf en goed verliezen	| Internet: jack@cwi.nl
dan dooft het licht			| Uucp:     hp4nl!cwi.nl!jack

mash@mips.COM (John Mashey) (07/06/90)

In article <1782@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>In article <39868@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>>>This I/D cache discussion sparked a thought: I get the impression that
>>>separate I/D caches are used mainly to get the benefit of two-way
>>>set associative caches without the cost involved. If there aren't any
>>>other added advantages,...

>>No, the main reason is to double the peak bandwidth from cache -> CPU
>>with the same speed SRAMs.

>Ok, but that doesn't invalidate the point: separate I/D caches limit
>the OS in the ways it can make best use of the available cache, while
>a cache system that uses an address bit (or more address bits) to select
>the cache (or cache bank, more correctly) should give the same performance
>and more freedom.
I must have misunderstood the original posting, as I didn't really
understand how this design was supposed to work at all.
Either:
	a) An instruction reference, and a data reference to the same
	address reference the same bits in the same SRAM.
OR
	b) They don't.

People design computers where:
	a) Is no problem, hence consistency is there, except maybe for
	prefetch buffers & similar things.
	b) Consistency is handled by software, explicitly,
	which is what almost everybody does that has separate I & D.
	c) Like b), except that extra hardware (somewhere) is
	used to maintain consistency, most likely by using
	a multiprocessor cache coherency protocol that must
	gain ownership of a cache line before doing the write. (Amdahl)

Having multiple banks of SRAMs, decoded by address bits, doesn't make
the fundamental problem go away, unless I completely misunderstand
the description.  A convincing description would start with a
virtual address, showing Virtual Page Number and Byte Offset fields,
and work thru the translation process to a physical address
(i.e., page frame, and byte-offset fields), which combined with
an I/D bit selection, access the SRAMs.  Remember, the standard split
I & D cache never makes any I reference or D reference wait for
the other one.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

alvitar@xavax.com (Phillip Harbison) (07/07/90)

In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
> This I/D cache discussion sparked a thought: I get the impression that
> separate I/D caches are used mainly to get the benefit of two-way
> set associative caches without the cost involved. If there aren't any
> other added advantages, why not just have two caches ...

This is not the reason for split I&D caches.  If you think about it,
the split I&D caches require about as much hardware as any two-way
set associative cache (two sets of comparators, two sets of tag RAMS,
two data paths, two bus connections, etc.).  There appears to be two
major reasons for using a split I&D cache.  [1] It allows the CPU to
implement a Harvard architecture and enjoy the associated benefits,
i.e. twice as much memory bandwidth.  [2] It allows each cache to be
tuned for performing a task.  For example, large cache block sizes
may be very beneficial in an I-cache but less useful in a D-cache.
Also, the I-cache doesn't have to be writeable by the CPU.

> ... instructions and data, and steal the top address bit to denote
> which cache to use? The OS can easily set things up so that text
> segments are mapped in with the top bit clear, and data segments
> with the top bit set.

If you design a cache in this manner, it is not a 2-way set associative
cache as your previous comments imply.  In fact, it is just a direct-
mapped cache that is twice as large.

-- 
Live: Phil Harbison, Xavax, P.O. Box 7413, Huntsville, AL 35807
Uucp: alvitar@xavax.com
Bell: 205-883-4233, 205-880-8951

mash@mips.COM (John Mashey) (07/09/90)

In article <1990Jul7.041100.2413@xavax.com> alvitar@xavax.com (Phillip Harbison) writes:
>In article <1756@charon.cwi.nl> jack@cwi.nl (Jack Jansen) writes:
>> This I/D cache discussion sparked a thought: I get the impression that
>> separate I/D caches are used mainly to get the benefit of two-way

>This is not the reason for split I&D caches.  If you think about it,
>the split I&D caches require about as much hardware as any two-way
>set associative cache (two sets of comparators, two sets of tag RAMS,
>two data paths, two bus connections, etc.).  There appears to be two
>major reasons for using a split I&D cache.  [1] It allows the CPU to
>implement a Harvard architecture and enjoy the associated benefits,
>i.e. twice as much memory bandwidth.  [2] It allows each cache to be
>tuned for performing a task.  For example, large cache block sizes
>may be very beneficial in an I-cache but less useful in a D-cache.
>Also, the I-cache doesn't have to be writeable by the CPU.

All of these are good comments.  In addition:
1) In some cases, especially with an off-chip cache, the tag-comparison
on a load or instruction fetch is in series with delivery to the
processor.  It is probably easier to hide this for an I-fetch than for
a load instruction, by various trickery.  However, in some designs,
it can be part of the critical path for a load instruction, either:
	a) Adding a cycle to the load latency.
	OR
	b) Lenghtening the cycle time.
ALl of this is easier to deal with, with on-chip caches.
2) With direct-mapped caches, it is sometimes easier to optimize the
(most-common) cache-hit case, i.e., because you can send the data
from cache -> pipeline, and be doing tag check and parity check in
parallel, and then suppress the load and restart the pipeline as needed.

3) Depending on the nature of the cache, sometimes adirect-mapped cache
can easily implement a 1-cycle store, which is difficult for a 2-set associative
cache.  Of course, for any store into a cache for which there is not
a separate tag for the unit being stored, you're likely to pay 2 cycles or
more anyway.  Fortunately, some degree of write buffering helps mask all of
this in almost any cache design.

4) See: Steven Przylblski, CACHE AND MEMORY HIERARCHY DESIGN,
Morgan Kaufmann, Spring 1990, for thorough analyses of the tradeoffsa
among cache designs, good references to important papers in this area.
It especially talks about the cycle-time impacts of various design choices,
or why:
	What is your cache-miss rate?
			is NOT a very useful question compared to
	What % of the total cycles are lost to the memory system?
			which is  a better question.
Of course, the best question is:
	How long does it take to run?
		because it is quite possible for a system to look WORSE
		on either of the measures above, if it turns out that
		a "more-efficient-looking" design lengthens the
		cycle time of the machine MORE than its efficiency
		improvement.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086