[net.arch] cache designs

chuck@dartvax.UUCP (Chuck Simmons) (11/15/84)

I don't know much about architecture, so try not to be too harsh
with me as I parade my ignorance...

All the cacheing schemes I have ever heard of seem to work on a
demand paged basis.  When a piece of information is asked for,
it is sucked into the cache, replacing the least recently used
piece of data in the cache.  (Here I am not making any distinctions
between a cache used to speed up main memory and a cache used to speed
up accesses to a disk.)

One problem with this approach is that when the processor realizes
that a desired piece of data is not in cache, then the processor must
wait until the data can be read into cache.

Now it seems to me that human brains don't work this way.  Rather,
human brains are more associative.  It seems to me that within the
brain, each piece of information points to other pieces of information
which are often used in association with the first piece of information.
Thus, while part of the brain is cogitating one hunk of data, other parts
of the brain can start swapping in pieces of data that will soon be
needed.

Now suppose we had a cache that was much more under a programmer's control.
To be concrete, suppose we have a cache of say, 32 elements each containing
32 words.  And suppose our processor has a load cache instruction with
syntax:

    load cache <cache address> <memory address>

When this instruction was executed, the instruction execution processor
would quickly tell the cache processor to pick up some data from memory.
Now, while the cache processor is picking up the data, the instruction
processor continues executing instructions which are already in cache.

Using this instruction, I could imagine a very careful programmer helping
the system to obtain cache hit rates of at least 90%.  For example, at
the beginning of each 32 word chunk of code, the programmer might load
the next 32 word chunk of code into cache, a chunk of code that might
be branched to, and a few memory locations that would soon be accessed.

So maybe someone out there can tell me about interesting cache designs,
or tell me why a cache such I have described wouldn't work (the cache
processor conflicts with the instruction processor?  a compiler could
never use the load cache instruction effectively?  no programmer in
her right mind would want to bother with the instruction?).

You bring the flames, I'll bring the marshmallows.

dartvax!chuck

kissell@flairvax.UUCP (Kevin Kissell) (11/17/84)

Well, the 68020 has cache-control instructions of a sort -  you can freeze 
the on-chip cache.  That's probably worthwhile, because that cache is a little 
small.  I believe that it is preferable to have a cache big enough to get a
90+% hit ratio without intervention than to use dedicated cacheing instructions.

Kevin D. Kissell
Fairchild Advanced Processor Development
uucp: {ihnp4 decvax}!decwrl!\
                             >flairvax!kissell
    {ucbvax sdcrdcf}!hplabs!/

abc@brl-tgr.ARPA (Brint Cooper ) (11/20/84)

> 
> All the cacheing schemes I have ever heard of seem to work on a
> demand paged basis.  When a piece of information is asked for,
> it is sucked into the cache, replacing the least recently used
> piece of data in the cache. 

Not exactly.  There is usually an entire "block" of data in the cache.  	
> 
> One problem with this approach is that when the processor realizes
> that a desired piece of data is not in cache, then the processor must
> wait until the data can be read into cache.  

But only a few microseconds.  The time is so short, the OS isn't even
aware of it.
...
> 
> Now suppose we had a cache that was much more under a programmer's control.
> To be concrete, suppose we have a cache of say, 32 elements each containing
> 32 words.  And suppose our processor has a load cache instruction with
> syntax:
> 
>     load cache <cache address> <memory address>
> 
> When this instruction was executed, the instruction execution processor
> would quickly tell the cache processor to pick up some data from memory.
> Now, while the cache processor is picking up the data, the instruction
> processor continues executing instructions which are already in cache.
> 
> Using this instruction, I could imagine a very careful programmer helping
> the system to obtain cache hit rates of at least 90%. 
> 
> So maybe someone out there can tell me about interesting cache designs,
> or tell me why a cache such I have described wouldn't work... 

It probably would work, but it isn't needed.  Further, this is the sort
of drudgery from which programmers must be liberated.

Cache memory is based upon the "principal of locality," which sort of
means that the next memory location to be referenced by the program is
probably quite close to the last one referenced.  Therefore, with a
"reasonable" cache size and a fast cache, high "hit ratios" should be
possible.  The exact numbers, of course, depend upon the probram being
run and the design parameters.


*** REPLACE THIS LINE WITH YOUR MESSAGE ***
Brint

(301) 278-6883    AV:  283-6883     FTS: 939-6883

ArpaNet:  abc@brl
UUCP:     ...!{decvax,cbosgd}!brl-bmd!abc
Postal:

  Dr Brinton Cooper
  U.S. Army Ballistic Research Laboratory
  Attn: AMXBR-SECAD (Cooper)
  Aberdeen Proving Ground, Md  21005

bcase@uiucdcs.UUCP (11/22/84)

[bug lunch]

Your software controlled cache is a very good idea.  It is so good
that it is being implemented by smart compilers everywhere.  If you
imagine that the register file of a machine is a cache that must be
managed by software, and that at any time, the things in the register
file also have home memory locations associated with them, then the
code generator (compiler or human being, but the most interesting
case is the compiler) or optimizer (as you wish) does the cache
managment by emitting the proper load and store instructions.  Thus,
with a really smart and globally optimizing compiler, a large register
file starts to perform like (better than?) a good (great?) data cache.
This is one interpretation
of your idea, another is honest-to-goodness cache hardware controlled
by software.  The IBM 801 had some such instructions, but I am not
qualified to comment.  My point is that your idea is a good one, but
it is not really new (sigh, there are no new ideas, only old ones.
Many times I thought I had thought of something new, only to read
about it in some paper somewhere.).  One good place to start reading
about this idea is in the paper by Sites "How to use 1000 registers."
Although his ideas sound a little like the RISC register file, the
essence is there.  Then, if you are in to compilers, read about
register allocation by coloring and spilling (a paper by some guys
at IBM, I can get the reference for you if you are interested)
and about register allocation by priority based coloring (by some
Stanford guys, Chow and Hennessey).  I don't know if anyone has
done any work with software controlled instruction caches, but it
should be looked into (although it sounds a little scary!).

srm@nsc.UUCP (Richard Mateosian) (11/23/84)

> Well, the 68020 has cache-control instructions of a sort -  you can freeze 
> the on-chip cache.

There is no way that a user-mode program can use that feature, even through
a system call.
-- 
Richard Mateosian
{amd,decwrl,fortune,hplabs,ihnp4}!nsc!srm    nsc!srm@decwrl.ARPA

dgreen@ucla-cs.UUCP (11/27/84)

Regarding predictive hardware caches, and indeed almost everything else you
wanted to know about performance of hardware caches, please refer to the
following survey article:

Alan J. Smith, "Cache Memories," in ACM Computing Surveys, vol. 14, number 3,
September 1982.

He outlines three cache prefetch algorithms:

1. Always prefetch.  Whenever you access memory block x, prefetch memory block
   x+1.
2. Prefetch on misses.  Whenever you attempt to access memory block x, and x is
   missing from the cache, prefetch memory block x+1.
3. Tagged prefetch.  "We associate with each line a single bit called the tag,
   which is set to one whenever the line is accessed by a program.  It is
   initially zero and is reset to zero when the line is removed from the cache.
   Any line brought to the cache by a prefetch operation retains its tag of
   zero.  When a tag changes from 0 to 1 (i.e., when the line is referenced for
   the first time after prefetching or is demand-fetched), a prefetch is
   initiated for the next sequential line.  The idea is very similar to
   prefetching on misses only, except that miss which did not occur because the
   line was prefetched (i.e., had there not been a prefetch, there would have
   been a miss to this line) also initiates a prefetch."

"... always prefetching cut the miss ratio by 50 to 90 percent ... and tagged
prefetch was almost equally effective.  Prefetching only on misses was less
than half as good ..."

(all quotes from Smith)

Dan Greening, Computer Science Department UCLA.

kissell@flairvax.UUCP (Kevin Kissell) (11/27/84)

Cache control by a programmer or compiler differs from register file management
in that cacheing affects both data and program memory accesses, while register 
management has a direct effect only on data memory accesses. Unless, of course,
the registers are addressable as memory by the instruction unit, but not many
contemporary designs have this property.  (As I recall DEC 10's and 20's did).

Kevin D. Kissell
Fairchild Advanced Processor Development
uucp: {ihnp4 decvax}!decwrl!\
                             >flairvax!kissell
    {ucbvax sdcrdcf}!hplabs!/

paul@ISM780B.UUCP (11/29/84)

> Now suppose we had a cache that was much more under a programmer's control.
> To be concrete, suppose we have a cache of say, 32 elements each containing
> 32 words.  And suppose our processor has a load cache instruction with
> syntax:

>     load cache <cache address> <memory address>

> When this instruction was executed, the instruction execution processor
> would quickly tell the cache processor to pick up some data from memory.
> Now, while the cache processor is picking up the data, the instruction
> processor continues executing instructions which are already in cache.

You may have just re-invented registers, and part of the rational
for a RISC machine with tons of registers, and memory addressing restricted
to explicit load and store instructions.  Current supercomputers tend to
be built this way already.

> ...For example, at
> the beginning of each 32 word chunk of code, the programmer might load
> the next 32 word chunk of code into cache, a chunk of code that might
> be branched to, and a few memory locations that would soon be accessed.

Automatic prefetching of instructions is a common way of speeding
up a cpu.  The bigger the machine, the bigger the prefetch cache.
It is sometimes suggested that branch instructions should allow
the programmer (or compiler) to indicate which way the branch is
most likely to go; this would make "automatic" instruction prefetching
"virtually" equivalent to explicit instruction cache control.

Paul Perkins    --      INTERACTIVE Systems
USENET:     {ihnp4|allegra|harvard|amd}!ima!paul
	    decvax!cca!ima!paul
	    harpo!inmet!ima!paul
	    ucbvax!cbosgd!ima!paul
	    {uscvax|ucla-vax|vortex}!ism780!paul
Disclaimer: This message is provided AS IS.  The reader assumes all risk as
	    to the accuracy of the content and its usefulness for any
	    particular purpose.  Opinions expressed are my own and not
	    necessarily those of my employer or anyone else.  Void where
	    taxed or prohibited.