[comp.arch] flexible caches

levisonm@qucis.queensu.CA (Mark Levison) (09/14/89)

   As the caches start to get as smart as we have discussed they
begin to look more and more similar to a peice of fast local
memory or large register set.  A good example of this can be
seen with the Cray-2 which has a 2k (I think maybe more) of very
fast local memory.  The compiler in all of its wisdom makes
decisions about what it wants to store and hopefully does a
better job for vectors and other pathological cases than normal
cache management does.  The only problem with this scheme as I
see it is that the program must know exactly how much local memory
it has in advance (although that might also be a problem for
smart compilers using smart caches) and so programs must be
recompiled for machines with different size local memories.
Does anyone else have any more semi-rational thoughts on this
and where caches might go in the future?

Mark Levison
levisonm@qucis.queensu.ca
Queen's University
Kingston, Ont.
Canada

les@unicads.UUCP (Les Milash) (09/14/89)

In article <224@qusunr.queensu.CA> levisonm@qucis.queensu.CA (Mark Levison) writes:
>[caching tricks are like compiler-directed use of registers]
>The only problem with this scheme
>[is that] programs might have to be recompiled [...]
my old friend Griff Hamlin said that at LANL they'd never keep the object code,
since you could compile,link,load,and start spewing results faster than
9600 baud could keep up anyway.  (we're talking Crays).

mmm@cup.portal.com (Mark Robert Thorson) (09/15/89)

A very smart cache controller might gather statistics on the usage of
particular memory locations, which might be used to decide which memory
locations to cache.

However there are many possible cases in which two programs seem to make
the same use of some memory locations, but one program would be very
sensitive to the caching of these locations, and the other would not.

In these cases, an even smarter cache might make use of a neural network
to recognize which program should get caching of these memory locations.
The training signal for this network could come from some indicator of
the speed of execution of the program, for example number of instructions
executed during a sample period.  Or, the compiler might insert code to 
provide each program with a heartbeat visible to the cache.

The data gathered on caching effectiveness might be part of the module
loaded when the program is run.  And of course, it would be written back
to disk when the program terminates.  A smart compiler might prepare an
initial set of predicted dynamic characteristics for the module, and the
neural network could optimize these characteristics with data gathered
from the actual use of the program.

The caching effectiveness of particular memory locations might even change on
on the fly.  For example, a program might have some distinct operating modes,
like restructuring a tree and refreshing a display.  This is like having two
separate programs knit into one.  A neural network which learned and forgot
quickly might make determinations like "Hey!  Caching address 0177566 doesn't
work anymore.  Try caching 0000400!"  Another possibility is that a register
would be updated by the program to indicate what mode it's in.  The cache
could then select different dynamic characteristics for each mode.  Programs
which do not provide this information might have their different modes
recognized by correlating dynamic behavior with references to particular
areas in their instruction space (e.g. when the REFRESH_WINDOW subroutine
is running, don't bother to cache anything but the display list).

The nice thing about this application for neural networks is that nothing
croaks if the neural network occasionally makes a mistake.  If the net
mistakenly decides to cache one memory location in preference to another,
no big deal.  The system just runs a little slower until the net realizes
it has made a less than optimal decision, and corrects itself.

ok@cs.mu.oz.au (Richard O'Keefe) (09/16/89)

In article <22151@cup.portal.com>, mmm@cup.portal.com (Mark Robert Thorson) writes:
> In these cases, an even smarter cache might make use of a neural network
> to recognize which program should get caching of these memory locations.

This is a joke, right?  I mean, neural nets typically require THOUSANDS
of training runs to learn even the simplest things.  In this case, each
training run corresponds to a complete execution of the program.  If you
don't believe that neural nets learn slowly, get a copy of PDP vol 3 and
play with the programs in the enclosed floppies.  ["slowly" in the sense
of requiring many trials.  Obviously those programs could be a lot faster.]

mmm@cup.portal.com (Mark Robert Thorson) (09/17/89)

ok@cs.mu.oz.au (Richard O'Keefe) says:

> This is a joke, right?  I mean, neural nets typically require THOUSANDS
> of training runs to learn even the simplest things.  In this case, each
> training run corresponds to a complete execution of the program.  If you
> don't believe that neural nets learn slowly, get a copy of PDP vol 3 and
> play with the programs in the enclosed floppies.  ["slowly" in the sense
> of requiring many trials.  Obviously those programs could be a lot faster.]

The speed of learning depends on the model being used.  Would you have
agreed with my statement if I had used the term "advanced statistical
techniques" rather than the emotion-laden "neural networks"?  It seems many
people are resistant to use of neural networks whenever they can see 
another way to solve the problem, much like the situation when micro-
programming was first introduced.  (Many people continued to prefer
hardwired implementations.)

Also, note that many programs provide the opportunity to gather statistics
over thousands of runs.  A display list interpreter for computer animation
might get called 60 times a second.  1000 runs would take less than 17 seconds.

I have received criticism by e-mail that cache prediction is a purely
static problem, to be solved entirely by compiler cleverness.  I'm not
sure I can agree with this.  The compiler can see the instruction space,
but it has no way to anticipate the kind of data which might be thrown
at it.  Nor can it anticipate how often the different modes of the program
might get exercised.

For example, if the program maintains some sort of tree data structure,
the order in which data arrives could affect how many times that tree
needs to be restructured.  This could in turn have consequences for the
cache.  Normally, you might want to cache the tree for speed, but under
conditions of frequent restructuring you might want the tree to be
uncached so that more important stuff won't be pushed out of the cache.
In this case, a statistical algorithm which incorporates a learning model
could anticipate what sort of data is going to come in, and make preparations
for the most efficient way to handle it.

As another example, a CAD program might have several computation-intensive
functions which are called in response to user actions.  Depending upon
which functions are frequently called, the effectiveness of caching different
parts of the memory space could vary widely.  If the user is doing something
which causes frequent screen updates, caching the display list might be
appropriate.  If the user is performing many simulation runs, caching the
description of the model might be the right thing to do.
In this case, a learning model would recognize what kind of work the user
is doing and adjust its predictions of his memory usage accordingly.

ok@cs.mu.oz.au (Richard O'Keefe) (09/18/89)

(Mark Robert Thorson) wrote suggesting that a neural net should watch
program behaviour in order to learn how to manage the cache for that
program.

I wrote
> This is a joke, right?  I mean, neural nets typically require THOUSANDS
> of training runs to learn even the simplest things.  In this case, each
> training run corresponds to a complete execution of the program.  ...
rather grossly understating my case.

In article <22211@cup.portal.com>, mmm@cup.portal.com (Mark Robert Thorson) writes:
> The speed of learning depends on the model being used.  Would you have
> agreed with my statement if I had used the term "advanced statistical
> techniques" rather than the emotion-laden "neural networks"?

H*** no!  If you had said and meant "utterly simple methods justified by
advanced statistical arguments" I might have agreed.  Since Thorson
proposed no model, I assumed models like the ones in the PDP book, which
I have tried.

Fact 1:  there is nothing that a neural net can do that a parallel boolean
         circuit of similar complexity cannot do.  (Don't forget that the
         complexity of the coefficients has to be accounted for.)  If you
         want to call parallel boolean circuits "neural nets", fine.
Fact 2:  a system which can represent information has to be at least as
	 complex as the information it represents.
Fact 3:  the kind of information Thorson envisaged his hypothetical nets
	 learning is very complex:  this variable ought to be cached NOW
	 even though it shouldn't have been cached a few seconds ago.
Fact 4:  it follows from facts 2 and 3 that a neural net, parallel boolean
	 circuit, advanced statistical technique, or thaumatron system
	 for Thorson's caching scheme must be complex.
One is left wondering whether it might not be a better use of resources
to just spend the effort on a bigger cache.

[PS: "thaumatron" is a joke.
 It means an electronic device for performing miracles.]

hankd@pur-ee.UUCP (Hank Dietz) (09/19/89)

In article <22211@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
[...as to use of neural nets to dynamically adjust caching...]
>The speed of learning depends on the model being used.  Would you have
>agreed with my statement if I had used the term "advanced statistical
>techniques" rather than the emotion-laden "neural networks"? ....

No, I would not agree.  I do prefer calling this "statistical prediction" --
because that's what it is ;-).  Many optimizing-compiler writers have
considered improving the compiler's analysis by using stats collected by
trial executions, but it rarely helps.  Statistics based on runtime data are
notoriously unstable.

Remember that, given the (stochastic) reference pattern, compiler algorithms
do exist for determining optimal cache control...  see Chi's PhD thesis.

>Also, note that many programs provide the opportunity to gather statistics
>over thousands of runs.  A display list interpreter for computer animation
>might get called 60 times a second.  1000 runs would take less than 17 seconds.

An interesting example...  one doesn't usually recompile for each display list.
I'll get back to you on this one....

							-hankd@ecn.purdue.edu

mmm@cup.portal.com (Mark Robert Thorson) (09/19/89)

ok@cs.mu.oz.au (Richard O'Keefe) says:
 
> Fact 3:  the kind of information Thorson envisaged his hypothetical nets
>        learning is very complex:  this variable ought to be cached NOW
>        even though it shouldn't have been cached a few seconds ago.
 
His whole argument hinges on this point.  I suppose I must agree that at the
byte level this is true.  The storage of the synaptic weights and the
current level of stimulation will occupy more silicon area than the data
itself, so your money is better spent on more cache.

But for page-level cache control, I think the argument is still valid.
Here's a crude sketch of what I have in mind:

1)  Equip the CPU with a mechanism for measuring its own performance.
This might require the compiler to insert code which gives the program
a "heartbeat".  Also provide an indication of which mode a program is in;
the compiler could insert code to load an immediate into a register 
whenever a call to one of the major functional blocks of your program
occurs;  the mode would be reported in a register.

For each mode, the performance of the program during the previous run
would be available.  This also could be a register loaded during a call.

Current performance would be available in a register maintained by hardware.

2)  Extend the page table entries to support synaptic weights which
associate the caching effectiveness of a page with the mode a program is in.
For 16 modes, 16 bytes of synaptic weights would be enough.  (I could
argue that 16 nibbles would be enough.)

3)  When current performance is better than previous performance, uptick
the synaptic values for all cached pages.  When current performance is
worse, downtick the synaptic values.

4)  Equip the CPU with sufficient neurons for the pages in the working set.
The neuron model should support both spatial and temporal summation.
When a reference is made to an uncached page, it replaces a cached page
if there is a cached page for which the neuron output is below a threshold
value.  It may be necessary to inject a little random noise into the
neurons to make sure every page gets a chance to improve execution
performance.

Pages eventually will get high synaptic values for the modes in which
they contribute to performance, hence will be difficult to push out of
the cache.  Pages which compete with the winners will eventually get
low (or even negative) synaptic values, so will be vunerable to being
pushed out.

rang@cs.wisc.edu (Anton Rang) (09/19/89)

About the whole idea of improving cache prediction with different
techniques...just make sure that it doesn't slow down the cache on the
average.  It's better to use a simple, fast algorithm than a more
impressive but slower one.  (Yes, I know everybody probably realizes
this already.)

About neural nets or anything fancy...suppose that you have a cache
which is 100 times faster than main memory, and only a 90% hit rate.
You come up with a magic algorithm which will give you a 100% hit
rate.  You have to implement it to be no more than 10.9 times slower
or the overall system speed will be worse.  Real caches may be slower
WRT main memory, and probably can do better than 90% typically, so the
problem is probably worse.  I doubt a reasonable cache learning
algorithm can be implemented which runs on a 5-10 ns cycle, say...at
least with current technology.  (Then again I could be wrong :-)....
   
+----------------------------------+------------------+
| Anton Rang (grad student)        | rang@cs.wisc.edu |
| University of Wisconsin--Madison |                  |
+----------------------------------+------------------+

"You are in a twisty little maze of Unix versions, all different."

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (09/19/89)

[Hankk Dietz writes:]
>Many optimizing-compiler writers have considered improving the
>compiler's analysis by using stats collected by trial executions, but
>it rarely helps.  Statistics based on runtime data are notoriously
>unstable.

Woah!  The group that I am peripherally associated with at the
University of Illinois has been using runtime feedback to compiler
optimization for a while (see Hwu and Chang, or Hwu, Conte, and
Chang's papers in the most recent Int'l Symp on Computer Architecture)
and getting good results.  I think it depends on what questions you
expect to answer with the runtime stats, and what sort of "trials" 
they are collected from.
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.