levisonm@qucis.queensu.CA (Mark Levison) (09/14/89)
As the caches start to get as smart as we have discussed they begin to look more and more similar to a peice of fast local memory or large register set. A good example of this can be seen with the Cray-2 which has a 2k (I think maybe more) of very fast local memory. The compiler in all of its wisdom makes decisions about what it wants to store and hopefully does a better job for vectors and other pathological cases than normal cache management does. The only problem with this scheme as I see it is that the program must know exactly how much local memory it has in advance (although that might also be a problem for smart compilers using smart caches) and so programs must be recompiled for machines with different size local memories. Does anyone else have any more semi-rational thoughts on this and where caches might go in the future? Mark Levison levisonm@qucis.queensu.ca Queen's University Kingston, Ont. Canada
les@unicads.UUCP (Les Milash) (09/14/89)
In article <224@qusunr.queensu.CA> levisonm@qucis.queensu.CA (Mark Levison) writes: >[caching tricks are like compiler-directed use of registers] >The only problem with this scheme >[is that] programs might have to be recompiled [...] my old friend Griff Hamlin said that at LANL they'd never keep the object code, since you could compile,link,load,and start spewing results faster than 9600 baud could keep up anyway. (we're talking Crays).
mmm@cup.portal.com (Mark Robert Thorson) (09/15/89)
A very smart cache controller might gather statistics on the usage of particular memory locations, which might be used to decide which memory locations to cache. However there are many possible cases in which two programs seem to make the same use of some memory locations, but one program would be very sensitive to the caching of these locations, and the other would not. In these cases, an even smarter cache might make use of a neural network to recognize which program should get caching of these memory locations. The training signal for this network could come from some indicator of the speed of execution of the program, for example number of instructions executed during a sample period. Or, the compiler might insert code to provide each program with a heartbeat visible to the cache. The data gathered on caching effectiveness might be part of the module loaded when the program is run. And of course, it would be written back to disk when the program terminates. A smart compiler might prepare an initial set of predicted dynamic characteristics for the module, and the neural network could optimize these characteristics with data gathered from the actual use of the program. The caching effectiveness of particular memory locations might even change on on the fly. For example, a program might have some distinct operating modes, like restructuring a tree and refreshing a display. This is like having two separate programs knit into one. A neural network which learned and forgot quickly might make determinations like "Hey! Caching address 0177566 doesn't work anymore. Try caching 0000400!" Another possibility is that a register would be updated by the program to indicate what mode it's in. The cache could then select different dynamic characteristics for each mode. Programs which do not provide this information might have their different modes recognized by correlating dynamic behavior with references to particular areas in their instruction space (e.g. when the REFRESH_WINDOW subroutine is running, don't bother to cache anything but the display list). The nice thing about this application for neural networks is that nothing croaks if the neural network occasionally makes a mistake. If the net mistakenly decides to cache one memory location in preference to another, no big deal. The system just runs a little slower until the net realizes it has made a less than optimal decision, and corrects itself.
ok@cs.mu.oz.au (Richard O'Keefe) (09/16/89)
In article <22151@cup.portal.com>, mmm@cup.portal.com (Mark Robert Thorson) writes: > In these cases, an even smarter cache might make use of a neural network > to recognize which program should get caching of these memory locations. This is a joke, right? I mean, neural nets typically require THOUSANDS of training runs to learn even the simplest things. In this case, each training run corresponds to a complete execution of the program. If you don't believe that neural nets learn slowly, get a copy of PDP vol 3 and play with the programs in the enclosed floppies. ["slowly" in the sense of requiring many trials. Obviously those programs could be a lot faster.]
mmm@cup.portal.com (Mark Robert Thorson) (09/17/89)
ok@cs.mu.oz.au (Richard O'Keefe) says: > This is a joke, right? I mean, neural nets typically require THOUSANDS > of training runs to learn even the simplest things. In this case, each > training run corresponds to a complete execution of the program. If you > don't believe that neural nets learn slowly, get a copy of PDP vol 3 and > play with the programs in the enclosed floppies. ["slowly" in the sense > of requiring many trials. Obviously those programs could be a lot faster.] The speed of learning depends on the model being used. Would you have agreed with my statement if I had used the term "advanced statistical techniques" rather than the emotion-laden "neural networks"? It seems many people are resistant to use of neural networks whenever they can see another way to solve the problem, much like the situation when micro- programming was first introduced. (Many people continued to prefer hardwired implementations.) Also, note that many programs provide the opportunity to gather statistics over thousands of runs. A display list interpreter for computer animation might get called 60 times a second. 1000 runs would take less than 17 seconds. I have received criticism by e-mail that cache prediction is a purely static problem, to be solved entirely by compiler cleverness. I'm not sure I can agree with this. The compiler can see the instruction space, but it has no way to anticipate the kind of data which might be thrown at it. Nor can it anticipate how often the different modes of the program might get exercised. For example, if the program maintains some sort of tree data structure, the order in which data arrives could affect how many times that tree needs to be restructured. This could in turn have consequences for the cache. Normally, you might want to cache the tree for speed, but under conditions of frequent restructuring you might want the tree to be uncached so that more important stuff won't be pushed out of the cache. In this case, a statistical algorithm which incorporates a learning model could anticipate what sort of data is going to come in, and make preparations for the most efficient way to handle it. As another example, a CAD program might have several computation-intensive functions which are called in response to user actions. Depending upon which functions are frequently called, the effectiveness of caching different parts of the memory space could vary widely. If the user is doing something which causes frequent screen updates, caching the display list might be appropriate. If the user is performing many simulation runs, caching the description of the model might be the right thing to do. In this case, a learning model would recognize what kind of work the user is doing and adjust its predictions of his memory usage accordingly.
ok@cs.mu.oz.au (Richard O'Keefe) (09/18/89)
(Mark Robert Thorson) wrote suggesting that a neural net should watch program behaviour in order to learn how to manage the cache for that program. I wrote > This is a joke, right? I mean, neural nets typically require THOUSANDS > of training runs to learn even the simplest things. In this case, each > training run corresponds to a complete execution of the program. ... rather grossly understating my case. In article <22211@cup.portal.com>, mmm@cup.portal.com (Mark Robert Thorson) writes: > The speed of learning depends on the model being used. Would you have > agreed with my statement if I had used the term "advanced statistical > techniques" rather than the emotion-laden "neural networks"? H*** no! If you had said and meant "utterly simple methods justified by advanced statistical arguments" I might have agreed. Since Thorson proposed no model, I assumed models like the ones in the PDP book, which I have tried. Fact 1: there is nothing that a neural net can do that a parallel boolean circuit of similar complexity cannot do. (Don't forget that the complexity of the coefficients has to be accounted for.) If you want to call parallel boolean circuits "neural nets", fine. Fact 2: a system which can represent information has to be at least as complex as the information it represents. Fact 3: the kind of information Thorson envisaged his hypothetical nets learning is very complex: this variable ought to be cached NOW even though it shouldn't have been cached a few seconds ago. Fact 4: it follows from facts 2 and 3 that a neural net, parallel boolean circuit, advanced statistical technique, or thaumatron system for Thorson's caching scheme must be complex. One is left wondering whether it might not be a better use of resources to just spend the effort on a bigger cache. [PS: "thaumatron" is a joke. It means an electronic device for performing miracles.]
hankd@pur-ee.UUCP (Hank Dietz) (09/19/89)
In article <22211@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: [...as to use of neural nets to dynamically adjust caching...] >The speed of learning depends on the model being used. Would you have >agreed with my statement if I had used the term "advanced statistical >techniques" rather than the emotion-laden "neural networks"? .... No, I would not agree. I do prefer calling this "statistical prediction" -- because that's what it is ;-). Many optimizing-compiler writers have considered improving the compiler's analysis by using stats collected by trial executions, but it rarely helps. Statistics based on runtime data are notoriously unstable. Remember that, given the (stochastic) reference pattern, compiler algorithms do exist for determining optimal cache control... see Chi's PhD thesis. >Also, note that many programs provide the opportunity to gather statistics >over thousands of runs. A display list interpreter for computer animation >might get called 60 times a second. 1000 runs would take less than 17 seconds. An interesting example... one doesn't usually recompile for each display list. I'll get back to you on this one.... -hankd@ecn.purdue.edu
mmm@cup.portal.com (Mark Robert Thorson) (09/19/89)
ok@cs.mu.oz.au (Richard O'Keefe) says: > Fact 3: the kind of information Thorson envisaged his hypothetical nets > learning is very complex: this variable ought to be cached NOW > even though it shouldn't have been cached a few seconds ago. His whole argument hinges on this point. I suppose I must agree that at the byte level this is true. The storage of the synaptic weights and the current level of stimulation will occupy more silicon area than the data itself, so your money is better spent on more cache. But for page-level cache control, I think the argument is still valid. Here's a crude sketch of what I have in mind: 1) Equip the CPU with a mechanism for measuring its own performance. This might require the compiler to insert code which gives the program a "heartbeat". Also provide an indication of which mode a program is in; the compiler could insert code to load an immediate into a register whenever a call to one of the major functional blocks of your program occurs; the mode would be reported in a register. For each mode, the performance of the program during the previous run would be available. This also could be a register loaded during a call. Current performance would be available in a register maintained by hardware. 2) Extend the page table entries to support synaptic weights which associate the caching effectiveness of a page with the mode a program is in. For 16 modes, 16 bytes of synaptic weights would be enough. (I could argue that 16 nibbles would be enough.) 3) When current performance is better than previous performance, uptick the synaptic values for all cached pages. When current performance is worse, downtick the synaptic values. 4) Equip the CPU with sufficient neurons for the pages in the working set. The neuron model should support both spatial and temporal summation. When a reference is made to an uncached page, it replaces a cached page if there is a cached page for which the neuron output is below a threshold value. It may be necessary to inject a little random noise into the neurons to make sure every page gets a chance to improve execution performance. Pages eventually will get high synaptic values for the modes in which they contribute to performance, hence will be difficult to push out of the cache. Pages which compete with the winners will eventually get low (or even negative) synaptic values, so will be vunerable to being pushed out.
rang@cs.wisc.edu (Anton Rang) (09/19/89)
About the whole idea of improving cache prediction with different techniques...just make sure that it doesn't slow down the cache on the average. It's better to use a simple, fast algorithm than a more impressive but slower one. (Yes, I know everybody probably realizes this already.) About neural nets or anything fancy...suppose that you have a cache which is 100 times faster than main memory, and only a 90% hit rate. You come up with a magic algorithm which will give you a 100% hit rate. You have to implement it to be no more than 10.9 times slower or the overall system speed will be worse. Real caches may be slower WRT main memory, and probably can do better than 90% typically, so the problem is probably worse. I doubt a reasonable cache learning algorithm can be implemented which runs on a 5-10 ns cycle, say...at least with current technology. (Then again I could be wrong :-).... +----------------------------------+------------------+ | Anton Rang (grad student) | rang@cs.wisc.edu | | University of Wisconsin--Madison | | +----------------------------------+------------------+ "You are in a twisty little maze of Unix versions, all different."
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (09/19/89)
[Hankk Dietz writes:] >Many optimizing-compiler writers have considered improving the >compiler's analysis by using stats collected by trial executions, but >it rarely helps. Statistics based on runtime data are notoriously >unstable. Woah! The group that I am peripherally associated with at the University of Illinois has been using runtime feedback to compiler optimization for a while (see Hwu and Chang, or Hwu, Conte, and Chang's papers in the most recent Int'l Symp on Computer Architecture) and getting good results. I think it depends on what questions you expect to answer with the runtime stats, and what sort of "trials" they are collected from. -- Andy "Krazy" Glew, Motorola MCD, aglew@urbana.mcd.mot.com 1101 E. University, Urbana, IL 61801, USA. {uunet!,}uiucuxc!udc!aglew My opinions are my own; I indicate my company only so that the reader may account for any possible bias I may have towards our products.