jonathan@cs.pitt.edu (Jonathan Eunice) (03/17/91)
Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from HP are: 1) cache pre-load instructions (the compiler inserts these into the instr stream, and hopefully, the appropriate cache line will be available by the time it's needed, avoiding delays and speeding up single-task execution) 2) cache no-load hints as a part of store instructions (useful to avoid useless cache loading for initialization statements, for faster program startup, and perhaps in other situations too) How effective are these optimizations likely to be? (While they aren't going to give the same kind of speedup as making the system super-scalar or super-pipelined, they strike me as effective tweaks.) Does anyone else have them? I seem to recall a posting to the effect that the RS/6000 POWER architecture does not. What about MIPS, SPARC, etc? Is this a me-too feature?
gbyrd@mcnc.org (Gregory T. Byrd) (03/18/91)
In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes: >Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from >HP are: > >1) cache pre-load instructions [...] > >2) cache no-load hints as a part of store instructions (useful to avoid >useless cache loading for initialization statements, for faster program >startup, and perhaps in other situations too) > >[...] I just want to make sure I understand what (2) means. Is this a way to implement a Store-Allocate-No-Fetch (SANF) policy for a give cache block, rather than the default behavior (fetch-on-store?)? Would this work with shared data in a multiprocessor system? Does HP-PA say anything about cache coherence, or is that up to the system implementor? ...Greg Byrd Digital Equipment Corp./MCNC gbyrd@mcnc.org P.O. Box 12889 (919)248-1439 Research Triangle Park, NC 27709 -- ...Greg Byrd MCNC/Digital Equipment Corp. gbyrd@mcnc.org P.O. Box 12889 (919)248-1439 Research Triangle Park, NC 27709
carters@ajpo.sei.cmu.edu (Scott Carter) (03/21/91)
In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes: >Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from >HP are: > >1) cache pre-load instructions (the compiler inserts these into the >instr stream, and hopefully, the appropriate cache line will be available >by the time it's needed, avoiding delays and speeding up single-task >execution) > >2) cache no-load hints as a part of store instructions (useful to avoid >useless cache loading for initialization statements, for faster program >startup, and perhaps in other situations too) > >How effective are these optimizations likely to be? (While they aren't going >to give the same kind of speedup as making the system super-scalar or >super-pipelined, they strike me as effective tweaks.) > A military machine we're working on (whose name may Not be Mentioned) has some similar capabilities. BTW, load to R0 may well turn out to be a cache preload in some RISCs, depending on how the pipe control is implemented. Definitely an unsupported feature :} The technology we had to use had small caches and fairly slow memory, so minimizing miss penalty certainly counted, as did not knocking a line out with another line from which you only needed one word, and whose locality was poor. Both individual loads and pages can be marked as no-allocate (the data comes from the cache if there is a hit, but avoids the cache lockup [and replace] on a miss). The cache is physically addressed, so we can have allocating and non-allocating aliases of important data structures. This is mostly useful with large arrays which are sometimes addressed row-wise and sometimes column-wise. The performance gain on matrix-multiplication type operations (which we spend a lot of time doing) is fairly good versus just treating the off-stride access as uncached. The pipeline control to handle aborting a cache preload when a real miss comes along is fairly unpleasant. There's no store hinting because our cache is writethrough, non-allocating. Note that the application programmer has to insert compiler pragmas to make use of this (though the pipeline scheduler does have some heuristics about which loads to promote the most). Methinks fully heuristic compilers for this are still a research topic. See CARP at Purdue, for example. >Does anyone else have them? I seem to recall a posting to the effect that >the RS/6000 POWER architecture does not. What about MIPS, SPARC, etc? Is >this a me-too feature? I don't know of any other GP ISAs that have this. Scott Carter - McDonnell Douglas Electronic Systems Company carter%csvax.decnet@mdcgwy.mdc.com (preferred and faster) - or - carters@ajpo.sei.cmu.edu (714)-896-3097 The opinions expressed herein are solely those of the author, and are not necessarily those of McDonnell Douglas.
preston@ariel.rice.edu (Preston Briggs) (03/22/91)
jonathan@cs.pitt.edu (Jonathan Eunice) writes: >>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from >> >>1) cache pre-load instructions (the compiler inserts these into the >>instr stream, and hopefully, the appropriate cache line will be available >>by the time it's needed, avoiding delays and speeding up single-task >>execution) >> >>2) cache no-load hints as a part of store instructions (useful to avoid >>useless cache loading for initialization statements, for faster program >>startup, and perhaps in other situations too) At the upcoming ASPLOS, there's a paper called "Software Prefetching", by Callahan, Kennedy, and Porterfield, describing compiler mechanisms to take advantage of cache pre-fetch instructions (1 above). They seem very effective for scientific code. The RS/6000 includes 2 interesting possibilities. An instruction that zeroes a line in the data cache (without fetching it). May be used like (2 above); additionally handy for zeroing big chunks of memory. They also include an "invalidate line" instruction which says: "don't bother writing this one back to memory." >>How effective are these optimizations likely to be? (While they aren't going >>to give the same kind of speedup as making the system super-scalar or >>super-pipelined, they strike me as effective tweaks.) This sort of thing can be very important. One of the basic problems of the i860 (for an example) is its low off-chip memory bandwidth, at least in relation to it's FP performance. Instruction-level parallelism (piplines, wide instructions, superscalar, speculative execution) is ok for getting the FP performance up, but the processor will starve without lots of bandwidth. Preston Briggs
maf@hpfcso.FC.HP.COM (Mark Forsyth) (03/22/91)
>From: jonathan@cs.pitt.edu (Jonathan Eunice) >Message-ID: <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> >Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from These were presented as extensions to the PA-RISC architecture, which is used in several product lines, NOT features of any particular products. >HP are: > >1) cache pre-load instructions (the compiler inserts these into the > >2) cache no-load hints as a part of store instructions (useful to avoid > >How effective are these optimizations likely to be? Extremely effective at eliminating cache-miss bottlenecks in certain intended cases. 1) is employed for applications which access very large uniform data sets and perform a fair amount of manipulation or calculations on individual data items (allowing enough time to prefetch the next group of operands). In some cases, cache miss penalties can be COMPLETELY elim- inated from the performance equation. 2) is intended to be used primarily by the OS for page initialization, block moves, etc. >(While they aren't going >to give the same kind of speedup as making the system super-scalar or >super-pipelined, they strike me as effective tweaks.) Comparing these features to pipeline implementations is apples to oranges. Cache hints address classes of applications which are dominated by memory system performance, whereas, superscalar pipelines improve primarily certain floating point applications dominated by the pipeline CPI. In applications dominated by cache misses these give far bigger performance improvements than a superscalar pipeline would. > >Does anyone else have them? I seem to recall a posting to the effect that >the RS/6000 POWER architecture does not. What about MIPS, SPARC, etc? Is >this a me-too feature? The features were defined as a result of extensive analysis of bottlenecks in important customer applications, not imitation. I'm not aware of any similar features in other architectures. --- Mark Forsyth Hewlett-Packard maf@hpesmaf.fc.hp.com Engineering Systems Laboratory Fort Collins, Colorado
brandis@inf.ethz.ch (Marc Brandis) (03/22/91)
In article <1991Mar21.161044.2898@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >The RS/6000 includes 2 interesting possibilities. >An instruction that zeroes a line in the data cache (without >fetching it). May be used like (2 above); additionally handy for zeroing >big chunks of memory. They also include an "invalidate line" >instruction which says: "don't bother writing this one back to memory." > Unfortunately, IBM made these instructions privileged. They had some good reasons to do it, as the instructions ignore lock and protection bits. I do not know the reasons why they could not make them check the bits, however. I am not sure whether having these instructions in user mode would be a great advantage. DCLSZ (data cache line set zero) can be used to initialize large chunks of memory, of course. The other obvious target for the DCLSZ and CLI (cache line invalidate) instructions is to control the allocation and deallocation of procedure frames on the stack so that no memory references are generated for newly allocated stack space and that no deallocated stack space will be written back to memory. I do not think that this mechanism would really improve the performance of current programs. Many programs consume only a few kilobytes of stack space and exhibit a large amount of spatial locality on their references. The number of frames on the stack is almost constant over large fractions of many programs and so is the top of the stack. Under this standpoint of view, it is very unlikely that stack references cause cache misses, so that this 'optimization' would not reduce the number of cache misses at all. Now consider the cost of it. Considering the static overhead of a procedure frame on the RS/6000 (6 words header, at least 8 words for output parameters) and the typical number of saved registers (I assume 16 words) as well as some additional local stack space (I assume another 16 words), a frame is about 46 words or 184 bytes large. The cache line size on the RS/6000 is 128 bytes, so you would need two additional instructions at each procedure entry and two additional instructions at each procedure exit (or three+three for the cost reduced CPU in the models 320 and 520 with a 64 byte line size), adding some overhead to each procedure call. While the overhead is not large, it may well eat up the benefits that we are getting from the scheme. Note that in order to make the same program run on machines with different cache line sizes, some additional overhead to parametrize the entry and exit code would have to be paid. Marc-Michael Brandis Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology) CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch
oehler@arnor.UUCP (Rich Oehler) (03/22/91)
In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes: |> |> >Does anyone else have them? I seem to recall a posting to the effect that |> >the RS/6000 POWER architecture does not. What about MIPS, SPARC, etc? Is |> >this a me-too feature? |> The RISC System/6000 has cache control instructions, but not touch (to prefetch a line) nor set (to establish a line without fetching). The original 801 (circa 1975) had set and subsequent 801 designs had touch. -- Richard Oehler (oehler@ibm.com)
preston@ariel.rice.edu (Preston Briggs) (03/23/91)
I wrote: >>The RS/6000 includes 2 interesting possibilities. >>An instruction that zeroes a line in the data cache (without >>fetching it). May be used like (2 above); additionally handy for zeroing >>big chunks of memory. They also include an "invalidate line" >>instruction which says: "don't bother writing this one back to memory." and brandis@inf.ethz.ch (Marc Brandis) writes: >Unfortunately, IBM made these instructions privileged. They had some good >reasons to do it, as the instructions ignore lock and protection bits. I do >not know the reasons why they could not make them check the bits, however. > >I am not sure whether having these instructions in user mode would be a great >advantage. DCLSZ (data cache line set zero) can be used to initialize large >chunks of memory, of course. The other obvious target for the DCLSZ and CLI >(cache line invalidate) instructions is to control the allocation and >deallocation of procedure frames on the stack so that no memory references >are generated for newly allocated stack space and that no deallocated stack >space will be written back to memory. Implementing Fortran, I would have used them on large arrays. When you're doing one of the BLAS routines and the destination is merely overwritten, then we can save a lot of cache-misses by not fetching it. Similarly, when we're done with a temporary workspace, we may simply invalidate it. The difficulty is alignment. It seems difficult to ensure that nothing extraneous is accidentally zeroed when using long cache lines. Brandis also make the point that the compiler would have to be parameterized to account properly for cache line length. True! Generally, compilers are written to the architecture, not the implementation; cache is usually part of the implementation. However, instruction schedulers are bending this idea already. Further, various cache blocking techniques (often used at the source level) bend it further. You have to work hard for performance. Summarizing, I can't argue that the RS/6000's instructions are practical as they stand, and I don't have compiler techniques to use them yet. However, they (along with HP's cache instructions) are interesting ideas and probably worth some study. Preston Briggs
gbyrd@mcnc.org (Gregory T. Byrd) (03/23/91)
In article <765@ajpo.sei.cmu.edu> carter%csvax.decnet@mdcgwy.mdc.com writes: >Both individual loads and pages can be marked as no-allocate (the data >comes from the cache if there is a hit, but avoids the cache lockup [and >replace] on a miss). The cache is physically addressed, so we can have >allocating and non-allocating aliases of important data structures. Do you mean "virtually addressed," or am I missing something? If you really mean physically addressed, could you elaborate? ...Greg -- ...Greg Byrd MCNC/Digital Equipment Corp. gbyrd@mcnc.org P.O. Box 12889 (919)248-1439 Research Triangle Park, NC 27709
jesup@cbmvax.commodore.com (Randell Jesup) (03/23/91)
In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes: >Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from >HP are: > >1) cache pre-load instructions (the compiler inserts these into the >instr stream, and hopefully, the appropriate cache line will be available >by the time it's needed, avoiding delays and speeding up single-task >execution) Can be fairly effective, especially on a machine with long latencies and therefore more NOPs to fill in various places. Certain algorithms can get big wins from this sort of thing. Another interesting tweak (designed, never implemented, for the RPM-40 external cache chip) was access-knowlegable external cache. The cache would be told that accesses via a given register would normally have leaps of X words, and whenever an access was made via that register the cache would prefetch the line at <addr+X>. This can make certain types of algorithms (mainly array-based ones, like matrix multiplies) go much faster (instead of taking a miss on just about every memory access, they can approach a 100% hit rate for inner loops). X could also be negative, 0 (off), or 1 (i.e. sequential - it this is the end of a line, fetch the next one). The RPM-40 cache could enable or disable caching of stores based on what register they were stored via. It also had a conventional portion, writeback queue, and a context-stack cache. Items written to the stack cache would not be written to memory unless they were forced out by stack growth, or when otherwise unused main-memory cycles were available, or when instructed to by an instruction. Randell Jesup, former member of the RPM-40 backend-software/design team. -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com BIX: rjesup The compiler runs Like a swift-flowing river I wait in silence. (From "The Zen of Programming") ;-)
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/24/91)
In article <20054@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: ::cache pre-load instructions (the compiler inserts these into the ::instr stream, and hopefully, the appropriate cache line will be available ::by the time it's needed, avoiding delays and speeding up single-task ::execution) >Can be fairly effective, especially on a machine with long latencies >and therefore more NOPs to fill in various places. Certain >algorithms can get big wins from this sort of thing. I agree. In particular, ensemble machines (e.g. hypercubes) could find some wins here, because it is their nature to have long-latency accesses. Also, highly parallel algorithms are the most likely to find a use for such features. (They have predictable access patterns, or they access cache-busting quantities of data, or both.) -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
cet1@cl.cam.ac.uk (C.E. Thompson) (03/24/91)
In article <27671@neptune.inf.ethz.ch> brandis@inf.ethz.ch (Marc Brandis) writes: >In article <1991Mar21.161044.2898@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes: >>The RS/6000 includes 2 interesting possibilities. >>An instruction that zeroes a line in the data cache (without >>fetching it). May be used like (2 above); additionally handy for zeroing >>big chunks of memory. They also include an "invalidate line" >>instruction which says: "don't bother writing this one back to memory." >> > >Unfortunately, IBM made these instructions privileged. They had some good >reasons to do it, as the instructions ignore lock and protection bits. I do >not know the reasons why they could not make them check the bits, however. > Simplicity, I suppose: CLF (cache line flush) and DCLST (data cache line store) don't check either: but they don't have to, because they don't alter the relative consistency of cache and main memory. Even if DCLZ and CLI did check protection, you have to allow for scenarios such as the following. User program touches a never-before-referenced page in a work segment; kernel allocates a real page frame and zeros it with DCLZ instructions (this must be the archetypal use of DCLZ in practice). Now the user program must *not* be allowed to use CLI on this page, or it could read the previous contents of the real page frame which the kernel was trying to hide from it. Chris Thompson JANET: cet1@uk.ac.cam.phx Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk
lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/27/91)
In article <1991Mar24.151523.21921@cl.cam.ac.uk> cet1@cl.cam.ac.uk (C.E. Thompson) writes: :::An instruction that zeroes a line in the data cache (without :::fetching it). ::Unfortunately, IBM made these instructions privileged. :Even if DCLZ and CLI did check protection, you have to allow for scenarios :such as the following. User program touches a never-before-referenced page :in a work segment; kernel allocates a real page frame and zeros it with DCLZ :instructions (this must be the archetypal use of DCLZ in practice). Now the :user program must *not* be allowed to use CLI on this page, or it could read :the previous contents of the real page frame which the kernel was trying to :hide from it. I thought that read-uncached would check for valid cached data before going to memory? -- Don D.C.Lindsay .. temporarily at Carnegie Mellon Robotics
cet1@cl.cam.ac.uk (C.E. Thompson) (03/27/91)
In article <12487@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes: >In article <1991Mar24.151523.21921@cl.cam.ac.uk> > cet1@cl.cam.ac.uk (C.E. Thompson) writes: >:Even if DCLZ and CLI did check protection, you have to allow for scenarios >:such as the following. User program touches a never-before-referenced page >:in a work segment; kernel allocates a real page frame and zeros it with DCLZ >:instructions (this must be the archetypal use of DCLZ in practice). Now the >:user program must *not* be allowed to use CLI on this page, or it could read >:the previous contents of the real page frame which the kernel was trying to >:hide from it. > >I thought that read-uncached would check for valid cached data before >going to memory? This sub-thread was about the cache manipulation instructions on the RS/6000. It doesn't have a read-uncached mode (except at the hardware test level of "do this memory cycle"). CLI invalidates the cache line, even if dirty, without affecting main memory: it is privileged, and I was trying to explain one reason why it needs to be. Chris Thompson JANET: cet1@uk.ac.cam.phx Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk