[comp.arch] cache pre-load/no-load instructions

jonathan@cs.pitt.edu (Jonathan Eunice) (03/17/91)

Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
HP are:

1)  cache pre-load instructions (the compiler inserts these into the
instr stream, and hopefully, the appropriate cache line will be available
by the time it's needed, avoiding delays and speeding up single-task 
execution) 

2) cache no-load hints as a part of store instructions (useful to avoid
useless cache loading for initialization statements, for faster program
startup, and perhaps in other situations too)

How effective are these optimizations likely to be?  (While they aren't going
to give the same kind of speedup as making the system super-scalar or 
super-pipelined, they strike me as effective tweaks.)  

Does anyone else have them?  I seem to recall a posting to the effect that
the RS/6000 POWER architecture does not.  What about MIPS, SPARC, etc?  Is
this a me-too feature?

gbyrd@mcnc.org (Gregory T. Byrd) (03/18/91)

In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
>HP are:
>
>1)  cache pre-load instructions [...]
>
>2) cache no-load hints as a part of store instructions (useful to avoid
>useless cache loading for initialization statements, for faster program
>startup, and perhaps in other situations too)
>
>[...]

I just want to make sure I understand what (2) means.

Is this a way to implement a Store-Allocate-No-Fetch (SANF)
policy for a give cache block, rather than the default behavior
(fetch-on-store?)?

Would this work with shared data in a multiprocessor system?
Does HP-PA say anything about cache coherence, or is that up
to the system implementor?

...Greg Byrd             Digital Equipment Corp./MCNC
   gbyrd@mcnc.org        P.O. Box 12889
   (919)248-1439         Research Triangle Park, NC  27709

-- 
...Greg Byrd             MCNC/Digital Equipment Corp.
   gbyrd@mcnc.org        P.O. Box 12889
   (919)248-1439         Research Triangle Park, NC  27709

carters@ajpo.sei.cmu.edu (Scott Carter) (03/21/91)

In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
>HP are:
>
>1)  cache pre-load instructions (the compiler inserts these into the
>instr stream, and hopefully, the appropriate cache line will be available
>by the time it's needed, avoiding delays and speeding up single-task 
>execution) 
>
>2) cache no-load hints as a part of store instructions (useful to avoid
>useless cache loading for initialization statements, for faster program
>startup, and perhaps in other situations too)
>
>How effective are these optimizations likely to be?  (While they aren't going
>to give the same kind of speedup as making the system super-scalar or 
>super-pipelined, they strike me as effective tweaks.)  
>

A military machine we're working on (whose name may Not be Mentioned) has some
similar capabilities.  BTW, load to R0 may well turn out to be a cache preload
in some RISCs, depending on how the pipe control is implemented.  Definitely
an unsupported feature :}

The technology we had to use had small caches and fairly slow memory, so
minimizing miss penalty certainly counted, as did not knocking a line out
with another line from which you only needed one word, and whose locality
was poor.

Both individual loads and pages can be marked as no-allocate (the data 
comes from the cache if there is a hit, but avoids the cache lockup [and 
replace] on a miss).  The cache is physically addressed, so we can have
allocating and non-allocating aliases of important data structures.  This is
mostly useful with large arrays which are sometimes addressed row-wise and
sometimes column-wise.  The performance gain on matrix-multiplication type
operations (which we spend a lot of time doing) is fairly good versus
just treating the off-stride access as uncached.

The pipeline control to handle aborting a cache preload when a real miss comes
along is fairly unpleasant.

There's no store hinting because our cache is writethrough, non-allocating.

Note that the application programmer has to insert compiler pragmas to
make use of this (though the pipeline scheduler does have some heuristics
about which loads to promote the most).  Methinks fully heuristic compilers
for this are still a research topic.  See CARP at Purdue, for example.

>Does anyone else have them?  I seem to recall a posting to the effect that
>the RS/6000 POWER architecture does not.  What about MIPS, SPARC, etc?  Is
>this a me-too feature?

I don't know of any other GP ISAs that have this.  

Scott Carter - McDonnell Douglas Electronic Systems Company
carter%csvax.decnet@mdcgwy.mdc.com (preferred and faster) - or -
carters@ajpo.sei.cmu.edu		 (714)-896-3097
The opinions expressed herein are solely those of the author, and are not
necessarily those of McDonnell Douglas.

preston@ariel.rice.edu (Preston Briggs) (03/22/91)

jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
>>
>>1)  cache pre-load instructions (the compiler inserts these into the
>>instr stream, and hopefully, the appropriate cache line will be available
>>by the time it's needed, avoiding delays and speeding up single-task 
>>execution) 
>>
>>2) cache no-load hints as a part of store instructions (useful to avoid
>>useless cache loading for initialization statements, for faster program
>>startup, and perhaps in other situations too)

At the upcoming ASPLOS, there's a paper called "Software Prefetching",
by Callahan, Kennedy, and Porterfield, describing compiler mechanisms
to take advantage of cache pre-fetch instructions (1 above).  They seem very
effective for scientific code.

The RS/6000 includes 2 interesting possibilities.
An instruction that zeroes a line in the data cache (without
fetching it).  May be used like (2 above); additionally handy for zeroing
big chunks of memory.  They also include an "invalidate line"
instruction which says: "don't bother writing this one back to memory."

>>How effective are these optimizations likely to be?  (While they aren't going
>>to give the same kind of speedup as making the system super-scalar or 
>>super-pipelined, they strike me as effective tweaks.)  

This sort of thing can be very important.  One of the basic problems
of the i860 (for an example) is its low off-chip memory bandwidth,
at least in relation to it's FP performance.  Instruction-level
parallelism (piplines, wide instructions, superscalar, speculative execution)
is ok for getting the FP performance up, but the processor will starve
without lots of bandwidth.

Preston Briggs

maf@hpfcso.FC.HP.COM (Mark Forsyth) (03/22/91)

>From: jonathan@cs.pitt.edu (Jonathan Eunice)
>Message-ID: <JONATHAN.91Mar17034438@speedy.cs.pitt.edu>
>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 

These were presented as extensions to the PA-RISC architecture, which is
used in several product lines, NOT features of any particular products. 

>HP are:
>
>1)  cache pre-load instructions (the compiler inserts these into the
>
>2) cache no-load hints as a part of store instructions (useful to avoid
>
>How effective are these optimizations likely to be?  

Extremely effective at eliminating cache-miss bottlenecks in certain
intended cases. 1) is employed for applications which access very large
uniform data sets and perform a fair amount of manipulation or calculations
on individual data items (allowing enough time to prefetch the next group
of operands). In some cases, cache miss penalties can be COMPLETELY elim-
inated from the performance equation. 2) is intended to be used primarily
by the OS for page initialization, block moves, etc.  

>(While they aren't going
>to give the same kind of speedup as making the system super-scalar or 
>super-pipelined, they strike me as effective tweaks.)  

Comparing these features to pipeline implementations is apples to oranges.
Cache hints address classes of applications which are dominated by memory
system performance, whereas, superscalar pipelines improve primarily 
certain floating point applications dominated by the pipeline CPI. In
applications dominated by cache misses these give far bigger performance
improvements than a superscalar pipeline would. 

>
>Does anyone else have them?  I seem to recall a posting to the effect that
>the RS/6000 POWER architecture does not.  What about MIPS, SPARC, etc?  Is
>this a me-too feature?

The features were defined as a result of extensive analysis of bottlenecks
in important customer applications, not imitation. I'm not aware of any
similar features in other architectures.

---
Mark Forsyth                        Hewlett-Packard
maf@hpesmaf.fc.hp.com               Engineering Systems Laboratory
                                    Fort Collins, Colorado

brandis@inf.ethz.ch (Marc Brandis) (03/22/91)

In article <1991Mar21.161044.2898@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>The RS/6000 includes 2 interesting possibilities.
>An instruction that zeroes a line in the data cache (without
>fetching it).  May be used like (2 above); additionally handy for zeroing
>big chunks of memory.  They also include an "invalidate line"
>instruction which says: "don't bother writing this one back to memory."
>

Unfortunately, IBM made these instructions privileged. They had some good
reasons to do it, as the instructions ignore lock and protection bits. I do
not know the reasons why they could not make them check the bits, however.

I am not sure whether having these instructions in user mode would be a great
advantage. DCLSZ (data cache line set zero) can be used to initialize large
chunks of memory, of course. The other obvious target for the DCLSZ and CLI
(cache line invalidate) instructions is to control the allocation and 
deallocation of procedure frames on the stack so that no memory references
are generated for newly allocated stack space and that no deallocated stack
space will be written back to memory. 

I do not think that this mechanism would really improve the performance of
current programs. Many programs consume only a few kilobytes of stack space
and exhibit a large amount of spatial locality on their references. The number
of frames on the stack is almost constant over large fractions of many programs
and so is the top of the stack. Under this standpoint of view, it is very
unlikely that stack references cause cache misses, so that this 'optimization'
would not reduce the number of cache misses at all.

Now consider the cost of it. Considering the static overhead of a procedure
frame on the RS/6000 (6 words header, at least 8 words for output parameters)
and the typical number of saved registers (I assume 16 words) as well as some
additional local stack space (I assume another 16 words), a frame is about
46 words or 184 bytes large. The cache line size on the RS/6000 is 128 bytes,
so you would need two additional instructions at each procedure entry and two
additional instructions at each procedure exit (or three+three for the cost
reduced CPU in the models 320 and 520 with a 64 byte line size), adding some
overhead to each procedure call. While the overhead is not large, it may well
eat up the benefits that we are getting from the scheme.

Note that in order to make the same program run on machines with different
cache line sizes, some additional overhead to parametrize the entry and exit
code would have to be paid.

Marc-Michael Brandis
Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology)
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch

oehler@arnor.UUCP (Rich Oehler) (03/22/91)

In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:

|> 
|> >Does anyone else have them?  I seem to recall a posting to the effect that
|> >the RS/6000 POWER architecture does not.  What about MIPS, SPARC, etc?  Is
|> >this a me-too feature?
|>

The RISC System/6000 has cache control instructions, but not touch (to prefetch
a line) nor set (to establish a line without fetching).

The original 801 (circa 1975) had set and subsequent 801 designs had touch.
-- 
Richard Oehler		(oehler@ibm.com)

preston@ariel.rice.edu (Preston Briggs) (03/23/91)

I wrote:
>>The RS/6000 includes 2 interesting possibilities.
>>An instruction that zeroes a line in the data cache (without
>>fetching it).  May be used like (2 above); additionally handy for zeroing
>>big chunks of memory.  They also include an "invalidate line"
>>instruction which says: "don't bother writing this one back to memory."

and brandis@inf.ethz.ch (Marc Brandis) writes:

>Unfortunately, IBM made these instructions privileged. They had some good
>reasons to do it, as the instructions ignore lock and protection bits. I do
>not know the reasons why they could not make them check the bits, however.
>
>I am not sure whether having these instructions in user mode would be a great
>advantage. DCLSZ (data cache line set zero) can be used to initialize large
>chunks of memory, of course. The other obvious target for the DCLSZ and CLI
>(cache line invalidate) instructions is to control the allocation and 
>deallocation of procedure frames on the stack so that no memory references
>are generated for newly allocated stack space and that no deallocated stack
>space will be written back to memory. 

Implementing Fortran, I would have used them on large arrays.  When you're
doing one of the BLAS routines and the destination is merely overwritten,
then we can save a lot of cache-misses by not fetching it.  Similarly,
when we're done with a temporary workspace, we may simply invalidate it.

The difficulty is alignment.  It seems difficult to ensure that nothing
extraneous is accidentally zeroed when using long cache lines.

Brandis also make the point that the compiler would have to be parameterized
to account properly for cache line length.  True!  Generally, compilers
are written to the architecture, not the implementation; cache is usually
part of the implementation.  However, instruction schedulers are bending
this idea already.  Further, various cache blocking techniques (often
used at the source level) bend it further.  You have to work hard
for performance.

Summarizing, I can't argue that the RS/6000's instructions are practical
as they stand, and I don't have compiler techniques to use them yet.
However, they (along with HP's cache instructions) are interesting ideas
and probably worth some study.

Preston Briggs

gbyrd@mcnc.org (Gregory T. Byrd) (03/23/91)

In article <765@ajpo.sei.cmu.edu> carter%csvax.decnet@mdcgwy.mdc.com writes:
>Both individual loads and pages can be marked as no-allocate (the data 
>comes from the cache if there is a hit, but avoids the cache lockup [and 
>replace] on a miss).  The cache is physically addressed, so we can have
>allocating and non-allocating aliases of important data structures.

Do you mean "virtually addressed," or am I missing something?
If you really mean physically addressed, could you elaborate?

...Greg






-- 
...Greg Byrd             MCNC/Digital Equipment Corp.
   gbyrd@mcnc.org        P.O. Box 12889
   (919)248-1439         Research Triangle Park, NC  27709

jesup@cbmvax.commodore.com (Randell Jesup) (03/23/91)

In article <JONATHAN.91Mar17034438@speedy.cs.pitt.edu> jonathan@cs.pitt.edu (Jonathan Eunice) writes:
>Two of the tweaks of the forthcoming "Snake" (HP-PA 1.1) systems from 
>HP are:
>
>1)  cache pre-load instructions (the compiler inserts these into the
>instr stream, and hopefully, the appropriate cache line will be available
>by the time it's needed, avoiding delays and speeding up single-task 
>execution) 

	Can be fairly effective, especially on a machine with long latencies
and therefore more NOPs to fill in various places.  Certain algorithms can get
big wins from this sort of thing.

	Another interesting tweak (designed, never implemented, for the
RPM-40 external cache chip) was access-knowlegable external cache.  The cache
would be told that accesses via a given register would normally have leaps
of X words, and whenever an access was made via that register the cache would
prefetch the line at <addr+X>.  This can make certain types of algorithms
(mainly array-based ones, like matrix multiplies) go much faster (instead of
taking a miss on just about every memory access, they can approach a 100% hit
rate for inner loops).  X could also be negative, 0 (off), or 1 (i.e.
sequential - it this is the end of a line, fetch the next one).

	The RPM-40 cache could enable or disable caching of stores based on
what register they were stored via.  It also had a conventional portion,
writeback queue, and a context-stack cache.  Items written to the stack
cache would not be written to memory unless they were forced out by stack
growth, or when otherwise unused main-memory cycles were available, or when
instructed to by an instruction.

Randell Jesup, former member of the RPM-40 backend-software/design team.
-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/24/91)

In article <20054@cbmvax.commodore.com> 
	jesup@cbmvax.commodore.com (Randell Jesup) writes:
::cache pre-load instructions (the compiler inserts these into the
::instr stream, and hopefully, the appropriate cache line will be available
::by the time it's needed, avoiding delays and speeding up single-task 
::execution) 
>Can be fairly effective, especially on a machine with long latencies
>and therefore more NOPs to fill in various places.  Certain
>algorithms can get big wins from this sort of thing.

I agree. In particular, ensemble machines (e.g. hypercubes) could
find some wins here, because it is their nature to have long-latency
accesses.  Also, highly parallel algorithms are the most likely to
find a use for such features.  (They have predictable access
patterns, or they access cache-busting quantities of data, or both.)
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

cet1@cl.cam.ac.uk (C.E. Thompson) (03/24/91)

In article <27671@neptune.inf.ethz.ch> brandis@inf.ethz.ch (Marc Brandis) writes:
>In article <1991Mar21.161044.2898@rice.edu> preston@ariel.rice.edu (Preston Briggs) writes:
>>The RS/6000 includes 2 interesting possibilities.
>>An instruction that zeroes a line in the data cache (without
>>fetching it).  May be used like (2 above); additionally handy for zeroing
>>big chunks of memory.  They also include an "invalidate line"
>>instruction which says: "don't bother writing this one back to memory."
>>
>
>Unfortunately, IBM made these instructions privileged. They had some good
>reasons to do it, as the instructions ignore lock and protection bits. I do
>not know the reasons why they could not make them check the bits, however.
>

Simplicity, I suppose: CLF (cache line flush) and DCLST (data cache line
store) don't check either: but they don't have to, because they don't alter
the relative consistency of cache and main memory.

Even if DCLZ and CLI did check protection, you have to allow for scenarios
such as the following. User program touches a never-before-referenced page
in a work segment; kernel allocates a real page frame and zeros it with DCLZ
instructions (this must be the archetypal use of DCLZ in practice). Now the
user program must *not* be allowed to use CLI on this page, or it could read
the previous contents of the real page frame which the kernel was trying to
hide from it.

Chris Thompson
JANET:    cet1@uk.ac.cam.phx
Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (03/27/91)

In article <1991Mar24.151523.21921@cl.cam.ac.uk>
	cet1@cl.cam.ac.uk (C.E. Thompson) writes:
:::An instruction that zeroes a line in the data cache (without
:::fetching it).
::Unfortunately, IBM made these instructions privileged.
:Even if DCLZ and CLI did check protection, you have to allow for scenarios
:such as the following. User program touches a never-before-referenced page
:in a work segment; kernel allocates a real page frame and zeros it with DCLZ
:instructions (this must be the archetypal use of DCLZ in practice). Now the
:user program must *not* be allowed to use CLI on this page, or it could read
:the previous contents of the real page frame which the kernel was trying to
:hide from it.

I thought that read-uncached would check for valid cached data before
going to memory?  
-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

cet1@cl.cam.ac.uk (C.E. Thompson) (03/27/91)

In article <12487@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>In article <1991Mar24.151523.21921@cl.cam.ac.uk>
>	cet1@cl.cam.ac.uk (C.E. Thompson) writes:
>:Even if DCLZ and CLI did check protection, you have to allow for scenarios
>:such as the following. User program touches a never-before-referenced page
>:in a work segment; kernel allocates a real page frame and zeros it with DCLZ
>:instructions (this must be the archetypal use of DCLZ in practice). Now the
>:user program must *not* be allowed to use CLI on this page, or it could read
>:the previous contents of the real page frame which the kernel was trying to
>:hide from it.
>
>I thought that read-uncached would check for valid cached data before
>going to memory?  

This sub-thread was about the cache manipulation instructions on the RS/6000.
It doesn't have a read-uncached mode (except at the hardware test level
of "do this memory cycle"). CLI invalidates the cache line, even if dirty,
without affecting main memory: it is privileged, and I was trying to explain
one reason why it needs to be.

Chris Thompson
JANET:    cet1@uk.ac.cam.phx
Internet: cet1%phx.cam.ac.uk@nsfnet-relay.ac.uk