[comp.arch] How Caches Work

tbray@watsol.waterloo.edu (Tim Bray) (09/11/89)

In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
+You may be running software that has a very low cache hit rate if you
+are doing CAD work or scientific calculations.  Take this little loop
+for example:
+
+      SUM = 0.0
+      DO 10 I = 1, 1000000
+	SUM = SUM + VEC(I)
+   10 CONTINUE
+
+A data cache is *no use at all* for this problem.  You will get a
+cache miss on every data access.  

Now hold on just a dag-blaggin' minute.  I'm a software weenie who's never
built a cache, but I thought I understood how they work.  If this is right
obviously I don't at all.  Somebody who knows should either debunk this or
explain what's really going on, because I'm probably not alone in my
ignorance.

I thought caches respected the principle of locality.  And this code has
really good locality.  In fact, I thought they were block- or page-based.  And 
when VEC(I) hits a page for the first time, it'll be cached, and then it'll 
keep hitting the cache (bar nasty context switches, etc.), until VEC(I) moves
off that page.  One cache miss per page; in the worst case, if SUM is DOUBLE 
and the page size is 512, you do 64 times as well as hitting main memory per 
loop iteration.  Nyet?

+Similarly, copying data from one bit
+of memory to another will be limited by the raw memory speed.  

Say what?

Tim Bray, New OED Project, U of Waterloo, Ontario

rang@cs.wisc.edu (Anton Rang) (09/11/89)

In article <16306@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
>In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
>+You may be running software that has a very low cache hit rate if you
>+are doing CAD work or scientific calculations.  Take this little loop
>+for example:
>+
>+      SUM = 0.0
>+      DO 10 I = 1, 1000000
>+	SUM = SUM + VEC(I)
>+   10 CONTINUE
>+
>+A data cache is *no use at all* for this problem.  You will get a
>+cache miss on every data access.  
>
>I thought caches respected the principle of locality.  And this code has
>really good locality.

  The instruction fetches here have good locality; the data fetches
don't (1 MW of data is being accessed).  The problem here is that 1
million words (assuming 1 word per vector element) of memory is being
accessed.  No matter HOW your cache is arranged, you are going to have
to go to main memory for every one of those words (conceivably barring
a few in the cache when you start).
  In fact, a (data) cache can hurt in this kind of code, because you
have 1 million data fetch times, plus the overhead of the cache
(small, but significant for very large problems).

>  And when VEC(I) hits a page for the first time, it'll be cached, [ ... ]
                                                   ^^^^^^^^^^^^^^^
  The problem is that caching something is not without cost.  Loading
128 words into the cache from main memory is generally just as
expensive as fetching 128 words from main memory directly.  The cache
only helps if you're doing multiple fetches of the same memory
location before it's overwritten with other data.

>+Similarly, copying data from one bit
>+of memory to another will be limited by the raw memory speed.  
>
>Say what?

  A := B, where A and B are 1 MW arrays, will require at least 2
million memory cycle times (1 million reads, 1 million writes).
   
+----------------------------------+------------------+
| Anton Rang (grad student)        | "VMS Forever!"   |
| University of Wisconsin--Madison | rang@cs.wisc.edu |
+----------------------------------+------------------+

elm@chilli.Berkeley.EDU (ethan miller) (09/11/89)

In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes:
%  The problem is that caching something is not without cost.  Loading
%128 words into the cache from main memory is generally just as
%expensive as fetching 128 words from main memory directly.  The cache
%only helps if you're doing multiple fetches of the same memory
%location before it's overwritten with other data.

It is often faster to load 128 words sequentially than it is to have
to fetch those words from memory.  If you can use static column mode on
your RAMs, for example, you can get words from memory about twice as fast
as normal mode.  Unless you have smart hardware that knows successive
accesses refer to the same bank and treats the accesses as static column,
it'll be faster to read 128 words at once.  However, the cache overhead
might cut into that advantage.

ethan
=================================
ethan miller--cs grad student   elm@ginger.berkeley.edu
#include <std/disclaimer.h>     {...}!ucbvax!ginger!elm
"I like the Austrian way better." -- Dr. Henry Jones, Jr.

swarren@eugene.uucp (Steve Warren) (09/11/89)

In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes:
>  The problem is that caching something is not without cost.  Loading
>128 words into the cache from main memory is generally just as
>expensive as fetching 128 words from main memory directly.  The cache
>only helps if you're doing multiple fetches of the same memory
>location before it's overwritten with other data.

Not if main memory is pipelined and interleaved.  In that case the
cache may be filled fairly rapidly since consecutive locations are
accessed.

--Steve
-------------------------------------------------------------------------
	  {uunet,sun}!convex!swarren; swarren@convex.COM

roy@phri.UUCP (Roy Smith) (09/11/89)

>       SUM = 0.0
>       DO 10 I = 1, 1000000
>       SUM = SUM + VEC(I)
> 10    CONTINUE
>
> A data cache is *no use at all* for this problem.  You will get a
> cache miss on every data access.  

	Well, not quite.  True, you're going to miss on all the VEC(I)
references, but you should hit on all the I and SUM references.  Rewrite
the loop as follows (and no, I don't care if I got the test-at-the-top or
test-at-the-bottom backwards)

10	IF (I .GT. 1000000) GOTO 20	# fetch I
	I = I + 1			# fetch I, store I
	SUM = SUM + VEC(I)		# fetch I, fetch SUM, store SUM
	GOTO 10
20	CONTINUE

and you can count 4 data fetches (as pseudo-commented above), all of which
should be cache hits.  This far outnumbers the 1 cache miss per loop due to
the main data array accesses.  Some FORTRAN compilers have even been known
to put constants in memory and refer to them indirectly; if that is true
here, you get another two cache hits for the 1 and the 1000000.  You could
argue, of course, that I and SUM might be in registers, but that's a whole
other argument.
-- 
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu
"The connector is the network"

ttl@astroatc.UUCP (Tony Laundrie) (09/11/89)

In article <31224@ucbvax.BERKELEY.EDU> elm@chilli.Berkeley.EDU (ethan miller) writes:
>In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes:
>%  The problem is that caching something is not without cost.  Loading
>%128 words into the cache from main memory is generally just as
>%expensive as fetching 128 words from main memory directly.  The cache
>
>It is often faster to load 128 words sequentially than it is to have
>to fetch those words from memory.  If you can use static column mode on
>

Loading 128 words into the cache from memory is much faster than loading them
one at a time because the path between cache and memory is usually many
words wide (like 16).

dik@cwi.nl (Dik T. Winter) (09/12/89)

In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes:
 > In article <16306@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes:
 > >+      SUM = 0.0
 > >+      DO 10 I = 1, 1000000
 > >+	SUM = SUM + VEC(I)
 > >+   10 CONTINUE
 > >+
 > >+A data cache is *no use at all* for this problem.  You will get a
 > >+cache miss on every data access.  
 > >
 > >I thought caches respected the principle of locality.  And this code has
 > >really good locality.
 > 
 >   The instruction fetches here have good locality; the data fetches
 > don't (1 MW of data is being accessed).  The problem here is that 1
 > million words (assuming 1 word per vector element) of memory is being
 > accessed.  No matter HOW your cache is arranged, you are going to have
 > to go to main memory for every one of those words (conceivably barring
 > a few in the cache when you start).
 >   In fact, a (data) cache can hurt in this kind of code, because you
 > have 1 million data fetch times, plus the overhead of the cache
 > (small, but significant for very large problems).
This is not entirely true.  There are systems where a cache acts more or
less similar to the paging of VM.  Take for instance the system where
a cache line consists of 32 bytes and the width of the bus between
memory and cache is 32 bytes (i.e. in memory a word is 265 (8*32) bits,
but for the processor it is still  16 or 32 bits).  In that case fetching
a single operand in the processor will cache a memory word (or a number
of processor words).  So assuming the elements of array VEC are 4 bytes
there will be a cache miss every eighth element or a cache hit rate of
87.5%.  Changing the index in the loop to (33*I)mod 1000000 will reduce
the hit rate to 0%.  Because the bus between the cache and memory is wide
enough, no speed penalty is involved (except for cache overhead.
This is also true (although a bit less) if the bus between memory and
cache is smaller, but the cache is refreshed in lines of multiple words,
because the refreshment of the cache goes in general much faster than you
would see with individual memory access for each word (typically something
like one word per cycle, or faster).
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax

roy@phri.UUCP (Roy Smith) (09/12/89)

	Here's a (possibly crazy) idea for cache design.  The current EUD
(Example Under Debate) shows that caches just don't work for sequential
access, but we knew that already.  You can argue about how doing very wide
memory fetches will preload the cache for sequential accesses, but that's
certainly a second order effect.

	What if you segmented the virtual memory space (Oh no!  Not
segmented address spaces again!  Shades of Intel!) so that the top bit was
a hint to the cache on probably access patterns.  Variables which were
expected to hit the cache a lot (SUM and I in the EUD) would be put in the
"normal" part of the address space.  Variables which were expected to be
sequential access and thus never hit (VEC in the EUD) would be put in the
other half of the address space.  The cache would know to not bother doing
a tag match on this kind of access.  The advantages would be faster access
time (a memory fetch should be faster than a cache miss followed by a memory
fetch) but more important it wouldn't cause bogus cache flushes.

	As with every crazy idea, you can expand on this in all sorts of
ways.  You might use more than one high order bit to provide lots of
different sorts of hints to the cache.  You might want to be able to turn
the hinting on and off, or even make it programable.  Of course, every bell
and whistle adds cost and complexity, and reduces the chance that you will
use the fancy features well, or at all.

	So what do you think?  Has this been done before?
-- 
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
{att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu
"The connector is the network"

mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (09/12/89)

In message <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:

>Here's a (possibly crazy) idea for cache design.  The current EUD
>(Example Under Debate) shows that caches just don't work for sequential
>access, but we knew that already.

	[ Roy describes a system for which only a portion of the
	  address space is sun through the cache ]

>	So what do you think?  Has this been done before?

Along a similar line, the Convex machines cache accesses going to the
scalar unit only.  The load/store unit for the vector unit operates
directly from main memory to the vector registers.  This is pretty
close to what Roy describes, but it is not controlled by any tags,
merely by which functional unit executes the load instruction.

--
John D. McCalpin - mccalpin@masig1.ocean.fsu.edu
		   mccalpin@scri1.scri.fsu.edu
		   mccalpin@delocn.udel.edu

mat@uts.amdahl.com (Mike Taylor) (09/13/89)

In article <3985@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes:
> >       SUM = 0.0
> >       DO 10 I = 1, 1000000
> >       SUM = SUM + VEC(I)
> > 10    CONTINUE
> >
> > A data cache is *no use at all* for this problem.  You will get a
> > cache miss on every data access.  
> 
> 	Well, not quite.  True, you're going to miss on all the VEC(I)
> references, but you should hit on all the I and SUM references.  Rewrite
> the loop as follows (and no, I don't care if I got the test-at-the-top or
> test-at-the-bottom backwards)
> 
> 10	IF (I .GT. 1000000) GOTO 20	# fetch I
> 	I = I + 1			# fetch I, store I
> 	SUM = SUM + VEC(I)		# fetch I, fetch SUM, store SUM
> 	GOTO 10
> 20	CONTINUE
> 
> and you can count 4 data fetches (as pseudo-commented above), all of which
> should be cache hits.  This far outnumbers the 1 cache miss per loop due to
> the main data array accesses.  Some FORTRAN compilers have even been known
> to put constants in memory and refer to them indirectly; if that is true
> here, you get another two cache hits for the 1 and the 1000000.  You could
> argue, of course, that I and SUM might be in registers, but that's a whole
> other argument.
All perfectly right, but the cache miss rate also depends on the cache
line size - how many words are fetched per miss - which might be (say)
128 bytes or 16 words.  In this case, you'd get a miss rate of 1/16 on
the vec(i) references.  Also, some mainframe caches are smart enough to
look ahead and prefetch the next sequential line (usually only on I-fetch,
but sometimes on data as well).  This could result in an actual miss rate
of zero for the example.  Even without this, you have an operand miss
rate of 1/80 or .0125.  Quite respectable.
-- 
Mike Taylor                               ...!{hplabs,amdcad,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

hankd@pur-ee.UUCP (Hank Dietz) (09/13/89)

In article <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>	Here's a (possibly crazy) idea for cache design.  The current EUD
...
>	What if you segmented the virtual memory space (Oh no!  Not
>segmented address spaces again!  Shades of Intel!) so that the top bit was
>a hint to the cache on probably access patterns.  Variables which were
>expected to hit the cache a lot (SUM and I in the EUD) would be put in the
>"normal" part of the address space.  Variables which were expected to be
>sequential access and thus never hit (VEC in the EUD) would be put in the
>other half of the address space.  The cache would know to not bother doing
>a tag match on this kind of access.  The advantages would be faster access
>time (a memory fetch should be faster than a cache miss followed by a memory
>fetch) but more important it wouldn't cause bogus cache flushes.

Look at Chi's PhD thesis (see my last posting).  The idea of using a bit to
control mapping is "older than dirt" and the idea of using it in sequential
machines to control cacheability is part of what Chi's thesis suggests.

The catch is that you don't divide the address space -- one simply has the
compiler tag individual REFERENCES with the "don't cache me" bit.  The reason
you just ignore the high bit for addressing rather than use it with creative
data layout is that the variable X might be cached along one program flow
path, but not cached along another flow path.  Chi also gives an O(n) typical
performance compiler algorithm for setting the "don't cache me" bits...  your
"sequential access" rule is a VERY crude approximation.

You get a BIG performance gain this way...  results also in Chi's thesis.

							-hankd@ecn.purdue.edu

PS: Chi and I have come to call this mechanism "Cache Bypass."  Parts of
    this work have appeared in a bunch of papers as well as in his PhD
    thesis...  I think the first was at HICSS 87; the most recent was at the
    SIGPLAN 89 conf. on programming lang. design & implementation.

retrac@titan.rice.edu (John Carter) (09/13/89)

In article <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>
>	Here's a (possibly crazy) idea for cache design.  [Stuff deleted.]
>
>	What if you segmented the virtual memory space (Oh no!  Not
>segmented address spaces again!  Shades of Intel!) so that the top bit was
>a hint to the cache on probably access patterns.  Variables which were
>expected to hit the cache a lot (SUM and I in the EUD) would be put in the
>"normal" part of the address space.  Variables which were expected to be
>sequential access and thus never hit (VEC in the EUD) would be put in the
>other half of the address space.  The cache would know to not bother doing
>a tag match on this kind of access.  The advantages would be faster access
>time (a memory fetch should be faster than a cache miss followed by a memory
>fetch) but more important it wouldn't cause bogus cache flushes.
>
>	As with every crazy idea, you can expand on this in all sorts of
>ways.  You might use more than one high order bit to provide lots of
>different sorts of hints to the cache.  You might want to be able to turn
>the hinting on and off, or even make it programable.  Of course, every bell
>and whistle adds cost and complexity, and reduces the chance that you will
>use the fancy features well, or at all.
>
>	So what do you think?  Has this been done before?


I don't know of anybody else doing exactly this, but I/we have been
examining something similar here at Rice.  Our motivation for examining what
I call adaptive caching schemes (adaptive because the caching mechanism
adapts to the way the shared data is being accessed) is in the design of a
`highly efficient' distributed shared memory system (I put highly efficient
in quotes because we're just starting to move out from paper design, and
while efficiency is a major goal, I don't know how much we'll succeed until
we actually get some hard numbers).  Implementing distributed shared memory
is quite similar to implementing a multiprocessor cache (ref. work by Kai Li
among others), but has a few important differences, including much longer
latency to `main memory' and lower bus bandwidth (which may or may not
support efficient broadcast, depending on your target distributed memory
multiprocessor).  The high cost of `page faults' in such a system motivated
us to study ways to significantly reduce the number of `page faults' by
taking advantage of semantic information either derived automatically by the
compiler or specified by the user.  Since distributed shared memory systems
are essentially software-implemented caches, you can implement a more
complicated caching mechanism than can be reasonably implemented in
hardware.

I don't have the time right now to go into details (both Sigmetrics and
PPoPP have immediate deadlines that we're shooting for), but our basic idea
is that you can characterize different common `types' of shared memory
objects (data items), provide multiple cache coherence mechanisms designed
to efficiently support these different `types', and have each shared object
be handled with a caching mechanism tuned to its needs.  The current thrust
of the work has been two-fold: characterizing the way shared memory is
accessed (somewhat related to earlier work by Weber & Gupta, Eggers & Katz,
and Agarwal et. al.) and developing the basic design of a distributed shared
memory system that can take advantage of this knowledge (Munin -- if you
want to know the etymology of the name, you'll have to ask :-).  If anybody
is interested, we should have a few tech reports available soon (next week).
One will describe our work on characterizing shared memory access patterns
and the other will discuss the preliminary design of Munin.

Hope you found this interesting.

>Roy Smith, Public Health Research Institute

John Carter               Internet: retrac@rice.edu
Dept of Computer Science  UUCP:     {internet node or backbone}!rice!retrac
Rice University
Houston, TX      "Badgers?!  We don't need no stinking badgers!" - from UHF

mash@mips.COM (John Mashey) (09/13/89)

In article <1174@brazos.Rice.edu> retrac@titan.rice.edu (John Carter) writes:
>In article <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>>	Here's a (possibly crazy) idea for cache design.  [Stuff deleted.]
>>	What if you segmented the virtual memory space (Oh no!  Not
>>segmented address spaces again!  Shades of Intel!) so that the top bit was
>>a hint to the cache on probably access patterns.  Variables which were
>>expected to hit the cache a lot (SUM and I in the EUD) would be put in the
>>"normal" part of the address space......
>>	So what do you think?  Has this been done before?

The issue is mostly a software issue.  Most microprocessors designed
with caching in mind from day one, and many implementations of earlier
architectures include MMUs that offer at least a cache/nocache bit
per MMU map entry.  Any of the high-performance RISCs, and most of the
CISC chips do this.  In any of the MMU designs that permit double-mapping,
the same physical page can be double-mapped with 2 MMU entries, one
cached, the other uncached, so if the software can figure out what to do,
and keep reasonable consistency, all might be well.

MIPS R3000s use one of the address bits to distinguish between
cached/nocached access for the "unmapped" part of the kernel address
space, and actually uses this in various ways.  For example, you can
call a function, cached, in the usual way.  If you call it by jumping to
the same address | 0x20000000, when you get there, you're running
uncached.....  this feature is actually used, by the way.
RISC/OS also takes advantage of uncached address space in those
configurations (i.e., most R2000-based machines), where it is faster
to read disk blocks into buffers and copy from them uncached,
thus avoiding trashing the cache unnecessarily. (This is not so
clear-cut on R330s that do multi-word refill; another strategy might
be better).

However, to my knowledge, we do not make use of this for ordinary user
programs of the form implied by the original question; rather it is
restricted to manual choice where people know a priori that this is
what they want.

Understanding the useability of software-controlled caching is
a good area for research over the next few years; the answers are not
yet clear.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

amos@taux01.UUCP (Amos Shapir) (09/13/89)

This idea is already in use by some MMU systems - it's called uncacheable
(or memory-locked) pages.  Such schemes let you specify which pages
will be bypassing the cache.  (Specifying a page address is equivalent
to marking a few address bits, which is what you suggested).

-- 
	Amos Shapir		amos@taux01.nsc.com or amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522261  TWX: 33691, fax: +972-52-558322
34 48 E / 32 10 N			(My other cpu is a NS32532)

les@unicads.UUCP (Les Milash) (09/13/89)

ok. since we're talkin about "How Caches Work" here's a question.

i realize that cache lines can be read faster than N individual read
cycles, either by
	0.	having the mem-to-cache path wider than the cache-to-
		processor path.
	1.0	doing burst (page-mode) access to a bank of dram.
	1.1	using interleaved banks.
so my question is:

who does #0?  i don't think i've seen it done in a micro (i.e. i don't think
i've seen any few-chip CMMU that comprehends (say) 16-byte wide paths to
the dram.  (i think i heard that the Lilith did fetch 8byte instructions 
into a 8byte wide x 1 deep prefetch buffer).  now i have no problem
imagining Mr. Cray and the Fast Crowd doing wide things, but in the micro
world i haven't heard of it (yet--seems obvious now that big pin-count pkgs
are starting to appear).

also, what other things that i haven't seen prevent this from being an
easy performance-boost?

the gentleman from japan who was asking about integrating cache and dram
on one chip sounded like he was onto a good idea.

faster, faster, till the thrill of gips outweighs the cost of pins!

thanks folks,
Les Milash

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/13/89)

In article <MCCALPIN.89Sep12111256@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes:
>Along a similar line, the Convex machines cache accesses going to the
>scalar unit only.  The load/store unit for the vector unit operates
>directly from main memory to the vector registers.  This is pretty
>close to what Roy describes, but it is not controlled by any tags,
>merely by which functional unit executes the load instruction.

This is one of the advantages of having
a vector instruction set, because this is *exactly* what is needed for
programs of this type.  This point is not always appreciated by systems
people who are used to dealing with kernel code and utilities, so it is
worth noting.  You do *not* want to flush the cache by routing vector
reads and writes through it.  A typical code on getting 10-20 average MFLOPS
on a machine with ~100 MBytes of real memory may easily chop though
25 MBytes on each time step, much bigger than *any* current cache.  If you
flush the cache, you slow down the intermixed scalar parts of the code.
Instead, the vector read/writes can go directly to memory, with a bus
watcher on the cache to update anything which happens to be in the cache.
It may seem like a kludge to have the instruction used determine whether
something is put in the cache or not, but there is actually a lot of
information in the fact that the data is being accessed with a vector
instruction.  Unfortunately, it won't save you from flushing the cache
occasionally when you have to do certain operations which won't vectorize.
There is a considerable art to optimizing the CPU/memory interaction,
just as there is to optimizing branch strategies, etc. 

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/13/89)

In article <27445@winchester.mips.COM> mash@mips.COM (John Mashey) writes:

>MIPS R3000s use one of the address bits to distinguish between
>cached/nocached access for the "unmapped" part of the kernel address
>space, and actually uses this in various ways.  For example, you can

It would appear that a "general" form of this would allow arbitrary
memory sections to be cache/nocache.  This is an interesting
alternative which is similar to the option on the CDC 2xx architecture
of having user selectable page sizes.  It could be a very useful feature
to have on a "superscalar" type machine, because you generally don't
want array accesses to be cached.  (On the CDC machines, the OS let
you specify at load time which arrays would go on large pages...)

>Understanding the useability of software-controlled caching is
>a good area for research over the next few years; the answers are not
>yet clear.

Since there are times when you *do* want array accesses to be cached,
this might be a case for having two different load instructions, one
cached and one not cached.  That would allow the compiler and/or user
to choose at any moment what the optimal choice is for that load.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

mbutts@mentor.com (Mike Butts) (09/13/89)

From article <MCCALPIN.89Sep12111256@masig3.ocean.fsu.edu>, by mccalpin@masig3.ocean.fsu.edu (John D. McCalpin):
> In message <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
> 
>>Here's a (possibly crazy) idea for cache design.  The current EUD
>>(Example Under Debate) shows that caches just don't work for sequential
>>access, but we knew that already.
> 
> 	[ Roy describes a system for which only a portion of the
> 	  address space is sun through the cache ]
> 
>>	So what do you think?  Has this been done before?
> 
> Along a similar line, the Convex machines cache accesses going to the
> scalar unit only.  The load/store unit for the vector unit operates
> directly from main memory to the vector registers.  This is pretty
> close to what Roy describes, but it is not controlled by any tags,
> merely by which functional unit executes the load instruction.

It's not terribly hard to build a vector cache, or vector buffer, call it what
you will, into which vector elements are prefetched, according to an address
and stride detected by the compiler of a predictable loop.
-- 
Michael Butts, Research Engineer       KC7IT           503-626-1302
Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton, OR 97005
...!{sequent,tessi,apollo}!mntgfx!mbutts  OR  mbutts@pdx.MENTOR.COM
Opinions are my own, not necessarily those of Mentor Graphics Corp.

colin@watcsc.waterloo.edu (Colin Plumb) (09/14/89)

Not cacheing certain accesses is something the i860 does... there are
normal loads that the data cache tries to hit on, and pipelined loads
that have programmer-visible delay whose values are *not* loaded into
the cache.   They have some example code for matrix multiplication that
loads the rows into cache and uses pipelined loads on the columns to
avoid memory delays (it's pipelined with the multiply/accumulates) despite
the awful locality of reference.
-- 
	-Colin

preston@titan.rice.edu (Preston Briggs) (09/14/89)

In article <31815@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>In article <27445@winchester.mips.COM> mash@mips.COM (John Mashey) writes:
>
>>MIPS R3000s use one of the address bits to distinguish between
>>cached/nocached access for the "unmapped" part of the kernel address
>>space, and actually uses this in various ways.  For example, you can
>
>It would appear that a "general" form of this would allow arbitrary
>memory sections to be cache/nocache.  This is an interesting


>Since there are times when you *do* want array accesses to be cached,
>this might be a case for having two different load instructions, one
>cached and one not cached.  That would allow the compiler and/or user
>to choose at any moment what the optimal choice is for that load.

I think you're right.  The i860 allows both cached and non-cached loads.
It would sometimes be nice if they allowed a variable line size
(small, to prevent loading extra unneeded data, 
 or large, to preload data that will be needed soon.)
A fetch-block-to-cache instruction might actually be what's wanted.

I expect compiler people will want more and more control over the cache
in the future.  Current hardware management will continue
to be demanded for un-cache-optimized code.

An analogy might be stack machines vs. register machines.
Without good register allocation, stack machines with top of
stack caching are pretty good.  With good register allocation,
we can do better with by explicitely controlling the registers.

Preston Briggs
preston@titan.rice.edu

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/14/89)

In article <1184@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes:
>In article <31815@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>
>>Since there are times when you *do* want array accesses to be cached,
>>this might be a case for having two different load instructions, one
>>cached and one not cached.  That would allow the compiler and/or user

>I think you're right.  The i860 allows both cached and non-cached loads.

You are right.  If I had bothered to read the Aug. 89 IEEE Micro before
touching my keyboard, I would have known that the i860 has exactly this
feature.  Good for Intel!

Aug. 89 also has a comparison of RISC architectures, although for some
reason it ignores MIPSCo...

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

rodman@mfci.UUCP (Paul Rodman) (09/14/89)

In article <31814@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>Along a similar line, the Convex machines cache accesses going to the
>>scalar unit only.  The load/store unit for the vector unit operates
>>directly from main memory to the vector registers.  This is pretty
>
>This is one of the advantages of having
>a vector instruction set, because this is *exactly* what is needed for
>programs of this type.  

What you should have said is: "This is one of the advantages of having
control over whether or not to bypass the cache". A vector instruction
set gives a pretty coarse method of making the choice.

Another point worth noting is that many vector machines only do the
cache bypass if the stride = 1, other strides may go for the cache.
It depends, of course, on how good the memory system is.

pkr

david@daisy.UUCP (David Schachter) (09/14/89)

In article <639@unicads.UUCP> les@unicads.UUCP (Les Milash) asks about micro-
computers with a cache line size greater than the memory word size, and wonders
of the absence of such machines.

The Everex Step 386/20 uses a cache line size that depends on the amount of
cache installed.  64KB -> 4 byte line.  128KB -> 8 byte line.  256KB -> 16 byte
line.  The machine is/was the fastest 20 MHz 386 PC clone around.  The Step
386/25 uses the same approach and gets the same results.  Everex calls this
"AMMA", which stands for some marketing drivel.  I don't recall the details on
the Step 386/33.

The cache is a direct-mapped write-back cache, so they clearly brute-forced the
lower hit rate of direct mapping by putting in a lot of cache and devoted
engineering resources that could have gone for multi-way associativity to
write-back logic instead.  In my opinion, they made a good design decision.

The design doesn't use ASICs or custom VLSI; it does use some PALs.

					-- David Schachter

"With friends like you, who needs hallucinations?"  -- "Night Court" (NBC)

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/15/89)

In article <1029@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <31814@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:

>>a vector instruction set, because this is *exactly* what is needed for

>What you should have said is: "This is one of the advantages of having
>control over whether or not to bypass the cache". A vector instruction
>set gives a pretty coarse method of making the choice.

You are right.  I should have noticed that the terms of the debate
have changed during the last year from: "Nobody runs applications
like that" to "What is the optimal way to support calculations involving
large arrays of floating point variables?"  Needless to say I am pleased.

I am now hoping that fast multiport memory subsystems will become the
next thing which we didn't need, but which we now can't live without.


  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117