[comp.arch] Cache Size

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/26/90)

In article <WAYNE.90Feb23091735@dsndata.uucp> wayne@dsndata.uucp 
	(Wayne Schlitt) writes:
>i am sure there _are_ real programs that can fit entirely in the 68030's
>massive (:-) 256 byte data and instruction caches.  i am sure there
>are a lot more real programs that can fit entirely in a 8k cache.  i
>am sure that a lot of real programs will fit nicely in 32k-128k
>caches.

Programs certainly do vary. The transaction people at Concurrent say
that their idea of a cache is a megabyte. There may also be programs
with essentially no locality: this has been claimed for some network
communications software.  (The point was that the packets would be
processed exactly once, and then passed along, never to be seen
again.)
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

wilson@carcoar.Stanford.EDU (Paul Wilson) (02/26/90)

In article <8168@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>In article <WAYNE.90Feb23091735@dsndata.uucp> wayne@dsndata.uucp 
>	(Wayne Schlitt) writes:
>>i am sure there _are_ real programs that can fit entirely in the 68030's
>>massive (:-) 256 byte data and instruction caches.  i am sure there
>>are a lot more real programs that can fit entirely in a 8k cache.  i
>>am sure that a lot of real programs will fit nicely in 32k-128k
>>caches.
>
>Programs certainly do vary. The transaction people at Concurrent say
>that their idea of a cache is a megabyte. There may also be programs
>with essentially no locality: this has been claimed for some network
>communications software.  (The point was that the packets would be
>processed exactly once, and then passed along, never to be seen
>again.)

Garbage collected languages have similar characteristics.  If you
use a generational garbage collector, you may be able to fit the
youngest generation entirely in the cache.  But for it to be 
worthwhile, you want a fairly big cache.

The problem is that you allocated up through a region of memory,
allocating mostly very-short-lived objects that you touch a
few times and then never reference again.  A while later, you
copy the minority of live objects to some other region, and
allocate upward through the same area again.  If you can
keep this whole area in cache, you don't repeatedly miss on it.

This *may* require a *big* cache.  Ben Zorn's thesis from Berkeley,
(Dec '89) shows that misses only drop off slowly for a large
direct-mapped cache like SPUR's, so that you might need
1-2MB caches to get really good hit rates.  I have a theory that
you might do much better with set-associative caches, if they're
around 200KB or more.  But the numbers aren't in yet.

Note that if your cache is not at least a bit bigger than your
youngest generation, you don't win -- you tend to evict blocks
before you reuse them -- LRU is your enemy here.  So the miss
rate is fairly insensitive to cache size until the cache
gets to be as large as the youngest generation.

A smarter replacement policy, which knew about this, could
probably cut misses significantly for caches too small to
hold the whole allocation area.  (But is it worth it?  Maybe.
It might do a good job for the above-mentioned networking
software as well, and some other things.)

>-- 
>Don		D.C.Lindsay 	Carnegie Mellon Computer Science

   -- Paul

Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@bert.eecs.uic.edu
Box 4348   Chicago,IL 60680 
Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@bert.eecs.uic.edu
Box 4348   Chicago,IL 60680

koopman@a.gp.cs.cmu.edu (Philip Koopman) (02/27/90)

In article <1990Feb26.022057.28461@Neon.Stanford.EDU>, wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
> Garbage collected languages have similar characteristics.  If you
> use a generational garbage collector, you may be able to fit the
> youngest generation entirely in the cache.  But for it to be 
> worthwhile, you want a fairly big cache.
> ...

My thesis work showed that functional programming languages that
use combinator graph reduction have many unexpected characteristics
when using cache.  They all seem to make excellent use of even very
small caches of a few Kbytes.

On the other hand, scientific/engineering number-crunching is
notorious for breaking caches.  Any time the numeric arrays
you are operating on get substantially bigger than the cache size,
you end up heading towards a 0% hit rate.  And, for any cache size
you can *always* find someone who wants to use an array bigger
than you can handle.  Yes, block-manipulation techniques can help,
but they are a hack to get around this basic problem -- and are
just about as effective with smaller, cheaper vector register files.
So, that's why most supercomputers seem to use vector register
files instead of caches for their vector units.

  Phil Koopman                koopman@greyhound.ece.cmu.edu   Arpanet
  2525A Wexford Run Rd.
  Wexford, PA  15090
Senior Scientist at Harris Semiconductor, adjunct professor at CMU.
I don't speak for them, and they don't speak for me.

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/28/90)

In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
>In article <1990Feb26.022057.28461@Neon.Stanford.EDU>, wilson@carcoar.Stanford.EDU (Paul Wilson) writes:
>So, that's why most supercomputers seem to use vector register
>files instead of caches for their vector units.

Or, at least semi-bypass the scalar data cache for vector reads and writes 
(coherency can become a performance issue, though, if you have a real cache).

On the Cray machines, scalar data "caches" have always been explicitly 
programmed as "registers" anyway, so the bypassing is easy.

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)604-6117

dik@cwi.nl (Dik T. Winter) (02/28/90)

In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
 > So, that's why most supercomputers seem to use vector register
 > files instead of caches for their vector units.
 > 
Well, no (depends on your definition of supercomputer of course; let us
assume vector processor).  There are systems without cache that use vector
registers (Cray, NEC), or have memory to memory operations (Cyber 205).
And there are processors with cache.  Possibilities are:
1.  No vector registers, bypass cache (Cyber 995).
2.  Vector registers, bypass cache (i know none).
3.  No vector registers, through cache (again, i know none).
4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).
So, no, vector registers are not a replacement for cache.
-- 
dik t. winter, cwi, amsterdam, nederland
dik@cwi.nl

ps@fps.com (Patricia Shanahan) (03/01/90)

In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
> > So, that's why most supercomputers seem to use vector register
> > files instead of caches for their vector units.
> > 
>Well, no (depends on your definition of supercomputer of course; let us
>assume vector processor).  There are systems without cache that use vector
>registers (Cray, NEC), or have memory to memory operations (Cyber 205).
>And there are processors with cache.  Possibilities are:
>1.  No vector registers, bypass cache (Cyber 995).
>2.  Vector registers, bypass cache (i know none).

The FPS Model 500 is in this category. The scalar processors use
caches for instructions and scalar data references. The vector processors
are connected to the System Memory Bus and fetch data from memory without
going through cache. 

The only interaction is that the data caches snoop on the bus, and stores
by the vector processors cause corresponding purges in the data caches.

>3.  No vector registers, through cache (again, i know none).
>4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).

I would add the VAX 9000 to this list, but question whether the Convex C2
belongs here. I think it is a vector registers and bypass cache system.

>So, no, vector registers are not a replacement for cache.
>-- 
>dik t. winter, cwi, amsterdam, nederland
>dik@cwi.nl

Vector registers and cache both perform the same basic function, of holding
data that is likely to be used in future calculations close to the arithmetic
unit that is going to do those calculations. They also serve to buffer between
the arithmetic unit's need for individual numbers, and the memory preference
for highly parallel processing of larger blocks.

Caches are very good when reference patterns are basically unpredictable and
recent past reference to a data item or something close to it in memory is
positively correlated with probability of near future reference. They work
by fetching a block surrounding a referenced item (because of spacial locality)
and keeping it around for a while to exploit temporal locality.

Vector memory references are not like that. The reference pattern is set at
compile time and known to the compiler. During a strided fetch there is no
spacial locality. There is frequently a strong negative correlation between
recent past reference and near future reference because of the large scale
sequential characteristics of vector processing. Any cache substantially
smaller than main memory can be flooded by some vector jobs. 

The very big manufacturers, especially IBM, are something of an exception
because they can persuade third party software vendors to do a lot of special
tuning for their systems. It is usually possible, with care, to arrange a
vector job so that it will fit well with a particular cache size.
--
	Patricia Shanahan
	ps@fps.com
        uucp : {decvax!ucbvax || ihnp4 || philabs}!ucsd!celerity!ps
	phone: (619) 271-9940

patrick@convex.com (Patrick F. McGehearty) (03/01/90)

In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
>1.  No vector registers, bypass cache (Cyber 995).
>2.  Vector registers, bypass cache (i know none).
>3.  No vector registers, through cache (again, i know none).
>4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).

Actually, in the Convex C2 series, vector loads and stores bypass the cache.
Scalar operations use the cache.  One argument for this design is that
the large amount of data used in vector operations would quickly invalidate
most scalar data in the cache, such as local variables, loop limits, etc.

mccalpin@vax1.acs.udel.EDU (John D Mccalpin) (03/01/90)

In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
> > So, that's why most supercomputers seem to use vector register
> > files instead of caches for their vector units.
> > 
>Well, no (depends on your definition of supercomputer of course; let us
>assume vector processor).  There are systems without cache that use vector
>registers (Cray, NEC), or have memory to memory operations (Cyber 205).
>And there are processors with cache.  Possibilities are:
>1.  No vector registers, bypass cache (Cyber 995).
>2.  Vector registers, bypass cache (i know none).
>3.  No vector registers, through cache (again, i know none).
>4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).
>So, no, vector registers are not a replacement for cache.
>-- 
>dik t. winter, cwi, amsterdam, nederland
>dik@cwi.nl

I believe that the Convex C-2 series bypass the cache to load/store 
the vector registers, so it should be in category 2, not category 4.

The ETA-10 is a special case different than any of the above
categories.  It had no vector registers (being a memory-to-memory
machine), but it had a set of cache-like registers (the "short-stop"
registers) that were used to cache only short vectors.  I never got a
definitive answer from any of my ETA friends about exactly how big this
register set was used or exactly how the hardware decided to use it.  I
do know from direct observation that repeated access to short vectors
showed a lower start-up cost than the equivalent long-vector
operations.

The impression that I got was that this use is different than category
3, since the short-stop registers were only used for vector operands.
I think that short vectors were stuffed into these registers in
parallel with their first use, and then they could be re-loaded
directly from the short- stop registers if needed again soon.  The
output of the vector unit seems also to be loaded into the short-stop
registers (if short enough) to make back-to-back vector operations run
with slightly less overhead --- this is in the spirit of chaining.

Corrections from informed sources welcome....

bdg@tetons.UUCP (Blaine Gaither) (03/02/90)

>In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes:
>>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) 
>>writes:
>>1.  No vector registers, bypass cache (Cyber 995).
>>2.  Vector registers, bypass cache (i know none).
>>3.  No vector registers, through cache (again, i know none).
>>4.  Vector registers, through cache (IBM 3090, Convex, Alliant, Gould).
>
>Actually, in the Convex C2 series, vector loads and stores bypass the cache.
>Scalar operations use the cache.  One argument for this design is that
>the large amount of data used in vector operations would quickly invalidate
>most scalar data in the cache, such as local variables, loop limits, etc.

I have to agree with the arguments states by Pat.  The Gould NP1 was
(still is) a cache based vector mini-super.  All vector operation went
through a store-through cache.  The cache size was small (16KB data).
It turned out to perform rather poorly on some scientific codes.  The
basic machine was about 2x its competition at scalar, and about equal
on vector codes if their data fit in the data cache.  If an algorithm
didn't fit in the cache, the result was to run at .3 to .5 the
in-cache speed.

There were several other factors of the hw/sw design that were not
optimal for the scientific market, but the cache based design did not
help.  One problem is that the smaller the cache, the greater the
emphasis that must be placed upon algorithms to make better use of the
memory hierarchy.  If user and 3rd party software vendors use vendor
libraries, they can get good performance (if the vendor bothers to do
the requisite algorithm work).  If they don't then is an uphill battle
for compiler writers to recognize poor algorithms and substitute
better ones.  These tools must be ready before 3rd party ports start.

Cache based vector machines can in general be assumed to get the same
number of cache faults as a machine executing in scalar (with the same
algorithm) would get.  The big problem is vector machines get them
over a shorter period of time, thus it becomes a relatively more
serious problem than it would in a typical scalar machine.

Another usability problem with a cache based approach is the
discontinuity it causes in the performance when plotted vs problem
size.  This is very disconcerting.  If when you do a matrix multiply
of 2 nxn matrices and then multiply 2 (N+5)x(n+5) matrices you can
experience a startling slowdown (m sec vs 3m sec).  This causes some
very intersting measurement problems in scientific benchmarks which
attempted to find [N sub 1/2].  i.e. You have to find the peak vector
performance (and it ain't on 1000x1000 matricies).  So measurement
techniques that expect monotonically increasing (or worse continuous)
values of performance vs problem size get wrong answers.

Well 16K is a very small cache for scientific problems.  What about
larger caches.  Well larger is better.  It moves the discontinuity so
it affects fewer problems.  At some point you might not care. But if
the reason you are buying the architecture (as well as the
implementation) is because you see ever increasing problem size, then
even if it works fast enough on todays problems, will it be fast
enough for tomorrows problem sizes.

What can a cache based vector machine do?

	Use a dual ported cache with parallel prefetch.

	Use a very large cache.

	Use a very very large secondary cache to reduce miss penalty.