[comp.arch] 68040 and caches

jesup@cbmvax.commodore.com (Randell Jesup) (02/27/91)

	Random question concerning the 68040: what do people think about
the utility/cost effectiveness/need for external caches (given that it
has ?4-way? associative 4K I and D caches internally and a single
external bus.  What sort of speedups/cache size do you think you'd be
likely to get?  I would suspect you don't need as large caches as with 
most "risc" chips, because of the more complex, higher-density instructions,
but how much affect does this really have (are there any recent figures out
there)?

	What about external caches on other CISC's, such as 68030's, x86's
(yech), etc?  Certainly at some point you get insufficient gain for the
expense of adding more cache (I know, "insufficient" is a subjective term).
I'm interested in where people think the crossovers are (and I suppose for
RISC's too while we're at it).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
The compiler runs
Like a swift-flowing river
I wait in silence.  (From "The Zen of Programming")  ;-)

torrie@cs.stanford.edu (Evan Torrie) (02/27/91)

jesup@cbmvax.commodore.com (Randell Jesup) writes:

>	Random question concerning the 68040: what do people think about
>the utility/cost effectiveness/need for external caches (given that it
>has ?4-way? associative 4K I and D caches internally and a single
>external bus.  

  I don't have any figures for the 68040, but for a very good
explanation of the details and tradeoffs in cache design, take a look
at Steven Przybylski's "Cache and Memory Hierarchy Design:  A
Performance-Directed Approach", Morgan Kaufman, 1990.
  There are, of course, many issues which would dictate whether adding
a second-level cache is "worth it".  Workload is a big factor - Unix
type development environments are very different from personal
computers.  
  Cost also plays an important part.  If you're striving for the last
percentage point of performance, you can afford to spend a lot on the
cache.  If you're prepared to sacrifice peak speed in order to get a
low cost machine (as it seems NeXT's designers have chosen), the 4K
I/D caches are probably sufficient.
  Written from a MIPS Risc perspective, Pryzbylski suggests that 4K
caches are far from optimal.  He suggests 64K - 256K for an external
cache, and argues a case for direct-mapped over set-associative
caches.
  You mention the issue of code density on the 040 vs RISC type
machines.  I wonder if this will actually be less of a factor it is
with the 68030.  I believe it's true that the 040, taking it's
RISC-like approach, is actually optimised for the very simple
addressing modes, and will actually have an overall lower CPI if the
code contains more of these simple instructions in place of
a complicated 680x0 addressing mode instruction.
  Perhaps someone else can confirm this, along with whether this is
being implemented in any 68040 specific compilers.

>	What about external caches on other CISC's, such as 68030's, x86's
>(yech), etc?  Certainly at some point you get insufficient gain for the
>expense of adding more cache (I know, "insufficient" is a subjective term).
>I'm interested in where people think the crossovers are (and I suppose for
>RISC's too while we're at it).

  My opinion...  For a Unix type workload, 64K-256K of cache is about
where you should be now.  For a PC type workload, 32K-64K.  Apple seems
to have done studies on its IIfx and IIci cache designs which indicate
that anything more than 32K of cache for the Mac design is overkill.
  Other crossover points...  set associativity of more than 2-4 is
wasted.  A block size of 4 words - 8 words is usually optimal.

  My $0.02 worth...

-- 
------------------------------------------------------------------------------
Evan Torrie.  Stanford University, Class of 199?       torrie@cs.stanford.edu   
"And in the death, as the last few corpses lay rotting in the slimy
 thoroughfare, the shutters lifted in inches, high on Poacher's Hill, and

lindsay@gandalf.cs.cmu.edu (Donald Lindsay) (02/28/91)

In article <19330@cbmvax.commodore.com>
	jesup@cbmvax.commodore.com (Randell Jesup) writes:
>	Random question concerning the 68040: what do people think about
>the utility/cost effectiveness/need for external caches (given that it
>has ?4-way? associative 4K I and D caches internally and a single
>external bus.

Claimed results from HP (for the HP 425t and HP 425s, both 25 MHz) are:

KB of external cache:	0	128

overall SPECmark	11	11.8
Integer SPECs		12.3	12.9
Float   SPECs		10.2	11

Note that this is at 25 MHz. What this data is saying, is that the
disparity between onchip cache, and main memory, is not extreme
enough.  [For this benchmark suite] few systems can justify adding an
intermediate level to the memory heirarchy.

A higher clock rate, or a slower main memory, would cause a bigger
disparity. Eventually, the external cache would be reasonable, to
reduce the penalty of the onchip cache misses.

Note, by the way, that second-level caches have to be much larger
than first-level caches. This is because the first-level cache skims
the cream, and any second-level cache sees an address stream with
most of its locality removed. With bad locality, a small cache isn't
going to help. Luckily, the second-level cache only has to be prompt,
not screamingly fast, so it isn't *that* expensive to build one.

-- 
Don		D.C.Lindsay .. temporarily at Carnegie Mellon Robotics

edwardm@hpcuhd.HP.COM (Edward McClanahan) (03/02/91)

> >	Random question concerning the 68040: what do people think about
> >the utility/cost effectiveness/need for external caches (given that it
> >has ?4-way? associative 4K I and D caches internally and a single
> >external bus.  

>   My opinion...  For a Unix type workload, 64K-256K of cache is about
> where you should be now.  For a PC type workload, 32K-64K.  Apple seems
> to have done studies on its IIfx and IIci cache designs which indicate
> that anything more than 32K of cache for the Mac design is overkill.

Another issue to consider is cache coherency in systems with multiple CPUs
or DMA (e.g. most of the machines mentioned).  I do not know what the '040
internal caches do to solve this problem, but external caches which don't
address it (cache coherency) can actually be quite a hassle to program
around.

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

  Edward McClanahan
  Hewlett Packard Company     -or-     edwardm@cup.hp.com
  Mail Stop 42UN
  11000 Wolfe Road                     Phone: (480)447-5651
  Cupertino, CA  95014                 Fax:   (408)447-5039

mash@mips.com (John Mashey) (03/05/91)

In article <12143@pt.cs.cmu.edu> lindsay@gandalf.cs.cmu.edu (Donald Lindsay) writes:
>
>In article <19330@cbmvax.commodore.com>
>	jesup@cbmvax.commodore.com (Randell Jesup) writes:
>>	Random question concerning the 68040: what do people think about
>>the utility/cost effectiveness/need for external caches (given that it
>>has ?4-way? associative 4K I and D caches internally and a single
>>external bus.
>
>Claimed results from HP (for the HP 425t and HP 425s, both 25 MHz) are:
>
>KB of external cache:	0	128
>
>overall SPECmark	11	11.8
>Integer SPECs		12.3	12.9
>Float   SPECs		10.2	11
>
>Note that this is at 25 MHz. What this data is saying, is that the
>disparity between onchip cache, and main memory, is not extreme
>enough.  [For this benchmark suite] few systems can justify adding an
>intermediate level to the memory heirarchy.
>
>A higher clock rate, or a slower main memory, would cause a bigger
>disparity. Eventually, the external cache would be reasonable, to
>reduce the penalty of the onchip cache misses.

>Note, by the way, that second-level caches have to be much larger
>than first-level caches. This is because the first-level cache skims
>the cream, and any second-level cache sees an address stream with
>most of its locality removed. With bad locality, a small cache isn't
>going to help. Luckily, the second-level cache only has to be prompt,
>not screamingly fast, so it isn't *that* expensive to build one.

Yes, all of this is an interesting effect; maybe somebody who knows
the details can help sort out some of the following conjectures.
First, a few observations:
1) Using MHz/SPECint as an approximation for CPI, we get about 2.0,
+/- a little, depending on external cache, or not.  The lower the
CPI, the worse cache-missing hurts you; the higher the CPI, less hurt.
For comparison, does anyone have any SPEC numbers for 486s that do NOT
have secondary caches? (The CPI is similar, but the interface is
probably different).
2) The miss rate for the on-chip caches is, of course identical
between the two configurations.
3) Thus, this just really says:
	The average miss penalty is only slightly better with
	and without the external cache, for these programs, on a 68040.
4) Let's consider some reasons why that might be:
	a) Main memory is fast, using page-mode DRAM or whatever to
	get fairly fast refills, but also (if that's what they're doing),
	it costs you some to switch between pages.
	One would assume that the secondary cache buys MORE in a larger
	machine with a larger (and probably, longer latency) memory
	system, and LESS in a tight-coupled workstation / embedded design.
	b) Perhaps there is something in the 68040-interface+external
	secondary cache control that has a higher penalty than one
	would expect.  I assume the secondary cache is writeback (?).
	Maybe the design requires flushing dirty data back to DRAM
	before initiating the refill?  Maybe there are extra cycles
	for synchronizing everything?
Anyway, maybe somebody who actually knows can post a few details,
since the rest of us are just guessing.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086

mash@mips.com (John Mashey) (03/05/91)

Oops, I forgot.
A really good analysis of memory hierachy design is:
Steven A. Przybylski, CACHE AND MEMORY HIERARCHY DESIGN: A Performance-
Directed Approach, Morgan Kaufmann, San Mateo, CA, 1990.

There's a lot of analysis on multi-level caches.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086