[net.arch] Cache revisited + page tables

kds@intelca.UUCP (Ken Shoemaker) (08/13/85)

> John Van Zandt of Loral Instrumentation (ucbvax!sdcsvax!jvz) said:
> > Cache's are strange creatures, exactly how they are designed can impact
> > performance significantly.  Your idea of separate cache's for supervisor
> > and user is a good one.
> I believe (uninformed opinion) that this makes it perform worse, all
> other things equal.  It means that at any given time, half the cache
> is unusable, thus if you spend 90% of your time in user state you
> only have a halfsize cache.  (Ditto if you are doing system state stuff.)

I believe (uninformed opinion) that a better solution is to use a
multi-way set associative cache.  Also that the performance enhancement/
degradation is all tied to the amount of time spend bouncing between
user/supervisor mode and the average lifetime of items in the cache.
If you have a large cache (with presumably large hit rates) in a program
that does lots of system calls, you could easily thrash your cache
if it spends all its time updating the same cache cells shared between
the user code and the supervisor code.  A multi-way associative cache
is really the best of both worlds, since it allows both options (a large
non-split cache or two half size split caches), albeit at increased expense...
The same argument could be applied to seperate instruction/data caches.

> is something like 50%.  The reason for external caches is probably
> to speed up data accesses, which otherwise would go at main memory speeds.
> 
> > >P.S. Both Altos and Charles River Data had both the 020 cache and
> > >an external 8K cache, Dual did not. Anyone know why?
> Dual probably wanted to build a cheaper system.  Depending on the hit
> rate and timings, a cache system may LOSE performance because it takes
> longer to do a cache miss in a cached system than it takes to do a
> memory access in a noncache system.  Remember, performance with a cache
> = (hitrate*hitspeed) + ((1-hitrate)*missspeed).  What you buy in
> hitspeed may not reclaim all that you lose in missspeed.

But what is the real story here, i.e., does anyone out there have real
numbers for the systems (or others) mentioned?

One more thing, doesn't putting all the page tables in dedicated memory:

	1) waste lots of high-speed expensive memory
	2) limit the virtual memory size of processes

Maybe this wasn't a problem with 68010s, since they only had 24 address
bits, but with 68020s, I would think it a problem, especially if
someone wants to do something as foolish as direct file mapping
onto optical disks.  Any solutions, or do I not understand the gist
of how these puppies work?
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.

mash@mips.UUCP (John Mashey) (08/14/85)

Ken Shoemaker writes:
> > John Van Zandt of Loral Instrumentation (ucbvax!sdcsvax!jvz) said:
> > > Cache's are strange creatures, exactly how they are designed can impact
> > > performance significantly.  Your idea of separate cache's for supervisor
> > > and user is a good one.
> > I believe (uninformed opinion) that this makes it perform worse, all
> > other things equal.  It means that at any given time, half the cache
> > is unusable, thus if you spend 90% of your time in user state you
> > only have a halfsize cache.  (Ditto if you are doing system state stuff.)
> 
> I believe (uninformed opinion) that a better solution is to use a
> multi-way set associative cache.  ...  A multi-way associative cache
> is really the best of both worlds, since it allows both options (a large
> non-split cache or two half size split caches), albeit at increased expense...
> The same argument could be applied to seperate instruction/data caches.

Most of this is true, except for caveats to the last sentence.
Direct-mapped, split I & D caches behave somewhat like 2-way set-assoc. caches,
i.e., they have measurably higher rates than direct-mapped joint I&D cache.
[With direct-mapped joint, all you need is one frequently-used loop that
happens to reference clashing data, and you get many misses.]
Caches are indeed strange things, but they do follow a few reasonable rules.
Given a fixed total amount of cache memory:

1) (N+1)-way set-associative cache has a higher hit rate than N-way.
(in particular, 2-way is higher than 1-way (direct)).

2) Joint caches have higher hit rates than split ones.

Unfortunately,
1) (N+1)-way is more expensive than N-way (for the same speed).
Even worse, unless you're building everything from scratch, you may
be able to buy parts to make N-way go fast enough, but maybe not for N+1.
In particular, there's a big jump from N=1 to anything higher.  In particular,
fast CPUs demand fast cache access.

2) Split I&D caches can get away with using slower SRAMs than do joint
I&D caches, at least for some architectures, because you can more-or-less
alternate accesses to the 2 caches.

Although true, the above is a ferocious over-simplification - it's very
hard to evaluate cache designs without knowing:
a) CPU chip nature  (speed; execution cycles per I-fetch - i.e., CISC vs RISC).
b) Choice of write-thru vs write-back caches; use of write buffers.
c) Approach to cache coherency, i.e., bus-watching cache vs software methods.
d) Main bus speed.
e) Requirements (or lack thereof) for real multi-processor support.
f) Use of optimizing compilers that put things in registers, often driving
the hit rate down [yes, down], although the speed is improved and there are
less total memory references.
g) Physical versus virtual caches, and interaction with timing of
whatever memory management scheme is used; related are interactions
on operating system, given style of shared address spaces, and requirements,
if any, for cache-flushing - there are many tradeoff combinations possible.
h) Cost, speed, and availability of fast RAMs.
i) Board size and power limitations.
[I'm sure I've forgotten some, but these come to mind quickly.]

Nontrivial stuff; not necessarily intuitive; easy to do wrong.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

ed@mtxinu.UUCP (Ed Gould) (08/19/85)

In article <168@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>
>Nontrivial stuff; not necessarily intuitive; easy to do wrong.

Which means Simulate, Simulate, Simulate.  Don't try to figure it
out, too much, or to guess.

-- 
Ed Gould                    mt Xinu, 2910 Seventh St., Berkeley, CA  94710  USA
{ucbvax,decvax}!mtxinu!ed   +1 415 644 0146

"A man of quality is not threatened by a woman of equality."

hammond@petrus.UUCP (Rich A. Hammond) (10/10/85)

On cache's:  The hit rates reported are published in the literature, try
ACM's Transactions on Computer Systems for several articles on the DEC VAX
performance measurements. (TOCS isn't that old, you can check all the issues.)
Another publication would be Computer Architecture News, particlarly the
annual Computer Architecture Conference proceedings.  When I get to my office
I'll try and get the exact issues.

Multi-way set associative caches are implemented on some machines, I suspect
that the performance gain drops off after 2 to 4 sets.  Keeping separate
User/Supervisor caches works better in a multi-user operating system,
since one user doesn't wipe out the cache for everyone and frequently
occuring system things (like serial line interrupt code, maybe) stay in cache.
Similarly, split instruction/data caches are simpler to build than a multi-
way set associative cache but are better than a single cache for both.

> Ken Shoemaker asks:
> One more thing, doesn't putting all the page tables in dedicated memory:
> 
> 	1) waste lots of high-speed expensive memory
> 	2) limit the virtual memory size of processes
> 
> Maybe this wasn't a problem with 68010s, since they only had 24 address
> bits, but with 68020s, I would think it a problem, especially if
> someone wants to do something as foolish as direct file mapping
> onto optical disks.  Any solutions, or do I not understand the gist
> of how these puppies work?

You don't have to keep the page tables in dedicated memory (I assume you
mean fast static RAM), the DEC VAX and the NS32000 family both have caches
of frequently used page table entries and keep the rest in main memory.
The thing I would hope that Motorola does better is allow a larger size
page (VAX and NS320000 use 512 bytes/page).  As dynamic RAM gets less
expensive per bit it makes sense to accept greater internal fragmentation
in pages in return for smaller page tables.  This really works because
user processes tend to keep up in size with available memory, so as
more memory becomes available the user trys things like direct mapping of
large files.

Rich Hammond, Bellcore	[ihnp4, allegra, ucbvax]!bellcore!hammond