[net.arch] Cache Revisited

wall@fortune.UUCP (Jim Wall) (07/18/85)

    I have a feeling that this topic may once again start the
religious wars of cache types and implimentations, but just maybe 
I can get the info that I'm looking for.  Given a multiuser 
UNIX environment, and given a cache that is split so that user
and supervisor/kernal are separate; what kind of performance 
improvment will the cache yield. More information: The cache is
a simple direct mapped implimentation, as the CPU fetches something
from RAM (I/O is not cached obviously), it is stored in the cache.
If that physical address is ever used again, the data will come from
the cache. 

    So where this will aid in is program loops, and often used 
routines. The cache is 16K bytes split into 8K for user and 8K for
supervisor. THE QUESTION IS:  in UNIX how much improvement can the 
cache make? Let's face it, you won't be making a 70% - 90% hit rate.

   O.K., an extra credit part to the problem: If the CPU were a 
68020 with the internal 256 byte direct mapped instruction cache, is 
there a need for the external cache?

   						-Jim Wall
						...amd!fortune!wall

P.S. Both Altos and Charles River Data had both the 020 cache and
an external 8K cache, Dual did not. Anyone know why? Not that I'm 
trying to have someone else do my design work, I believe that for the
external cache to be really useful, the system architecture must be
designed for the cache, such as 64 bit wide memory, or block transfers
on cache misses. Does CRD or Altos do any of these?

brad@gcc-bill.ARPA (Brad Parker) (07/22/85)

Jim Walls question seems like a good one. Unfortunetly, it only leads
me to as another... (please excuse the horrible spelling)

Could someone who has a decent understanding of memory management systems
give me a short discourse on the following?

I'd like to compare and contrast the difference in performance between a
simple single level paged memory manager using a ram (a la Sage 68000) and
a system like the IBM DAT box, where the page tables are stored in main memory
and cached in hardware. The point being that switching context is MUCH
faster if you only need to change the pointer to the page tables, rather than
copy 8K of paging information into the page table ram. It is assummed that
the cache used to speed up the main memory page table accesses is sufficiently
large to get a good hit rate (what ever that may be).

Elaboration: The simple hardware system is a ram which uses as it's address
the upper part of the access address with the lower part of the access
address concatenated to the contents of the ram - additionally you'd
need some flag ram used to mark pages swapped out (no doubt generating
a page fault interupt).

The main memory system would need a state machine to access the page tables
from main memory and some sort of nifty cache to keep the most recently
accessed translations around - this is more or less similar to the hardware
version except the page tables are in main memory, not in dedicated ram.

The goal is to allow for simple hardware without an huge overhead in context
switching. Any ideas?
-- 

J Bradford Parker
uucp: seismo!harvard!gcc-bill!brad

"She said you know how to spell AUDACIOUSLY? I could tell I was in love...
You want to go to heaven? or would you rather not be saved?" - Lloyd Coal

richardt@orstcs.UUCP (richardt) (07/23/85)

I wouldn't know about UNIX, but a cache of the type you're describing
certainly does speed thngs up, if you can run it fast enough.  The '020
internal cache is nice, but there are a number of things it has problems
with:  any loop which relies on a lookup table; any loop which uses a
constant which is in data space because it is constant throughout a run but
changes from run to run; and generally any loop which uses fixed tables in 
data space.  A larger cache could also handle entire control loops instead
of merely operation loops.  For example: the Forth main interprter loop
or editor loop aren't very long.  They are just barely long enough to fall
outside of the '020 cache for example.  with an 8k cache, you could probably
make an interpreter or any other reasonably complex interactive program
run with a very high hit ratio.

You'll note I haven't given any figures.  That's because I haven't gotten
my own design completed or up and running yet.
					orstcs!richardt
"Is there an Assembly Language Programmer in the house?"

herbie@watdcsu.UUCP (Herb Chong [DCS]) (07/25/85)

In article <268@gcc-bill.ARPA> brad@gcc-bill.UUCP (Brad Parker) writes:
>I'd like to compare and contrast the difference in performance between a
>simple single level paged memory manager using a ram (a la Sage 68000) and
>a system like the IBM DAT box, where the page tables are stored in main memory
>and cached in hardware. The point being that switching context is MUCH
>faster if you only need to change the pointer to the page tables, rather than
>copy 8K of paging information into the page table ram. It is assummed that
>the cache used to speed up the main memory page table accesses is sufficiently
>large to get a good hit rate (what ever that may be).

i've been doing a lot of reading lately on storage management at the
kernel (or as IBM prefers to call it, the nucleus) level of 370 and
370-XA machines because i may be working on kernel code for those
machines soon.  anyway, i should point out that there are two sets of
tables used by DAT.  there are segment tables in addition to page
tables.  segments are 1Mbyte and pages are 4Kbytes.

each address space (which can contain many processes but all owned by
the same user) has it's own segment table entries which point to page
tables for that user.  all the processes in a single address space
occupy various sections of virtual memory and operate as co-routines so
that only one process can ever be running at one time in an address
space and control is transfered between processes by explicit calls by
the co-routines.  because all processes in an address space share the
same virtual memory, each can see all the others if it wants to, unlike
unix processes which are isolated from each other in terms of storage.

when a context switch is performed by the CPU, the hardware saves away
status in some block of storage and changes a segment table pointer
before loading new status of the next address space to execute.
i believe the actual size of information moved is on the order of
128 bytes, but i'm not completely sure.

the DAT hardware maintains a cache of segment and page table entries
(called the Translation Lookaside Buffer, TLB) which improves overall
performance because all storage references, whether by instruction
fetch or operand access, require information in the segment and page
tables.  the hardware maintains this cache, although there are
instructions provided for manipulating the entries.

the net result is a much more complex CPU and memory manager.  it would
be very interesting to compare a 68000 system to a single chip (or even
dozen chip) implementation of the full 370 hardware.  there is also
provision for prefix control where multiple CPU's can refer to the same
real address, but the memory manager uses the CPU prefix to decide
where the real block of storage is in real memory.  this only happens
for page 0 of memory.  you get the idea.

Herb Chong...

I'm user-friendly -- I don't byte, I nybble....

UUCP:  {decvax|utzoo|ihnp4|allegra|clyde}!watmath!water!watdcsu!herbie
CSNET: herbie%watdcsu@waterloo.csnet
ARPA:  herbie%watdcsu%waterloo.csnet@csnet-relay.arpa
NETNORTH, BITNET, EARN: herbie@watdcs, herbie@watdcsu

mat@amdahl.UUCP (Mike Taylor) (07/25/85)

> Could someone who has a decent understanding of memory management systems
> give me a short discourse on the following?

The fact that I make a comment does not imply any pretensions of
a decent understanding.

> 
> I'd like to compare and contrast the difference in performance between a
> simple single level paged memory manager using a ram (a la Sage 68000) and
> a system like the IBM DAT box, where the page tables are stored in main memory
> and cached in hardware. The point being that switching context is MUCH
> faster if you only need to change the pointer to the page tables, rather than
> copy 8K of paging information into the page table ram. It is assummed that
> the cache used to speed up the main memory page table accesses is sufficiently
> large to get a good hit rate (what ever that may be).
> 
In fact, the context switch in S/370 does not require any massive copies.
A CPU control register contains the address of the segment tables
associated with the current address space. This is called the
Segment Table Origin (STO). A cached list contains some
(implementation-dependent) set of these values, and maps them to a
small number, the STO ID. Translations are cached in a buffer
called the Translation Lookaside Buffer (TLB). Each translation
in the TLB is associated with a particular STO ID, or else is marked as
being common to all address spaces (Common Segment). Therefore,
many translations for the same virtual address may reside in the TLB,
each associated with a different address space by means of the STO ID.
Instructions are provided to selectively or completely invalidate
entries in the TLB.

The reason for caching the entries relates to the cycle time objectives
for the machine.  If you use the simple hardware, then main storage access
time is factored into the cycle time for address translation.  In our
implementation of S/370, this would mean substituting (say) 200 ns.
main storage for the 7.5 ns. rams used. The difference would add
directly to cycle time (simplistically, at least), which would result
in making the machine run about 9 times slower, ignoring the effects of
TLB misses, which are very closely related to cache misses in our
machine.   The reason for the relation is that we use a virtually
addressed cache and therefore include the TLB information in the cache
tag.  The effects of TLB misses, however, are generally quite small in
high-end systems.

This dramatic difference relates directly to the performance difference
between the cache RAM and main storage, related to the machine
cycle time (23.25 ns. - 43 MHz.).
-- 
Mike Taylor                        ...!{ihnp4,hplabs,amd,sun}!amdahl!mat

[ This may not reflect my opinion, let alone anyone else's.  ]

jvz@loral.UUCP (John Van Zandt) (07/26/85)

In article <5374@fortune.UUCP> wall@fortune.UUCP (Jim Wall) writes:
>
>    I have a feeling that this topic may once again start the
>religious wars of cache types and implimentations, but just maybe 
>I can get the info that I'm looking for.  Given a multiuser 
>UNIX environment, and given a cache that is split so that user
>and supervisor/kernal are separate; what kind of performance 
>improvment will the cache yield. More information: The cache is
>a simple direct mapped implimentation, as the CPU fetches something
>from RAM (I/O is not cached obviously), it is stored in the cache.
>If that physical address is ever used again, the data will come from
>the cache. 
>
>    So where this will aid in is program loops, and often used 
>routines. The cache is 16K bytes split into 8K for user and 8K for
>supervisor. THE QUESTION IS:  in UNIX how much improvement can the 
>cache make? Let's face it, you won't be making a 70% - 90% hit rate.

Cache's are strange creatures, exactly how they are designed can impact
performance significantly.  Your idea of separate cache's for supervisor
and user is a good one.  However, you haven't given enough information
to determine the performance improvement.  All that a cache can do is
to improve the memory performance for the system.  The standard formula
for computing the speedup looks at the percentage of cache hits times the
cache speed plus the percentage of cache misses times the memory speed.

And there are tricks that effect performance such as the number of cache
sets, whether the cache has a 'dirty' bit, etc.  I haven't checked the
M68000 lately, but if lines coming from the chip can signal whether
an instruction or data fetch is occuring, then another speedup would be
to have separate instruction and data cache's.

>
>   O.K., an extra credit part to the problem: If the CPU were a 
>68020 with the internal 256 byte direct mapped instruction cache, is 
>there a need for the external cache?

Again, depends on the hits for the internal cache; but my guess is that the
internal cache does not take into account the different spaces (supervisor
and user), therefore the external cache would give better performance on
context switches.

>P.S. Both Altos and Charles River Data had both the 020 cache and
>an external 8K cache, Dual did not. Anyone know why? Not that I'm 
>trying to have someone else do my design work, I believe that for the
>external cache to be really useful, the system architecture must be
>designed for the cache, such as 64 bit wide memory, or block transfers
>on cache misses. Does CRD or Altos do any of these?

I don't agree with the 'designed for the cache' statement, but agree that
you can get slightly better performance if you have designed the memory system
to work with a cache... but the percentage improvement I suspect would
be quite small, and only worthwhile in very high performance systems.

   John Van Zandt
   Loral Instrumentation
   (619) 560-5888

   uucp: ucbvax!sdcsvax!jvz
   arpa: jvz@UCSD

P.S. Of course, the above are my opinions alone and not necessarily those
     of my employer's.

alan@sun.uucp (Alan Marr, Sun Graphics) (07/28/85)

Is there any architectural reason why Motorola at some future date
could not issue a 6802x with a larger instruction cache?

rbt@sftig.UUCP (R.Thomas) (08/02/85)

> 
> I'd like to compare and contrast the difference in performance between a
> simple single level paged memory manager using a ram (a la Sage 68000) and
> a system like the IBM DAT box, where the page tables are stored in main memory
> and cached in hardware. The point being that switching context is MUCH
> faster if you only need to change the pointer to the page tables, rather than
> copy 8K of paging information into the page table ram. It is assummed that
> the cache used to speed up the main memory page table accesses is sufficiently
> large to get a good hit rate (what ever that may be).
>

You pay to load the page table when you context switch with either system. 
In one system, you load the page table registers explicitly all at once during
the actual context switch, in the other you pay to load it in a more lesurely
fashion as the DAT box cache faults it in.  The reference to main memory to
load the DAT box cache line cost just as much as the reference to main
memory to load the page table entry.

Mitigating circumstances -- with the DAT box, you only pay to load the ones
you actually use, with the page table in its own register file, you have to
load each register in the file whether you intend to use that page or not.
But if the cache is too small (and it *always* is too small -- There is no
economic incentive to make it too big!), you may have to load each cache
line several times.

If you only care about interrupt response time, then the DAT-box/cache is a
win.  But you take the same thruput hit either way.


Rick Thomas

gnu@sun.uucp (John Gilmore) (08/09/85)

John Van Zandt of Loral Instrumentation (ucbvax!sdcsvax!jvz) said:
> Cache's are strange creatures, exactly how they are designed can impact
> performance significantly.  Your idea of separate cache's for supervisor
> and user is a good one.
I believe (uninformed opinion) that this makes it perform worse, all
other things equal.  It means that at any given time, half the cache
is unusable, thus if you spend 90% of your time in user state you
only have a halfsize cache.  (Ditto if you are doing system state stuff.)

>                                                  I haven't checked the
> M68000 lately, but if lines coming from the chip can signal whether
> an instruction or data fetch is occuring, then another speedup would be
> to have separate instruction and data cache's.
The 68000 signals supervisor/user as well as instruction/data.

Again, this creates an artificial split.  If indeed the CPU is spending
50% of its time on instruction fetches and 50% on data cycles, this
could be OK, but it won't adapt dynamically as the instruction/data
mix changes.

> >                                              If the CPU were a 
> >68020 with the internal 256 byte direct mapped instruction cache, is 
> >there a need for the external cache?
> Again, depends on the hits for the internal cache; but my guess is that the
> internal cache does not take into account the different spaces (supervisor
> and user), therefore the external cache would give better performance on
> context switches.
The internal instruction cache on the 68020 definitely takes supervisor/user
mode into account.  It does NOT take context switches into account, thus it
must be flushed on context switch.  I believe the hit rate on the icache
is something like 50%.  The reason for external caches is probably
to speed up data accesses, which otherwise would go at main memory speeds.

> >P.S. Both Altos and Charles River Data had both the 020 cache and
> >an external 8K cache, Dual did not. Anyone know why?
Dual probably wanted to build a cheaper system.  Depending on the hit
rate and timings, a cache system may LOSE performance because it takes
longer to do a cache miss in a cached system than it takes to do a
memory access in a noncache system.  Remember, performance with a cache
= (hitrate*hitspeed) + ((1-hitrate)*missspeed).  What you buy in
hitspeed may not reclaim all that you lose in missspeed.

wall@fortune.UUCP (Jim Wall) (08/13/85)

    Someone in replying tomy original article on cache said that 
the hit rate on the internal cache in the 68020 is about 50%. 
Anyone care to agree with that?  Anyone care to tell me what 
reasonable application or operating system spends 50% of its time
in loops that are smaller that 256 bytes??

    The numbers that are claimed for the hit rates on caches are 
nothing short of incredible. I think the CPU manufacturers are the
instigators, and nobody bothers to question them. 

    But, hey, I could be wrong. It's happened before. So let's hear
it. Anyone who claims high cache hit rates on normal applications,
let's hear the justification for them.

						-Jim Wall
					...amd!fortune!wall

blarson@oberon.UUCP (Bob Larson) (08/14/85)

In article <5459@fortune.UUCP> wall@fortune.UUCP (Jim wall) writes:
>    The numbers that are claimed for the hit rates on caches are 
>nothing short of incredible. I think the CPU manufacturers are the
>instigators, and nobody bothers to question them. 
>
I know of one computer manufacturer that has one set of numbers for
cashe hit rate given out by the salespeople and another by the
technical people.  (95% vs 98%)  The manuals list the higher figure.

They also just got around to revising all the manuals to say that a word
is 32 bits rather than 16.  Since the basic addressing unit of the machine
is 16 bits, this makes talking about the machine akward.  (The basic 
addressing unit of a VAX is 8 bits.)

Bob Larson
Arpa: Blarson@Usc-Ecl.Arpa
Uucp: {ihnp4,hplabs,...}!sdcrdcf!uscvax!oberon!blarson

john@frog.UUCP (John Woods) (08/15/85)

>Someone...said that the hit rate on the internal cache in the 68020 is about
> 50%. Anyone care to agree with that?  Anyone care to tell me what 
> reasonable application or operating system spends 50% of its time
> in loops that are smaller that 256 bytes??
> ... 
> But, hey, I could be wrong. It's happened before. So let's hear it. Anyone
> who claims high cache hit rates on normal applications, let's hear the
> justification for them.
> 
We recently measured the cache-hit performance of our prototype 68020 board
on a number of programs, some useful, some benchmark type programs (including
the good old Knight's tour).  Results were:

	The '20 I-cache had a hit rate of between 30% and 58%.  Our 8Kb
external cache had a hit rate of 70-83%; between them (when both turned on),
they had a hit rate of 76-89%.  The 68000 board (our current product, and
from which the current 68020 board was devised [roughly]) had a cache hit
rate (only an 8Kb external cache, of course) of between 86% and 93% on the
same programs.

	And indeed, we found that few reasonable loops are small enough to
fit into the I-cache (especially since the Greenhills C compiler tries to
be really clever about loop unrolling and re-ordering of code).


--
John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101
...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA

davet@oakhill.UUCP (Dave Trissel) (08/16/85)

In article <5459@fortune.UUCP> wall@fortune.UUCP (Jim wall) writes:
>
>    Someone in replying tomy original article on cache said that 
>the hit rate on the internal cache in the 68020 is about 50%. 
>Anyone care to agree with that?  Anyone care to tell me what 
>reasonable application or operating system spends 50% of its time
>in loops that are smaller that 256 bytes??
>

The problem is that cache hit values are so variable that it really doesn't
make sense to talk about an average figure.

The lowest I've seem for the '020 is a range of 10 to 15 percent which was
taken from a monitoring of the Unix operating system. (Sorry don't remember
which version.)  One would suspect that operating systems would be among the
worst performers.  On the other hand, we have lots of reports ranging from
30 to 65 percent for measured applications.

Yet another problem measuring cache hits specifically on the '020 is the
fact that since the chip always does a 32-bit longword instruction fetch from
the bus, if only the first word is used (e.g. it finishes an earlier
instruction or is itself a 16-bit instruction) then the other word is treated
as a cache hit.  This tends to boost cache hit rate values depending on just
what you define a cache hit to be.

BTW, the 10 to 15 percent cache hit rate is nothing to sneeze at when you look
at real performance gain.  Take a hit rate of 10 percent.  That 10 percent
amounts to a much higher realized performance improvement when you consider
that the cached reads would take 1.5 to 2 times longer to do on the external
bus and that the 15 to 20 percent less activity of that bus can then be used
for simultaneous data reads and writes by the processor.  So your real
improvement could be anywhere up to 30 percent.

 --  Dave Trissel
     Motorola Semiconductor
     Austin, Texas           {ihnp4,seismo}!ut-sally!oakhill!davet

chris@umcp-cs.UUCP (Chris Torek) (08/16/85)

>And indeed, we found that few reasonable loops are small enough to
>fit into the I-cache (especially since the Greenhills C compiler tries to
>be really clever about loop unrolling and re-ordering of code).

Oh no!  A pessimizing compiler!  :-)

(Of course the unrolled code might still run faster; I just thought
this was a neat example of optimization backfiring.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

hes@ecsvax.UUCP (Henry Schaffer) (08/19/85)

>     Someone in replying tomy original article on cache said that 
> the hit rate on the internal cache in the 68020 is about 50%. 
> Anyone care to agree with that?  Anyone care to tell me what 
> reasonable application or operating system spends 50% of its time
> in loops that are smaller that 256 bytes??
The hit rate depends on the total cache size and on the amount loaded
for each miss, as well as program and data locality.  Tight loops do
give programs a high hit rate, and so can other things like character
I/O.  Vectors and matrices are common in scientific computation, and
systematic processing of these data structures also tends to give data
locality and so produces a high hit rate.  (This is why scientific
programmers are usually concerned whether matrices are stored row- or
column-wise.  Cache performance, as well as address calculation are
assisted by accessing sequential addresses.)  On mainframes with large
caches a hit rate of >80% can be achieved.  This is often verified
using a hardware monitor, not by trusting the vendor's opinion.
--henry schaffer  n c state univ

edhall@randvax.UUCP (Ed Hall) (08/21/85)

Concerning the 68020's cache:

I can think of a lot of places where a loop would fit in a 256-byte
cache, especially in string-processing applications.  Remember, in many
applications a lot of time is spent simply copying memory, making
searches, and so forth.  This isn't just limited to strings: matrix
operations usually include small inner loops where the bulk of computer
time is spent.  The same is true of bit-map graphics.  And it is true
for a lot of other CPU-hungry applications.

So something close to a 50% hit rate wouldn't surprise me for a fairly
large class of programs, though there is probably a larger class of
programs that wouldn't do nearly that good.   If Motorola were claiming
it as an *average* I'd wonder who they thought they were fooling, but
I don't believe they are doing so.

		-Ed Hall
		decvax!randvax!edhall

mash@mips.UUCP (John Mashey) (08/21/85)

This is a response to question from huguet@LOCUS.UCLA.EDU [sorry, mail
kept bouncing] about an earlier assertion of mine:

> f) Use of optimizing compilers that put things in registers, often driving
> the hit rate down [yes, down], although the speed is improved and there are
> less total memory references.

I don't know of any published numbers to back this up.  The effect has
been seen in [unpublished] simulations; might be a good topic for research.
It does make sense, at least for data cache (instruction cache effects may
vary wildly).  The better an optimizer is, the more likely it is to put
frequently-used variables in registers, thus reducing the number of
references to that are likely to be cache hits.  Consider the ultimate
case: a smart compiler and a machine with many registers, such that
most code sequences fetch a variable just once, so that most data references
are cache misses.  Passing arguments in registers also drives the hit rate down.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043

kds@intelca.UUCP (Ken Shoemaker) (08/29/85)

> BTW, the 10 to 15 percent cache hit rate is nothing to sneeze at when you look
> at real performance gain.  Take a hit rate of 10 percent.  That 10 percent

This assumes, of course, that there is no miss penalty...
-- 
...and I'm sure it wouldn't interest anybody outside of a small circle
of friends...

Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm

{pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds
	
---the above views are personal.  They may not represent those of the
	employer of its submitter.

ed@mtxinu.UUCP (Ed Gould) (08/31/85)

In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes:

>                                                 Consider the ultimate
>case: a smart compiler and a machine with many registers, such that
>most code sequences fetch a variable just once, so that most data references
>are cache misses.  Passing arguments in registers also drives the hit
>rate down.

With this ultimate machine/compiler combination it seems intuitively
that a data cache would then be a *bad* idea, since having a cache
can't be faster than an uncached memory reference (for what would
be a miss) and is often slower.  We can then use the real estate saved
for even more registers!

-- 
Ed Gould                    mt Xinu, 2910 Seventh St., Berkeley, CA  94710  USA
{ucbvax,decvax}!mtxinu!ed   +1 415 644 0146

"A man of quality is not threatened by a woman of equality."

csg@pyramid.UUCP (Carl S. Gutekunst) (09/06/85)

>>Consider the ultimate
>>case: a smart compiler and a machine with many registers, such that
>>most code sequences fetch a variable just once, so that most data references
>>are cache misses.  Passing arguments in registers also drives the hit
>>rate down.
>
>With this ultimate machine/compiler combination it seems intuitively
>that a data cache would then be a *bad* idea, since having a cache
>can't be faster than an uncached memory reference (for what would
>be a miss) and is often slower.  We can then use the real estate saved
>for even more registers!

In fact, a data cache greatly improves throughput on real large-register-set
machines, like the Pyramid. Many operations in program development (e.g.
compiling) require repetitive searching/sorting on moderately large arrays;
the data cache can help out here a lot. It also helps when you have to save
all those registers during a context switch. 

What we really need is a machine with 64K 32-bit registers. :-{)
-- 
      -m-------   Carl S. Gutekunst, Software R&D, Pyramid Technology
    ---mmm-----   P.O. Box 7295, Mountain View, CA 94039   415/965-7200
  -----mmmmm---   UUCP: {allegra,decwrl,nsc,shasta,sun,topaz!pyrnj}!pyramid!csg
-------mmmmmmm-   ARPA: pyramid!csg@sri-unix.ARPA

davet@oakhill.UUCP (Dave Trissel) (09/06/85)

In article <48@intelca.UUCP> kds@intelca.UUCP (Ken Shoemaker) writes:

>> BTW, the 10 to 15 percent cache hit rate is nothing to sneeze at when you look
>> at real performance gain.  Take a hit rate of 10 percent.  That 10 percent
>
>This assumes, of course, that there is no miss penalty...

There is no penalty since the instruction address is sent both to the cache
and the bus control unit at the same time.  A hit causes the bus controller
to avoid the bus cycle, and as I understand it, without any loss of external
bus effeciency if another operation is queued.

  --  Dave Trissel
      Motorola Semiconductor Inc.  {ihnp4,seismo}!ut-sally!oakhill!davet
      Austin, Texas

jer@peora.UUCP (J. Eric Roskos) (09/09/85)

> With this ultimate machine/compiler combination it seems intuitively
> that a data cache would then be a *bad* idea, since having a cache
> can't be faster than an uncached memory reference ...

See, what you have here is a matter of working at a problem from two different
ends.  The example of the compiler that used a large set of registers and
thus produced mostly cache misses is really a compiler that knows about
cache and manages it itself.  The registers are just a cache that's closest
to the processor.

On the other hand, most caches are intended to improve performance with the
assumption that the compilers aren't going to do much of that themselves
(or can't due to the machine's architecture. Or the fact that the same
program is to run on a number of machines of different memory-hierarchy
organizations -- i.e., the compiler that generated good code for a machine
with 250 registers might not work so well (or at all) on a machine with
16 registers;  but for cost reasons it might be desirable for a manufacturer
to produce a machine with cache, and a machine without it, whereas he has
to keep the 250 registers no matter what, if there's code out there that
uses all of them.)
-- 
Shyy-Anzr:  J. Eric Roskos
UUCP:       ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer
US Mail:    MS 795; Perkin-Elmer SDC;
	    2486 Sand Lake Road, Orlando, FL 32809-7642

franka@mmintl.UUCP (Frank Adams) (09/10/85)

In article <455@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes:
>In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>
>>                                                 Consider the ultimate
>>case: a smart compiler and a machine with many registers, such that
>>most code sequences fetch a variable just once, so that most data references
>>are cache misses.  Passing arguments in registers also drives the hit
>>rate down.
>
>With this ultimate machine/compiler combination it seems intuitively
>that a data cache would then be a *bad* idea, since having a cache
>can't be faster than an uncached memory reference (for what would
>be a miss) and is often slower.  We can then use the real estate saved
>for even more registers!

Not necessarily.  When a context switch takes place, you would have to
save all the registers.  The cache can, at worst (best?) be dumped.

patc@tekcrl.UUCP (Pat Caudill) (09/16/85)

>In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>
>>                                                 Consider the ultimate
>>case: a smart compiler and a machine with many registers, such that
>>most code sequences fetch a variable just once, so that most data references
>>are cache misses.  Passing arguments in registers also drives the hit
>>rate down.

	If you have read the article on the IBM 801 project this was just
what they did. The cache was the register set (which was medium large -
32 registers). But there was a very very smart compiler which optimized
register usage even across subroutine calls. Go look at the article it
was published in a SIGPLAN several years ago. (It was by the compiler
writer)
			Pat Caudill

boston@celerity.UUCP (Boston Office) (09/18/85)

In article <645@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>
>In article <455@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes:
>>In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes:
>>
>>>                                                 Consider the ultimate
>>>case: a smart compiler and a machine with many registers, such that
>>>most code sequences fetch a variable just once, so that most data references
>>>are cache misses.  Passing arguments in registers also drives the hit
>>>rate down.
>>
>>With this ultimate machine/compiler combination it seems intuitively
>>that a data cache would then be a *bad* idea, since having a cache
>>can't be faster than an uncached memory reference (for what would
>>be a miss) and is often slower.  We can then use the real estate saved
>>for even more registers!
>
>Not necessarily.  When a context switch takes place, you would have to
>save all the registers.  The cache can, at worst (best?) be dumped.

However: if you have enough registers, and manage them in banks,
they only need be saved if you run out of BANKS of registers.

mash@mips.UUCP (John Mashey) (09/20/85)

Pat Caudill writes:
> 	If you have read the article on the IBM 801 project this was just
> what they did. The cache was the register set (which was medium large -
> 32 registers). But there was a very very smart compiler which optimized
> register usage even across subroutine calls. Go look at the article it
> was published in a SIGPLAN several years ago.
The referenced reference is:
M. Auslander, M. Hopkins, "An Overview of the PL.8 Compiler", Proc. SIGPLAN
Symp. Compiler Construction, ACM, Boston, June 1982, 22-31.
A useful related article is:
F. Chow, J. L. Hennessy, "Register Allocation by Priority-Based Coloring",
Proc. SIGPLAN Symp. Compiler Construction, ACM, Montreal, June  1984, 222-232.
-- 
-john mashey
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash
DDD:  	415-960-1200
USPS: 	MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043