[comp.arch] Smart I-cache? And separate R/W D-cache

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/01/90)

  In the discussion of relative cache sizes, I remembered something
which may have been discussed here several years ago. Perhaps someone
will be able to mention any developments which are recent.

  The feature is intelligent I-cache, which only stores instructions
which are the target of jumps. The basis of this is that pipelines make
cache less effective for long inline runs of code. In fact, in most
cases there is nothing to be gained by caching these instructions,
unless the memory bandwidth is really overloaded, such as by many DMA
devices or multiple CPUs.

  The implementation was to or the pipeline empty (or wait, or other
similar signal) with the 'not in cache' and use that value to determine
if the instructions should be cached.

  This is particularly a gain when the size of a loop or nested loops is
larger than the cache, when procedures are being called, etc. It can
eliminate the memory latency on calls to common procedures, and make
large loops run as if they were completely in cache.

  If anyone has any info on recent work (if any) I'd like to hear it. If
there are any good papers I should look up I'd like to see them, too.
Obviously this must either be harder to do than I think, or provide less
benefit, or everyone would be doing it.

  I can make the same argument about separate cache for data being
written and read, in that a large cache for locations read and small
write back cache for data written would seem (without analysis I haven't
done) to offer serious reductions in the number of times the CPU waits
for memory. Feel free to correct me, this just came to me while thinking
about the smart I-cache.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
      The Twin Peaks Halloween costume: stark naked in a body bag

jesup@cbmvax.commodore.com (Randell Jesup) (11/21/90)

In article <2823@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes:
>  The feature is intelligent I-cache, which only stores instructions
>which are the target of jumps. The basis of this is that pipelines make
>cache less effective for long inline runs of code. In fact, in most
>cases there is nothing to be gained by caching these instructions,
>unless the memory bandwidth is really overloaded, such as by many DMA
>devices or multiple CPUs.

	This is known as a "Branch Target Cache" or BTC.  Both the RPM-40
and the AMD 29000 have BTCs (you can probably dig up some of the RPM-40
design team around CR&D - Try Dave Nagy, Janet Moseley, or Dave McGonagle
(who has a company at the RPI incubator center nowadays)).

	If you can feed instructions into the CPU from memory (or second-
level cache) at full speed, then all you need to cover is the branch
latency to restart the pipeline from memory, and a BTC covers this nicely.

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.commodore.com  BIX: rjesup  
Thus spake the Master Ninjei: "If your application does not run correctly,
do not blame the operating system."  (From "The Zen of Programming")  ;-)