[comp.arch] A dumb idea for making data caches more useful

david@daisy.UUCP (David Schachter) (02/25/89)

The claim has been made recently that I caches are good but
D caches are hard to use, because there is little locality
of reference in data access.

Um, how about constructing the D cache such that a program
can give it a hint?  "Hey, D cache, I intend to access data
starting from location x, w bytes at a time, stride factor
y, for z times."

The D cache could then intelligently look ahead.  How about
it?

Also, is there any use in splitting your D cache into, say,
supervisor/user areas?  I doubt it...

		-- David "Oh yeah?  Yeah.  Oh." Schachter


                       / ...!ucbvax!imagen!atari--\
david@daisy.uucp  OR  +  ...!uunet-----------------!daisy!david
                       \ ...!pyramid--------------/

w-colinp@microsoft.UUCP (Colin Plumb) (02/27/89)

david@daisy.UUCP (David Schachter) wrote:
> Um, how about constructing the D cache such that a program
> can give it a hint?  "Hey, D cache, I intend to access data
> starting from location x, w bytes at a time, stride factor
> y, for z times."
> 
> The D cache could then intelligently look ahead.  How about
> it?

See: "The WM Computer Architecture," Wm. A. Wulf, Computer Architecture News
(SIGARCH newsletter) Vol. 16, No. 1, March 1988.

From what I've heard of the CDC 6600, it uses a similar idea.  To load or
store, an instruction computes the address, and an access to r0 uses or
supplies the data.  For a load, you must specify an address before reading
r0, but for a store, you can do things in either order.  FIFOs actually
allow you to get a few words ahead on the address or data side.  I suppose
that filling one FIFO while the other is empty causes a trap.

The nifty thing is the streaming instructions, which allow you to specify
a series of loads or stores as a (base, stride, number) triple.  In the
paper, two registers are provided for this purpose, but the concept can
apply to any number.  I think it's a great idea for a vector pipe, given
the memory latency/CPU speed ratios we have these days.  And it doesn't
constrain the processor to produce or eat values at any fixed rate.

In general, it's a fun paper.  I disagree with some of the ideas (there's
no provision for things like division which are popular, but take multiple
cycles), but there are a lot of good ones in there.  He has two ALU's, and
instructions are of the form dest = src3 op (src2 op src1).  Any comments
on that one?
-- 
	-Colin (uunet!microsoft!w-colinp)

"Don't listen to me.  I never do."

jesup@cbmvax.UUCP (Randell Jesup) (02/28/89)

In article <2756@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>Um, how about constructing the D cache such that a program
>can give it a hint?  "Hey, D cache, I intend to access data
>starting from location x, w bytes at a time, stride factor
>y, for z times."
>
>The D cache could then intelligently look ahead.  How about
>it?

	Hit the nail #1 on the head.  Who's up for nail #2?
(actually, no need to tell it how many times you're going to look, 
just tell it starting here, offset so much, and maybe how many bytes accessed
at each spot - note that there are other optimizations that you can do IF the
processor lets you get at some useful bits of internal state).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup