[comp.arch] Cache Line Fills -- Critical Word First

mmm@cup.portal.com (Mark Robert Thorson) (09/26/90)

When some caches miss, they just go out to memory and snort up a line in
sequential order.  I believe the MIPS R6000 is an example of this.

But with most designs, such as the 486, SPARC, 68040, etc., the first
item to be returned to the CPU is the item which caused the miss.  This
is called "critical word first".

One difference between the 486 and everybody else is that the 486 will
always read the next word from the other half of an aligned 2-word block
before reading the other two words needed to fill a cache line.

For example, if the miss is on address xxx1, the 486 will fill the cache line
in the order 1-0-3-2.  On SPARC or the 68040, the read continues to the end
of the block and wrapsaround, i.e. 1-2-3-0.

Why did Intel do it this way?  To optimize the performance of 64-bit
memory systems.  Using the conventional order, the CPU might read one word
from one 64-bit block, read two words from a second block, then hop back to the
first block for another word.  Intel's method always reads one block before
reading the other.

Does this little design tweak pay off?  Just recently, Motorola has provided
proof!  They've introduced two static RAM's with on-chip burst control logic
optimized for use as cache memories.  The MCM62940 is compatible with
conventional burst ordering.  The MCM62486 uses Intel-style burst ordering.
The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns.
(Max. access time.)  The MCM62486 is offered in:  14, 19, and 24 ns.  Intel
wins by a nanosecond!

ram@shukra.Eng.Sun.COM (Renu Raman) (09/27/90)

In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>sequential order.  I believe the MIPS R6000 is an example of this.
>in the order 1-0-3-2.  On SPARC or the 68040, the read continues to the end
>of the block and wrapsaround, i.e. 1-2-3-0.

On a SPARC!!! Let's be specific, since there are various SPARC
processors and different  systems built out of these various SPARC processors.

To date, there aren't any Systems built with Sparc processors that have
caches which do wrap-around-fill that I know of...

mmm@cup.portal.com (Mark Robert Thorson) (09/28/90)

ram@shukra.Eng.Sun.COM (Renu Raman) says:

> On a SPARC!!! Let's be specific, since there are various SPARC
> processors and different  systems built out of these various SPARC processors.
> 
> To date, there aren't any Systems built with Sparc processors that have
> caches which do wrap-around-fill that I know of...

In the specification of the SBus (the bus on the SPARCstation 1), the order for
a burst transfer is to read to the end of the aligned block, then wraparound
to the beginning.  I don't have the MBus spec -- perhaps you could tell
us whether that requires the same burst order.

If the architecture of SPARC or its standard buses require this burst order,
then it is probably accurate to say that all present and future SPARC
implementations are non-optimal with regard to their burst order -- a problem
solved successfully in the Intel 486  :-)

This is symptomatic of the area in which Sun is weakest -- lack of attention
to details.  I could say some choice words about the %&*#@ optical mouse
on the Sun, but that would be getting off the topic ....

aglew@crhc.uiuc.edu (Andy Glew) (09/29/90)

..> Critical word first, wrap around fill (1230) and other orders (1032)

Trying to generalize:

Perhaps a bus specification should not include the size of the
wrap-around region; rather, the control for the burst should specify
which address bits get complimented at which cycle of the burst?
--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

guy@auspex.auspex.com (Guy Harris) (10/02/90)

>In the specification of the SBus (the bus on the SPARCstation 1), the order for
>a burst transfer is to read to the end of the aligned block, then wraparound
>to the beginning.  I don't have the MBus spec -- perhaps you could tell
>us whether that requires the same burst order.

I asked somebody here, and he said:

  I quote from the Mbus Spec:

  "Mbus supports a feature called wrapping. During the address phase, MAD[2:0]
  is a don't care, and MAD[35:3] defines the 32 bytes to be returned. As the
  data is is returned, the MAD[4:3] is incremented by the memory controller,
  producing a wrap-around effect."

jesup@cbmvax.commodore.com (Randell Jesup) (10/02/90)

In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>For example, if the miss is on address xxx1, the 486 will fill the cache line
>in the order 1-0-3-2.  On SPARC or the 68040, the read continues to the end
>of the block and wrapsaround, i.e. 1-2-3-0.

>Does this little design tweak pay off?  Just recently, Motorola has provided
>proof!  They've introduced two static RAM's with on-chip burst control logic
>optimized for use as cache memories.  The MCM62940 is compatible with
>conventional burst ordering.  The MCM62486 uses Intel-style burst ordering.
>The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns.
>(Max. access time.)  The MCM62486 is offered in:  14, 19, and 24 ns.  Intel
>wins by a nanosecond!

	However, does having to wait for the last word of the burst for the
next entry (#2) cost you anything?  For instructions at least, there is
a definite tendency towards consecutive access (and even somewhat in data,
when operating on blocks such as strings or copies).

-- 
Randell Jesup, Keeper of AmigaDos, Commodore Engineering.
{uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com  BIX: rjesup  
Common phrase heard at Amiga Devcon '89: "It's in there!"

mash@mips.COM (John Mashey) (10/02/90)

In article <14780@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes:
>In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
>>For example, if the miss is on address xxx1, the 486 will fill the cache line
>>in the order 1-0-3-2.  On SPARC or the 68040, the read continues to the end
>>of the block and wrapsaround, i.e. 1-2-3-0.
>
>>Does this little design tweak pay off?  Just recently, Motorola has provided
>>proof!  They've introduced two static RAM's with on-chip burst control logic
>>optimized for use as cache memories.  The MCM62940 is compatible with
>>conventional burst ordering.  The MCM62486 uses Intel-style burst ordering.
>>The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns.
>>(Max. access time.)  The MCM62486 is offered in:  14, 19, and 24 ns.  Intel
>>wins by a nanosecond!
>
>	However, does having to wait for the last word of the burst for the
>next entry (#2) cost you anything?  For instructions at least, there is
>a definite tendency towards consecutive access (and even somewhat in data,
>when operating on blocks such as strings or copies).

This is discussed in Patterson & Hennessy, near end of Chapter 8.
They note it doesn't gain as much as you'd think.

How much this is worth depends a lot on the following attributes, at least:
a) I-cache versus D-cache
b) the pipeline structure
	instruction latencies
	issue parallelism
c) the memory system
	cache structure & buffering
	main memory latency & number of pipes to it

For the I-cache, when you have a cache miss:
	I1) You can stall the machine, fetch the entire cache block,
	then restart.  This is clearly the simplest.
	I2) You can do "early restart", where you begin executing as soon
	as the requested word is available.  This is sometimes called
	"Instruction Streaming" (in the MIPS R3000), i.e.,  when you
	cache miss:
		start fetching at word 0 of the block
		stall until the needed word is fetched, then stream
		if you branch elsewhere before the end of the block,
		stop streaming, stall the pipeline until block filled
		also, if a load/store causes a cache miss, complete
		the I-cache refill, then handle the D-cache miss
	I3) You can do "out-of-order fetch" in addition to early restart,
	and then do "wrapped fetch", so that you wrap-around to complete.

Cache miss penalty = access time + transfer time.
In all of these cases, the cache miss penalty is the same, but I2 and I3
try to "cover" some of the transfer time by converting stalls into
issues. (It's hard to do much about access time for I-cache misses :-)
I2 and I3 are the same, except that I3 gets to issue
instructions earlier.  This doesn't do much for short-latency operations,
but can help those with longer latency, especially if there are multiple
independent units.  In practice, this means that:
	- The approach probably helps floating point codes more than integer
	- It helps pipelined-FP units more than low-latency scalar ones
	- It can REALLY help the class of program that consists of
	a giant loop filled with sequential code larger than the cache
	size.  (Now you see the real reason why such features are included,
	as this class of program happens be very important to chip designers,
	and they take care of themselves :-)

For the I-cache, it is known that the first word of a cache block accounts
for an unusually high share of the misses, due to fall-thru. I think this
is true across a wide range of programs.  Hence, there is even less
difference between I2 and I3 than you'd expect, since they behave
identically in ths case.  I can't recall the numbers, but I think the
simulations done when the R3000 was being desinged showed relatively little
difference, but note that the R3000's FP operations are fairly low
latency also, hence woudl helped less than some others designed then.

For the D-cache, it's more of the same, except with more options,
and more complexity, as there are many more options for D-cache
implementation.  Here are few of the relevant items:

- Unlike the I-cache, where word 0 of a block is more likely cause a cache miss,
and this is true of almost every program, the effect is much more
program-dependent for the D-cache.  Some programs make many sequential
references likely to cause cache misses, others much less so.

- Unlike the I-cache, where successive instructions continue accessing
the same block, loads and stores may well access different blocks,
and this implies extra complexity or buffering.  For example, if you have:
	load	r1,a
	load	r2,a+4
	load	r3,a+8
	etc
then using read-data-streaming works real well, as does:
	load	r1,a
	load	r2,a+4
	div	r1,r3
	etc
where you get to start the divide early

On the other hand,
	load	r1,a
	load	r2,b		b in different cache block
	store	r1,c		c in third block
gets awkward.  Although a load of a+4 is probably referencing an
internal cache-line buffer, the load of b is referencing the cache,
and would do so at the same time as the cache is being refilled.
Should the refill have priority, or the load of b?
Some of your choices depend on how many memory pipes you have.

Big machines, with long memory latencies (in terms of cycle counts),
heavy pipelining for FP, and multiple memory pipes, can contemplate
more parallelism, than makes sense for a micro with 1 memory pipe,
where most attempts to "get ahead" quickly run into the constriction
of the single pipe.

All of this, especially the D-cache choices, is likely to be very
benchmark-dependent.  I'm sure there are programs where you can get
2X better by wrap-around with early restart, and otheres where you
get little or nothing.

All of this is just a long way of saying there is no simple answer
to the question "is this worth it?"  and that finding the answer
takes detailed simulation of many, big programs running on
simulators that know every detail of the pipeline, cache, etc.
(I've often commented in public talks that we had something like 400-500
VAX-mips used for the design of some future chip.  The designers
beat me up, saying it's more like 2X that these days...)
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

jgh@root.co.uk (Jeremy G Harris) (10/03/90)

In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes
yet another long and informative exposition....
>[...]
>For the I-cache, when you have a cache miss:
>	I1) You can stall the machine, fetch the entire cache block,
>	then restart.  This is clearly the simplest.
>	I2) You can do "early restart", where you begin executing as soon
>	as the requested word is available.  [...]
>	I3) You can do "out-of-order fetch" in addition to early restart,
>	and then do "wrapped fetch", so that you wrap-around to complete.

Does anyone do:

I4) Out-of-order fetch with early restart but without wrapped fetch?
Extra hardware is needed to mark unavailible words within the line;
would checking these marks be too slow?

(orthogonal possibility)
XX) Prefetch of next line on access of last word of line?


Disclaimer: I've never designed a memory system.
-- 
Jeremy Harris			jgh@root.co.uk			+44 71 315 6600

tim@proton.amd.com (Tim Olson) (10/03/90)

In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes:
| For the I-cache, when you have a cache miss:
| 	I1) You can stall the machine, fetch the entire cache block,
| 	then restart.  This is clearly the simplest.
| 	I2) You can do "early restart", where you begin executing as soon
| 	as the requested word is available.  This is sometimes called
| 	"Instruction Streaming" (in the MIPS R3000), i.e.,  when you
| 	cache miss:
| 		start fetching at word 0 of the block
| 		stall until the needed word is fetched, then stream
| 		if you branch elsewhere before the end of the block,
| 		stop streaming, stall the pipeline until block filled
| 		also, if a load/store causes a cache miss, complete
| 		the I-cache refill, then handle the D-cache miss
| 	I3) You can do "out-of-order fetch" in addition to early restart,
| 	and then do "wrapped fetch", so that you wrap-around to complete.

There are also other possibilities, such as:

	I4) Have a valid bit per word in the cache block, and fetch
	    the missed instruction first, then burst reload continuing
	    from that instruction into subsequent blocks, rather than
	    wrapping around to complete the missed block.

This tends to match instruction fetch patterns better than the other
solutions, but with the added expense of extra valid bits and more
complexity.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

mash@mips.COM (John Mashey) (10/03/90)

In article <1990Oct3.140725.3931@mozart.amd.com> tim@amd.com (Tim Olson) writes:
>In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes:
...
>There are also other possibilities, such as:

>	I4) Have a valid bit per word in the cache block, and fetch
>	    the missed instruction first, then burst reload continuing
>	    from that instruction into subsequent blocks, rather than
>	    wrapping around to complete the missed block.

>This tends to match instruction fetch patterns better than the other
>solutions, but with the added expense of extra valid bits and more
>complexity.

Recall that the R3000 uses block refill, but has 1 valid bit
per word of cache, so that, for example, stores can store 1 word
thru the cache, and only affect that word.  (The story might be
different if they were write-back caches, of course.)

Hence, it would have been fairly easy to implement the proposed I4.
Of course, we simulated that one, a long time ago.

Unfortunately....
	a) It turns out that if you I-cache-miss in word N of an M-word cache
	block, there is a fairly high probability that you will want
	word(s) earlier in the cache block fairly soon.  (For instance,
	a conditional branch inside a loop make take the branch the first
	time, then fall thru the next.)

	b) So, the cache miss penalty for I2 or I3 =
	access time + M

	c) The cache miss penalty for I4 =
	access time + N 
	+ penalty for refetches that I2/I3 don't do.

	This last term is nontrivial, i.e., with this scheme, every time
	you have a cache miss, you fetch from there to the end of the block,
	so it depends on what order things hit in.  Let's try the simplest
	case, and the most likely, which is to assume that the next miss
	is to word 0.  hence, the I4 penalty would be:
	access time + N
	+ Pr(touch word 0 "soon") * (access time + N )

	a gross rule of thumb is that access time and refill time should
	be about the same (this isn't exactly right, but close enough for this).
	If you assume N = M/2 on the average:
	I4: A + M/2  + Pr(0) * (A + M)
	I2/I3: A + M 

	Difference (how much worse) between I4 & I2/I3:
	diff = A + M/2 + Pr(0)*(A+M) - A - M =
	diff = Pr(0)*(A+M) - M/2
	assuming A == M, we get:
	diff = Pr(0)*(M+M) - M/2
	diff is > 0 (i.e., I4 is worse), if Pr(0) > 25%,

	that is, if you jump into the middle of a block, and come back
	and hit the beginning of the block again, before I2/I3 would
	have replaced the block, and you do this >25% of the time,
	then the proposed I4 is worse.

This is, of course, an over-simplified analysis, as the only way to really
do it is to simulate it heavily, and it may well work in some environments.
I don't know what the percentage was, but it was high enough for us to
make I4 less good than I2.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@crhc.uiuc.edu (Andy Glew) (10/04/90)

..> Critical word first,
..> and starting the processors as soon as the missing instruction is fetched.

>In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>| For the I-cache, when you have a cache miss:
>| 	I1) You can stall the machine, fetch the entire cache block,
>| 	then restart.  This is clearly the simplest.
>| 	I2) You can do "early restart", where you begin executing as soon
>| 	as the requested word is available.  This is sometimes called
>| 	"Instruction Streaming" (in the MIPS R3000), i.e.,  when you
>| 	cache miss:
>| 		start fetching at word 0 of the block
>| 		stall until the needed word is fetched, then stream
>| 		if you branch elsewhere before the end of the block,
>| 		stop streaming, stall the pipeline until block filled
>| 		also, if a load/store causes a cache miss, complete
>| 		the I-cache refill, then handle the D-cache miss
>| 	I3) You can do "out-of-order fetch" in addition to early restart,
>| 	and then do "wrapped fetch", so that you wrap-around to complete.
>
>[tim@proton.amd.com (Tim Olson)]:
>There are also other possibilities, such as:
>
>	I4) Have a valid bit per word in the cache block, and fetch
>	    the missed instruction first, then burst reload continuing
>	    from that instruction into subsequent blocks, rather than
>	    wrapping around to complete the missed block.
>
>This tends to match instruction fetch patterns better than the other
>solutions, but with the added expense of extra valid bits and more
>complexity.


One of the advantages of the IBM RS/6000's Cache Reload Buffers seems
to be that they function as an extra cache line that has a valid bit
on every bus width.

When the burst is finished, the CRB is transferred to a regular cache
line that has fewer valid bits.[*]

[*] Actually, I'm not so sure if the RS/6000 has fewer valid bits in the
    normal cache line - this might be one of my extrapolations.


--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]