[comp.arch] R4000 data cache stalls

krste@ICSI.Berkeley.EDU ( Krste Asanovic) (03/13/91)

Looking back at an old posting about the R4000 I noticed the following:

In article 21694 of comp.arch John Mashey wrote:
    >
    >How do stores work?  I can think of lots of possibilities ...
    >Is the pipelining of word stores and partial word stores the same?
    They work straightforwardly: you can do back-to-back stores, loads,
    loads/stores without any stalls, assuming cache hits, of course,
    and regardless of byte/half/word/double type.

The uP Report article states that the on-chip data cache is
write-back, with a block size of 16 or 32 bytes. Would anyone at MIPS
care to enlighten me as to how they managed to avoid stalls in
back-to-back runs of cache accesses?

Write-back implies that the cache may have the only valid copy, and so
stores must wait to check tags before stomping on the data in cache.
Doesn't this mean that stores access the cache data RAM at a different
pipestage (WB?) than loads (DS?), and hence that a store followed a
couple cycles later by a load will cause a stall? I doubt the cache
RAM would be dual-ported.

Thanks,
Krste

mash@mips.com (John Mashey) (03/16/91)

In article <1991Mar13.104148.25097@agate.berkeley.edu> krste@ICSI.Berkeley.EDU ( Krste Asanovic) writes:
>Looking back at an old posting about the R4000 I noticed the following:
>In article 21694 of comp.arch John Mashey wrote:
>    >How do stores work?  I can think of lots of possibilities ...
>    >Is the pipelining of word stores and partial word stores the same?

>    They work straightforwardly: you can do back-to-back stores, loads,
>    loads/stores without any stalls, assuming cache hits, of course,
>    and regardless of byte/half/word/double type.
>
>The uP Report article states that the on-chip data cache is
>write-back, with a block size of 16 or 32 bytes. Would anyone at MIPS
>care to enlighten me as to how they managed to avoid stalls in
>back-to-back runs of cache accesses?
>
>Write-back implies that the cache may have the only valid copy, and so
>stores must wait to check tags before stomping on the data in cache.
>Doesn't this mean that stores access the cache data RAM at a different
>pipestage (WB?) than loads (DS?), and hence that a store followed a
>couple cycles later by a load will cause a stall? I doubt the cache
>RAM would be dual-ported.

Sure.  I over-simplified a little too much in a quick posting.

Summary:
	There's a 2-deep (by 64-bit) store buffer between the CPU and D-cache.
	Assuming cache hits & no multi-processor cache-coherency state
	changes, the only stalls are when a load accesses a doubleword
	currently held in the store buffers, at which point the store
	buffer(s) unload into the cache, and the load continues.
	(Obviously, given optimizing compilers with reasonable number of
	registers, loading something you just stored doesn't happen too often.
	Most likely, it's because you store a value in one part of a 64-bit
	area, and load antoehr variable thqat happens to fall in same 64-bit
	storage.)
	Loads have priority over stores for cache access;
	stores unload into the cache whenever they get a chance.
	It is important to recall that a store needs immediate access
	to the tags, and the actual store of data can happen sometime later.

It goes like this:

1) Recall that the R4000's pipeline has 8 internal stages (4 external
clock ticks).  For a load, it takes 2 stages (DF/DS) for
the data+tag to arrive.

	IF IS RF EX DF DS TC WB

The tag goes into the Tag Check stage, and the WB only occurs if
the TC succeeds, although the data is forwared:
The data is forwarded, i.e.:
	lw	r1,x
	A
	B
	C
if A or B use r1, the machine stalls.  if C uses it, it's OK.

2) When you do a store, you need to fetch the tag (as in a load),
but do NOT need to access the data side (see later).
You check the tag in the TC stage.
If the tag doesn't match, or if there are coherency issues, you need
to go take care of that.
If the tag matches:
	save into the store buffer:
		cache index
		byte-mask (or equivalent) to say which bytes being written.
		data
3) The store buffer empties when it can, which is any time an instruction
does NOT need to access the data part of the data cache.  Only loads
access the data part of the D-cache. (Again, ignoring cache coherency
issues.)  Note that stores themselves do NOT need to access the data,
which means that each store itself creates a slot for some previous
store to finally complete.
	lw, lw, lw.... never stalls
	sw, sw, sw ... never stalls
	sw, sw, sw, lw, lw, lw ... never stalls (although there end up
		being a couple pending stores that have to be gotten rid of)
	In some sense, this is the only place the stalls really show
	up, i.e., if you need to do something and need to get rid of the
	pending stores first.

4) So consider this as a write-back cache with enough store-buffering
to mask what would otherwise be a read-modify-write sequence for
partial-word operations.  This is especially important since the
cache data paths are 8-bytes wide, and typical integer code (in 32-bit
mode) would ALL act like partial word operations.

Anyway, the only real stalls thus come from:
	a) Reads to doublewords just written
	b) Pending stores that are leftover when one needs to do
	something else.

Anyway, that's the best that I understand what's going on...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086