krste@ICSI.Berkeley.EDU ( Krste Asanovic) (03/13/91)
Looking back at an old posting about the R4000 I noticed the following: In article 21694 of comp.arch John Mashey wrote: > >How do stores work? I can think of lots of possibilities ... >Is the pipelining of word stores and partial word stores the same? They work straightforwardly: you can do back-to-back stores, loads, loads/stores without any stalls, assuming cache hits, of course, and regardless of byte/half/word/double type. The uP Report article states that the on-chip data cache is write-back, with a block size of 16 or 32 bytes. Would anyone at MIPS care to enlighten me as to how they managed to avoid stalls in back-to-back runs of cache accesses? Write-back implies that the cache may have the only valid copy, and so stores must wait to check tags before stomping on the data in cache. Doesn't this mean that stores access the cache data RAM at a different pipestage (WB?) than loads (DS?), and hence that a store followed a couple cycles later by a load will cause a stall? I doubt the cache RAM would be dual-ported. Thanks, Krste
mash@mips.com (John Mashey) (03/16/91)
In article <1991Mar13.104148.25097@agate.berkeley.edu> krste@ICSI.Berkeley.EDU ( Krste Asanovic) writes: >Looking back at an old posting about the R4000 I noticed the following: >In article 21694 of comp.arch John Mashey wrote: > >How do stores work? I can think of lots of possibilities ... > >Is the pipelining of word stores and partial word stores the same? > They work straightforwardly: you can do back-to-back stores, loads, > loads/stores without any stalls, assuming cache hits, of course, > and regardless of byte/half/word/double type. > >The uP Report article states that the on-chip data cache is >write-back, with a block size of 16 or 32 bytes. Would anyone at MIPS >care to enlighten me as to how they managed to avoid stalls in >back-to-back runs of cache accesses? > >Write-back implies that the cache may have the only valid copy, and so >stores must wait to check tags before stomping on the data in cache. >Doesn't this mean that stores access the cache data RAM at a different >pipestage (WB?) than loads (DS?), and hence that a store followed a >couple cycles later by a load will cause a stall? I doubt the cache >RAM would be dual-ported. Sure. I over-simplified a little too much in a quick posting. Summary: There's a 2-deep (by 64-bit) store buffer between the CPU and D-cache. Assuming cache hits & no multi-processor cache-coherency state changes, the only stalls are when a load accesses a doubleword currently held in the store buffers, at which point the store buffer(s) unload into the cache, and the load continues. (Obviously, given optimizing compilers with reasonable number of registers, loading something you just stored doesn't happen too often. Most likely, it's because you store a value in one part of a 64-bit area, and load antoehr variable thqat happens to fall in same 64-bit storage.) Loads have priority over stores for cache access; stores unload into the cache whenever they get a chance. It is important to recall that a store needs immediate access to the tags, and the actual store of data can happen sometime later. It goes like this: 1) Recall that the R4000's pipeline has 8 internal stages (4 external clock ticks). For a load, it takes 2 stages (DF/DS) for the data+tag to arrive. IF IS RF EX DF DS TC WB The tag goes into the Tag Check stage, and the WB only occurs if the TC succeeds, although the data is forwared: The data is forwarded, i.e.: lw r1,x A B C if A or B use r1, the machine stalls. if C uses it, it's OK. 2) When you do a store, you need to fetch the tag (as in a load), but do NOT need to access the data side (see later). You check the tag in the TC stage. If the tag doesn't match, or if there are coherency issues, you need to go take care of that. If the tag matches: save into the store buffer: cache index byte-mask (or equivalent) to say which bytes being written. data 3) The store buffer empties when it can, which is any time an instruction does NOT need to access the data part of the data cache. Only loads access the data part of the D-cache. (Again, ignoring cache coherency issues.) Note that stores themselves do NOT need to access the data, which means that each store itself creates a slot for some previous store to finally complete. lw, lw, lw.... never stalls sw, sw, sw ... never stalls sw, sw, sw, lw, lw, lw ... never stalls (although there end up being a couple pending stores that have to be gotten rid of) In some sense, this is the only place the stalls really show up, i.e., if you need to do something and need to get rid of the pending stores first. 4) So consider this as a write-back cache with enough store-buffering to mask what would otherwise be a read-modify-write sequence for partial-word operations. This is especially important since the cache data paths are 8-bytes wide, and typical integer code (in 32-bit mode) would ALL act like partial word operations. Anyway, the only real stalls thus come from: a) Reads to doublewords just written b) Pending stores that are leftover when one needs to do something else. Anyway, that's the best that I understand what's going on... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086