mmm@cup.portal.com (Mark Robert Thorson) (09/26/90)
When some caches miss, they just go out to memory and snort up a line in sequential order. I believe the MIPS R6000 is an example of this. But with most designs, such as the 486, SPARC, 68040, etc., the first item to be returned to the CPU is the item which caused the miss. This is called "critical word first". One difference between the 486 and everybody else is that the 486 will always read the next word from the other half of an aligned 2-word block before reading the other two words needed to fill a cache line. For example, if the miss is on address xxx1, the 486 will fill the cache line in the order 1-0-3-2. On SPARC or the 68040, the read continues to the end of the block and wrapsaround, i.e. 1-2-3-0. Why did Intel do it this way? To optimize the performance of 64-bit memory systems. Using the conventional order, the CPU might read one word from one 64-bit block, read two words from a second block, then hop back to the first block for another word. Intel's method always reads one block before reading the other. Does this little design tweak pay off? Just recently, Motorola has provided proof! They've introduced two static RAM's with on-chip burst control logic optimized for use as cache memories. The MCM62940 is compatible with conventional burst ordering. The MCM62486 uses Intel-style burst ordering. The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns. (Max. access time.) The MCM62486 is offered in: 14, 19, and 24 ns. Intel wins by a nanosecond!
ram@shukra.Eng.Sun.COM (Renu Raman) (09/27/90)
In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: >sequential order. I believe the MIPS R6000 is an example of this. >in the order 1-0-3-2. On SPARC or the 68040, the read continues to the end >of the block and wrapsaround, i.e. 1-2-3-0. On a SPARC!!! Let's be specific, since there are various SPARC processors and different systems built out of these various SPARC processors. To date, there aren't any Systems built with Sparc processors that have caches which do wrap-around-fill that I know of...
mmm@cup.portal.com (Mark Robert Thorson) (09/28/90)
ram@shukra.Eng.Sun.COM (Renu Raman) says: > On a SPARC!!! Let's be specific, since there are various SPARC > processors and different systems built out of these various SPARC processors. > > To date, there aren't any Systems built with Sparc processors that have > caches which do wrap-around-fill that I know of... In the specification of the SBus (the bus on the SPARCstation 1), the order for a burst transfer is to read to the end of the aligned block, then wraparound to the beginning. I don't have the MBus spec -- perhaps you could tell us whether that requires the same burst order. If the architecture of SPARC or its standard buses require this burst order, then it is probably accurate to say that all present and future SPARC implementations are non-optimal with regard to their burst order -- a problem solved successfully in the Intel 486 :-) This is symptomatic of the area in which Sun is weakest -- lack of attention to details. I could say some choice words about the %&*#@ optical mouse on the Sun, but that would be getting off the topic ....
aglew@crhc.uiuc.edu (Andy Glew) (09/29/90)
..> Critical word first, wrap around fill (1230) and other orders (1032) Trying to generalize: Perhaps a bus specification should not include the size of the wrap-around region; rather, the control for the burst should specify which address bits get complimented at which cycle of the burst? -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
guy@auspex.auspex.com (Guy Harris) (10/02/90)
>In the specification of the SBus (the bus on the SPARCstation 1), the order for >a burst transfer is to read to the end of the aligned block, then wraparound >to the beginning. I don't have the MBus spec -- perhaps you could tell >us whether that requires the same burst order. I asked somebody here, and he said: I quote from the Mbus Spec: "Mbus supports a feature called wrapping. During the address phase, MAD[2:0] is a don't care, and MAD[35:3] defines the 32 bytes to be returned. As the data is is returned, the MAD[4:3] is incremented by the memory controller, producing a wrap-around effect."
jesup@cbmvax.commodore.com (Randell Jesup) (10/02/90)
In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: >For example, if the miss is on address xxx1, the 486 will fill the cache line >in the order 1-0-3-2. On SPARC or the 68040, the read continues to the end >of the block and wrapsaround, i.e. 1-2-3-0. >Does this little design tweak pay off? Just recently, Motorola has provided >proof! They've introduced two static RAM's with on-chip burst control logic >optimized for use as cache memories. The MCM62940 is compatible with >conventional burst ordering. The MCM62486 uses Intel-style burst ordering. >The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns. >(Max. access time.) The MCM62486 is offered in: 14, 19, and 24 ns. Intel >wins by a nanosecond! However, does having to wait for the last word of the burst for the next entry (#2) cost you anything? For instructions at least, there is a definite tendency towards consecutive access (and even somewhat in data, when operating on blocks such as strings or copies). -- Randell Jesup, Keeper of AmigaDos, Commodore Engineering. {uunet|rutgers}!cbmvax!jesup, jesup@cbmvax.cbm.commodore.com BIX: rjesup Common phrase heard at Amiga Devcon '89: "It's in there!"
mash@mips.COM (John Mashey) (10/02/90)
In article <14780@cbmvax.commodore.com> jesup@cbmvax.commodore.com (Randell Jesup) writes: >In article <34275@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: >>For example, if the miss is on address xxx1, the 486 will fill the cache line >>in the order 1-0-3-2. On SPARC or the 68040, the read continues to the end >>of the block and wrapsaround, i.e. 1-2-3-0. > >>Does this little design tweak pay off? Just recently, Motorola has provided >>proof! They've introduced two static RAM's with on-chip burst control logic >>optimized for use as cache memories. The MCM62940 is compatible with >>conventional burst ordering. The MCM62486 uses Intel-style burst ordering. >>The MCM62940 is offered in the following speed grades: 15, 20, and 25 ns. >>(Max. access time.) The MCM62486 is offered in: 14, 19, and 24 ns. Intel >>wins by a nanosecond! > > However, does having to wait for the last word of the burst for the >next entry (#2) cost you anything? For instructions at least, there is >a definite tendency towards consecutive access (and even somewhat in data, >when operating on blocks such as strings or copies). This is discussed in Patterson & Hennessy, near end of Chapter 8. They note it doesn't gain as much as you'd think. How much this is worth depends a lot on the following attributes, at least: a) I-cache versus D-cache b) the pipeline structure instruction latencies issue parallelism c) the memory system cache structure & buffering main memory latency & number of pipes to it For the I-cache, when you have a cache miss: I1) You can stall the machine, fetch the entire cache block, then restart. This is clearly the simplest. I2) You can do "early restart", where you begin executing as soon as the requested word is available. This is sometimes called "Instruction Streaming" (in the MIPS R3000), i.e., when you cache miss: start fetching at word 0 of the block stall until the needed word is fetched, then stream if you branch elsewhere before the end of the block, stop streaming, stall the pipeline until block filled also, if a load/store causes a cache miss, complete the I-cache refill, then handle the D-cache miss I3) You can do "out-of-order fetch" in addition to early restart, and then do "wrapped fetch", so that you wrap-around to complete. Cache miss penalty = access time + transfer time. In all of these cases, the cache miss penalty is the same, but I2 and I3 try to "cover" some of the transfer time by converting stalls into issues. (It's hard to do much about access time for I-cache misses :-) I2 and I3 are the same, except that I3 gets to issue instructions earlier. This doesn't do much for short-latency operations, but can help those with longer latency, especially if there are multiple independent units. In practice, this means that: - The approach probably helps floating point codes more than integer - It helps pipelined-FP units more than low-latency scalar ones - It can REALLY help the class of program that consists of a giant loop filled with sequential code larger than the cache size. (Now you see the real reason why such features are included, as this class of program happens be very important to chip designers, and they take care of themselves :-) For the I-cache, it is known that the first word of a cache block accounts for an unusually high share of the misses, due to fall-thru. I think this is true across a wide range of programs. Hence, there is even less difference between I2 and I3 than you'd expect, since they behave identically in ths case. I can't recall the numbers, but I think the simulations done when the R3000 was being desinged showed relatively little difference, but note that the R3000's FP operations are fairly low latency also, hence woudl helped less than some others designed then. For the D-cache, it's more of the same, except with more options, and more complexity, as there are many more options for D-cache implementation. Here are few of the relevant items: - Unlike the I-cache, where word 0 of a block is more likely cause a cache miss, and this is true of almost every program, the effect is much more program-dependent for the D-cache. Some programs make many sequential references likely to cause cache misses, others much less so. - Unlike the I-cache, where successive instructions continue accessing the same block, loads and stores may well access different blocks, and this implies extra complexity or buffering. For example, if you have: load r1,a load r2,a+4 load r3,a+8 etc then using read-data-streaming works real well, as does: load r1,a load r2,a+4 div r1,r3 etc where you get to start the divide early On the other hand, load r1,a load r2,b b in different cache block store r1,c c in third block gets awkward. Although a load of a+4 is probably referencing an internal cache-line buffer, the load of b is referencing the cache, and would do so at the same time as the cache is being refilled. Should the refill have priority, or the load of b? Some of your choices depend on how many memory pipes you have. Big machines, with long memory latencies (in terms of cycle counts), heavy pipelining for FP, and multiple memory pipes, can contemplate more parallelism, than makes sense for a micro with 1 memory pipe, where most attempts to "get ahead" quickly run into the constriction of the single pipe. All of this, especially the D-cache choices, is likely to be very benchmark-dependent. I'm sure there are programs where you can get 2X better by wrap-around with early restart, and otheres where you get little or nothing. All of this is just a long way of saying there is no simple answer to the question "is this worth it?" and that finding the answer takes detailed simulation of many, big programs running on simulators that know every detail of the pipeline, cache, etc. (I've often commented in public talks that we had something like 400-500 VAX-mips used for the design of some future chip. The designers beat me up, saying it's more like 2X that these days...) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jgh@root.co.uk (Jeremy G Harris) (10/03/90)
In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes yet another long and informative exposition.... >[...] >For the I-cache, when you have a cache miss: > I1) You can stall the machine, fetch the entire cache block, > then restart. This is clearly the simplest. > I2) You can do "early restart", where you begin executing as soon > as the requested word is available. [...] > I3) You can do "out-of-order fetch" in addition to early restart, > and then do "wrapped fetch", so that you wrap-around to complete. Does anyone do: I4) Out-of-order fetch with early restart but without wrapped fetch? Extra hardware is needed to mark unavailible words within the line; would checking these marks be too slow? (orthogonal possibility) XX) Prefetch of next line on access of last word of line? Disclaimer: I've never designed a memory system. -- Jeremy Harris jgh@root.co.uk +44 71 315 6600
tim@proton.amd.com (Tim Olson) (10/03/90)
In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes: | For the I-cache, when you have a cache miss: | I1) You can stall the machine, fetch the entire cache block, | then restart. This is clearly the simplest. | I2) You can do "early restart", where you begin executing as soon | as the requested word is available. This is sometimes called | "Instruction Streaming" (in the MIPS R3000), i.e., when you | cache miss: | start fetching at word 0 of the block | stall until the needed word is fetched, then stream | if you branch elsewhere before the end of the block, | stop streaming, stall the pipeline until block filled | also, if a load/store causes a cache miss, complete | the I-cache refill, then handle the D-cache miss | I3) You can do "out-of-order fetch" in addition to early restart, | and then do "wrapped fetch", so that you wrap-around to complete. There are also other possibilities, such as: I4) Have a valid bit per word in the cache block, and fetch the missed instruction first, then burst reload continuing from that instruction into subsequent blocks, rather than wrapping around to complete the missed block. This tends to match instruction fetch patterns better than the other solutions, but with the added expense of extra valid bits and more complexity. -- Tim Olson Advanced Micro Devices (tim@amd.com)
mash@mips.COM (John Mashey) (10/03/90)
In article <1990Oct3.140725.3931@mozart.amd.com> tim@amd.com (Tim Olson) writes: >In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes: ... >There are also other possibilities, such as: > I4) Have a valid bit per word in the cache block, and fetch > the missed instruction first, then burst reload continuing > from that instruction into subsequent blocks, rather than > wrapping around to complete the missed block. >This tends to match instruction fetch patterns better than the other >solutions, but with the added expense of extra valid bits and more >complexity. Recall that the R3000 uses block refill, but has 1 valid bit per word of cache, so that, for example, stores can store 1 word thru the cache, and only affect that word. (The story might be different if they were write-back caches, of course.) Hence, it would have been fairly easy to implement the proposed I4. Of course, we simulated that one, a long time ago. Unfortunately.... a) It turns out that if you I-cache-miss in word N of an M-word cache block, there is a fairly high probability that you will want word(s) earlier in the cache block fairly soon. (For instance, a conditional branch inside a loop make take the branch the first time, then fall thru the next.) b) So, the cache miss penalty for I2 or I3 = access time + M c) The cache miss penalty for I4 = access time + N + penalty for refetches that I2/I3 don't do. This last term is nontrivial, i.e., with this scheme, every time you have a cache miss, you fetch from there to the end of the block, so it depends on what order things hit in. Let's try the simplest case, and the most likely, which is to assume that the next miss is to word 0. hence, the I4 penalty would be: access time + N + Pr(touch word 0 "soon") * (access time + N ) a gross rule of thumb is that access time and refill time should be about the same (this isn't exactly right, but close enough for this). If you assume N = M/2 on the average: I4: A + M/2 + Pr(0) * (A + M) I2/I3: A + M Difference (how much worse) between I4 & I2/I3: diff = A + M/2 + Pr(0)*(A+M) - A - M = diff = Pr(0)*(A+M) - M/2 assuming A == M, we get: diff = Pr(0)*(M+M) - M/2 diff is > 0 (i.e., I4 is worse), if Pr(0) > 25%, that is, if you jump into the middle of a block, and come back and hit the beginning of the block again, before I2/I3 would have replaced the block, and you do this >25% of the time, then the proposed I4 is worse. This is, of course, an over-simplified analysis, as the only way to really do it is to simulate it heavily, and it may well work in some environments. I don't know what the percentage was, but it was high enough for us to make I4 less good than I2. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@crhc.uiuc.edu (Andy Glew) (10/04/90)
..> Critical word first, ..> and starting the processors as soon as the missing instruction is fetched. >In article <41856@mips.mips.COM> mash@mips.COM (John Mashey) writes: >| For the I-cache, when you have a cache miss: >| I1) You can stall the machine, fetch the entire cache block, >| then restart. This is clearly the simplest. >| I2) You can do "early restart", where you begin executing as soon >| as the requested word is available. This is sometimes called >| "Instruction Streaming" (in the MIPS R3000), i.e., when you >| cache miss: >| start fetching at word 0 of the block >| stall until the needed word is fetched, then stream >| if you branch elsewhere before the end of the block, >| stop streaming, stall the pipeline until block filled >| also, if a load/store causes a cache miss, complete >| the I-cache refill, then handle the D-cache miss >| I3) You can do "out-of-order fetch" in addition to early restart, >| and then do "wrapped fetch", so that you wrap-around to complete. > >[tim@proton.amd.com (Tim Olson)]: >There are also other possibilities, such as: > > I4) Have a valid bit per word in the cache block, and fetch > the missed instruction first, then burst reload continuing > from that instruction into subsequent blocks, rather than > wrapping around to complete the missed block. > >This tends to match instruction fetch patterns better than the other >solutions, but with the added expense of extra valid bits and more >complexity. One of the advantages of the IBM RS/6000's Cache Reload Buffers seems to be that they function as an extra cache line that has a valid bit on every bus width. When the burst is finished, the CRB is transferred to a regular cache line that has fewer valid bits.[*] [*] Actually, I'm not so sure if the RS/6000 has fewer valid bits in the normal cache line - this might be one of my extrapolations. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]