SFurber@acorn.co.uk (06/20/89)
Daniel Stodolsky writes: > When memory latency (I/O) dominates in an application, it seems like > write-back caches should be a big win, particularly on single processor > machines where you don't have to worry about cache coherency. > > So why don't we see more write-back D-caches? A write buffer fixes the memory latency problem; a write-back cache is needed only if memory bandwidth is also an issue. At first sight a write buffer looks a lot simpler to build than a write-back cache, because of the flushing issues involved in context switching or paging with the latter. However Jouppi (Proceedings of 16th International Symposium on Computer Architecture, p. 287) states that for similar performance "A write-back cache is a simpler design...". Steve Furber (sfurber@acorn.uucp)
dtynan@altos86.Altos.COM (Dermot Tynan) (06/21/89)
In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes: > > At first sight a write buffer looks a lot simpler to build than a write-back > cache, because of the flushing issues involved in context switching or > paging with the latter. However Jouppi (Proceedings of 16th International > Symposium on Computer Architecture, p. 287) states that for similar > performance "A write-back cache is a simpler design...". > > Steve Furber (sfurber@acorn.uucp) That's nice. How about a little proof, for those of us who don't happen to have the proceedings near at hand... What did he base his arguments on? - Der -- dtynan@altos86.Altos.COM (408) 946-6700 x4237 Dermot Tynan, Altos Computer Systems, San Jose, CA 95134 "Far and few, far and few, are the lands where the Jumblies live..."
slackey@bbn.com (Stan Lackey) (06/22/89)
>In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes: >> At first sight a write buffer looks a lot simpler to build than a write-back >> cache, because of the flushing issues involved in context switching or >> ... Part of the implementation difficulty around writeback caches is in architectures where DMA is supported, which does not go through the cache. This would include bus-based systems, where the cpu/cache is packaged as a bus device, and memory and I/O controllers are also bus devices. In this case, DMA cannot simply assume it can read memory, as the cache can contain the up-to-date data. To solve this problem, two mechanisms over and above a writethrough cache are needed. One is a bus watcher in the cache that looks for reads on the bus, and accesses the cache for every bus transaction to see if it contains hot data. Second, there must be some way for the cache to substitute its data in place of the memory, or tell the device to retry the transaction after the cache has written the hot data back to memory, or some similar thing. Now, any cache must be aware of DMA activity and either invalidate or write the new data into cache when a device writes to memory. This mechanism is typically extended to add (1) above. In fact, some systems, to reduce cache contention, implement the tag store as dual-ported (either as a dual-ported RAM, or as two copies of the same data). Further, (2) is treated by adding more kinds of bus transactions, and sequences that the cache must perform. Whether writeback or writethrough is chosen depends upon application of the product, preferences of the designers, and lots of times stuff even less tangible than that. Like the way previous generations of the product have been done. It is not safe to assume that writeback is always faster than writethrough. In some cases, when there are large data sets (data set is more than a couple of times larger than the cache and there is not much locality of data usage) and data is usually written once (like a matrix transpose), there is an interaction between writeback and large cache line that causes it to be slower than writethrough. This happens with OS's that clear virtual memory before giving it to a process, for example. Looking at the RISC trend, it seems natural to assume that the next step is to have a writeback cache with no "snooping" (as it has been called) for either I/O reads OR writes, and solve the problem in software. -Stan
frazier@oahu.cs.ucla.edu (Greg Frazier) (06/22/89)
In article <41770@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >Looking at the RISC trend, it seems natural to assume that the next >step is to have a writeback cache with no "snooping" (as it has been >called) for either I/O reads OR writes, and solve the problem in >software. > >-Stan I really don't want to start a "I know what RISC _really_ is" sort of argument, but the RISC philosophy would only put the cache consistancy functions in software if that made the system faster. The basic idea of RISC is hardware minimization -> speed, not hardware minimization for the sake of minimization. Since one of the keys to high-speed computing is keeping the memory "close" to the processor, I doubt moving the caching functions to software would ever be a win. &&&&&&&&&&&&&#######################(((((((((((((((((((((( Greg Frazier o Internet: frazier@CS.UCLA.EDU CS dept., UCLA /\ UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier ----^/---- /
henry@utzoo.uucp (Henry Spencer) (06/22/89)
In article <25114@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >>Looking at the RISC trend, it seems natural to assume that the next >>step is to have a writeback cache with no "snooping" (as it has been >>called) for either I/O reads OR writes, and solve the problem in >>software. > >... the RISC philosophy would only put the >cache consistancy functions in software if that made the system >faster. The basic idea of RISC is hardware minimization -> speed, >not hardware minimization for the sake of minimization... Making things easier to build generally makes them faster, because more effort can be invested in making them fast and there are fewer constraints that have to be observed. It is verifiably true that unsnoopy caches are easier to build than snoopy ones. The real question is, how much penalty is incurred by dealing with the issue in software, and do the benefits exceed the costs? If you read the IBM 360/370 Principles of Operations book, you will find elaborate wording defining what you can and cannot get away with in self-modifying instruction sequences. The consensus today is that it is better not to try to solve that problem in hardware, i.e. that snoopy instruction prefetchers are not worthwhile. Which way the tradeoff goes for caches, I'm not sure. -- NASA is to spaceflight as the | Henry Spencer at U of Toronto Zoology US government is to freedom. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
joh@hpisod2.HP.COM (Jim Hays) (06/23/89)
/ hpisod2:comp.arch / SFurber@acorn.co.uk / 1:46 am Jun 20, 1989 / Daniel Stodolsky writes: > When memory latency (I/O) dominates in an application, it seems like > write-back caches should be a big win, particularly on single processor > machines where you don't have to worry about cache coherency. > > So why don't we see more write-back D-caches? A write buffer fixes the memory latency problem; a write-back cache is needed only if memory bandwidth is also an issue. At first sight a write buffer looks a lot simpler to build than a write-back cache, because of the flushing issues involved in context switching or paging with the latter. However Jouppi (Proceedings of 16th International Symposium on Computer Architecture, p. 287) states that for similar performance "A write-back cache is a simpler design...". Steve Furber (sfurber@acorn.uucp) ----------
mash@mips.COM (John Mashey) (06/25/89)
In article <25114@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes: >In article <41770@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >>Looking at the RISC trend, it seems natural to assume that the next >>step is to have a writeback cache with no "snooping" (as it has been >>called) for either I/O reads OR writes, and solve the problem in >>software. >> >>-Stan > >I really don't want to start a "I know what RISC _really_ is" >sort of argument, but the RISC philosophy would only put the >cache consistancy functions in software if that made the system >faster. The basic idea of RISC is hardware minimization -> speed, >not hardware minimization for the sake of minimization. Since >one of the keys to high-speed computing is keeping the memory >"close" to the processor, I doubt moving the caching functions >to software would ever be a win. You might be surprised; in some cases, it's a perfectly reasonable tradeoff. For example, although R3000s permit external invalidation of the data cache (to allow I/O input coherency, for example), about the only systems that use it are multiprocessors, which want it for other reasons. Of course, the primary data cache is write-thru, which means you don't have to flush it out ot memory. Suppose you had a system with a 1-level cache, and the choice of snooping the I/O bus, or not: Snoop: extra hardware watches I/O for memory writes, and either stalls the CPU to snoop, or snoops in (extra) set of duplicate tags. Hardware cost: some Performance cost: whatever degradation of CPU happpens from times it wants to access data cache and snooper is in there doing something. No snoop: Hardware cost: none Performance cost: operating system needs to flush a cache page either before or after doing a read, or it's got to use uncached accesses to retrieve the data (which is usually slower, for linesize>1 word, but also doesn't pollute the cache), or it can get sneaky, like using timestamp algorithms and occasionally flushing the cache to "Clean" the entire freelist, which can then absorb DMA inputs without requiring further flushes. Depending on the kind of system you're doing, you can probably justify either answer. However, I think you'll find that the extra hardware can often be hard to justify, at least in smaller systems. [Numbers-running left as exercise for the reader. 10 points :-)] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
rec@dg.dg.com (Robert Cousins) (06/27/89)
In article <95@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes: >In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes: >> At first sight a write buffer looks a lot simpler to build than a write-back >> cache, because of the flushing issues involved in context switching or >> paging with the latter. However Jouppi (Proceedings of 16th International >> Symposium on Computer Architecture, p. 287) states that for similar >> performance "A write-back cache is a simpler design...". >> Steve Furber (sfurber@acorn.uucp) >That's nice. How about a little proof, for those of us who don't happen to >have the proceedings near at hand... What did he base his arguments on? > - Der I can offer a hand-waving argument for this simply: The logic in a write-back cache must handle a few cases: 1 write with dirty writeback 2 write without writeback 3 read with miss and dirty writeback 4 read with miss but no writeback 5 read with hit A write through cache must handle a more limited number of cases: 6 write with update to cache [write to resident line] 7 write with invalidate [write to non-resident line] 8 read with miss 9 read with hit (note: the two write cases may be considered identical in many implementations.) At this point, a write through cache seems simpler. However, if the line size is greater than a single word, the number of states increases substantially. Specifically, the case in (7) becomes: cache line read, word write to cache, (may take place simultaneously with below) word write to memory which is substantially more complex and is a subset of the operations required for a write back cache in the similar situation. In fact, when viewed in more detail, the line length issue makes the complexity approximately equal. A second point to ponder is that caches can be viewed as two level beasts: a CPU interface and a bus interface. The CPU interface is responsible for handling CPU requests, judging if a hit has occured and responding to similar things. The bus interface listens to the CPU interface and waits to be told to fetch a new line. The fetch operation involves writing the old line to memory if dirty and fetching the new line. This fetch operation will take place for all misses -- read or write. Compare this with the writethrough approach: the CPU interface appears to be about the same, but the bus interface has greater complexity since it must handle not only the line fetch, but also buffered writes. Potentially there can be multiple outstanding writes. While I tend to view caches as a necessary evil, a truly optimized caching scheme is a complex beast. Robert Cousins Dept. Mgr, Workstation Dev't. Data General Corp. Speaking for myself alone.
rodman@mfci.UUCP (Paul Rodman) (06/28/89)
In article <195@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: >7 write with invalidate [write to non-resident line] > >At this point, a write through cache seems simpler. However, if the >line size is greater than a single word, the number of states increases >substantially. Specifically, the case in (7) becomes: > > cache line read, > word write to cache, (may take place simultaneously with below) > word write to memory > I.E. A read-modify-write to cache, along with the memory write. Not all write-thru caches will require a RMW even if the line size is larger than the word size. Some will store a seperate copy of the tag (or cache index) with each word. If you want a byte-writable cache, you probably _will_ do RMWs. :-) ------- Another point that I haven't seen yet: Write-back caches are vulnerable to SRAM soft errors. Many write-thru systems will force a miss if a parity error on a cache cell is detected, in an attempt to refresh the contents from memory. With a write-thru cache, if a parity error is detected and the dirty bit is set, you've lost the only copy of the data. Not a big deal, but worth noticing. Paul K. Rodman rodman@mfci.uucp __... ...__ _.. . _._ ._ .____ __.. ._
hankd@pur-ee.UUCP (Hank Dietz) (06/28/89)
In article <195@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: >In article <95@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes: >>In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes: >>> At first sight a write buffer looks a lot simpler to build than a write-back >>> cache, because of the flushing issues involved in context switching or >>> paging with the latter. However Jouppi (Proceedings of 16th International ... >>[Comment wanting proof of lower complexity] ... >[Comments giving arguments for lower complexity] Hmmm. It seems to me that there are at least three, not two, alternatives: write-through, write-back, and lazy-write. Using the lazy-write idea, you have the cache watch for free memory bus cycles and execute a "pending" entry write whenever it finds a free cycle... of course, if there are lots of free cycles this is equivalent to write-through, whereas it is equivalent to write-back if there are no free cycles. If you think about it, lazy-write can be a really big win with only a little more circuitry. Yes, lazy-write is MORE circuitry ;-). >While I tend to view caches as a necessary evil, a truly optimized caching >scheme is a complex beast. Not necessarily complex in hardware. Making the cache HW be more explicitly controlled by the compiler is a big win in reducing HW complexity while reaping big performance gains. This sounds a lot like the argument for RISC processors, doesn't it? The most complete picture is C-H Chi's PhD thesis, "Compiler-Driven Cache Management Using A State Level Transition Model," Purdue University School of EE, May 1989... Chi and I have a number of papers published on this sort of thing. -hankd@ee.ecn.purdue.edu
slackey@bbn.com (Stan Lackey) (06/28/89)
In article <918@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >In article <195@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes: >>At this point, a write through cache seems simpler. However, if the >>line size is greater than a single word, the number of states increases >>[case assuming write-allocate is implemented] >Not all write-thru caches will require a RMW even if the line size is larger >than the word size. Some will store a seperate copy of the tag (or cache >index) with each word. If the cache does not to allocate-on-write, and the memory supports byte-write, the RMW in the cache is not necessary. ie the bytes written if a miss don't write the cache, just memory. This just moves the RMW to the memory, you may say? Lots of systems have RMW in the memory anyway so that I/O can do writes with data sizes smaller than the line size. >Write-back caches are vulnerable to SRAM soft errors. I've heard of mainframes with ERCC in the cache. Stan
roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (06/29/89)
In article <12070@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >Hmmm. It seems to me that there are at least three, not two, alternatives: >write-through, write-back, and lazy-write. Using the lazy-write idea, you >have the cache watch for free memory bus cycles and execute a "pending" >entry write whenever it finds a free cycle... of course, if there are lots The idea sounds good. I'm just wondering about the terminology. A lazy-operation (scheme) is generally defined as a scheme where a particular operation is carried out only when it can be delayed no longer. In other words, it sounds like the general definition of what lazy-write would be describes exactly what the current definition of what write-back is. What you describe with lazy-write sounds something like a "background-write". Adding to the semantic confusion is of course the observation that in order to write through to memory one has to write back to memory :-). If we all could start over again we might have had: - write through (was write through) - lazy write (was write back) - background write (was lazy write) Am I write or am I wrong? :-) (Sorry, couldn't resist) -- Roelof Vuurboom SSP/V3 Philips TDS Apeldoorn, The Netherlands +31 55 432226 domain: roelof@idca.tds.philips.nl uucp: ...!mcvax!philapd!roelof
4rst@unmvax.cs.unm.edu (Forrest Black) (01/22/91)
Can anyone provide me with information on the size/description of data/inst caches on the following machines? uVax II, III, VS2000, II GPX, etc. SparcStation 1, 1+, 2 Sun 3/50,60,160 DecStation 5000/200 (MIPS) SGI 4D/340 VGX Please send e-mail only. I'll post a summary to comp.arch if requested. Any information greatly appreciated. Thanks in advance. 4rst@turing.cs.unm.edu (Forrest Black) University of New Mexico, CS Department