[comp.arch] Caches

SFurber@acorn.co.uk (06/20/89)

Daniel Stodolsky writes:

> When memory latency (I/O) dominates in an application, it seems like
> write-back caches should be a big win, particularly on single processor
> machines where you don't have to worry about cache coherency.
>
> So why don't we see more write-back D-caches? 

A write buffer fixes the memory latency problem; a write-back cache is
needed only if memory bandwidth is also an issue.

At first sight a write buffer looks a lot simpler to build than a write-back
cache, because of the flushing issues involved in context switching or
paging with the latter. However Jouppi (Proceedings of 16th International
Symposium on Computer Architecture, p. 287) states that for similar
performance "A write-back cache is a simpler design...".

Steve Furber (sfurber@acorn.uucp)

dtynan@altos86.Altos.COM (Dermot Tynan) (06/21/89)

In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes:
> 
> At first sight a write buffer looks a lot simpler to build than a write-back
> cache, because of the flushing issues involved in context switching or
> paging with the latter. However Jouppi (Proceedings of 16th International
> Symposium on Computer Architecture, p. 287) states that for similar
> performance "A write-back cache is a simpler design...".
> 
> Steve Furber (sfurber@acorn.uucp)

That's nice.  How about a little proof, for those of us who don't happen to
have the proceedings near at hand...  What did he base his arguments on?
							- Der






-- 
	dtynan@altos86.Altos.COM		(408) 946-6700 x4237
	Dermot Tynan,  Altos Computer Systems,  San Jose, CA   95134

    "Far and few, far and few, are the lands where the Jumblies live..."

slackey@bbn.com (Stan Lackey) (06/22/89)

>In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes:
>> At first sight a write buffer looks a lot simpler to build than a write-back
>> cache, because of the flushing issues involved in context switching or
>> ...

Part of the implementation difficulty around writeback caches is in
architectures where DMA is supported, which does not go through the
cache.  This would include bus-based systems, where the cpu/cache is
packaged as a bus device, and memory and I/O controllers are also
bus devices.  In this case, DMA cannot simply assume it can read 
memory, as the cache can contain the up-to-date data.

To solve this problem, two mechanisms over and above a writethrough
cache are needed.  One is a bus watcher in the cache that looks for reads
on the bus, and accesses the cache for every bus transaction to see if
it contains hot data.  Second, there must be some way for the cache to
substitute its data in place of the memory, or tell the device to retry
the transaction after the cache has written the hot data back to memory,
or some similar thing.

Now, any cache must be aware of DMA activity and either invalidate or
write the new data into cache when a device writes to memory.  This
mechanism is typically extended to add (1) above.  In fact, some
systems, to reduce cache contention, implement the tag store as
dual-ported (either as a dual-ported RAM, or as two copies of the same
data).  Further, (2) is treated by adding more kinds of bus
transactions, and sequences that the cache must perform.

Whether writeback or writethrough is chosen depends upon application
of the product, preferences of the designers, and lots of times stuff
even less tangible than that.  Like the way previous generations of the
product have been done.

It is not safe to assume that writeback is always faster than
writethrough.  In some cases, when there are large data sets (data set
is more than a couple of times larger than the cache and there is not
much locality of data usage) and data is usually written once (like a
matrix transpose), there is an interaction between writeback and large
cache line that causes it to be slower than writethrough.  This
happens with OS's that clear virtual memory before giving it to a
process, for example.

Looking at the RISC trend, it seems natural to assume that the next
step is to have a writeback cache with no "snooping" (as it has been
called) for either I/O reads OR writes, and solve the problem in
software.

-Stan

frazier@oahu.cs.ucla.edu (Greg Frazier) (06/22/89)

In article <41770@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>Looking at the RISC trend, it seems natural to assume that the next
>step is to have a writeback cache with no "snooping" (as it has been
>called) for either I/O reads OR writes, and solve the problem in
>software.
>
>-Stan

I really don't want to start a "I know what RISC _really_ is"
sort of argument, but the RISC philosophy would only put the
cache consistancy functions in software if that made the system
faster.  The basic idea of RISC is hardware minimization -> speed,
not hardware minimization for the sake of minimization.  Since
one of the keys to high-speed computing is keeping the memory
"close" to the processor, I doubt moving the caching functions
to software would ever be a win.
&&&&&&&&&&&&&#######################((((((((((((((((((((((
Greg Frazier	    o	Internet: frazier@CS.UCLA.EDU
CS dept., UCLA	   /\	UUCP: ...!{ucbvax,rutgers}!ucla-cs!frazier
	       ----^/----
		   /

henry@utzoo.uucp (Henry Spencer) (06/22/89)

In article <25114@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>>Looking at the RISC trend, it seems natural to assume that the next
>>step is to have a writeback cache with no "snooping" (as it has been
>>called) for either I/O reads OR writes, and solve the problem in
>>software.
>
>... the RISC philosophy would only put the
>cache consistancy functions in software if that made the system
>faster.  The basic idea of RISC is hardware minimization -> speed,
>not hardware minimization for the sake of minimization...

Making things easier to build generally makes them faster, because more
effort can be invested in making them fast and there are fewer constraints
that have to be observed.  It is verifiably true that unsnoopy caches are
easier to build than snoopy ones.  The real question is, how much penalty
is incurred by dealing with the issue in software, and do the benefits
exceed the costs?  If you read the IBM 360/370 Principles of Operations
book, you will find elaborate wording defining what you can and cannot
get away with in self-modifying instruction sequences.  The consensus
today is that it is better not to try to solve that problem in hardware,
i.e. that snoopy instruction prefetchers are not worthwhile.  Which way
the tradeoff goes for caches, I'm not sure.
-- 
NASA is to spaceflight as the  |     Henry Spencer at U of Toronto Zoology
US government is to freedom.   | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

joh@hpisod2.HP.COM (Jim Hays) (06/23/89)

/ hpisod2:comp.arch / SFurber@acorn.co.uk /  1:46 am  Jun 20, 1989 /

Daniel Stodolsky writes:

> When memory latency (I/O) dominates in an application, it seems like
> write-back caches should be a big win, particularly on single processor
> machines where you don't have to worry about cache coherency.
>
> So why don't we see more write-back D-caches? 

A write buffer fixes the memory latency problem; a write-back cache is
needed only if memory bandwidth is also an issue.

At first sight a write buffer looks a lot simpler to build than a write-back
cache, because of the flushing issues involved in context switching or
paging with the latter. However Jouppi (Proceedings of 16th International
Symposium on Computer Architecture, p. 287) states that for similar
performance "A write-back cache is a simpler design...".

Steve Furber (sfurber@acorn.uucp)
----------

mash@mips.COM (John Mashey) (06/25/89)

In article <25114@shemp.CS.UCLA.EDU> frazier@cs.ucla.edu (Greg Frazier) writes:
>In article <41770@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>>Looking at the RISC trend, it seems natural to assume that the next
>>step is to have a writeback cache with no "snooping" (as it has been
>>called) for either I/O reads OR writes, and solve the problem in
>>software.
>>
>>-Stan
>
>I really don't want to start a "I know what RISC _really_ is"
>sort of argument, but the RISC philosophy would only put the
>cache consistancy functions in software if that made the system
>faster.  The basic idea of RISC is hardware minimization -> speed,
>not hardware minimization for the sake of minimization.  Since
>one of the keys to high-speed computing is keeping the memory
>"close" to the processor, I doubt moving the caching functions
>to software would ever be a win.

You might be surprised; in some cases, it's a perfectly reasonable
tradeoff.

For example, although R3000s permit external invalidation of the 
data cache (to allow I/O input coherency, for example), about the
only systems that use it are multiprocessors, which want it for other
reasons.  Of course, the primary data cache is write-thru, which
means you don't have to flush it out ot memory.

Suppose you had a system with a 1-level cache, and the choice of
snooping the I/O bus, or not:
	Snoop:
		extra hardware watches I/O for memory writes,
		and either stalls the CPU to snoop, or snoops
		in (extra) set of duplicate tags.
		Hardware cost: some
		Performance cost: whatever degradation of CPU happpens
			from times it wants to access data cache and
			snooper is in there doing something.

	No snoop:
		Hardware cost: none
		Performance cost: operating system needs to flush a
			cache page either before or after doing a read,
			or it's got to use uncached accesses to retrieve
			the data (which is usually slower, for linesize>1
			word, but also doesn't pollute the cache), or
			it can get sneaky, like using timestamp algorithms
			and occasionally flushing the cache to "Clean"
			the entire freelist, which can then absorb DMA
			inputs without requiring further flushes.

Depending on the kind of system you're doing, you can probably justify
either answer.  However, I think you'll find that the extra hardware
can often be hard to justify, at least in smaller systems.
[Numbers-running left as exercise for the reader. 10 points :-)]
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

rec@dg.dg.com (Robert Cousins) (06/27/89)

In article <95@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes:
>In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes:
>> At first sight a write buffer looks a lot simpler to build than a write-back
>> cache, because of the flushing issues involved in context switching or
>> paging with the latter. However Jouppi (Proceedings of 16th International
>> Symposium on Computer Architecture, p. 287) states that for similar
>> performance "A write-back cache is a simpler design...".
>> Steve Furber (sfurber@acorn.uucp)
>That's nice.  How about a little proof, for those of us who don't happen to
>have the proceedings near at hand...  What did he base his arguments on?
>							- Der

I can offer a hand-waving argument for this simply:  The logic in a write-back
cache must handle a few cases:

1	write with dirty writeback
2	write without writeback
3	read with miss and dirty writeback
4	read with miss but no writeback
5	read with hit

A write through cache must handle a more limited number of cases:

6	write with update to cache [write to resident line]
7	write with invalidate [write to non-resident line] 
8	read with miss
9	read with hit
	(note:  the two write cases may be considered identical in many
	implementations.)

At this point, a write through cache seems simpler.  However, if the
line size is greater than a single word, the number of states increases
substantially.  Specifically, the case in (7) becomes:

	cache line read,
	word write to cache, (may take place simultaneously with below)
	word write to memory

which is substantially more complex and is a subset of the operations
required for a write back cache in the similar situation.  In fact,
when viewed in more detail, the line length issue makes the complexity
approximately equal.

A second point to ponder is that caches can be viewed as two level
beasts:  a CPU interface and a bus interface.  The CPU interface is
responsible for handling CPU requests, judging if a hit has occured
and responding to similar things.  The bus interface listens to the
CPU interface and waits to be told to fetch a new line.  The fetch
operation involves writing the old line to memory if dirty and fetching
the new line.  This fetch operation will take place for all misses --
read or write.  

Compare this with the writethrough approach:  the CPU interface appears
to be about the same, but the bus interface has greater complexity since
it must handle not only the line fetch, but also buffered writes.  Potentially
there can be multiple outstanding writes.  

While I tend to view caches as a necessary evil, a truly optimized caching
scheme is a complex beast.  

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

rodman@mfci.UUCP (Paul Rodman) (06/28/89)

In article <195@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:

>7	write with invalidate [write to non-resident line] 
>
>At this point, a write through cache seems simpler.  However, if the
>line size is greater than a single word, the number of states increases
>substantially.  Specifically, the case in (7) becomes:
>
>	cache line read,
>	word write to cache, (may take place simultaneously with below)
>	word write to memory
>

I.E. A read-modify-write to cache, along with the memory write.

Not all write-thru caches will require a RMW even if the line size is larger 
than the word size. Some will store a seperate copy of the tag (or cache 
index) with each word. 

If you want a byte-writable cache, you probably _will_ do RMWs. :-)

-------

Another point that I haven't seen yet:

Write-back caches are vulnerable to SRAM soft errors. Many write-thru
systems will force a miss if a parity error on a cache cell is
detected, in an attempt to refresh the contents from memory.  With a
write-thru cache, if a parity error is detected and the dirty bit is
set, you've lost the only copy of the data.

Not a big deal, but worth noticing. 

    Paul K. Rodman 
    rodman@mfci.uucp
    __... ...__    _.. .   _._ ._ .____ __.. ._

hankd@pur-ee.UUCP (Hank Dietz) (06/28/89)

In article <195@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>In article <95@altos86.Altos.COM> dtynan@altos86.Altos.COM (Dermot Tynan) writes:
>>In article <799@acorn.co.uk>, SFurber@acorn.co.uk writes:
>>> At first sight a write buffer looks a lot simpler to build than a write-back
>>> cache, because of the flushing issues involved in context switching or
>>> paging with the latter. However Jouppi (Proceedings of 16th International
...
>>[Comment wanting proof of lower complexity]
...
>[Comments giving arguments for lower complexity]

Hmmm.  It seems to me that there are at least three, not two, alternatives:
write-through, write-back, and lazy-write.  Using the lazy-write idea, you
have the cache watch for free memory bus cycles and execute a "pending"
entry write whenever it finds a free cycle...  of course, if there are lots
of free cycles this is equivalent to write-through, whereas it is equivalent
to write-back if there are no free cycles.  If you think about it,
lazy-write can be a really big win with only a little more circuitry.  Yes,
lazy-write is MORE circuitry ;-).

>While I tend to view caches as a necessary evil, a truly optimized caching
>scheme is a complex beast.  

Not necessarily complex in hardware.  Making the cache HW be more explicitly
controlled by the compiler is a big win in reducing HW complexity while
reaping big performance gains.  This sounds a lot like the argument for RISC
processors, doesn't it?

The most complete picture is C-H Chi's PhD thesis, "Compiler-Driven Cache
Management Using A State Level Transition Model," Purdue University School
of EE, May 1989...  Chi and I have a number of papers published on this sort
of thing.

						-hankd@ee.ecn.purdue.edu

slackey@bbn.com (Stan Lackey) (06/28/89)

In article <918@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes:
>In article <195@dg.dg.com> rec@dg.UUCP (Robert Cousins) writes:
>>At this point, a write through cache seems simpler.  However, if the
>>line size is greater than a single word, the number of states increases
>>[case assuming write-allocate is implemented]
>Not all write-thru caches will require a RMW even if the line size is larger 
>than the word size. Some will store a seperate copy of the tag (or cache 
>index) with each word. 

If the cache does not to allocate-on-write, and the memory supports 
byte-write, the RMW in the cache is not necessary.  ie the bytes written
if a miss don't write the cache, just memory.  This just moves the RMW
to the memory, you may say?  Lots of systems have RMW in the memory
anyway so that I/O can do writes with data sizes smaller than the 
line size.

>Write-back caches are vulnerable to SRAM soft errors.

I've heard of mainframes with ERCC in the cache.

Stan

roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (06/29/89)

In article <12070@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:

>Hmmm.  It seems to me that there are at least three, not two, alternatives:
>write-through, write-back, and lazy-write.  Using the lazy-write idea, you
>have the cache watch for free memory bus cycles and execute a "pending"
>entry write whenever it finds a free cycle...  of course, if there are lots

The idea sounds good. I'm just wondering about the terminology.
A lazy-operation (scheme) is generally defined as a scheme where a particular
operation is carried out only when it can be delayed no longer.

In other words, it sounds like the general definition of what lazy-write 
would be describes exactly what the current definition of what write-back is.
What you describe with lazy-write sounds something like a "background-write".

Adding to the semantic confusion is of course the observation that in
order to write through to memory one has to write back to memory :-).

If we all could start over again we might have had:

- write through (was write through)
- lazy write (was write back)
- background write (was lazy write)

Am I write or am I wrong? :-) (Sorry, couldn't resist)
-- 
Roelof Vuurboom  SSP/V3   Philips TDS Apeldoorn, The Netherlands   +31 55 432226
domain: roelof@idca.tds.philips.nl             uucp:  ...!mcvax!philapd!roelof

4rst@unmvax.cs.unm.edu (Forrest Black) (01/22/91)

Can anyone provide me with information on the size/description of data/inst
caches on the following machines?

	uVax II, III, VS2000, II GPX, etc.
	SparcStation 1, 1+, 2
	Sun 3/50,60,160
	DecStation 5000/200 (MIPS)
	SGI 4D/340 VGX

Please send e-mail only.  I'll post a summary to comp.arch if requested.

Any information greatly appreciated.  Thanks in advance.

4rst@turing.cs.unm.edu	(Forrest Black)
University of New Mexico, CS Department