tbray@watsol.waterloo.edu (Tim Bray) (09/11/89)
In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes:
+You may be running software that has a very low cache hit rate if you
+are doing CAD work or scientific calculations. Take this little loop
+for example:
+
+ SUM = 0.0
+ DO 10 I = 1, 1000000
+ SUM = SUM + VEC(I)
+ 10 CONTINUE
+
+A data cache is *no use at all* for this problem. You will get a
+cache miss on every data access.
Now hold on just a dag-blaggin' minute. I'm a software weenie who's never
built a cache, but I thought I understood how they work. If this is right
obviously I don't at all. Somebody who knows should either debunk this or
explain what's really going on, because I'm probably not alone in my
ignorance.
I thought caches respected the principle of locality. And this code has
really good locality. In fact, I thought they were block- or page-based. And
when VEC(I) hits a page for the first time, it'll be cached, and then it'll
keep hitting the cache (bar nasty context switches, etc.), until VEC(I) moves
off that page. One cache miss per page; in the worst case, if SUM is DOUBLE
and the page size is 512, you do 64 times as well as hitting main memory per
loop iteration. Nyet?
+Similarly, copying data from one bit
+of memory to another will be limited by the raw memory speed.
Say what?
Tim Bray, New OED Project, U of Waterloo, Ontario
rang@cs.wisc.edu (Anton Rang) (09/11/89)
In article <16306@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes: >In article <1082@cernvax.UUCP> hjm@cernvax.UUCP (Hubert Matthews) writes: >+You may be running software that has a very low cache hit rate if you >+are doing CAD work or scientific calculations. Take this little loop >+for example: >+ >+ SUM = 0.0 >+ DO 10 I = 1, 1000000 >+ SUM = SUM + VEC(I) >+ 10 CONTINUE >+ >+A data cache is *no use at all* for this problem. You will get a >+cache miss on every data access. > >I thought caches respected the principle of locality. And this code has >really good locality. The instruction fetches here have good locality; the data fetches don't (1 MW of data is being accessed). The problem here is that 1 million words (assuming 1 word per vector element) of memory is being accessed. No matter HOW your cache is arranged, you are going to have to go to main memory for every one of those words (conceivably barring a few in the cache when you start). In fact, a (data) cache can hurt in this kind of code, because you have 1 million data fetch times, plus the overhead of the cache (small, but significant for very large problems). > And when VEC(I) hits a page for the first time, it'll be cached, [ ... ] ^^^^^^^^^^^^^^^ The problem is that caching something is not without cost. Loading 128 words into the cache from main memory is generally just as expensive as fetching 128 words from main memory directly. The cache only helps if you're doing multiple fetches of the same memory location before it's overwritten with other data. >+Similarly, copying data from one bit >+of memory to another will be limited by the raw memory speed. > >Say what? A := B, where A and B are 1 MW arrays, will require at least 2 million memory cycle times (1 million reads, 1 million writes). +----------------------------------+------------------+ | Anton Rang (grad student) | "VMS Forever!" | | University of Wisconsin--Madison | rang@cs.wisc.edu | +----------------------------------+------------------+
elm@chilli.Berkeley.EDU (ethan miller) (09/11/89)
In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes:
% The problem is that caching something is not without cost. Loading
%128 words into the cache from main memory is generally just as
%expensive as fetching 128 words from main memory directly. The cache
%only helps if you're doing multiple fetches of the same memory
%location before it's overwritten with other data.
It is often faster to load 128 words sequentially than it is to have
to fetch those words from memory. If you can use static column mode on
your RAMs, for example, you can get words from memory about twice as fast
as normal mode. Unless you have smart hardware that knows successive
accesses refer to the same bank and treats the accesses as static column,
it'll be faster to read 128 words at once. However, the cache overhead
might cut into that advantage.
ethan
=================================
ethan miller--cs grad student elm@ginger.berkeley.edu
#include <std/disclaimer.h> {...}!ucbvax!ginger!elm
"I like the Austrian way better." -- Dr. Henry Jones, Jr.
swarren@eugene.uucp (Steve Warren) (09/11/89)
In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes: > The problem is that caching something is not without cost. Loading >128 words into the cache from main memory is generally just as >expensive as fetching 128 words from main memory directly. The cache >only helps if you're doing multiple fetches of the same memory >location before it's overwritten with other data. Not if main memory is pipelined and interleaved. In that case the cache may be filled fairly rapidly since consecutive locations are accessed. --Steve ------------------------------------------------------------------------- {uunet,sun}!convex!swarren; swarren@convex.COM
roy@phri.UUCP (Roy Smith) (09/11/89)
> SUM = 0.0 > DO 10 I = 1, 1000000 > SUM = SUM + VEC(I) > 10 CONTINUE > > A data cache is *no use at all* for this problem. You will get a > cache miss on every data access. Well, not quite. True, you're going to miss on all the VEC(I) references, but you should hit on all the I and SUM references. Rewrite the loop as follows (and no, I don't care if I got the test-at-the-top or test-at-the-bottom backwards) 10 IF (I .GT. 1000000) GOTO 20 # fetch I I = I + 1 # fetch I, store I SUM = SUM + VEC(I) # fetch I, fetch SUM, store SUM GOTO 10 20 CONTINUE and you can count 4 data fetches (as pseudo-commented above), all of which should be cache hits. This far outnumbers the 1 cache miss per loop due to the main data array accesses. Some FORTRAN compilers have even been known to put constants in memory and refer to them indirectly; if that is true here, you get another two cache hits for the 1 and the 1000000. You could argue, of course, that I and SUM might be in registers, but that's a whole other argument. -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 {att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu "The connector is the network"
ttl@astroatc.UUCP (Tony Laundrie) (09/11/89)
In article <31224@ucbvax.BERKELEY.EDU> elm@chilli.Berkeley.EDU (ethan miller) writes: >In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes: >% The problem is that caching something is not without cost. Loading >%128 words into the cache from main memory is generally just as >%expensive as fetching 128 words from main memory directly. The cache > >It is often faster to load 128 words sequentially than it is to have >to fetch those words from memory. If you can use static column mode on > Loading 128 words into the cache from memory is much faster than loading them one at a time because the path between cache and memory is usually many words wide (like 16).
dik@cwi.nl (Dik T. Winter) (09/12/89)
In article <RANG.89Sep10184900@derby.cs.wisc.edu> rang@cs.wisc.edu (Anton Rang) writes: > In article <16306@watdragon.waterloo.edu> tbray@watsol.waterloo.edu (Tim Bray) writes: > >+ SUM = 0.0 > >+ DO 10 I = 1, 1000000 > >+ SUM = SUM + VEC(I) > >+ 10 CONTINUE > >+ > >+A data cache is *no use at all* for this problem. You will get a > >+cache miss on every data access. > > > >I thought caches respected the principle of locality. And this code has > >really good locality. > > The instruction fetches here have good locality; the data fetches > don't (1 MW of data is being accessed). The problem here is that 1 > million words (assuming 1 word per vector element) of memory is being > accessed. No matter HOW your cache is arranged, you are going to have > to go to main memory for every one of those words (conceivably barring > a few in the cache when you start). > In fact, a (data) cache can hurt in this kind of code, because you > have 1 million data fetch times, plus the overhead of the cache > (small, but significant for very large problems). This is not entirely true. There are systems where a cache acts more or less similar to the paging of VM. Take for instance the system where a cache line consists of 32 bytes and the width of the bus between memory and cache is 32 bytes (i.e. in memory a word is 265 (8*32) bits, but for the processor it is still 16 or 32 bits). In that case fetching a single operand in the processor will cache a memory word (or a number of processor words). So assuming the elements of array VEC are 4 bytes there will be a cache miss every eighth element or a cache hit rate of 87.5%. Changing the index in the loop to (33*I)mod 1000000 will reduce the hit rate to 0%. Because the bus between the cache and memory is wide enough, no speed penalty is involved (except for cache overhead. This is also true (although a bit less) if the bus between memory and cache is smaller, but the cache is refreshed in lines of multiple words, because the refreshment of the cache goes in general much faster than you would see with individual memory access for each word (typically something like one word per cycle, or faster). -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
roy@phri.UUCP (Roy Smith) (09/12/89)
Here's a (possibly crazy) idea for cache design. The current EUD (Example Under Debate) shows that caches just don't work for sequential access, but we knew that already. You can argue about how doing very wide memory fetches will preload the cache for sequential accesses, but that's certainly a second order effect. What if you segmented the virtual memory space (Oh no! Not segmented address spaces again! Shades of Intel!) so that the top bit was a hint to the cache on probably access patterns. Variables which were expected to hit the cache a lot (SUM and I in the EUD) would be put in the "normal" part of the address space. Variables which were expected to be sequential access and thus never hit (VEC in the EUD) would be put in the other half of the address space. The cache would know to not bother doing a tag match on this kind of access. The advantages would be faster access time (a memory fetch should be faster than a cache miss followed by a memory fetch) but more important it wouldn't cause bogus cache flushes. As with every crazy idea, you can expand on this in all sorts of ways. You might use more than one high order bit to provide lots of different sorts of hints to the cache. You might want to be able to turn the hinting on and off, or even make it programable. Of course, every bell and whistle adds cost and complexity, and reduces the chance that you will use the fancy features well, or at all. So what do you think? Has this been done before? -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 {att,philabs,cmcl2,rutgers,hombre}!phri!roy -or- roy@alanine.phri.nyu.edu "The connector is the network"
mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) (09/12/89)
In message <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >Here's a (possibly crazy) idea for cache design. The current EUD >(Example Under Debate) shows that caches just don't work for sequential >access, but we knew that already. [ Roy describes a system for which only a portion of the address space is sun through the cache ] > So what do you think? Has this been done before? Along a similar line, the Convex machines cache accesses going to the scalar unit only. The load/store unit for the vector unit operates directly from main memory to the vector registers. This is pretty close to what Roy describes, but it is not controlled by any tags, merely by which functional unit executes the load instruction. -- John D. McCalpin - mccalpin@masig1.ocean.fsu.edu mccalpin@scri1.scri.fsu.edu mccalpin@delocn.udel.edu
mat@uts.amdahl.com (Mike Taylor) (09/13/89)
In article <3985@phri.UUCP>, roy@phri.UUCP (Roy Smith) writes: > > SUM = 0.0 > > DO 10 I = 1, 1000000 > > SUM = SUM + VEC(I) > > 10 CONTINUE > > > > A data cache is *no use at all* for this problem. You will get a > > cache miss on every data access. > > Well, not quite. True, you're going to miss on all the VEC(I) > references, but you should hit on all the I and SUM references. Rewrite > the loop as follows (and no, I don't care if I got the test-at-the-top or > test-at-the-bottom backwards) > > 10 IF (I .GT. 1000000) GOTO 20 # fetch I > I = I + 1 # fetch I, store I > SUM = SUM + VEC(I) # fetch I, fetch SUM, store SUM > GOTO 10 > 20 CONTINUE > > and you can count 4 data fetches (as pseudo-commented above), all of which > should be cache hits. This far outnumbers the 1 cache miss per loop due to > the main data array accesses. Some FORTRAN compilers have even been known > to put constants in memory and refer to them indirectly; if that is true > here, you get another two cache hits for the 1 and the 1000000. You could > argue, of course, that I and SUM might be in registers, but that's a whole > other argument. All perfectly right, but the cache miss rate also depends on the cache line size - how many words are fetched per miss - which might be (say) 128 bytes or 16 words. In this case, you'd get a miss rate of 1/16 on the vec(i) references. Also, some mainframe caches are smart enough to look ahead and prefetch the next sequential line (usually only on I-fetch, but sometimes on data as well). This could result in an actual miss rate of zero for the example. Even without this, you have an operand miss rate of 1/80 or .0125. Quite respectable. -- Mike Taylor ...!{hplabs,amdcad,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
hankd@pur-ee.UUCP (Hank Dietz) (09/13/89)
In article <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > Here's a (possibly crazy) idea for cache design. The current EUD ... > What if you segmented the virtual memory space (Oh no! Not >segmented address spaces again! Shades of Intel!) so that the top bit was >a hint to the cache on probably access patterns. Variables which were >expected to hit the cache a lot (SUM and I in the EUD) would be put in the >"normal" part of the address space. Variables which were expected to be >sequential access and thus never hit (VEC in the EUD) would be put in the >other half of the address space. The cache would know to not bother doing >a tag match on this kind of access. The advantages would be faster access >time (a memory fetch should be faster than a cache miss followed by a memory >fetch) but more important it wouldn't cause bogus cache flushes. Look at Chi's PhD thesis (see my last posting). The idea of using a bit to control mapping is "older than dirt" and the idea of using it in sequential machines to control cacheability is part of what Chi's thesis suggests. The catch is that you don't divide the address space -- one simply has the compiler tag individual REFERENCES with the "don't cache me" bit. The reason you just ignore the high bit for addressing rather than use it with creative data layout is that the variable X might be cached along one program flow path, but not cached along another flow path. Chi also gives an O(n) typical performance compiler algorithm for setting the "don't cache me" bits... your "sequential access" rule is a VERY crude approximation. You get a BIG performance gain this way... results also in Chi's thesis. -hankd@ecn.purdue.edu PS: Chi and I have come to call this mechanism "Cache Bypass." Parts of this work have appeared in a bunch of papers as well as in his PhD thesis... I think the first was at HICSS 87; the most recent was at the SIGPLAN 89 conf. on programming lang. design & implementation.
retrac@titan.rice.edu (John Carter) (09/13/89)
In article <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > > Here's a (possibly crazy) idea for cache design. [Stuff deleted.] > > What if you segmented the virtual memory space (Oh no! Not >segmented address spaces again! Shades of Intel!) so that the top bit was >a hint to the cache on probably access patterns. Variables which were >expected to hit the cache a lot (SUM and I in the EUD) would be put in the >"normal" part of the address space. Variables which were expected to be >sequential access and thus never hit (VEC in the EUD) would be put in the >other half of the address space. The cache would know to not bother doing >a tag match on this kind of access. The advantages would be faster access >time (a memory fetch should be faster than a cache miss followed by a memory >fetch) but more important it wouldn't cause bogus cache flushes. > > As with every crazy idea, you can expand on this in all sorts of >ways. You might use more than one high order bit to provide lots of >different sorts of hints to the cache. You might want to be able to turn >the hinting on and off, or even make it programable. Of course, every bell >and whistle adds cost and complexity, and reduces the chance that you will >use the fancy features well, or at all. > > So what do you think? Has this been done before? I don't know of anybody else doing exactly this, but I/we have been examining something similar here at Rice. Our motivation for examining what I call adaptive caching schemes (adaptive because the caching mechanism adapts to the way the shared data is being accessed) is in the design of a `highly efficient' distributed shared memory system (I put highly efficient in quotes because we're just starting to move out from paper design, and while efficiency is a major goal, I don't know how much we'll succeed until we actually get some hard numbers). Implementing distributed shared memory is quite similar to implementing a multiprocessor cache (ref. work by Kai Li among others), but has a few important differences, including much longer latency to `main memory' and lower bus bandwidth (which may or may not support efficient broadcast, depending on your target distributed memory multiprocessor). The high cost of `page faults' in such a system motivated us to study ways to significantly reduce the number of `page faults' by taking advantage of semantic information either derived automatically by the compiler or specified by the user. Since distributed shared memory systems are essentially software-implemented caches, you can implement a more complicated caching mechanism than can be reasonably implemented in hardware. I don't have the time right now to go into details (both Sigmetrics and PPoPP have immediate deadlines that we're shooting for), but our basic idea is that you can characterize different common `types' of shared memory objects (data items), provide multiple cache coherence mechanisms designed to efficiently support these different `types', and have each shared object be handled with a caching mechanism tuned to its needs. The current thrust of the work has been two-fold: characterizing the way shared memory is accessed (somewhat related to earlier work by Weber & Gupta, Eggers & Katz, and Agarwal et. al.) and developing the basic design of a distributed shared memory system that can take advantage of this knowledge (Munin -- if you want to know the etymology of the name, you'll have to ask :-). If anybody is interested, we should have a few tech reports available soon (next week). One will describe our work on characterizing shared memory access patterns and the other will discuss the preliminary design of Munin. Hope you found this interesting. >Roy Smith, Public Health Research Institute John Carter Internet: retrac@rice.edu Dept of Computer Science UUCP: {internet node or backbone}!rice!retrac Rice University Houston, TX "Badgers?! We don't need no stinking badgers!" - from UHF
mash@mips.COM (John Mashey) (09/13/89)
In article <1174@brazos.Rice.edu> retrac@titan.rice.edu (John Carter) writes: >In article <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: >> Here's a (possibly crazy) idea for cache design. [Stuff deleted.] >> What if you segmented the virtual memory space (Oh no! Not >>segmented address spaces again! Shades of Intel!) so that the top bit was >>a hint to the cache on probably access patterns. Variables which were >>expected to hit the cache a lot (SUM and I in the EUD) would be put in the >>"normal" part of the address space...... >> So what do you think? Has this been done before? The issue is mostly a software issue. Most microprocessors designed with caching in mind from day one, and many implementations of earlier architectures include MMUs that offer at least a cache/nocache bit per MMU map entry. Any of the high-performance RISCs, and most of the CISC chips do this. In any of the MMU designs that permit double-mapping, the same physical page can be double-mapped with 2 MMU entries, one cached, the other uncached, so if the software can figure out what to do, and keep reasonable consistency, all might be well. MIPS R3000s use one of the address bits to distinguish between cached/nocached access for the "unmapped" part of the kernel address space, and actually uses this in various ways. For example, you can call a function, cached, in the usual way. If you call it by jumping to the same address | 0x20000000, when you get there, you're running uncached..... this feature is actually used, by the way. RISC/OS also takes advantage of uncached address space in those configurations (i.e., most R2000-based machines), where it is faster to read disk blocks into buffers and copy from them uncached, thus avoiding trashing the cache unnecessarily. (This is not so clear-cut on R330s that do multi-word refill; another strategy might be better). However, to my knowledge, we do not make use of this for ordinary user programs of the form implied by the original question; rather it is restricted to manual choice where people know a priori that this is what they want. Understanding the useability of software-controlled caching is a good area for research over the next few years; the answers are not yet clear. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
amos@taux01.UUCP (Amos Shapir) (09/13/89)
This idea is already in use by some MMU systems - it's called uncacheable (or memory-locked) pages. Such schemes let you specify which pages will be bypassing the cache. (Specifying a page address is equivalent to marking a few address bits, which is what you suggested). -- Amos Shapir amos@taux01.nsc.com or amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522261 TWX: 33691, fax: +972-52-558322 34 48 E / 32 10 N (My other cpu is a NS32532)
les@unicads.UUCP (Les Milash) (09/13/89)
ok. since we're talkin about "How Caches Work" here's a question. i realize that cache lines can be read faster than N individual read cycles, either by 0. having the mem-to-cache path wider than the cache-to- processor path. 1.0 doing burst (page-mode) access to a bank of dram. 1.1 using interleaved banks. so my question is: who does #0? i don't think i've seen it done in a micro (i.e. i don't think i've seen any few-chip CMMU that comprehends (say) 16-byte wide paths to the dram. (i think i heard that the Lilith did fetch 8byte instructions into a 8byte wide x 1 deep prefetch buffer). now i have no problem imagining Mr. Cray and the Fast Crowd doing wide things, but in the micro world i haven't heard of it (yet--seems obvious now that big pin-count pkgs are starting to appear). also, what other things that i haven't seen prevent this from being an easy performance-boost? the gentleman from japan who was asking about integrating cache and dram on one chip sounded like he was onto a good idea. faster, faster, till the thrill of gips outweighs the cost of pins! thanks folks, Les Milash
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/13/89)
In article <MCCALPIN.89Sep12111256@masig3.ocean.fsu.edu> mccalpin@masig3.ocean.fsu.edu (John D. McCalpin) writes: >Along a similar line, the Convex machines cache accesses going to the >scalar unit only. The load/store unit for the vector unit operates >directly from main memory to the vector registers. This is pretty >close to what Roy describes, but it is not controlled by any tags, >merely by which functional unit executes the load instruction. This is one of the advantages of having a vector instruction set, because this is *exactly* what is needed for programs of this type. This point is not always appreciated by systems people who are used to dealing with kernel code and utilities, so it is worth noting. You do *not* want to flush the cache by routing vector reads and writes through it. A typical code on getting 10-20 average MFLOPS on a machine with ~100 MBytes of real memory may easily chop though 25 MBytes on each time step, much bigger than *any* current cache. If you flush the cache, you slow down the intermixed scalar parts of the code. Instead, the vector read/writes can go directly to memory, with a bus watcher on the cache to update anything which happens to be in the cache. It may seem like a kludge to have the instruction used determine whether something is put in the cache or not, but there is actually a lot of information in the fact that the data is being accessed with a vector instruction. Unfortunately, it won't save you from flushing the cache occasionally when you have to do certain operations which won't vectorize. There is a considerable art to optimizing the CPU/memory interaction, just as there is to optimizing branch strategies, etc. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/13/89)
In article <27445@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >MIPS R3000s use one of the address bits to distinguish between >cached/nocached access for the "unmapped" part of the kernel address >space, and actually uses this in various ways. For example, you can It would appear that a "general" form of this would allow arbitrary memory sections to be cache/nocache. This is an interesting alternative which is similar to the option on the CDC 2xx architecture of having user selectable page sizes. It could be a very useful feature to have on a "superscalar" type machine, because you generally don't want array accesses to be cached. (On the CDC machines, the OS let you specify at load time which arrays would go on large pages...) >Understanding the useability of software-controlled caching is >a good area for research over the next few years; the answers are not >yet clear. Since there are times when you *do* want array accesses to be cached, this might be a case for having two different load instructions, one cached and one not cached. That would allow the compiler and/or user to choose at any moment what the optimal choice is for that load. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
mbutts@mentor.com (Mike Butts) (09/13/89)
From article <MCCALPIN.89Sep12111256@masig3.ocean.fsu.edu>, by mccalpin@masig3.ocean.fsu.edu (John D. McCalpin): > In message <3989@phri.UUCP> roy@phri.UUCP (Roy Smith) writes: > >>Here's a (possibly crazy) idea for cache design. The current EUD >>(Example Under Debate) shows that caches just don't work for sequential >>access, but we knew that already. > > [ Roy describes a system for which only a portion of the > address space is sun through the cache ] > >> So what do you think? Has this been done before? > > Along a similar line, the Convex machines cache accesses going to the > scalar unit only. The load/store unit for the vector unit operates > directly from main memory to the vector registers. This is pretty > close to what Roy describes, but it is not controlled by any tags, > merely by which functional unit executes the load instruction. It's not terribly hard to build a vector cache, or vector buffer, call it what you will, into which vector elements are prefetched, according to an address and stride detected by the compiler of a predictable loop. -- Michael Butts, Research Engineer KC7IT 503-626-1302 Mentor Graphics Corp., 8500 SW Creekside Place, Beaverton, OR 97005 ...!{sequent,tessi,apollo}!mntgfx!mbutts OR mbutts@pdx.MENTOR.COM Opinions are my own, not necessarily those of Mentor Graphics Corp.
colin@watcsc.waterloo.edu (Colin Plumb) (09/14/89)
Not cacheing certain accesses is something the i860 does... there are normal loads that the data cache tries to hit on, and pipelined loads that have programmer-visible delay whose values are *not* loaded into the cache. They have some example code for matrix multiplication that loads the rows into cache and uses pipelined loads on the columns to avoid memory delays (it's pipelined with the multiply/accumulates) despite the awful locality of reference. -- -Colin
preston@titan.rice.edu (Preston Briggs) (09/14/89)
In article <31815@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >In article <27445@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > >>MIPS R3000s use one of the address bits to distinguish between >>cached/nocached access for the "unmapped" part of the kernel address >>space, and actually uses this in various ways. For example, you can > >It would appear that a "general" form of this would allow arbitrary >memory sections to be cache/nocache. This is an interesting >Since there are times when you *do* want array accesses to be cached, >this might be a case for having two different load instructions, one >cached and one not cached. That would allow the compiler and/or user >to choose at any moment what the optimal choice is for that load. I think you're right. The i860 allows both cached and non-cached loads. It would sometimes be nice if they allowed a variable line size (small, to prevent loading extra unneeded data, or large, to preload data that will be needed soon.) A fetch-block-to-cache instruction might actually be what's wanted. I expect compiler people will want more and more control over the cache in the future. Current hardware management will continue to be demanded for un-cache-optimized code. An analogy might be stack machines vs. register machines. Without good register allocation, stack machines with top of stack caching are pretty good. With good register allocation, we can do better with by explicitely controlling the registers. Preston Briggs preston@titan.rice.edu
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/14/89)
In article <1184@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >In article <31815@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: > >>Since there are times when you *do* want array accesses to be cached, >>this might be a case for having two different load instructions, one >>cached and one not cached. That would allow the compiler and/or user >I think you're right. The i860 allows both cached and non-cached loads. You are right. If I had bothered to read the Aug. 89 IEEE Micro before touching my keyboard, I would have known that the i860 has exactly this feature. Good for Intel! Aug. 89 also has a comparison of RISC architectures, although for some reason it ignores MIPSCo... Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
rodman@mfci.UUCP (Paul Rodman) (09/14/89)
In article <31814@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >>Along a similar line, the Convex machines cache accesses going to the >>scalar unit only. The load/store unit for the vector unit operates >>directly from main memory to the vector registers. This is pretty > >This is one of the advantages of having >a vector instruction set, because this is *exactly* what is needed for >programs of this type. What you should have said is: "This is one of the advantages of having control over whether or not to bypass the cache". A vector instruction set gives a pretty coarse method of making the choice. Another point worth noting is that many vector machines only do the cache bypass if the stride = 1, other strides may go for the cache. It depends, of course, on how good the memory system is. pkr
david@daisy.UUCP (David Schachter) (09/14/89)
In article <639@unicads.UUCP> les@unicads.UUCP (Les Milash) asks about micro-
computers with a cache line size greater than the memory word size, and wonders
of the absence of such machines.
The Everex Step 386/20 uses a cache line size that depends on the amount of
cache installed. 64KB -> 4 byte line. 128KB -> 8 byte line. 256KB -> 16 byte
line. The machine is/was the fastest 20 MHz 386 PC clone around. The Step
386/25 uses the same approach and gets the same results. Everex calls this
"AMMA", which stands for some marketing drivel. I don't recall the details on
the Step 386/33.
The cache is a direct-mapped write-back cache, so they clearly brute-forced the
lower hit rate of direct mapping by putting in a lot of cache and devoted
engineering resources that could have gone for multi-way associativity to
write-back logic instead. In my opinion, they made a good design decision.
The design doesn't use ASICs or custom VLSI; it does use some PALs.
-- David Schachter
"With friends like you, who needs hallucinations?" -- "Night Court" (NBC)
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (09/15/89)
In article <1029@m3.mfci.UUCP> rodman@mfci.UUCP (Paul Rodman) writes: >In article <31814@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >>a vector instruction set, because this is *exactly* what is needed for >What you should have said is: "This is one of the advantages of having >control over whether or not to bypass the cache". A vector instruction >set gives a pretty coarse method of making the choice. You are right. I should have noticed that the terms of the debate have changed during the last year from: "Nobody runs applications like that" to "What is the optimal way to support calculations involving large arrays of floating point variables?" Needless to say I am pleased. I am now hoping that fast multiport memory subsystems will become the next thing which we didn't need, but which we now can't live without. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117