lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/26/90)
In article <WAYNE.90Feb23091735@dsndata.uucp> wayne@dsndata.uucp (Wayne Schlitt) writes: >i am sure there _are_ real programs that can fit entirely in the 68030's >massive (:-) 256 byte data and instruction caches. i am sure there >are a lot more real programs that can fit entirely in a 8k cache. i >am sure that a lot of real programs will fit nicely in 32k-128k >caches. Programs certainly do vary. The transaction people at Concurrent say that their idea of a cache is a megabyte. There may also be programs with essentially no locality: this has been claimed for some network communications software. (The point was that the packets would be processed exactly once, and then passed along, never to be seen again.) -- Don D.C.Lindsay Carnegie Mellon Computer Science
wilson@carcoar.Stanford.EDU (Paul Wilson) (02/26/90)
In article <8168@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >In article <WAYNE.90Feb23091735@dsndata.uucp> wayne@dsndata.uucp > (Wayne Schlitt) writes: >>i am sure there _are_ real programs that can fit entirely in the 68030's >>massive (:-) 256 byte data and instruction caches. i am sure there >>are a lot more real programs that can fit entirely in a 8k cache. i >>am sure that a lot of real programs will fit nicely in 32k-128k >>caches. > >Programs certainly do vary. The transaction people at Concurrent say >that their idea of a cache is a megabyte. There may also be programs >with essentially no locality: this has been claimed for some network >communications software. (The point was that the packets would be >processed exactly once, and then passed along, never to be seen >again.) Garbage collected languages have similar characteristics. If you use a generational garbage collector, you may be able to fit the youngest generation entirely in the cache. But for it to be worthwhile, you want a fairly big cache. The problem is that you allocated up through a region of memory, allocating mostly very-short-lived objects that you touch a few times and then never reference again. A while later, you copy the minority of live objects to some other region, and allocate upward through the same area again. If you can keep this whole area in cache, you don't repeatedly miss on it. This *may* require a *big* cache. Ben Zorn's thesis from Berkeley, (Dec '89) shows that misses only drop off slowly for a large direct-mapped cache like SPUR's, so that you might need 1-2MB caches to get really good hit rates. I have a theory that you might do much better with set-associative caches, if they're around 200KB or more. But the numbers aren't in yet. Note that if your cache is not at least a bit bigger than your youngest generation, you don't win -- you tend to evict blocks before you reuse them -- LRU is your enemy here. So the miss rate is fairly insensitive to cache size until the cache gets to be as large as the youngest generation. A smarter replacement policy, which knew about this, could probably cut misses significantly for caches too small to hold the whole allocation area. (But is it worth it? Maybe. It might do a good job for the above-mentioned networking software as well, and some other things.) >-- >Don D.C.Lindsay Carnegie Mellon Computer Science -- Paul Paul R. Wilson Software Systems Laboratory lab ph.: (312) 996-9216 U. of Illin. at C. EECS Dept. (M/C 154) wilson@bert.eecs.uic.edu Box 4348 Chicago,IL 60680 Paul R. Wilson Software Systems Laboratory lab ph.: (312) 996-9216 U. of Illin. at C. EECS Dept. (M/C 154) wilson@bert.eecs.uic.edu Box 4348 Chicago,IL 60680
koopman@a.gp.cs.cmu.edu (Philip Koopman) (02/27/90)
In article <1990Feb26.022057.28461@Neon.Stanford.EDU>, wilson@carcoar.Stanford.EDU (Paul Wilson) writes: > Garbage collected languages have similar characteristics. If you > use a generational garbage collector, you may be able to fit the > youngest generation entirely in the cache. But for it to be > worthwhile, you want a fairly big cache. > ... My thesis work showed that functional programming languages that use combinator graph reduction have many unexpected characteristics when using cache. They all seem to make excellent use of even very small caches of a few Kbytes. On the other hand, scientific/engineering number-crunching is notorious for breaking caches. Any time the numeric arrays you are operating on get substantially bigger than the cache size, you end up heading towards a 0% hit rate. And, for any cache size you can *always* find someone who wants to use an array bigger than you can handle. Yes, block-manipulation techniques can help, but they are a hack to get around this basic problem -- and are just about as effective with smaller, cheaper vector register files. So, that's why most supercomputers seem to use vector register files instead of caches for their vector units. Phil Koopman koopman@greyhound.ece.cmu.edu Arpanet 2525A Wexford Run Rd. Wexford, PA 15090 Senior Scientist at Harris Semiconductor, adjunct professor at CMU. I don't speak for them, and they don't speak for me.
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/28/90)
In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: >In article <1990Feb26.022057.28461@Neon.Stanford.EDU>, wilson@carcoar.Stanford.EDU (Paul Wilson) writes: >So, that's why most supercomputers seem to use vector register >files instead of caches for their vector units. Or, at least semi-bypass the scalar data cache for vector reads and writes (coherency can become a performance issue, though, if you have a real cache). On the Cray machines, scalar data "caches" have always been explicitly programmed as "registers" anyway, so the bypassing is easy. Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)604-6117
dik@cwi.nl (Dik T. Winter) (02/28/90)
In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: > So, that's why most supercomputers seem to use vector register > files instead of caches for their vector units. > Well, no (depends on your definition of supercomputer of course; let us assume vector processor). There are systems without cache that use vector registers (Cray, NEC), or have memory to memory operations (Cyber 205). And there are processors with cache. Possibilities are: 1. No vector registers, bypass cache (Cyber 995). 2. Vector registers, bypass cache (i know none). 3. No vector registers, through cache (again, i know none). 4. Vector registers, through cache (IBM 3090, Convex, Alliant, Gould). So, no, vector registers are not a replacement for cache. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
ps@fps.com (Patricia Shanahan) (03/01/90)
In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: > > So, that's why most supercomputers seem to use vector register > > files instead of caches for their vector units. > > >Well, no (depends on your definition of supercomputer of course; let us >assume vector processor). There are systems without cache that use vector >registers (Cray, NEC), or have memory to memory operations (Cyber 205). >And there are processors with cache. Possibilities are: >1. No vector registers, bypass cache (Cyber 995). >2. Vector registers, bypass cache (i know none). The FPS Model 500 is in this category. The scalar processors use caches for instructions and scalar data references. The vector processors are connected to the System Memory Bus and fetch data from memory without going through cache. The only interaction is that the data caches snoop on the bus, and stores by the vector processors cause corresponding purges in the data caches. >3. No vector registers, through cache (again, i know none). >4. Vector registers, through cache (IBM 3090, Convex, Alliant, Gould). I would add the VAX 9000 to this list, but question whether the Convex C2 belongs here. I think it is a vector registers and bypass cache system. >So, no, vector registers are not a replacement for cache. >-- >dik t. winter, cwi, amsterdam, nederland >dik@cwi.nl Vector registers and cache both perform the same basic function, of holding data that is likely to be used in future calculations close to the arithmetic unit that is going to do those calculations. They also serve to buffer between the arithmetic unit's need for individual numbers, and the memory preference for highly parallel processing of larger blocks. Caches are very good when reference patterns are basically unpredictable and recent past reference to a data item or something close to it in memory is positively correlated with probability of near future reference. They work by fetching a block surrounding a referenced item (because of spacial locality) and keeping it around for a while to exploit temporal locality. Vector memory references are not like that. The reference pattern is set at compile time and known to the compiler. During a strided fetch there is no spacial locality. There is frequently a strong negative correlation between recent past reference and near future reference because of the large scale sequential characteristics of vector processing. Any cache substantially smaller than main memory can be flooded by some vector jobs. The very big manufacturers, especially IBM, are something of an exception because they can persuade third party software vendors to do a lot of special tuning for their systems. It is usually possible, with care, to arrange a vector job so that it will fit well with a particular cache size. -- Patricia Shanahan ps@fps.com uucp : {decvax!ucbvax || ihnp4 || philabs}!ucsd!celerity!ps phone: (619) 271-9940
patrick@convex.com (Patrick F. McGehearty) (03/01/90)
In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: >1. No vector registers, bypass cache (Cyber 995). >2. Vector registers, bypass cache (i know none). >3. No vector registers, through cache (again, i know none). >4. Vector registers, through cache (IBM 3090, Convex, Alliant, Gould). Actually, in the Convex C2 series, vector loads and stores bypass the cache. Scalar operations use the cache. One argument for this design is that the large amount of data used in vector operations would quickly invalidate most scalar data in the cache, such as local variables, loop limits, etc.
mccalpin@vax1.acs.udel.EDU (John D Mccalpin) (03/01/90)
In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: > > So, that's why most supercomputers seem to use vector register > > files instead of caches for their vector units. > > >Well, no (depends on your definition of supercomputer of course; let us >assume vector processor). There are systems without cache that use vector >registers (Cray, NEC), or have memory to memory operations (Cyber 205). >And there are processors with cache. Possibilities are: >1. No vector registers, bypass cache (Cyber 995). >2. Vector registers, bypass cache (i know none). >3. No vector registers, through cache (again, i know none). >4. Vector registers, through cache (IBM 3090, Convex, Alliant, Gould). >So, no, vector registers are not a replacement for cache. >-- >dik t. winter, cwi, amsterdam, nederland >dik@cwi.nl I believe that the Convex C-2 series bypass the cache to load/store the vector registers, so it should be in category 2, not category 4. The ETA-10 is a special case different than any of the above categories. It had no vector registers (being a memory-to-memory machine), but it had a set of cache-like registers (the "short-stop" registers) that were used to cache only short vectors. I never got a definitive answer from any of my ETA friends about exactly how big this register set was used or exactly how the hardware decided to use it. I do know from direct observation that repeated access to short vectors showed a lower start-up cost than the equivalent long-vector operations. The impression that I got was that this use is different than category 3, since the short-stop registers were only used for vector operands. I think that short vectors were stuffed into these registers in parallel with their first use, and then they could be re-loaded directly from the short- stop registers if needed again soon. The output of the vector unit seems also to be loaded into the short-stop registers (if short enough) to make back-to-back vector operations run with slightly less overhead --- this is in the spirit of chaining. Corrections from informed sources welcome....
bdg@tetons.UUCP (Blaine Gaither) (03/02/90)
>In article <8848@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >>In article <8189@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) >>writes: >>1. No vector registers, bypass cache (Cyber 995). >>2. Vector registers, bypass cache (i know none). >>3. No vector registers, through cache (again, i know none). >>4. Vector registers, through cache (IBM 3090, Convex, Alliant, Gould). > >Actually, in the Convex C2 series, vector loads and stores bypass the cache. >Scalar operations use the cache. One argument for this design is that >the large amount of data used in vector operations would quickly invalidate >most scalar data in the cache, such as local variables, loop limits, etc. I have to agree with the arguments states by Pat. The Gould NP1 was (still is) a cache based vector mini-super. All vector operation went through a store-through cache. The cache size was small (16KB data). It turned out to perform rather poorly on some scientific codes. The basic machine was about 2x its competition at scalar, and about equal on vector codes if their data fit in the data cache. If an algorithm didn't fit in the cache, the result was to run at .3 to .5 the in-cache speed. There were several other factors of the hw/sw design that were not optimal for the scientific market, but the cache based design did not help. One problem is that the smaller the cache, the greater the emphasis that must be placed upon algorithms to make better use of the memory hierarchy. If user and 3rd party software vendors use vendor libraries, they can get good performance (if the vendor bothers to do the requisite algorithm work). If they don't then is an uphill battle for compiler writers to recognize poor algorithms and substitute better ones. These tools must be ready before 3rd party ports start. Cache based vector machines can in general be assumed to get the same number of cache faults as a machine executing in scalar (with the same algorithm) would get. The big problem is vector machines get them over a shorter period of time, thus it becomes a relatively more serious problem than it would in a typical scalar machine. Another usability problem with a cache based approach is the discontinuity it causes in the performance when plotted vs problem size. This is very disconcerting. If when you do a matrix multiply of 2 nxn matrices and then multiply 2 (N+5)x(n+5) matrices you can experience a startling slowdown (m sec vs 3m sec). This causes some very intersting measurement problems in scientific benchmarks which attempted to find [N sub 1/2]. i.e. You have to find the peak vector performance (and it ain't on 1000x1000 matricies). So measurement techniques that expect monotonically increasing (or worse continuous) values of performance vs problem size get wrong answers. Well 16K is a very small cache for scientific problems. What about larger caches. Well larger is better. It moves the discontinuity so it affects fewer problems. At some point you might not care. But if the reason you are buying the architecture (as well as the implementation) is because you see ever increasing problem size, then even if it works fast enough on todays problems, will it be fast enough for tomorrows problem sizes. What can a cache based vector machine do? Use a dual ported cache with parallel prefetch. Use a very large cache. Use a very very large secondary cache to reduce miss penalty.