wall@fortune.UUCP (Jim Wall) (07/18/85)
I have a feeling that this topic may once again start the religious wars of cache types and implimentations, but just maybe I can get the info that I'm looking for. Given a multiuser UNIX environment, and given a cache that is split so that user and supervisor/kernal are separate; what kind of performance improvment will the cache yield. More information: The cache is a simple direct mapped implimentation, as the CPU fetches something from RAM (I/O is not cached obviously), it is stored in the cache. If that physical address is ever used again, the data will come from the cache. So where this will aid in is program loops, and often used routines. The cache is 16K bytes split into 8K for user and 8K for supervisor. THE QUESTION IS: in UNIX how much improvement can the cache make? Let's face it, you won't be making a 70% - 90% hit rate. O.K., an extra credit part to the problem: If the CPU were a 68020 with the internal 256 byte direct mapped instruction cache, is there a need for the external cache? -Jim Wall ...amd!fortune!wall P.S. Both Altos and Charles River Data had both the 020 cache and an external 8K cache, Dual did not. Anyone know why? Not that I'm trying to have someone else do my design work, I believe that for the external cache to be really useful, the system architecture must be designed for the cache, such as 64 bit wide memory, or block transfers on cache misses. Does CRD or Altos do any of these?
brad@gcc-bill.ARPA (Brad Parker) (07/22/85)
Jim Walls question seems like a good one. Unfortunetly, it only leads me to as another... (please excuse the horrible spelling) Could someone who has a decent understanding of memory management systems give me a short discourse on the following? I'd like to compare and contrast the difference in performance between a simple single level paged memory manager using a ram (a la Sage 68000) and a system like the IBM DAT box, where the page tables are stored in main memory and cached in hardware. The point being that switching context is MUCH faster if you only need to change the pointer to the page tables, rather than copy 8K of paging information into the page table ram. It is assummed that the cache used to speed up the main memory page table accesses is sufficiently large to get a good hit rate (what ever that may be). Elaboration: The simple hardware system is a ram which uses as it's address the upper part of the access address with the lower part of the access address concatenated to the contents of the ram - additionally you'd need some flag ram used to mark pages swapped out (no doubt generating a page fault interupt). The main memory system would need a state machine to access the page tables from main memory and some sort of nifty cache to keep the most recently accessed translations around - this is more or less similar to the hardware version except the page tables are in main memory, not in dedicated ram. The goal is to allow for simple hardware without an huge overhead in context switching. Any ideas? -- J Bradford Parker uucp: seismo!harvard!gcc-bill!brad "She said you know how to spell AUDACIOUSLY? I could tell I was in love... You want to go to heaven? or would you rather not be saved?" - Lloyd Coal
richardt@orstcs.UUCP (richardt) (07/23/85)
I wouldn't know about UNIX, but a cache of the type you're describing certainly does speed thngs up, if you can run it fast enough. The '020 internal cache is nice, but there are a number of things it has problems with: any loop which relies on a lookup table; any loop which uses a constant which is in data space because it is constant throughout a run but changes from run to run; and generally any loop which uses fixed tables in data space. A larger cache could also handle entire control loops instead of merely operation loops. For example: the Forth main interprter loop or editor loop aren't very long. They are just barely long enough to fall outside of the '020 cache for example. with an 8k cache, you could probably make an interpreter or any other reasonably complex interactive program run with a very high hit ratio. You'll note I haven't given any figures. That's because I haven't gotten my own design completed or up and running yet. orstcs!richardt "Is there an Assembly Language Programmer in the house?"
herbie@watdcsu.UUCP (Herb Chong [DCS]) (07/25/85)
In article <268@gcc-bill.ARPA> brad@gcc-bill.UUCP (Brad Parker) writes: >I'd like to compare and contrast the difference in performance between a >simple single level paged memory manager using a ram (a la Sage 68000) and >a system like the IBM DAT box, where the page tables are stored in main memory >and cached in hardware. The point being that switching context is MUCH >faster if you only need to change the pointer to the page tables, rather than >copy 8K of paging information into the page table ram. It is assummed that >the cache used to speed up the main memory page table accesses is sufficiently >large to get a good hit rate (what ever that may be). i've been doing a lot of reading lately on storage management at the kernel (or as IBM prefers to call it, the nucleus) level of 370 and 370-XA machines because i may be working on kernel code for those machines soon. anyway, i should point out that there are two sets of tables used by DAT. there are segment tables in addition to page tables. segments are 1Mbyte and pages are 4Kbytes. each address space (which can contain many processes but all owned by the same user) has it's own segment table entries which point to page tables for that user. all the processes in a single address space occupy various sections of virtual memory and operate as co-routines so that only one process can ever be running at one time in an address space and control is transfered between processes by explicit calls by the co-routines. because all processes in an address space share the same virtual memory, each can see all the others if it wants to, unlike unix processes which are isolated from each other in terms of storage. when a context switch is performed by the CPU, the hardware saves away status in some block of storage and changes a segment table pointer before loading new status of the next address space to execute. i believe the actual size of information moved is on the order of 128 bytes, but i'm not completely sure. the DAT hardware maintains a cache of segment and page table entries (called the Translation Lookaside Buffer, TLB) which improves overall performance because all storage references, whether by instruction fetch or operand access, require information in the segment and page tables. the hardware maintains this cache, although there are instructions provided for manipulating the entries. the net result is a much more complex CPU and memory manager. it would be very interesting to compare a 68000 system to a single chip (or even dozen chip) implementation of the full 370 hardware. there is also provision for prefix control where multiple CPU's can refer to the same real address, but the memory manager uses the CPU prefix to decide where the real block of storage is in real memory. this only happens for page 0 of memory. you get the idea. Herb Chong... I'm user-friendly -- I don't byte, I nybble.... UUCP: {decvax|utzoo|ihnp4|allegra|clyde}!watmath!water!watdcsu!herbie CSNET: herbie%watdcsu@waterloo.csnet ARPA: herbie%watdcsu%waterloo.csnet@csnet-relay.arpa NETNORTH, BITNET, EARN: herbie@watdcs, herbie@watdcsu
mat@amdahl.UUCP (Mike Taylor) (07/25/85)
> Could someone who has a decent understanding of memory management systems > give me a short discourse on the following? The fact that I make a comment does not imply any pretensions of a decent understanding. > > I'd like to compare and contrast the difference in performance between a > simple single level paged memory manager using a ram (a la Sage 68000) and > a system like the IBM DAT box, where the page tables are stored in main memory > and cached in hardware. The point being that switching context is MUCH > faster if you only need to change the pointer to the page tables, rather than > copy 8K of paging information into the page table ram. It is assummed that > the cache used to speed up the main memory page table accesses is sufficiently > large to get a good hit rate (what ever that may be). > In fact, the context switch in S/370 does not require any massive copies. A CPU control register contains the address of the segment tables associated with the current address space. This is called the Segment Table Origin (STO). A cached list contains some (implementation-dependent) set of these values, and maps them to a small number, the STO ID. Translations are cached in a buffer called the Translation Lookaside Buffer (TLB). Each translation in the TLB is associated with a particular STO ID, or else is marked as being common to all address spaces (Common Segment). Therefore, many translations for the same virtual address may reside in the TLB, each associated with a different address space by means of the STO ID. Instructions are provided to selectively or completely invalidate entries in the TLB. The reason for caching the entries relates to the cycle time objectives for the machine. If you use the simple hardware, then main storage access time is factored into the cycle time for address translation. In our implementation of S/370, this would mean substituting (say) 200 ns. main storage for the 7.5 ns. rams used. The difference would add directly to cycle time (simplistically, at least), which would result in making the machine run about 9 times slower, ignoring the effects of TLB misses, which are very closely related to cache misses in our machine. The reason for the relation is that we use a virtually addressed cache and therefore include the TLB information in the cache tag. The effects of TLB misses, however, are generally quite small in high-end systems. This dramatic difference relates directly to the performance difference between the cache RAM and main storage, related to the machine cycle time (23.25 ns. - 43 MHz.). -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
jvz@loral.UUCP (John Van Zandt) (07/26/85)
In article <5374@fortune.UUCP> wall@fortune.UUCP (Jim Wall) writes: > > I have a feeling that this topic may once again start the >religious wars of cache types and implimentations, but just maybe >I can get the info that I'm looking for. Given a multiuser >UNIX environment, and given a cache that is split so that user >and supervisor/kernal are separate; what kind of performance >improvment will the cache yield. More information: The cache is >a simple direct mapped implimentation, as the CPU fetches something >from RAM (I/O is not cached obviously), it is stored in the cache. >If that physical address is ever used again, the data will come from >the cache. > > So where this will aid in is program loops, and often used >routines. The cache is 16K bytes split into 8K for user and 8K for >supervisor. THE QUESTION IS: in UNIX how much improvement can the >cache make? Let's face it, you won't be making a 70% - 90% hit rate. Cache's are strange creatures, exactly how they are designed can impact performance significantly. Your idea of separate cache's for supervisor and user is a good one. However, you haven't given enough information to determine the performance improvement. All that a cache can do is to improve the memory performance for the system. The standard formula for computing the speedup looks at the percentage of cache hits times the cache speed plus the percentage of cache misses times the memory speed. And there are tricks that effect performance such as the number of cache sets, whether the cache has a 'dirty' bit, etc. I haven't checked the M68000 lately, but if lines coming from the chip can signal whether an instruction or data fetch is occuring, then another speedup would be to have separate instruction and data cache's. > > O.K., an extra credit part to the problem: If the CPU were a >68020 with the internal 256 byte direct mapped instruction cache, is >there a need for the external cache? Again, depends on the hits for the internal cache; but my guess is that the internal cache does not take into account the different spaces (supervisor and user), therefore the external cache would give better performance on context switches. >P.S. Both Altos and Charles River Data had both the 020 cache and >an external 8K cache, Dual did not. Anyone know why? Not that I'm >trying to have someone else do my design work, I believe that for the >external cache to be really useful, the system architecture must be >designed for the cache, such as 64 bit wide memory, or block transfers >on cache misses. Does CRD or Altos do any of these? I don't agree with the 'designed for the cache' statement, but agree that you can get slightly better performance if you have designed the memory system to work with a cache... but the percentage improvement I suspect would be quite small, and only worthwhile in very high performance systems. John Van Zandt Loral Instrumentation (619) 560-5888 uucp: ucbvax!sdcsvax!jvz arpa: jvz@UCSD P.S. Of course, the above are my opinions alone and not necessarily those of my employer's.
alan@sun.uucp (Alan Marr, Sun Graphics) (07/28/85)
Is there any architectural reason why Motorola at some future date could not issue a 6802x with a larger instruction cache?
rbt@sftig.UUCP (R.Thomas) (08/02/85)
> > I'd like to compare and contrast the difference in performance between a > simple single level paged memory manager using a ram (a la Sage 68000) and > a system like the IBM DAT box, where the page tables are stored in main memory > and cached in hardware. The point being that switching context is MUCH > faster if you only need to change the pointer to the page tables, rather than > copy 8K of paging information into the page table ram. It is assummed that > the cache used to speed up the main memory page table accesses is sufficiently > large to get a good hit rate (what ever that may be). > You pay to load the page table when you context switch with either system. In one system, you load the page table registers explicitly all at once during the actual context switch, in the other you pay to load it in a more lesurely fashion as the DAT box cache faults it in. The reference to main memory to load the DAT box cache line cost just as much as the reference to main memory to load the page table entry. Mitigating circumstances -- with the DAT box, you only pay to load the ones you actually use, with the page table in its own register file, you have to load each register in the file whether you intend to use that page or not. But if the cache is too small (and it *always* is too small -- There is no economic incentive to make it too big!), you may have to load each cache line several times. If you only care about interrupt response time, then the DAT-box/cache is a win. But you take the same thruput hit either way. Rick Thomas
gnu@sun.uucp (John Gilmore) (08/09/85)
John Van Zandt of Loral Instrumentation (ucbvax!sdcsvax!jvz) said: > Cache's are strange creatures, exactly how they are designed can impact > performance significantly. Your idea of separate cache's for supervisor > and user is a good one. I believe (uninformed opinion) that this makes it perform worse, all other things equal. It means that at any given time, half the cache is unusable, thus if you spend 90% of your time in user state you only have a halfsize cache. (Ditto if you are doing system state stuff.) > I haven't checked the > M68000 lately, but if lines coming from the chip can signal whether > an instruction or data fetch is occuring, then another speedup would be > to have separate instruction and data cache's. The 68000 signals supervisor/user as well as instruction/data. Again, this creates an artificial split. If indeed the CPU is spending 50% of its time on instruction fetches and 50% on data cycles, this could be OK, but it won't adapt dynamically as the instruction/data mix changes. > > If the CPU were a > >68020 with the internal 256 byte direct mapped instruction cache, is > >there a need for the external cache? > Again, depends on the hits for the internal cache; but my guess is that the > internal cache does not take into account the different spaces (supervisor > and user), therefore the external cache would give better performance on > context switches. The internal instruction cache on the 68020 definitely takes supervisor/user mode into account. It does NOT take context switches into account, thus it must be flushed on context switch. I believe the hit rate on the icache is something like 50%. The reason for external caches is probably to speed up data accesses, which otherwise would go at main memory speeds. > >P.S. Both Altos and Charles River Data had both the 020 cache and > >an external 8K cache, Dual did not. Anyone know why? Dual probably wanted to build a cheaper system. Depending on the hit rate and timings, a cache system may LOSE performance because it takes longer to do a cache miss in a cached system than it takes to do a memory access in a noncache system. Remember, performance with a cache = (hitrate*hitspeed) + ((1-hitrate)*missspeed). What you buy in hitspeed may not reclaim all that you lose in missspeed.
wall@fortune.UUCP (Jim Wall) (08/13/85)
Someone in replying tomy original article on cache said that the hit rate on the internal cache in the 68020 is about 50%. Anyone care to agree with that? Anyone care to tell me what reasonable application or operating system spends 50% of its time in loops that are smaller that 256 bytes?? The numbers that are claimed for the hit rates on caches are nothing short of incredible. I think the CPU manufacturers are the instigators, and nobody bothers to question them. But, hey, I could be wrong. It's happened before. So let's hear it. Anyone who claims high cache hit rates on normal applications, let's hear the justification for them. -Jim Wall ...amd!fortune!wall
blarson@oberon.UUCP (Bob Larson) (08/14/85)
In article <5459@fortune.UUCP> wall@fortune.UUCP (Jim wall) writes: > The numbers that are claimed for the hit rates on caches are >nothing short of incredible. I think the CPU manufacturers are the >instigators, and nobody bothers to question them. > I know of one computer manufacturer that has one set of numbers for cashe hit rate given out by the salespeople and another by the technical people. (95% vs 98%) The manuals list the higher figure. They also just got around to revising all the manuals to say that a word is 32 bits rather than 16. Since the basic addressing unit of the machine is 16 bits, this makes talking about the machine akward. (The basic addressing unit of a VAX is 8 bits.) Bob Larson Arpa: Blarson@Usc-Ecl.Arpa Uucp: {ihnp4,hplabs,...}!sdcrdcf!uscvax!oberon!blarson
john@frog.UUCP (John Woods) (08/15/85)
>Someone...said that the hit rate on the internal cache in the 68020 is about > 50%. Anyone care to agree with that? Anyone care to tell me what > reasonable application or operating system spends 50% of its time > in loops that are smaller that 256 bytes?? > ... > But, hey, I could be wrong. It's happened before. So let's hear it. Anyone > who claims high cache hit rates on normal applications, let's hear the > justification for them. > We recently measured the cache-hit performance of our prototype 68020 board on a number of programs, some useful, some benchmark type programs (including the good old Knight's tour). Results were: The '20 I-cache had a hit rate of between 30% and 58%. Our 8Kb external cache had a hit rate of 70-83%; between them (when both turned on), they had a hit rate of 76-89%. The 68000 board (our current product, and from which the current 68020 board was devised [roughly]) had a cache hit rate (only an 8Kb external cache, of course) of between 86% and 93% on the same programs. And indeed, we found that few reasonable loops are small enough to fit into the I-cache (especially since the Greenhills C compiler tries to be really clever about loop unrolling and re-ordering of code). -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA
davet@oakhill.UUCP (Dave Trissel) (08/16/85)
In article <5459@fortune.UUCP> wall@fortune.UUCP (Jim wall) writes: > > Someone in replying tomy original article on cache said that >the hit rate on the internal cache in the 68020 is about 50%. >Anyone care to agree with that? Anyone care to tell me what >reasonable application or operating system spends 50% of its time >in loops that are smaller that 256 bytes?? > The problem is that cache hit values are so variable that it really doesn't make sense to talk about an average figure. The lowest I've seem for the '020 is a range of 10 to 15 percent which was taken from a monitoring of the Unix operating system. (Sorry don't remember which version.) One would suspect that operating systems would be among the worst performers. On the other hand, we have lots of reports ranging from 30 to 65 percent for measured applications. Yet another problem measuring cache hits specifically on the '020 is the fact that since the chip always does a 32-bit longword instruction fetch from the bus, if only the first word is used (e.g. it finishes an earlier instruction or is itself a 16-bit instruction) then the other word is treated as a cache hit. This tends to boost cache hit rate values depending on just what you define a cache hit to be. BTW, the 10 to 15 percent cache hit rate is nothing to sneeze at when you look at real performance gain. Take a hit rate of 10 percent. That 10 percent amounts to a much higher realized performance improvement when you consider that the cached reads would take 1.5 to 2 times longer to do on the external bus and that the 15 to 20 percent less activity of that bus can then be used for simultaneous data reads and writes by the processor. So your real improvement could be anywhere up to 30 percent. -- Dave Trissel Motorola Semiconductor Austin, Texas {ihnp4,seismo}!ut-sally!oakhill!davet
chris@umcp-cs.UUCP (Chris Torek) (08/16/85)
>And indeed, we found that few reasonable loops are small enough to >fit into the I-cache (especially since the Greenhills C compiler tries to >be really clever about loop unrolling and re-ordering of code). Oh no! A pessimizing compiler! :-) (Of course the unrolled code might still run faster; I just thought this was a neat example of optimization backfiring.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 4251) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
hes@ecsvax.UUCP (Henry Schaffer) (08/19/85)
> Someone in replying tomy original article on cache said that > the hit rate on the internal cache in the 68020 is about 50%. > Anyone care to agree with that? Anyone care to tell me what > reasonable application or operating system spends 50% of its time > in loops that are smaller that 256 bytes?? The hit rate depends on the total cache size and on the amount loaded for each miss, as well as program and data locality. Tight loops do give programs a high hit rate, and so can other things like character I/O. Vectors and matrices are common in scientific computation, and systematic processing of these data structures also tends to give data locality and so produces a high hit rate. (This is why scientific programmers are usually concerned whether matrices are stored row- or column-wise. Cache performance, as well as address calculation are assisted by accessing sequential addresses.) On mainframes with large caches a hit rate of >80% can be achieved. This is often verified using a hardware monitor, not by trusting the vendor's opinion. --henry schaffer n c state univ
edhall@randvax.UUCP (Ed Hall) (08/21/85)
Concerning the 68020's cache: I can think of a lot of places where a loop would fit in a 256-byte cache, especially in string-processing applications. Remember, in many applications a lot of time is spent simply copying memory, making searches, and so forth. This isn't just limited to strings: matrix operations usually include small inner loops where the bulk of computer time is spent. The same is true of bit-map graphics. And it is true for a lot of other CPU-hungry applications. So something close to a 50% hit rate wouldn't surprise me for a fairly large class of programs, though there is probably a larger class of programs that wouldn't do nearly that good. If Motorola were claiming it as an *average* I'd wonder who they thought they were fooling, but I don't believe they are doing so. -Ed Hall decvax!randvax!edhall
mash@mips.UUCP (John Mashey) (08/21/85)
This is a response to question from huguet@LOCUS.UCLA.EDU [sorry, mail kept bouncing] about an earlier assertion of mine: > f) Use of optimizing compilers that put things in registers, often driving > the hit rate down [yes, down], although the speed is improved and there are > less total memory references. I don't know of any published numbers to back this up. The effect has been seen in [unpublished] simulations; might be a good topic for research. It does make sense, at least for data cache (instruction cache effects may vary wildly). The better an optimizer is, the more likely it is to put frequently-used variables in registers, thus reducing the number of references to that are likely to be cache hits. Consider the ultimate case: a smart compiler and a machine with many registers, such that most code sequences fetch a variable just once, so that most data references are cache misses. Passing arguments in registers also drives the hit rate down. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
kds@intelca.UUCP (Ken Shoemaker) (08/29/85)
> BTW, the 10 to 15 percent cache hit rate is nothing to sneeze at when you look > at real performance gain. Take a hit rate of 10 percent. That 10 percent This assumes, of course, that there is no miss penalty... -- ...and I'm sure it wouldn't interest anybody outside of a small circle of friends... Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm {pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds ---the above views are personal. They may not represent those of the employer of its submitter.
ed@mtxinu.UUCP (Ed Gould) (08/31/85)
In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > Consider the ultimate >case: a smart compiler and a machine with many registers, such that >most code sequences fetch a variable just once, so that most data references >are cache misses. Passing arguments in registers also drives the hit >rate down. With this ultimate machine/compiler combination it seems intuitively that a data cache would then be a *bad* idea, since having a cache can't be faster than an uncached memory reference (for what would be a miss) and is often slower. We can then use the real estate saved for even more registers! -- Ed Gould mt Xinu, 2910 Seventh St., Berkeley, CA 94710 USA {ucbvax,decvax}!mtxinu!ed +1 415 644 0146 "A man of quality is not threatened by a woman of equality."
csg@pyramid.UUCP (Carl S. Gutekunst) (09/06/85)
>>Consider the ultimate >>case: a smart compiler and a machine with many registers, such that >>most code sequences fetch a variable just once, so that most data references >>are cache misses. Passing arguments in registers also drives the hit >>rate down. > >With this ultimate machine/compiler combination it seems intuitively >that a data cache would then be a *bad* idea, since having a cache >can't be faster than an uncached memory reference (for what would >be a miss) and is often slower. We can then use the real estate saved >for even more registers! In fact, a data cache greatly improves throughput on real large-register-set machines, like the Pyramid. Many operations in program development (e.g. compiling) require repetitive searching/sorting on moderately large arrays; the data cache can help out here a lot. It also helps when you have to save all those registers during a context switch. What we really need is a machine with 64K 32-bit registers. :-{) -- -m------- Carl S. Gutekunst, Software R&D, Pyramid Technology ---mmm----- P.O. Box 7295, Mountain View, CA 94039 415/965-7200 -----mmmmm--- UUCP: {allegra,decwrl,nsc,shasta,sun,topaz!pyrnj}!pyramid!csg -------mmmmmmm- ARPA: pyramid!csg@sri-unix.ARPA
davet@oakhill.UUCP (Dave Trissel) (09/06/85)
In article <48@intelca.UUCP> kds@intelca.UUCP (Ken Shoemaker) writes: >> BTW, the 10 to 15 percent cache hit rate is nothing to sneeze at when you look >> at real performance gain. Take a hit rate of 10 percent. That 10 percent > >This assumes, of course, that there is no miss penalty... There is no penalty since the instruction address is sent both to the cache and the bus control unit at the same time. A hit causes the bus controller to avoid the bus cycle, and as I understand it, without any loss of external bus effeciency if another operation is queued. -- Dave Trissel Motorola Semiconductor Inc. {ihnp4,seismo}!ut-sally!oakhill!davet Austin, Texas
jer@peora.UUCP (J. Eric Roskos) (09/09/85)
> With this ultimate machine/compiler combination it seems intuitively > that a data cache would then be a *bad* idea, since having a cache > can't be faster than an uncached memory reference ... See, what you have here is a matter of working at a problem from two different ends. The example of the compiler that used a large set of registers and thus produced mostly cache misses is really a compiler that knows about cache and manages it itself. The registers are just a cache that's closest to the processor. On the other hand, most caches are intended to improve performance with the assumption that the compilers aren't going to do much of that themselves (or can't due to the machine's architecture. Or the fact that the same program is to run on a number of machines of different memory-hierarchy organizations -- i.e., the compiler that generated good code for a machine with 250 registers might not work so well (or at all) on a machine with 16 registers; but for cost reasons it might be desirable for a manufacturer to produce a machine with cache, and a machine without it, whereas he has to keep the 250 registers no matter what, if there's code out there that uses all of them.) -- Shyy-Anzr: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642
franka@mmintl.UUCP (Frank Adams) (09/10/85)
In article <455@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes: >In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > >> Consider the ultimate >>case: a smart compiler and a machine with many registers, such that >>most code sequences fetch a variable just once, so that most data references >>are cache misses. Passing arguments in registers also drives the hit >>rate down. > >With this ultimate machine/compiler combination it seems intuitively >that a data cache would then be a *bad* idea, since having a cache >can't be faster than an uncached memory reference (for what would >be a miss) and is often slower. We can then use the real estate saved >for even more registers! Not necessarily. When a context switch takes place, you would have to save all the registers. The cache can, at worst (best?) be dumped.
patc@tekcrl.UUCP (Pat Caudill) (09/16/85)
>In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes: > >> Consider the ultimate >>case: a smart compiler and a machine with many registers, such that >>most code sequences fetch a variable just once, so that most data references >>are cache misses. Passing arguments in registers also drives the hit >>rate down. If you have read the article on the IBM 801 project this was just what they did. The cache was the register set (which was medium large - 32 registers). But there was a very very smart compiler which optimized register usage even across subroutine calls. Go look at the article it was published in a SIGPLAN several years ago. (It was by the compiler writer) Pat Caudill
boston@celerity.UUCP (Boston Office) (09/18/85)
In article <645@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes: > >In article <455@mtxinu.UUCP> ed@mtxinu.UUCP (Ed Gould) writes: >>In article <170@mips.UUCP> mash@mips.UUCP (John Mashey) writes: >> >>> Consider the ultimate >>>case: a smart compiler and a machine with many registers, such that >>>most code sequences fetch a variable just once, so that most data references >>>are cache misses. Passing arguments in registers also drives the hit >>>rate down. >> >>With this ultimate machine/compiler combination it seems intuitively >>that a data cache would then be a *bad* idea, since having a cache >>can't be faster than an uncached memory reference (for what would >>be a miss) and is often slower. We can then use the real estate saved >>for even more registers! > >Not necessarily. When a context switch takes place, you would have to >save all the registers. The cache can, at worst (best?) be dumped. However: if you have enough registers, and manage them in banks, they only need be saved if you run out of BANKS of registers.
mash@mips.UUCP (John Mashey) (09/20/85)
Pat Caudill writes: > If you have read the article on the IBM 801 project this was just > what they did. The cache was the register set (which was medium large - > 32 registers). But there was a very very smart compiler which optimized > register usage even across subroutine calls. Go look at the article it > was published in a SIGPLAN several years ago. The referenced reference is: M. Auslander, M. Hopkins, "An Overview of the PL.8 Compiler", Proc. SIGPLAN Symp. Compiler Construction, ACM, Boston, June 1982, 22-31. A useful related article is: F. Chow, J. L. Hennessy, "Register Allocation by Priority-Based Coloring", Proc. SIGPLAN Symp. Compiler Construction, ACM, Montreal, June 1984, 222-232. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043