steve@gondor.UUCP (Stephen J. Williams) (04/19/86)
In article <909@umd5.UUCP> zben@umd5.UUCP (Ben Cranston) writes: >In article <6581@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >>In article ?? somebody writes: > >>> .... But, to maintain multiple >>> cache consistency where there are multiple virtual address spaces, caches >>> have to keep (and be able to associate on) physical addresses. > >>A simpler approach is to have the "address" in the cache include a few bits >>of process number. This is increasingly common. It amounts to putting >>the processes in a common address space for caching purposes without giving >>the individual processes the ability to address each other's data. > >But, this means that shared writable memory becomes tricky. There are now >several cache keys under which a given memory word might be stored, one for >each process that might be accessing the memory. Personnally, I have no trouble with the caches that I plan on. They are part of the I/O system, and as such, are kept in REAL memory. It is the kernal's job to decide how to use these caches, and it is the kernal's job to multiplex them amongst the processes. Note that the data within the cache has to be give to one process or another, not matter how many address spaces you have. Shared writable memory is a different thing altogether. Shared memory is a form of communication between processes. The way that I see this implimented is by putting one page into the page tables of 2 or more processes. (I'm speaking in terms of seperate address spaces, of course.) This is not tricky. --Scal
aglew@ccvaxa.UUCP (04/21/86)
I'm afraid this is another long post. I go into painful detail about the advantages of one virtual address space for multiple cache systems; essentially, the advantages are not qualitative, only quantitative, in that they permit you to run a multiple cache system a bit faster with less hardware. >/* Written 2:52 pm Apr 18, 1986 by petolino@chronon.UUCP */ >Sharing writable memory among address spaces in a cached system is indeed >tricky. As the above postings suggest, problems of consistency arise >not only between copies of the same physical memory location residing in >different caches in the system, but also among copies of a physical memory >location residing in different places in a single cache (by 'consistency' >we mean making sure that all cached copies of a physical memory location >accurately reflect any writes that have been done to that location - the >terms 'cache coherence' and 'data integrity' have also been used for this >concept). > >Many solutions to these problems have been implemented and/or proposed. (Paraphrased): (1) No shared memory; (2) don't cache shared memory; (3) single cache location (doesn't help inter); (4) temporarily `own' cached shared memory locations. I don't think you mentioned (unless that was what you meant by number 4) the widely used multiple cache synchronization technique of write-through, either with invalidate, or with update. Add to this the Snoopy cache techniques from Xerox which attempt to avoid unnecessary write-throughs. The write-through techniques all involve having the cache listen to the system bus (at the same time as they are responding to their own processor's requests) and updating their own entries that correspond to traffic they see on the bus. Yes, this requires multiporting. Update according to bus traffic requires two full write ports; invalidate means that you only have to write one bit, which the processor's cache port only reads, so it's considerably less expensive. Whatever, you have to be associating on the addresses that you see both at both ports. Consider these configurations: Processor A Processor B Processor A's virtual address virtual address Physical address physical address |----------------Physical Memory-------------------| There are two places you can put caches for any particular processor: before the virtual address gets translated, or after. Processor --virtual address--> :cache: --physical address--> Physical memory Processor --virtual address--> --physical address--> :cache: Physical memory If you are only dealing with one processor that doesn't even have to worry about things like DMA, and if your virtual mapping doesn't change very often (depends on your application mix), then there is an advantage in caching on virtual addresses rather than physical, since you can simply start the cache at the same time as you start your translation (using your TLB or whatever) and you can reduce the cycle time for cache hits by that much. Ie. caching on virtual addresses can let you go a bit faster. When you have multiple caches, if you want to do bus monitoring, you have three possible configurations: (1) PP caching on your processor's physical addresses and listening to physical addresses on the bus: Processor --virtual address--> --physical address--> :cache: Physical memory \___/ / \ Processor --virtual address--> --physical address--> :cache: Physical memory (2) VP caching on your virtual processor's virtual addresses, listening to physical addresses on the bus: Processor --virtual address--> :cache: --physical address--> Physical memory \______/ / \ Processor --virtual address--> :cache: --physical address--> Physical memory (3) VV caching on your processor's virtual addresses and listening to other processor's virtual addresses: Processor --virtual address--> :cache: --physical address--> Physical memory \______/ / \ Processor --virtual address--> :cache: --physical address--> Physical memory (The fourth possibility has no advantages). PP what is usually done, but it has the same `problem' (whether it is a problem depends on how much you are worried about speed) that the cache doesn't get started until the address has been translated. VP is alright, in that it means that you can start responding to your own processor before translating the address, but it doubles the complexity of your cache in that you have to be able to associate on both physical and virtual addresses. VV gives you faster response to your own processor by starting the cache lookup before address translation begins. It works like this: Give virtual address to cache If a miss then put virtual address on cache synchronization bus || put physical address on physical memory bus || if a write but data on data bus where || means operations performed in parallel.The speed advantage isn't carried over to the bus - the advantage over VP is only that both cache ports associate on the same thing, the virtual address - but it only works if all processors share the same virtual address space. --- Please, don't anyone say `who cares about the little increment of speed you get from virtual caching?'. Caching is purely about speed - if you're not worried about speed, don't do caching at all. No, sorry, that's not quite just - some people may be happy with the increment gained from PP caching, and not need to worry about getting that little bit faster (particularly since a good cache is faster than many of the microprocessors it might get attached to). But if your goals are speed, speed, and more speed, then the little bit of gain from V* caching is tempting, but the extra size necessary for VP caching turns you off. Andy "Krazy" Glew. Gould CSD-Urbana. USEnet: ihnp4!uiucdcs!ccvaxa!aglew 1101 E. University, Urbana, IL 61801 ARPAnet: aglew@gswd-vms
stewart@magic.UUCP (04/23/86)
A physical address cache need not suffer in performance because one must somehow do the virtual to physical translation "first." The early Stanford Sun boards, for example, applied the high order bits of the virtual address to the TLB and the low order bits to the cache tag memory and the physical page number came out of the TLB just in time to be compared with the cache tag, giving a hit or miss indication. The particular implementation had some oddities like the page size equalling the cache size, but having them be different doesn't hurt too much. For example, if the page size is smaller than the cache size (likely, these days) then it might appear that you need some bits of the physical page number to address the cache tag memory, but that can be solved by requiring that a few low order bits of the physical page number match the low order bits of the virtual page number. Any real memory allocator worth its salt should be able to deal with that kind of restriction. If the page number and cache size overlap by, say, two bits, then one could also look up the four possible tags and then select, based on the TLB output. Kind of a set associative trick.... Let your imagination roam. -Larry
eric@chronon.UUCP (Eric Black) (04/23/86)
In article <7711@watdaisy.UUCP> dneudoerffer@watdaisy.UUCP (Dave Neudoerffer) writes: >I think we need some clarification on what cache we're talking about here. >I agree with Henry that a process identification tag on addresses >in the memory management translation cache (TLB) is a bonus and saves >flushes of this cache. >Also if a data cache is used between processor and memory management, >ie caching virtual addresses, then this tag may also be useful. However, >as Ben points out, you can get physical address aliasing if two >processes are accessing the some memory page. >However, if a data cache is put in the system after the memory management >hardware, there is no aliasing problem since only physical addresses >are being cached. I beleive this last setup is the one found in most >systems with a data cache. > Ah, but putting the cache AFTER the memory management hardware puts the address translation in the critical path for ALL memory accesses, whether the reference hits the cache or not. Too bad if TLB lookup (or whatever) is more than an insignificant fraction of cache access time. It is easier to implement, though. Of course, if the processor is slow enough that time(TLB+cache) is still quick from the processor's point of view, it doesn't matter. Nobody would ever want to buy a faster processor, would they? :-) -- Eric Black "Garbage In, Gospel Out" UUCP: {sun,pyramid,hplabs,amdcad}!chronon!eric WELL: eblack BIX: eblack
jer%peora@peora.UUCP (04/29/86)
> Seems to me that you've worked around to one single virtual address space. > You've got a fixed number of address bits for your address within a process, > plus a fixed number of address bits for the process number => one bigger > virtual address with a fixed number of bits. > > What you've proposed is segments, with a one-to-one correspondence between > segments and processes. It was my intent to work around to a single virtual address space! Because that was what you were asking about. However, note that this address space has one unusual property, viz., that if the "process number" bits are all zero (or some other unique value, but let's say zero WLOG) then you get your "own" process's address space, just as if you'd filled in your process number there. I *think* this provides what you concluded you require, viz., a way for the code to tell "what processor it's running on" (which I am thinking you meant "what process it's running in behalf of" -- clearly in a symmetrical multiprocessor the code doesn't have to know what processor it's running on at all, except for the small amount of code that actually schedules processes onto processors). The issue of "where to put the caches" also applies to "where to put the address translation hardware". Since such hardware tends to be slow, you get better performance if you put it with the processors, so that you have a number of them working in parallel. On the other hand, you then have your basic consistency problem (a favorite topic of mine since my research back in graduate school involved a model of memory where data objects, rather than memory locations, had names, in order to avoid this problem), i.e., keeping multiple copies of what are really the same object consistent. On the other hand, you can eliminate this problem by putting the translation hardware out at the memory (which I believe is what was done by Gottlieb et. al. in their supercomputer project, along with also putting some adders and so on out there), but then you only have one of them, which means it has to be very fast to avoid a bottleneck. I recall reading a comment by Gottlieb about that fairly recently, where he was saying he wished he'd put his memory management at the processors instead. Regarding the "segments", I have some difficulty with that term because it seems to constrain a lot of thinking. What I am actually proposing is that you have multiple translation tables for the lower-order bits of your address, one such table per process, and you select the table number based on the high-order bits of the address. The processor would also have a register identifying the current process number, which it would use to fill in these high-order bits whenever they would otherwise be zero. -- E. Roskos