lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/22/90)
In article <43367@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >1) Several posters have mentioned that there is some unspecified but obvious >(to them) major problem with using inverted page tables together with >memory mapped files. I wonder if someone could enlighten us regarding >this problem, since apparently it isn't obvious to every system architect :-) The big problem is with sharing. The word "inverted" is used, because the table is a map from physical addresses to virtual addresses. More precisely, you use an array, and use physical page numbers as subscripts. Each table entry tells you what virtual page is mapped to there. SO: only one page can be mapped to there. Unless, of course, the OS goes through a fair bit of grief to make the machine work right in spite of its design. -- Don D.C.Lindsay Carnegie Mellon Computer Science
aglew@oberon.csg.uiuc.edu (Andy Glew) (02/22/90)
>>1) Several posters have mentioned that there is some unspecified but obvious >>(to them) major problem with using inverted page tables together with >>memory mapped files. I wonder if someone could enlighten us regarding >>this problem, since apparently it isn't obvious to every system architect :-) > >The big problem is with sharing. > >The word "inverted" is used, because the table is a map from physical >addresses to virtual addresses. More precisely, you use an array, and >use physical page numbers as subscripts. Each table entry tells you >what virtual page is mapped to there. > >SO: only one page can be mapped to there. Unless, of course, the OS >goes through a fair bit of grief to make the machine work right in >spite of its design. Now, let's see. Physical page addresses map to virtual? Via a hash function? What ever happened to hash chaining? :-) Especially if the TLB miss is handled in software (like on MIPS). -- Andy Glew, aglew@uiuc.edu
johnl@esegue.segue.boston.ma.us (John R. Levine) (02/22/90)
In article <8106@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >In article <43367@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes: >>1) Several posters have mentioned that there is some unspecified but obvious >>(to them) major problem with using inverted page tables together with >>memory mapped files. ... > >The big problem is with sharing. The ROMP, actually its mmu chip which is named Rosetta, handles this pretty well despite having a reverse map page table with no provision for aliasing. The high four bits of a memory address are looked up in a tiny fast memory to get a segment number. These segment numbers are global to the entire system. The file mapping we put into AIX maps a file to a segment, so if you map a file into several processes there is no aliasing problem; they use the same segment. If you map a file copy-on-write, the aliasing problem does exist, and they handled that at the VRM level via soft page faults. The same problem occurs after a fork, since the pages of the stack and data segments are physically but not logically shared. In that case, they made them copy-on-touch, evidently because by the time they'd handled the soft page fault, the extra effort involved in making a copy of the page wasn't all that great. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl "Now, we are all jelly doughnuts."
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/23/90)
In article <1990Feb22.034517.8676@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes: >The ROMP, actually its mmu chip which is named Rosetta, handles this pretty >well despite having a reverse map page table with no provision for aliasing. >The high four bits of a memory address are looked up in a tiny fast memory >to get a segment number. These segment numbers are global to the entire >system. The file mapping we put into AIX maps a file to a segment, so if you >map a file into several processes there is no aliasing problem; they use the >same segment. The segment idea could have been built on top of any viable paging scheme, and doesn't require the inverted scheme that was used. The inverted table (+ hash table and chains) mostly gets in the way when the software wants aliasing. Since the hardware designers knew that AIX does aliasing, could you explain why they still went with the inverted approach? I don't understand the net benefit, or any benefit. -- Don D.C.Lindsay Carnegie Mellon Computer Science
johnl@esegue.segue.boston.ma.us (John R. Levine) (02/23/90)
In article <8115@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >The inverted table (+ hash table and chains) mostly gets in the way >when the software wants aliasing. Since the hardware designers knew >that AIX does aliasing, could you explain why they still went with >the inverted approach? The ROMP hardware was designed some years before any AIX work began. When I first started to talk to IBM about the software work that evolved into AIX, the ROMP and Rosetta were already being fabricated. It is my impression, not backed up by any hard facts, that the original intention was for the ROMP's operating system to be more like the System/38's with a single level store, and they went with Unix when it became clear that the original plan was too hard and had too limited a market. This is straying from architectural issues. Followups elsewhere. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl "Now, we are all jelly doughnuts."
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/27/90)
In article <37774@cornell.UUCP> huff@cs.cornell.edu (Richard Huff) writes: >their problems. The true issue is supporting Copy-On-Write, which But suppose that Copy-On-Write *must* be supported. Maybe IPT still works well enough. >The Unix fork() policy of logically copying the stack is brain dead. If >we had a general create_a_process() facility that allowed you to specify This was argued a while back. It does seem that a lot of cases could be covered by a general exec function, but many people argued that in practice every fork() is unhappy in its own way... >Can anyone name a use for copy-on-write that wasn't designed to fix one >of the above two problems (or to support the brain dead policies of >fork(), as mentioned above) ? I guess we are stuck with supporting an efficient fork() with Copy-On-Write, whether it is "necessary" or not. Does that make IPT impossible. I wonder if every system designer knows? :-) Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)604-6117
huff@svax.cs.cornell.edu (Richard Huff) (02/28/90)
Someone suggested that Mach handles large sparse address spaces efficiently. But Mach's architecture independent pmap module simply implements an inverted page table in software; although their IPT's are on a per process basis, rather than using a single IPT for the entire machine. So I still see IPT's as the only way to manage huge virtual address spaces. Now here's an interesting research problem: What is the most efficient way to manage large virtual address spaces on NUMA (non uniform memory access) shared memory MIMD machines? Can we extend the IPT approach in a clean way? I'd like to handle a TLB fault to a local physically present page without referencing non local memory. Ok, I could do that with a local IPT approach. But what if the page is physically present on another processor's node? How can I determine which node to talk to, WITHOUT utilizing a local data structure that grows with the number of nodes in the system? Remember, I want a SCALABLE solution. And we presumably don't want a virtual address to always be associated with a fixed node; so the processor number can't be encoded in the address itself. Besides, I might want to let the OS move pages around between nodes for maximum locality. In this case, we might view local memory as simply a page cache for the larger global address space. Will TLB faults be rare enough for me to simply maintain a master page directory for the entire machine, distributed across all of the nodes, that only gets used when the local IPT comes up empty? Should we use multiple distributed IPT's, say, one per "cluster" of nodes? What is currently being done, or being proposed, for such large NUMA MIMD machines? How does the Butterfly II, Ultracomputer, or RP3 do virtual to physical address translation? Do they employ a separate virtual address space per process? Is it 32-bits, or larger? Is anyone out there considering building a NUMA MIMD shared memory machine with a single, machine wide, 64 bit virtual address space? Richard Huff huff@svax.cs.cornell.edu
lkaplan@bbn.com (Larry Kaplan) (03/03/90)
In article <37877@cornell.UUCP> huff@cs.cornell.edu (Richard Huff) writes: >Someone suggested that Mach handles large sparse address spaces >efficiently. But Mach's architecture independent pmap module simply >implements an inverted page table in software; although their IPT's are >on a per process basis, rather than using a single IPT for the entire >machine. So I still see IPT's as the only way to manage huge virtual >address spaces. I'm not sure the efficiency refers to the data structures you mention. The VM system uses a doubly linked-list of structures to define regions of user (and kernel) virtual address space. These regions need not be contiguous. This is probably the efficiency referred to. BBN has rewritten the Mach VM system from scratch for the TC2000 (Butterfly II) and has enhanced this data structure by also storing it as a balanced binary tree. This speeds lookups into the virtual memory map. >Now here's an interesting research problem: What is the most efficient >way to manage large virtual address spaces on NUMA (non uniform memory >access) shared memory MIMD machines? ... > >What is currently being done, or being proposed, for such large NUMA >MIMD machines? How does the Butterfly II, Ultracomputer, or RP3 do >virtual to physical address translation? Do they employ a separate >virtual address space per process? Is it 32-bits, or larger? Again, for the adding of mappings of virtual addresses to physical memory, I believe the Mach style is fine. The problem comes when removing or reclaiming a physical page for other use. This is where the inverted page tables come into play as you must notify all processes which map the page that their mapping is no longer valid. For each physical page in the system, there is a software page descriptor (residing on the same node as the physical page) that contains a list of all {process, virtual address} pairs that map the page. This list can grow to be fairly long and in that sense is not efficiently scalable. Note however, that the invalidation required is fairly cheap except for the possible need for a TLB shoot to a remote node. BBN's TLB shoot code uses a clock based algorithm that eliminates the need for most of these TLB shoots. The TC2000 does indeed have a separate 32 bit virtual address space per process as, I believe, do most UNIX VM implementations. For more information about BBN's new VM and pmap implementations, see the paper "Highly Parallel Virtual Memory Management in Large-Scale Shared Memory Multiprocessors" by D. Barach et. al. This paper has been submitted to the ICPP 1990. > >Richard Huff >huff@svax.cs.cornell.edu #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
edler@jan.ultra.nyu.edu (Jan Edler) (03/03/90)
In article <37877@cornell.UUCP> huff@cs.cornell.edu (Richard Huff) writes: > What is currently being done, or being proposed, for such large NUMA > MIMD machines? How does the Butterfly II, Ultracomputer, or RP3 do > virtual to physical address translation? Do they employ a separate > virtual address space per process? Is it 32-bits, or larger? > > Is anyone out there considering building a NUMA MIMD shared memory > machine with a single, machine wide, 64 bit virtual address space? There are a couple different issues here. First is one of terminology and taxonomy: The acronyms UMA and NUMA are somewhat slippery, at least for some machines (like the NYU Ultracomputer). In the "generic" Ultracomputer design, all memory is equally far from all processors, thus making it an UMA. Of course we add caches (possibly with only software-controlled cacheability to enforce coherency), making the UMA designation less appropriate. If we add local memory, some would call the design NUMA. Yet we don't consider such modifications to be significant enough to warrant a reclassification of the machine, i.e. we still think of the machine in much the same way, with or without local memory. Consider the RP3, where each processor enjoys fast access to a co-located memory module (there is no memory equally far from all processors): Is it UMA or NUMA? Virtual address ranges can be interleaved accross the memory modules or sequential within a memory module as controlled by the MMU. Sequential placement is good for private (or mostly-private) memory; interleaved is good for shared. There are a spectrum of possibilities, with two extremes: - all "sequential": the machine appears to be NUMA - all "interleaved": looks UMA ( (n-1)/n of the refs are uniform nonlocal) It depends on how you plan to use (or think about) the machine. As for how address translation is performed on the Ultracomputer, there is really no single good answer. The "generic" Ultracomputer design doesn't really address the issue, merely assuming that some sort of address translation hardware is present to support a general-purpose operating system. The issues of word and address size are also not very relevant to the generic design, except that the word size determines the largest object that can be atomically accessed with a single load or store instruction. When considering an implementation of the Ultracomputer, things become quite different, and of course it really matters how things are done. All the specific designs we've considered at NYU have had 32-bit word sizes and 32-bit addresses. The operating system design supports a separate address space for each process (although support for lightweight processes within a shared address space is under consideration). To date we've only considered hardware designs with fairly conventional MMUs: - buddy-system, TLB-only (with mc68451 MMU), - multi-level segment/page tables (mc68030), - TLB-only fixed-size pages (am29000). In all cases, we've considered page tables (or their equivalent) to reside in shared memory (this was also the case with our OS design study for RP3, which was to support sequential memory as well as interleaved). Other factors to consider are cache control and TLB coherence. Up to now, our designs have relied entirely on software for cache coherency, and so we assume the MMU can indicate the cacheability of each reference. Hardware cache or TLB coherence schemes can impact the MMU in various ways, some of them favoring a globally shared address space (but not necessarily with a flat addressing scheme). Jan Edler NYU Ultracomputer Project edler@nyu.edu (212) 998-3353