[comp.arch] Inverted Page Tables

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/22/90)

In article <43367@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>1) Several posters have mentioned that there is some unspecified but obvious
>(to them) major problem with using inverted page tables together with
>memory mapped files.  I wonder if someone could enlighten us regarding
>this problem, since apparently it isn't obvious to every system architect :-)

The big problem is with sharing.

The word "inverted" is used, because the table is a map from physical
addresses to virtual addresses. More precisely, you use an array, and
use physical page numbers as subscripts. Each table entry tells you
what virtual page is mapped to there.

SO: only one page can be mapped to there. Unless, of course, the OS
goes through a fair bit of grief to make the machine work right in
spite of its design.

-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

aglew@oberon.csg.uiuc.edu (Andy Glew) (02/22/90)

>>1) Several posters have mentioned that there is some unspecified but obvious
>>(to them) major problem with using inverted page tables together with
>>memory mapped files.  I wonder if someone could enlighten us regarding
>>this problem, since apparently it isn't obvious to every system architect :-)
>
>The big problem is with sharing.
>
>The word "inverted" is used, because the table is a map from physical
>addresses to virtual addresses. More precisely, you use an array, and
>use physical page numbers as subscripts. Each table entry tells you
>what virtual page is mapped to there.
>
>SO: only one page can be mapped to there. Unless, of course, the OS
>goes through a fair bit of grief to make the machine work right in
>spite of its design.

Now, let's see.  Physical page addresses map to virtual?  Via a hash
function? What ever happened to hash chaining? :-) Especially if the
TLB miss is handled in software (like on MIPS).

--
Andy Glew, aglew@uiuc.edu

johnl@esegue.segue.boston.ma.us (John R. Levine) (02/22/90)

In article <8106@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>In article <43367@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>>1) Several posters have mentioned that there is some unspecified but obvious
>>(to them) major problem with using inverted page tables together with
>>memory mapped files.  ...
>
>The big problem is with sharing.

The ROMP, actually its mmu chip which is named Rosetta, handles this pretty
well despite having a reverse map page table with no provision for aliasing.
The high four bits of a memory address are looked up in a tiny fast memory
to get a segment number.  These segment numbers are global to the entire
system.  The file mapping we put into AIX maps a file to a segment, so if you
map a file into several processes there is no aliasing problem; they use the
same segment.

If you map a file copy-on-write, the aliasing problem does exist, and they
handled that at the VRM level via soft page faults.  The same problem occurs
after a fork, since the pages of the stack and data segments are physically
but not logically shared.  In that case, they made them copy-on-touch,
evidently because by the time they'd handled the soft page fault, the extra
effort involved in making a copy of the page wasn't all that great.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/23/90)

In article <1990Feb22.034517.8676@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:

>The ROMP, actually its mmu chip which is named Rosetta, handles this pretty
>well despite having a reverse map page table with no provision for aliasing.
>The high four bits of a memory address are looked up in a tiny fast memory
>to get a segment number.  These segment numbers are global to the entire
>system.  The file mapping we put into AIX maps a file to a segment, so if you
>map a file into several processes there is no aliasing problem; they use the
>same segment.

The segment idea could have been built on top of any viable paging
scheme, and doesn't require the inverted scheme that was used.

The inverted table (+ hash table and chains) mostly gets in the way
when the software wants aliasing. Since the hardware designers knew
that AIX does aliasing, could you explain why they still went with
the inverted approach? I don't understand the net benefit, or any
benefit.
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

johnl@esegue.segue.boston.ma.us (John R. Levine) (02/23/90)

In article <8115@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes:
>The inverted table (+ hash table and chains) mostly gets in the way
>when the software wants aliasing. Since the hardware designers knew
>that AIX does aliasing, could you explain why they still went with
>the inverted approach? 

The ROMP hardware was designed some years before any AIX work began.  When I
first started to talk to IBM about the software work that evolved into AIX,
the ROMP and Rosetta were already being fabricated.  It is my impression, not
backed up by any hard facts, that the original intention was for the ROMP's
operating system to be more like the System/38's with a single level store,
and they went with Unix when it became clear that the original plan was too
hard and had too limited a market.

This is straying from architectural issues.  Followups elsewhere.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
"Now, we are all jelly doughnuts."

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (02/27/90)

In article <37774@cornell.UUCP> huff@cs.cornell.edu (Richard Huff) writes:
>their problems.  The true issue is supporting Copy-On-Write, which 

But suppose that Copy-On-Write *must* be supported.  Maybe IPT still works
well enough.  

>The Unix fork() policy of logically copying the stack is brain dead.  If 
>we had a general create_a_process() facility that allowed you to specify 

This was argued a while back.  It does seem that a lot of cases could be
covered by a general exec function, but many people argued that in practice
every fork() is unhappy in its own way...

>Can anyone name a use for copy-on-write that wasn't designed to fix one 
>of the above two problems (or to support the brain dead policies of 
>fork(), as mentioned above) ?

I guess we are stuck with supporting an efficient fork() with Copy-On-Write,
whether it is "necessary" or not.  Does that make IPT impossible.  I wonder
if every system designer knows?   :-)

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)604-6117

huff@svax.cs.cornell.edu (Richard Huff) (02/28/90)

Someone suggested that Mach handles large sparse address spaces
efficiently.  But Mach's architecture independent pmap module simply
implements an inverted page table in software; although their IPT's are
on a per process basis, rather than using a single IPT for the entire
machine.  So I still see IPT's as the only way to manage huge virtual
address spaces.

Now here's an interesting research problem:  What is the most efficient
way to manage large virtual address spaces on NUMA (non uniform memory
access) shared memory MIMD machines?  Can we extend the IPT approach in
a clean way?  I'd like to handle a TLB fault to a local physically
present page without referencing non local memory.  Ok, I could do that
with a local IPT approach.  But what if the page is physically present
on another processor's node?  How can I determine which node to talk to,
WITHOUT utilizing a local data structure that grows with the number of
nodes in the system?  Remember, I want a SCALABLE solution.  And we
presumably don't want a virtual address to always be associated with a
fixed node; so the processor number can't be encoded in the address
itself.  Besides, I might want to let the OS move pages around between
nodes for maximum locality.  In this case, we might view local memory as
simply a page cache for the larger global address space.

Will TLB faults be rare enough for me to simply maintain a master page
directory for the entire machine, distributed across all of the nodes,
that only gets used when the local IPT comes up empty?  Should we use
multiple distributed IPT's, say, one per "cluster" of nodes?

What is currently being done, or being proposed, for such large NUMA
MIMD machines?  How does the Butterfly II, Ultracomputer, or RP3 do
virtual to physical address translation?  Do they employ a separate
virtual address space per process?  Is it 32-bits, or larger?

Is anyone out there considering building a NUMA MIMD shared memory
machine with a single, machine wide, 64 bit virtual address space?


Richard Huff
huff@svax.cs.cornell.edu

lkaplan@bbn.com (Larry Kaplan) (03/03/90)

In article <37877@cornell.UUCP> huff@cs.cornell.edu (Richard Huff) writes:
>Someone suggested that Mach handles large sparse address spaces
>efficiently.  But Mach's architecture independent pmap module simply
>implements an inverted page table in software; although their IPT's are
>on a per process basis, rather than using a single IPT for the entire
>machine.  So I still see IPT's as the only way to manage huge virtual
>address spaces.

I'm not sure the efficiency refers to the data structures you mention.
The VM system uses a doubly linked-list of structures to define regions
of user (and kernel) virtual address space.  These regions need not be 
contiguous.  This is probably the efficiency referred to.  BBN has rewritten 
the Mach VM system from scratch for the TC2000 (Butterfly II) and has 
enhanced this data structure by also storing it as a balanced binary tree.  
This speeds lookups into the virtual memory map.

>Now here's an interesting research problem:  What is the most efficient
>way to manage large virtual address spaces on NUMA (non uniform memory
>access) shared memory MIMD machines?  ...
>
>What is currently being done, or being proposed, for such large NUMA
>MIMD machines?  How does the Butterfly II, Ultracomputer, or RP3 do
>virtual to physical address translation?  Do they employ a separate
>virtual address space per process?  Is it 32-bits, or larger?

Again, for the adding of mappings of virtual addresses to physical memory,
I believe the Mach style is fine.  The problem comes when removing or 
reclaiming a physical page for other use.  This is where the inverted
page tables come into play as you must notify all processes which map
the page that their mapping is no longer valid.  For each physical page in
the system, there is a software page descriptor (residing on the same node as
the physical page) that contains a list of all {process, virtual address} pairs
that map the page.  This list can grow to be fairly long and in that sense
is not efficiently scalable.  Note however, that the invalidation required
is fairly cheap except for the possible need for a TLB shoot to a remote node.
BBN's TLB shoot code uses a clock based algorithm that eliminates the need
for most of these TLB shoots.

The TC2000 does indeed have a separate 32 bit virtual address space per 
process as, I believe, do most UNIX VM implementations.

For more information about BBN's new VM and pmap implementations, see the paper
"Highly Parallel Virtual Memory Management in Large-Scale Shared Memory
Multiprocessors" by D. Barach et. al.  This paper has been submitted to
the ICPP 1990.

>
>Richard Huff
>huff@svax.cs.cornell.edu

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

edler@jan.ultra.nyu.edu (Jan Edler) (03/03/90)

In article <37877@cornell.UUCP> huff@cs.cornell.edu (Richard Huff) writes:
> What is currently being done, or being proposed, for such large NUMA
> MIMD machines?  How does the Butterfly II, Ultracomputer, or RP3 do
> virtual to physical address translation?  Do they employ a separate
> virtual address space per process?  Is it 32-bits, or larger?
> 
> Is anyone out there considering building a NUMA MIMD shared memory
> machine with a single, machine wide, 64 bit virtual address space?

There are a couple different issues here.  First is one of terminology
and taxonomy: The acronyms UMA and NUMA are somewhat slippery, at
least for some machines (like the NYU Ultracomputer).  In the
"generic" Ultracomputer design, all memory is equally far from all
processors, thus making it an UMA.  Of course we add caches (possibly
with only software-controlled cacheability to enforce coherency),
making the UMA designation less appropriate.  If we add local memory,
some would call the design NUMA.  Yet we don't consider such
modifications to be significant enough to warrant a reclassification
of the machine, i.e. we still think of the machine in much the same
way, with or without local memory.

Consider the RP3, where each processor enjoys fast access to a
co-located memory module (there is no memory equally far from all
processors): Is it UMA or NUMA?  Virtual address ranges can be
interleaved accross the memory modules or sequential within a memory
module as controlled by the MMU.  Sequential placement is good for
private (or mostly-private) memory; interleaved is good for shared.
There are a spectrum of possibilities, with two extremes:
  - all "sequential": the machine appears to be NUMA
  - all "interleaved": looks UMA ( (n-1)/n of the refs are uniform nonlocal)
It depends on how you plan to use (or think about) the machine.

As for how address translation is performed on the Ultracomputer, there
is really no single good answer.  The "generic" Ultracomputer design
doesn't really address the issue, merely assuming that some sort of
address translation hardware is present to support a general-purpose
operating system.  The issues of word and address size are also not
very relevant to the generic design, except that the word size
determines the largest object that can be atomically accessed with a
single load or store instruction.

When considering an implementation of the Ultracomputer, things become
quite different, and of course it really matters how things are done.
All the specific designs we've considered at NYU have had 32-bit word
sizes and 32-bit addresses.  The operating system design supports a
separate address space for each process (although support for
lightweight processes within a shared address space is under
consideration).  To date we've only considered hardware designs with
fairly conventional MMUs:
	- buddy-system, TLB-only (with mc68451 MMU),
	- multi-level segment/page tables (mc68030),
	- TLB-only fixed-size pages (am29000).
In all cases, we've considered page tables (or their equivalent) to
reside in shared memory (this was also the case with our OS design
study for RP3, which was to support sequential memory as well as
interleaved).

Other factors to consider are cache control and TLB coherence.  Up to
now, our designs have relied entirely on software for cache
coherency, and so we assume the MMU can indicate the cacheability of
each reference.  Hardware cache or TLB coherence schemes can impact
the MMU in various ways, some of them favoring a globally shared
address space (but not necessarily with a flat addressing scheme).

Jan Edler
NYU Ultracomputer Project
edler@nyu.edu
(212) 998-3353