[comp.arch] Paging page tables

march@m.cs.uiuc.edu (06/28/90)

While reading Hennessey and Patterson (p. 437), they mention the fact
that page tables entries are often paged themselves (with the operative
word being often).  Now to me, paging you're means of address translation 
makes no sense.  As they continue to point out, the cost of this is 
appreciable because one has to swap the PTEs back in and then do the 
translation.  Given this, just how "often" is this used?  I suppose
given a translation-lookaside buffer, the hugh cost of swapping page
table entries back in could be offset by some translations being very
fast (via the TLB).

Anyone ... ?

-Steve

===============================================================================
Steve March                            (H) (217)328-5176/328-5230  (W) 333-7408
Domain: march@cs.uiuc.edu             Path: {uunet|convex|pur-ee}!uiucdcs!march
"Time and space are modes by which we think and not conditions in which
 we live."  - Albert Einstein

amos@taux01.nsc.com (Amos Shapir) (06/28/90)

In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
>
>While reading Hennessey and Patterson (p. 437), they mention the fact
>that page tables entries are often paged themselves (with the operative
>word being often).  Now to me, paging you're means of address translation 
>makes no sense.  As they continue to point out, the cost of this is 
>appreciable because one has to swap the PTEs back in and then do the 
>translation.  Given this, just how "often" is this used?

A simple calculation: 32-bit addresses cover 4GB; when using pages
of 4KB (which are rather large as paging systems go) you have
to keep 1M PTE's, which occupy 4MB.  The need to page these is
of course proportional to the amount of virtual memory used vs. the
amount of physical memory that can be tied down in PTE's.

So "often" means - as often as more than a few MB of virtual space
are used.

-- 
	Amos Shapir		amos@taux01.nsc.com, amos@nsc.nsc.com
National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel
Tel. +972 52 522408  TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N

henry@zoo.toronto.edu (Henry Spencer) (06/29/90)

In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
>While reading Hennessey and Patterson (p. 437), they mention the fact
>that page tables entries are often paged themselves (with the operative
>word being often).  Now to me, paging you're means of address translation 
>makes no sense...

Unfortunately, it makes a great deal of sense, because the PTEs for a
very large process can chew up a lot of memory all by themselves.  All
it means is that you need a smaller set of PTEs to describe where the
main PTEs are, layering one address-translation mechanism on top of
another.  This is done pretty routinely.
-- 
"Either NFS must be scrapped or NFS    | Henry Spencer at U of Toronto Zoology
must be changed."      -John Osterhout |  henry@zoo.toronto.edu   utzoo!henry

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (06/29/90)

In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
>While reading Hennessey and Patterson (p. 437), they mention the fact
>that page tables entries are often paged themselves (with the operative
>word being often).  Now to me, paging you're means of address translation 
>makes no sense.  As they continue to point out, the cost of this is 
>appreciable because one has to swap the PTEs back in and then do the 
>translation.  Given this, just how "often" is this used?  I suppose
>given a translation-lookaside buffer, the hugh cost of swapping page
>table entries back in could be offset by some translations being very
>fast (via the TLB).

This cost is incurred only if the OS software decides to make the
page tables swappable.  The cost of a swap may not be known at the
chip-designer level, because a general purpose design could be
suitable for quite varying amounts of memory, and varying speeds of
swap devices, and of course varying demands upon that memory. 

The OS designer balances memory demands and  swap speed against swap
frequency.  Note that a 16 byte table entry per 512 byte page is only
a 3% overhead. Most designs run leaner.

The chip designer does control the cost of a simple TLB refill.  For
example, the 88000 and 68020 (well, 68851) have table-walk hardware,
whereas the R3000 designers punted TLB refill to an interrupt handler
and some special instructions.

The chip designer also controls the replacement algorithm of the TLB.
(That is, when an item is loaded into a TLB, some slot has to be
selected and overwritten.)  The 68020 uses a pseudo-LRU algorithm for
this.  The 88000 uses FIFO - entries, valid or not, are just shifted
off the end into the bit bucket. The R3000 hardware selects victims
with a pseudo-random.  (Very pseudo.) Are these equivalent?  Not
hardly!  It's always been assumed that it just doesn't matter.

Discussions on this topic usually move on the the issue of Process ID
tags, which are variously a few bits (MIPS) or one bit (88000).
(Does the i860 have PIDs?)  From there one gets into task switch
issues, which is to say, flushing.
-- 
Don		D.C.Lindsay

JONESD@kcgl1.eng.ohio-state.edu (David Jones) (06/29/90)

In article <3300142@m.cs.uiuc.edu>, march@m.cs.uiuc.edu (Steve March) writes:

> While reading Hennessey and Patterson (p. 437), they mention the fact
> that page tables entries are often paged themselves (with the operative
> word being often).  Now to me, paging you're means of address translation
> makes no sense.  As they continue to point out, the cost of this is
> appreciable because one has to swap the PTEs back in and then do the
> translation.  Given this, just how "often" is this used?  I suppose
> given a translation-lookaside buffer, the hugh cost of swapping page
> table entries back in could be offset by some translations being very
> fast (via the TLB).

The VAX architecture uses pageable page tables for user code.  A TLB, even
a small one, does a very effective job of reducing the overhead of paged
page tables.  I once intentionally wrote a pathological program that thrashed
the TLB on a MicroVax and it ran 3 times slower than a non-thrashing version.

Note that to thrash the TLB entries for the page table pages, you have to
separate your memory references by enough to go to another page in the
table (64KB in the case of the VAX).  Most normal programs would thrash the
program page TLB entries more than an order of magnitude more that the page
table page TLB entries, so the cost over that of a non-paged page table
is small.

David L. Jones               |      Phone:    (614) 292-6929
Ohio State Unviversity       |      Internet:
1971 Neil Ave. Rm. 406       |               jonesd@kcgl1.eng.ohio-state.edu
Columbus, OH 43210           |               jones-d@eng.ohio-state.edu

Disclaimer: A repudiation of a claim.

littauer@uts.amdahl.com (Tom Littauer) (06/29/90)

In article <1990Jun28.182303.8352@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes:
>In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
>>While reading Hennessey and Patterson (p. 437), they mention the fact
>>that page tables entries are often paged themselves (with the operative
>>word being often).  Now to me, paging you're means of address translation 
>>makes no sense...
>
>Unfortunately, it makes a great deal of sense, because the PTEs for a
>very large process can chew up a lot of memory all by themselves.  All
>it means is that you need a smaller set of PTEs to describe where the
>main PTEs are, layering one address-translation mechanism on top of
>another.  This is done pretty routinely.

Correct. A couple of examples where this is desirable and efficient:

Large but sparse arrays. The programmer needn't worry about special hacks,
but can get on with solving the original problem.

Extremely large arrays combined with programs with good locality of reference.

-- 
UUCP:  littauer@amdahl.amdahl.com
  or:  {sun,decwrl,hplabs,pyramid,ames,uunet}!amdahl!littauer
DDD:   (408) 737-5056
USPS:  Amdahl Corp.  M/S 278,  1250 E. Arques Av,  Sunnyvale, CA 94086

I'll tell you when I'm giving you the party line. The rest of the time
it's my very own ravings (accept no substitutes).

bob@tera.com (Bob Alverson) (06/29/90)

In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
>
>While reading Hennessey and Patterson (p. 437), they mention the fact
>that page tables entries are often paged themselves (with the operative
>word being often).  Now to me, paging you're means of address translation 
>makes no sense.  As they continue to point out, the cost of this is 
>appreciable because one has to swap the PTEs back in and then do the 
>translation.  Given this, just how "often" is this used?  I suppose
>given a translation-lookaside buffer, the hugh cost of swapping page
>table entries back in could be offset by some translations being very
>fast (via the TLB).
>
The main justification I see for paging page tables is to keep their
overhead a function of physical memory size, not virtual memory size.
Consider a UNIX process with heap at low addresses going up and stack
at high addresses going down.  Without a paged page table, all the
space in between must have page table entries sitting in physical memory.
That doesn't seem like a smart way to use memory.  To keep things sane,
you probably want to lock in the page containing the page table entries
for virtual address space containing the page table.  With a 4Mb addr
space and 512b pages, you need 8k entries taking (say) 32kb.  You need 64 page
table entries to cover the page table (sounds recursive), which fits
in one page.

Bob

martins@ivory.SanDiego.NCR.COM (Martin Sandman) (06/29/90)

>that page tables entries are often paged themselves (with the operative
>word being often).  Now to me, paging you're means of address translation 
>makes no sense. 

As described in H&P I don't understand how this could work either.
However, I am familar with a combination segment/page scheme where it works
very well.  The eight high order bits of the address select one of 256
page tables.  Each page table can have up to 128 entries.  The segment table
entry specifies the number of valid entries in each page table and whether
the page table is in memory or not.  If the page table in not in real memory,
a special fault is taken when the segment table entry is referenced.  (A
segment table _is_ nailed to real memory.)  When this special segment table
fault occurs the paging supervisor builds a page table "on the fly" in real
memory and restarts the instruction.  This time it results in an "ordinary"
page fault because the page table is in memory but the page itself is not.
This technique clearly saves lots of real memory that could be potentially
lost to seldom used page tables.  On the other hand, it _seems_ to require
an additional memory access to resolve a virtual to real translation.  It does
but not very often.  99+% of all v-to-r happens in the translation lookaside
buffer so the two-stage memory access does not really happen offen.

Since the page tables are built on the fly, really just allocated, they don't
have to be swapped in.
     ______________________________________________________________________
     = UUCP:   sdcsvax!ncr-sd!fiddler!martins      Martin D. Sandman      =
     = NCR:    Martin.Sandman@SanDiego.NCR.COM     NCR Corporation        =
     = Phone:  (619) 485-3554                      16550 West Bernardo Dr.=

dwc@cbnewsh.att.com (Mordecai the Fowl) (06/30/90)

In article <4137@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes:
> In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
> 
> A simple calculation: 32-bit addresses cover 4GB; when using pages
> of 4KB (which are rather large as paging systems go) you have
> to keep 1M PTE's, which occupy 4MB.  The need to page these is
> of course proportional to the amount of virtual memory used vs. the
> amount of physical memory that can be tied down in PTE's.
> 
note that with a combination of segmentation and paging, you don't
need a fully allocated set of page tables for your entire address
space.  the page tables themselves can be sparsely allocated.

it also seems that the only information you need in the page table
upon process swapout is whether a particular page was part of the
swapout since the page tables can be rebuilt on swapin.  you probably
want to keep other things such as modified bits but don't need to
keep the reference bit unless you are a perfectionist.  what this means
is that since you don't need to keep the mapped frame number (which is
no longer a valid page frame number anyway), the page table representation
can be much more compact when a process is swapped.  whether you still want
to swap this compacted representation of the page table (or whether you
actually want to compact the page table instead of just swapping it)
is another design issue.

also, i believe that as long you keep a record of the size of
the swapped image, and in the swapped image you keep information
to distinguish the page tables from the actual pages, you can still
rebuild all the information in a single (logical) I/O.  so swapping
the page tables out doesn't cost you much incrementally.

all this assumes we are dealing with private pages.  don't ask me
how to deal with shared pages in a general way.

danny chen
att!hocus!dwc

aglew@bach.crhc.uiuc.edu (Andy Glew) (06/30/90)

>The chip designer does control the cost of a simple TLB refill.  For
>example, the 88000 and 68020 (well, 68851) have table-walk hardware,
>whereas the R3000 designers punted TLB refill to an interrupt handler
>and some special instructions.

I hope that "punted" isn't derogatory.

First off, the hardware complexity/performance tradeoffs involved in
the MIPS software TLB miss handling decision were described in a paper
somewhere, and seemed reasonable. It could go either way.

Second, though, from the point of view of someone who aspires to
architect HW/OS/SW systems (note that OS is somewhere in the middle
:-) ), SW TLB handling is wonderful for experimentation at least.
    For example, you can do a lot of MACH style remapping to avoid
copying, by letting the user modify some parts of the translation
tables (which therefore get spread over several pages) and making the
TLB miss handler enfore security.  Eg. let the user specify that a PTE
points to another virtual address (not physical - that leads to
security problems). And so on.
    Of course, you can always do this via system calls, but by careful
design you can reduce the number of TLB flushes required (which might
be the only thing you want a syscall for) by 1/2.  More, if you are
going to use this for things like LISP style garbage collection.
    Similarly, many applications (from security minded Europeans, in
my experience) wish to restrict the ability to access certain areas of
memory.  Eg. you have a database that you want to be readable/writable
only from the database code.  You can always use a system call to
change the perms, but, as several customers I was called in to found
out, that's slow.  By putting some of the perms in a user-modifiable
part of memory, and changing the fault handler, you can always *relax*
the perms (go from non-accessible to accessible) without a system
call.  Going back requires a TLB flush, which you might have to use a
syscall for - but still, 1 syscall as opposed to 2 is pretty good.
    Of course if the TLB miss handling is comparable to a syscall in
slowness you haven't gained much - but isn't MIPS' point that the SW
TLB miss handler is *not* as slow as a syscall?
    NB. I do not propose that the TLB miss handler be purely user
code.  Just that it interact with some (untrusted) user accessible
data structures.

In addition to user interaction with the virtual address mapping, SW
TLB handling also gives you the flexibility to use new data structures
for the virtual-to-physical mapping.  For example, with potentially
*very* large address spaces impending, extent based mapping structures
may be necessary to save space.


--
Andy Glew, aglew@uiuc.edu

boebert@sctc.com (Earl Boebert) (06/30/90)

Check Organick's book on Multics.

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/30/90)

There's been a lot of talk about paging page tables.
The basic problem is that unless you do something to prevent it,
the amount of *physical* memory taken up by page tables
is proportional to the amount of *virtual* memory allocated.

One really beautiful idea is to stand the whole thing on its
head and have your page tables map from *physical* pages to
*virtual*, so that the space taken up by page tables is a
fixed fraction of the size of *physical* memory.  This is the
technique used on the IBM RT PC.

(1) Would someone who really understands the technique care to post
    a short description?
(2) When I read the RT PC bumf, I was _sure_ that I had seen the
    technique described before, but I couldn't and still can't
    remember where.  Does anyone know where the "inverted page tables"
    idea was first published/used?
(3) IBM hold the patent.  Is any other manufacturer known to have
    been granted licence to use the technique?
(4) Does the RS/6000 use inverted page tables or paged page tables?

-- 
"private morality" is an oxymoron, like "peaceful war".

jfc@athena.mit.edu (John F Carr) (06/30/90)

In article <3341@goanna.cs.rmit.oz.au>
	ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:

>One really beautiful idea is to stand the whole thing on its
>head and have your page tables map from *physical* pages to
>*virtual*, so that the space taken up by page tables is a
>fixed fraction of the size of *physical* memory.  This is the
>technique used on the IBM RT PC.

>(1) Would someone who really understands the technique care to post
>    a short description?

A 32 bit address on the IBM RT is treated by hardware as a 4 bit segment
number/17 bit virtual page number/11 bit byte address.  When an address is
presented to the memory management unit, the segment number is extracted.
Unless this segment is marked "not present" (for example, segment 15 on the
RT is dedicated to I/O devices; there is a bit to tell the MMU to let another
system handle references to this segment), it is replaced in the virtual
address with a 12 bit segment ID to create a 40 bit virtual address.  The MMU
has a 32 entry TLB; when a virtual address is not found in the TLB, the
hardware needs to consult the page tables.

The page table is called the "HAT/IPT" -- "Hash Anchor Table/Inverted Page
Table".  This refers to the two ways of indexing it to perform virtual to
physical and physical to virtual address translation.  The MMU servicing a
TLB miss starts with a HAT lookup:

	1.  compute an index into the HAT.  This is the exclusive
	    OR of the low bits of the virtual page number and the
	    segment number (number of bits depends on memory size).

	2.  check the "empty" bit; if it is set than page fault

	3.  extract from the table entry a 13 bit index

Then it continues with a search of a chain of IPT entries:

	4.  read the table entry corresponding to the index

		a. if the virtual address matches, load this entry
		   into the TLB; end of processing

		b. if the "last" bit is not set, extract a 13 bit
		   index to the next table entry in the chain; repeat 4

		c. if the "last" bit is set then this is the end of
		   the chain and no match has been found, so page fault

This takes 5 cycles + 3 cycles per IPT entry searched (cycle time is 80ns to
170 ns depending on CPU model).  Ideally, search length should be very close
to 1 (this is a hash table with as many buckets as objects).  I ran some
tests a few days ago and found that in reality the average is between 2 and
3; due to the software implementation most segment numbers and virtual page
numbers are small so the high bits of the hash index are predictable.

Searching for the virtual page mapped to a physical page is constant time --
just index into the table by the physical page number in question.

Each table entry is 16 bytes.  When using 2K pages, the overhead is 16/2048 =
1/128 memory dedicated to page tables.

>(4) Does the RS/6000 use inverted page tables or paged page tables?

The RS/6000 memory management unit is a slightly newer version of the RT MMU.
It has longer segment IDs to give more virtual address bits (56, I think).

--
    --John Carr (jfc@athena.mit.edu)

mash@mips.COM (John Mashey) (07/01/90)

In article <AGLEW.90Jun29204550@bach.crhc.uiuc.edu> aglew@bach.crhc.uiuc.edu (Andy Glew) writes:
>>The chip designer does control the cost of a simple TLB refill.  For
>>example, the 88000 and 68020 (well, 68851) have table-walk hardware,
>>whereas the R3000 designers punted TLB refill to an interrupt handler
>>and some special instructions.
0) There is a separate exception vector for TLB miss.
1) TLB refill routines are typically 10-15 instructions, depending on
which TLB refill routine you use (they're sometiems different for different
OSs to match different ideas of PTE arrangements,
2) The usual refill uses the "TLB write random" instruction, which writes
a TLB entry into a pseudo-random location in the top 56 entries of the
64-entry TLB.  This instruction does get used elsewher; the rest of the
instructions in the sequences are nothing special, just a sequnece of
user-level instructions plus moves to/from coprocessor 0 (the system
coprocessor).
3) Both HP PA and AMD 29K use software-refilled TLBs, although the
details are rather different.

>First off, the hardware complexity/performance tradeoffs involved in
>the MIPS software TLB miss handling decision were described in a paper
>somewhere, and seemed reasonable. It could go either way.
Proc. 1986 IEEE CompCon, SanFrancisco, March 1986.
Paper by DeMoney, Moore, and Mashey.
>    Of course if the TLB miss handling is comparable to a syscall in
>slowness you haven't gained much - but isn't MIPS' point that the SW
>TLB miss handler is *not* as slow as a syscall?
yes, not as slow, which is why it has own exception vector.
>    NB. I do not propose that the TLB miss handler be purely user
>code.  Just that it interact with some (untrusted) user accessible
>data structures.
This is quite feasible, although I don't know anyone who's done it.

As usual, whether you do this or not depends on:
	a) The problem domain you wish to cover with the chip.
	b) The cost/complexity of TLBmiss handling as it integrates
	with the pipeline.  This is especially "interesting" as pipelines
	get more complex.
	c) The cost of the silicon space for the MMU and its control.
	d) The cycle count cost of doing software refill versus hardware.
	e) How much cycle count degradation actually results from TLB
	processing, versus cycle-time degradation (if any) from various
	approaches.  Anyone who does serious analysis of all this finds
	that TLB processing is on the order of 1%, except for big
	array processing, or certain other cases that have awful locality,
	in which case TLBs (either hardware or software) suffer.

Note that it is very easy to get surprised in all this.  For example, some
chips cannot do PTE table-walking inside their caches, especially in
multi-pocessing environments, and it is quite possible that regardless
of hardware of software refill, the time may well be dominated by
the number of uncached accesses to main memory.

In the usual marketing wars, sometimes people claim "hardware TLB
refill is faster than software refill."  This may be true, or it may
not be, in SPECIFIC comparisons.
However, anyone making the claim IN GENERAL is almost certainly
a) A marketeer extolling the virtues of a product that has hardware-refill,
OR
b) Someone unlikely to be able answer the following questions:
	"It's faster?"  Compared to which chips?  How much faster is it?
	What percentage of total CPU time is that?
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/01/90)

In article <9758@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU
(Donald Lindsay) writes:

	[ ... on paging page tables ... ]

   The OS designer balances memory demands and  swap speed against swap
   frequency.  Note that a 16 byte table entry per 512 byte page is
   only a 3% overhead. Most designs run leaner.

Actually this 3% overhead can be mostly obviated, as the need of paging
page tables. The clear win is to have reverse page map tables for the
paging itself, whose size is only proportional to the size of real
memory, and segment map tables for the description of valid stretches
of address space. In this way you have the best of all worlds, because
you can have a very large address space with only locally dense data
space mappings, and a very low overhead.

I am quite sure that Don Lindsay could expound on the advantages of
this approach from direct observation, as the Mach portable virtual
memory code uses exactly this approach (well, the MACH people actually
bitch about the reverse map MMU in the IBM RISCs, because they abuse
copy-on-write, and reverse map makes it difficult, but this is just
because they use the wrong virtual memory model, i.e. copy-on-write).

   The chip designer does control the cost of a simple TLB refill.  For
   example, the 88000 and 68020 (well, 68851) have table-walk hardware,
   whereas the R3000 designers punted TLB refill to an interrupt
   handler and some special instructions.

Refilling the TLB in software is arguably a win -- most obviously
because it allows collection of a lot of interesting statistics, and
this can help the page replacement policy. It also gets the hardware
designer out of some very sticky points. I think that the very best
references on the subject are Ibbet and Morris "The MU5 Computer
System", MacMillan, and some MU6 papers and reports by Knowles, Woods,
Edwards, etc... from Manchester University. They are *the* virtual
memory people, incidentally.

   Discussions on this topic usually move on the the issue of Process
   ID tags, which are variously a few bits (MIPS) or one bit (88000).
   (Does the i860 have PIDs?)  From there one gets into task switch
   issues, which is to say, flushing.

:-). IMNHO TLBs without address space (*not* process, please) id tags
are ridiculous. Some IBM guy measured the hit from cold restarting
translation of address space maps because of TLB flush, and it is, ehm,
arresting.

I find it incredible that system architects in this time and age still
design MMUs without address space tags in the TLBs. But then somebody
*did* design the SUN-3 MMU, and the address space tag issue looks
miniscule compared to what has been perpetrated there. :-> :-/.

Hey, after all you win the benchmark wars on MIPS (the four letter
word) ratings and other CPU, or at best, single user, oriented
ratings.
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/03/90)

In article <3341@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard
A. O'Keefe) writes:

   One really beautiful idea is to stand the whole thing on its
   head and have your page tables map from *physical* pages to
   *virtual*, so that the space taken up by page tables is a
   fixed fraction of the size of *physical* memory.  This is the
   technique used on the IBM RT PC.

And on virtually all the Manchester University machines for the last few
dozen years (major exception being the MU5).

   (1) Would someone who really understands the technique care to post
       a short description?

Each physical page contains the segments and page number of the virtual
page mapped to it. When a translation is needed, all physical page are
looked up to see which one contains the given virtual page. The lookup
can be sped up in various ways; firstly by caching a few translations,
secondly by using an hash index or parallel search hardware.

It is fairly obvious from this description that shared memory is not
easy with a reverse map MMU, but it can be done.

   (2) When I read the RT PC bumf, I was _sure_ that I had seen the
       technique described before, but I couldn't and still can't
       remember where.  Does anyone know where the "inverted page tables"
       idea was first published/used?

The very first virtual memory machine, the Manchester Atlas (MU4), used
inverted page tables. A very good description of their latest scheme,
the MU6 MMU, is contained in a thesis available on request from them.

   (3) IBM hold the patent.  Is any other manufacturer known to have
       been granted licence to use the technique?

IBM has invented virtual memory, IBM has patented reverse map MMUs, what
is good for IBM is good for IBM :-(. This subject has already been
discussed; I have had in a drawer a paper on a very simple reverse map
MMU design (two ROMs and one RAM) for many years now, and when I get the
nerve to publish it... Well, after all IBM never sued Manchester
University for having dared copy their inventions and patents a few
years early :-).

In another thread the abuses of the patent system are discussed. Well,
every time I think of IBM and virtual memory...
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

lewine@dg.dg.com (Don Lewine) (07/03/90)

In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes:
>
>While reading Hennessey and Patterson (p. 437), they mention the fact
>that page tables entries are often paged themselves (with the operative
>word being often).  Now to me, paging you're means of address translation 
>makes no sense.  As they continue to point out, the cost of this is 
>appreciable because one has to swap the PTEs back in and then do the 
>translation.  Given this, just how "often" is this used?  

First, read ahead through page 445 to understan VAX memory management.  Pay
close attention to Figure 8.27.  The key concept is that user (P0 & P1)
page tables are kept in system space.  System space is paged.

When I was working on the MicroVAX architecture in 1982-83, it seemed that
we could save some gates by making all page tables physical.  This was done
one other computer architectures.  We did some experiments to determine how
often page tables were paged.  I have forgotten the exact numbers, but the
answer was "a great deal."  In fact, the amount of page table paging was high
enough to convince us that we needed to implement full-VAX memory management.

We assumed that VMS could be made smarter and do less paging of page tables,
however, even at that the cost of keeping all page tables in memory was way
to high.  That is, even if VMS only had to read the page tables for the 
current process, there would be a great deal of wasted memory.

firth@sei.cmu.edu (Robert Firth) (07/04/90)

In article <3341@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:

>One really beautiful idea is to stand the whole thing on its
>head and have your page tables map from *physical* pages to
>*virtual*, so that the space taken up by page tables is a
>fixed fraction of the size of *physical* memory.

Am I being an idiot here, or doesn't this idea have a killer
flaw: it precludes having two different virtual addresses
mapped onto the same physical address.  I can see a lot of
problems with such a restriction.

richard@aiai.ed.ac.uk (Richard Tobin) (07/06/90)

In article <1990Jun29.154940.22762@tera.com> bob@colossus.tera.com (Bob Alverson) writes:
>Consider a UNIX process with heap at low addresses going up and stack
>at high addresses going down.  Without a paged page table, all the
>space in between must have page table entries sitting in physical memory.

Or you must have two page tables.

Do operating systems that allow sparse address spaces (eg SunOS 4) use
paged page tables to achieve this?  Is the fact that BSD doesn't use
the Vax's paged page tables the reason why it doesn't support sparse
address spaces?

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/06/90)

In article <2936@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes:
>Do operating systems that allow sparse address spaces (eg SunOS 4) use
>paged page tables to achieve this?  Is the fact that BSD doesn't use
>the Vax's paged page tables the reason why it doesn't support sparse
>address spaces?

The answer is no. The VAX hardware supports multilevel page tables,
so the "usual" sparse address spaces can be described by a fairly
small page table.

For example: the 88000 uses a two level table.  The first level is a
single 4 KB page, containing 1024 one-word entries.  The second level
is N pages, where N is less-than-or-equal-to 1024. Each second level
page can access 1024 data pages. Conveniently, 1024 * 1024 * 4 KB is
4 GB, which is what you need when virtual addresses are 32 bits.

In the example you gave, there was a code region, and a stack.  If
each of these is smaller than 4 MB, then the 88000's page table for
that fits into three pages. In the first level table, there are 1022
null entries, and two valid entries.

If (say) a particular compile had a 10 MB heap, then the 88000's page
table would occupy 3 pages, +1 for the stack, +1 for the first level.
That's 5 pages (20 KB), or an overhead of one fifth of one percent.
For smaller programs, the overhead is relatively higher, but that's
not a major problem.
-- 
Don		D.C.Lindsay

lkaplan@bbn.com (Larry Kaplan) (07/06/90)

In article <PCG.90Jun30230310@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>In article <9758@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU
>>(Donald Lindsay) writes:
>>	[ ... on paging page tables ... ]
>>   The OS designer balances memory demands and  swap speed against swap
>>   frequency.  Note that a 16 byte table entry per 512 byte page is
>>   only a 3% overhead. Most designs run leaner.

I believe the 88k is about 4 bytes per 4k page.

>Actually this 3% overhead can be mostly obviated, as the need of paging
>page tables. The clear win is to have reverse page map tables for the
>paging itself, whose size is only proportional to the size of real
>memory, and segment map tables for the description of valid stretches
>of address space. In this way you have the best of all worlds, because
>you can have a very large address space with only locally dense data
>space mappings, and a very low overhead.

This reverse scheme fails to save you much on machines with large physical
memories which are becomming more and more common.  We build UNIX machines
that regularly have 1 or more gigabytes of real physical memory.
Allocating segment tables on demand is certainly helpful, though.

>I am quite sure that Don Lindsay could expound on the advantages of
>this approach from direct observation, as the Mach portable virtual
>memory code uses exactly this approach (well, the MACH people actually
>bitch about the reverse map MMU in the IBM RISCs, because they abuse
>copy-on-write, and reverse map makes it difficult, but this is just
>because they use the wrong virtual memory model, i.e. copy-on-write).

Wrong virtual model?  This sounds like a religious comment.  My understanding
is that UNIX has wanted to have true copy-on-write semantics for a while but
just never got around to it.  Witness the vfork() hack in BSD which was
suppose to disappear when "proper sharing semantics" were implemented.
The system we build has MACH ancestry (though a totally rewritten VM system)
and therefore allowed us to make vfork() the same as fork().

As far as paging the page tables, our page fault handler is always prepared 
to allocate page table space.  The only problem would be to clean the system's
TLBs since we put page tables into the kernel's virtual address space.

In our multiprocessor systems, my feeling is that it is a win to be able to 
page your page tables to prevent some process from consumming lots of local
page table memory in order to reference alot of system wide memory.  Other
small local processes (or even user allocations from the same process) would 
then get very starved for memory on the local node.

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

johnl@esegue.segue.boston.ma.us (John R. Levine) (07/06/90)

I don't get what this argument is about.  Every machine I've ever seen that
has regular (as opposed to inverted) page tables pages the page tables.
Some, like the Vax, do it explicitly by embedding the page table in another
pagable address space.  Most others, such as the IBM 370 and the Intel 386,
get exactly the same effect with two-level page tables.  In both cases, the
page table is broken up into chunks (which always seem to have a size of one
page) and each chunk can be marked resident or non-resident independently.

Except for machines like the PDP-11/70 which has a tiny virtual address
space, the page table is too big to lock down into memory.  Even if you have
the memory, having to allocate a large contiguous chunk of physical memory
is painful, so paged page tables allow scattered allocation of page tables
just like any other memory.

-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
Marlon Brando and Doris Day were born on the same day.

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/06/90)

In article <58001@bbn.BBN.COM>
	lkaplan@BBN.COM (Larry Kaplan) writes:

>Wrong virtual model?  This sounds like a religious comment.  My understanding
>is that UNIX has wanted to have true copy-on-write semantics for a while but
>just never got around to it.  Witness the vfork() hack in BSD which was
>suppose to disappear when "proper sharing semantics" were implemented.
>The system we build has MACH ancestry (though a totally rewritten VM system)
>and therefore allowed us to make vfork() the same as fork().

If you implement vfork() the same as fork(), your system is broken.

If a large process want to fork(), it requires a large amount of virtual
memory space (twice the size of the process). So, if the process almost
used up virtual memory space, the process can not fork(). Copy-on-write
won't help.

If a large process want to vfork(), no extra virtual memory space
is necessary.

This is a serious problem when large scale computing is necessary.

						Masataka Ohta

lkaplan@bbn.com (Larry Kaplan) (07/06/90)

I wrote:
>
>>Wrong virtual model?  This sounds like a religious comment.  My understanding
>>is that UNIX has wanted to have true copy-on-write semantics for a while but
>>just never got around to it.  Witness the vfork() hack in BSD which was
>>suppose to disappear when "proper sharing semantics" were implemented.
>>The system we build has MACH ancestry (though a totally rewritten VM system)
>>and therefore allowed us to make vfork() the same as fork().

In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>
>If you implement vfork() the same as fork(), your system is broken.
>
>If a large process want to fork(), it requires a large amount of virtual
>memory space (twice the size of the process). So, if the process almost
>used up virtual memory space, the process can not fork(). Copy-on-write
>won't help.
>
>If a large process want to vfork(), no extra virtual memory space
>is necessary.
>
>This is a serious problem when large scale computing is necessary.
>
>						Masataka Ohta

This shows some serious misunderstanding on your part.  The idea behind 
copy-on-write is that a proper fork() requires NO EXTRA physical memory for 
the child process (except that required for kernel data structures and page 
tables).  You probably end up needing one stack page pretty soon.  Virtual 
memory is VIRTUAL.  Who cares if you were running out in one process?  You get 
an entire new address space in the next.  This implementation of fork() is
what fork() was always intended to be, as far as I can tell.  By doing fork()
correctly, the need for a separate vfork() disappears (as stated in the BSD
man pages).

#include <std_disclaimer>
_______________________________________________________________________________
				 ____ \ / ____
Laurence S. Kaplan		|    \ 0 /    |		BBN Advanced Computers
lkaplan@bbn.com			 \____|||____/		10 Fawcett St.
(617) 873-2431			  /__/ | \__\		Cambridge, MA  02238

henry@zoo.toronto.edu (Henry Spencer) (07/06/90)

In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>If a large process want to fork(), it requires a large amount of virtual
>memory space (twice the size of the process). So, if the process almost
>used up virtual memory space, the process can not fork(). Copy-on-write
>won't help.

Nonsense.  There need not be any concept of "using up virtual memory space".
Virtual memory is *virtual*; only when those pages need to become real do
they consume resources.  Forking large processes without needing lots more
memory/disk just requires an intelligent implementation.  Unfortunately,
intelligent implementors are scarce.

Vfork is an abomination invented by lazy people; the sooner it goes,
the better.  (Please don't tell me that hardware problems prevented them
from implementing copy-on-write; copy-on-access they *could* have done,
and it would have been almost as good.)
-- 
"Either NFS must be scrapped or NFS    | Henry Spencer at U of Toronto Zoology
must be changed."  -John K. Ousterhout |  henry@zoo.toronto.edu   utzoo!henry

pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/07/90)

In article <1990Jul6.160004.896@zoo.toronto.edu> henry@zoo.toronto.edu
(Henry Spencer) writes:

   In article <5796@titcce.cc.titech.ac.jp>
   mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:

   >If a large process want to fork(), it requires a large amount of virtual
   >memory space (twice the size of the process). So, if the process almost
   >used up virtual memory space, the process can not fork(). Copy-on-write
   >won't help.

   Nonsense.  There need not be any concept of "using up virtual memory space".
   Virtual memory is *virtual*; only when those pages need to become real do
   they consume resources.  Forking large processes without needing lots more
   memory/disk just requires an intelligent implementation.  Unfortunately,
   intelligent implementors are scarce.

What Ohta probably means here is that in such naive implementations that
reserve swap space as soon as the corresponding virtual address range is
defined, if one process has consumed more than half swap space,
copy-on-write will fail, while vfork will not... I do agree with Spencer
that this is merely a fixup to a wrong implementation decision, though.

I do not believe that copy-on-write and the Unix fork()/exec() semantics
are in general useful. IMNHO the right thing is that address space
creation and thread creation should be separate operations, and threads
of control and memory segments should be movable from one address space
to another, and there should be no shared memory at all (except for
irrevocably read only segments), either copy-on-write or not.

To get something a fork you would do:

	create new address space, with no thread or segment in it
	create and map into it the required code and data segments
	create a new thread in this address space
	jump this new thread to the other address space

This requires the ability to distinguish and manipulate independently
between thread and address space and address space and data space;
unfortunately the three concepts are inextricably merged in almost all
the OS designs that are popular (e.g. the only OS I know that separates
the concept of data space from that of address space is MUSS, and
precious few, e.g. the CAP monitor or some recent Rochester ones allow
you to manage threads and address spaces independently).
--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/09/90)

In article <1990Jul6.160004.896@zoo.toronto.edu>
	henry@zoo.toronto.edu (Henry Spencer) writes:

>In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:
>>If a large process want to fork(), it requires a large amount of virtual
>>memory space (twice the size of the process). So, if the process almost
>>used up virtual memory space, the process can not fork(). Copy-on-write
>>won't help.

>Nonsense.  There need not be any concept of "using up virtual memory space".
>Virtual memory is *virtual*;

Sigh..., It seems to me that I must explain AGAIN in this newsgroup
why vfork() is essential.

Virtual memory is NOT nothing.

>only when those pages need to become real do
>they consume resources.

OK. You are correct here, though the terminology is confusing (real hear
dose not means you need real memory, but means you need swap space).

So, perhaps, you know when a page need to be allocated
swap space. With copy-on-write scheme, a page need swap space
when the page is written something.

>Forking large processes without needing lots more
>memory/disk just requires an intelligent implementation.  Unfortunately,
>intelligent implementors are scarce.

I admit that that is an elaborated implementation. But, in general,
elabolation should be considered vice in UNIX.

In this case, the INTELLIGENT implementation is much worse than a
simple-minded (tough copy-on-write is already too much elaborated,
and thus vice) implementation. Fortunately, INTELLIGENT implementors
are scarce.

If you think virtual memory is free and allow forking without reserving
actual swap space, when  swap space is required, it is often the case
that, there is no free swap space, anymore.

That is, if a process (which may not be a large process, it may even
be a system process) write to a copy-on-write page, it can't. The situation
is deadlock and to resolve it, some process must be killed.

In some INTELLIGENT implementation, swap space for read only text
page will also be allocated on request, so, even instruction read
(not write) will cause deadlocking.

The problem is serious when you do a large scale computation.

>Vfork is an abomination invented by lazy people; the sooner it goes,
>the better.

You should understand laziness is virtue in UNIX.

>(Please don't tell me that hardware problems prevented them
>from implementing copy-on-write; copy-on-access they *could* have done,
>and it would have been almost as good.)

But, the reason why vfork survives in 4.4BSD is, as I heard from
a person in Berkeley, that vfork is faster than fork. Even with
an elabolated shared memory mechanism of 4.4BSD, it takes time
to spawn a large completely new process.

						Masataka Ohta

jkrueger@dgis.dtic.dla.mil (Jon) (07/10/90)

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes:

>With copy-on-write scheme, a page need swap space
>when the page is written something.

But not until.  Page and swap file space allocation is as postponeable
as the memory copy.  Again, the cost is not paid at process creation
and quite commonly not ever.  In UNIX systems the most common sequence
of events after a fork is an exec.  Why allocate swap space at process
creation when the most common immediately following event will be
execution of a different sized program?

>If you think virtual memory is free and allow forking without reserving
>actual swap space, when  swap space is required, it is often the case
>that, there is no free swap space, anymore.

That way lies MVS, friends;  can't do anything without allocating all
resources you might ever use, paying for them whether you ever use them
or not, denying them to others, including your own other processes.  If
you want MVS you know where to find it.  The rest of us have found more
flexible resource management schemes worthwhile.  We love ownership but
also find option-to-buy an attractive proposition.

-- Jon
-- 
Jonathan Krueger    jkrueger@dtic.dla.mil   uunet!dgis!jkrueger
Drop in next time you're in the tri-planet area!

pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/10/90)

In article <58001@bbn.BBN.COM> lkaplan@bbn.com (Larry Kaplan) writes:

   [ I had written: ]
   >In this way you have the best of all worlds, because
   >you can have a very large address space with only locally dense data
   >space mappings, and a very low overhead.

   This reverse scheme fails to save you much on machines with large physical
   memories which are becomming more and more common.  We build UNIX machines
   that regularly have 1 or more gigabytes of real physical memory.

Yes, but as somebody else has already remarked, reverse maps mean that
the overhead is proportional to physical size, i.e. for a given memory
configuration it is constant, while direct maps are proportional to
virtual size. Of course if your workload underutilizes real memory, that
is the combined sizes of all virtual address spaces is less than the
size of the physical memory you lose, but then you are wasting much more
space in data space than in translation tables.

   Allocating segment tables on demand is certainly helpful, though.

The only way to get around the fundamentally poor encoding of direct
maps and their ensuing unsuitability for large sparse locally dense
address spaces is to use many levels of page tables; three levels is not
common (region, segment, page). This has problems of its own, typically
not just latency (TLBs help here as well), but complicated fault
handling and restartability.

   >I am quite sure that Don Lindsay could expound on the advantages of
   >this approach from direct observation, as the Mach portable virtual
   >memory code uses exactly this approach (well, the MACH people actually
   >bitch about the reverse map MMU in the IBM RISCs, because they abuse
   >copy-on-write, and reverse map makes it difficult, but this is just
   >because they use the wrong virtual memory model, i.e. copy-on-write).

   Wrong virtual model?  This sounds like a religious comment.

Religious maybe, but the eveils of data space aliasing are there for
everybody to see... I at least do see them :-).

   My understanding is that UNIX has wanted to have true copy-on-write
   semantics for a while but just never got around to it.

I still think that fork() has the wrong semantics and a more decoupled
design would be neater and more efficient. If you stay within the fork()
based model of address space coupled to control thread, creating a new
control thread requires creating a new address space, and copy-on-write
is just a fairly tolerable implementation technique. On a PDP-11 it was
not really terribly necessary, with its 64KB limit on data space.

--
Piercarlo "Peter" Grandi           | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

chip@tct.uucp (Chip Salzenberg) (07/10/90)

According to mohta@necom830.cc.titech.ac.jp (Masataka Ohta):
>Sigh..., It seems to me that I must explain AGAIN in this newsgroup
>why vfork() is essential.

"Essential" is pretty strong language.  At best, you explain
why YOU think vfork() is *valuable*.

>So, perhaps, you know when a page need to be allocated
>swap space. With copy-on-write scheme, a page need swap space
>when the page is written something.

This behavior is entirely consistent with other Unix resource
management behavior.  Programs do not begin by reserving swap space
for all the storage they might need.  Rather, they ask for memory as
necessary.  Likewise, during a fork(), a Unix process doesn't reserve
swap space; it uses it as necessary.

Perhaps you don't like this philosophy.  Well, then, it seems you're
using the wrong OS for your tastes.

>>Forking large processes without needing lots more memory/disk just
>>requires an intelligent implementation.  Unfortunately, intelligent
>>implementors are scarce.
>
>I admit that that is an elaborated implementation. But, in general,
>elabolation should be considered vice in UNIX.

"Intelligent" and "elaborate" are not synonyms.
-- 
Chip Salzenberg at ComDev/TCT     <chip@tct.uucp>, <uunet!ateng!tct!chip>

mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/12/90)

In article <2699E08D.117A@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes:

>This behavior is entirely consistent with other Unix resource
>management behavior.  Programs do not begin by reserving swap space
>for all the storage they might need.  Rather, they ask for memory as
>necessary.  Likewise, during a fork(), a Unix process doesn't reserve
>swap space; it uses it as necessary.
>
>Perhaps you don't like this philosophy.  Well, then, it seems you're
>using the wrong OS for your tastes.

I don't know what variation of UNIX you know. So please tell me
what happens when there is not enough swap space with your
favorite UNIX.

With vfork, such a situation never occur (except for stack segment),
becasue fork is denied.

Without vfork, it is a serious problem. So, please tell me
what happens.

>"Intelligent" and "elaborate" are not synonyms.

No, of course.

					Masataka Ohta