[comp.os.research] Life with TLB and no PT

darrell@sdcsvax.UUCP (04/22/87)

Prodded by something Avie Tevanian of the MACH research group said,
I have been considering life with a translation lookaside buffer (TLB)
but without hardware page tables (PT).  This message is intended to
spark some discussion of a) what such a system would be like, and
b) how existing architectures and operating systems can be instrumented
to get some empirical data to guide TLB design and sizing.  Since it
involves tradeoffs between hardware and software, I cross-posted to
comp.os.research and comp.arch.
  
Given a good TLB design which doesn't require frequent flushing of
entries, it is possible to do without hardware PT entirely.  I don't
know of anybody working on a machine with TLB and no PT, but I wouldn't
be at all surprised to see one in the next few years.  The trick is to
increase the TLB hit rate to the point where you can afford to provide
translation information entirely through software.  If you do that, you
can keep any amount of "PT" information and implement all kinds of nice
things like shared segments of variable size, sparse address spaces, etc.

Another benefit is reduced hardware/microcode complexity and therefore
reduced cost.  I imagine that a TLB of a given design can be made lots
bigger for less money than it would take to (re)implement hardware/microcode
address translation.  Large address spaces that require paging to support the
PT, like the VAX (two levels) or the 432 (is it really SEVEN levels?),
get translated in inexpensive, flexible software rather than expensive,
hard-to-change hardware/microcode.

Is a TLB/no-PT architecture at all feasible?  If you look at the VAX
TLB studies published in TOCS, it seems clear that the VAX TLB design
can reasonably be sized to reduce misses to just under one percent.
This is good, but not good enough.  The optimum would be to reduce
misses to just those addresses which are not in physical memory;  at
that point the software has to get into the act to process a page fault
anyway.  On the other hand, even if there's a TLB entry for every
physical page frame, if there's sharing between contexts, you still
don't get complete TLB coverage of the physically resident pages.
(This suggests segmented address spaces (MULTICS-style, not
80?86-style), and sharing segments rather than pages, but I will assume
linear address spaces.) So one question to answer is: how high a TLB
miss rate can be tolerated?  A better way to state it is:  how much
time can be devoted to address translation?  Then the tolerable TLB
miss rate is a function of the cost and complexity of doing an address
translation, which is a function of the operating system (virtual
machine) rather than the machine architecture (real machine).

One thing that is clearly needed for a PT-less TLB system, or even a
PT-full TLB system with good performance, is a set of context tags
(process identifiers) so that TLB entries do not need to be flushed on
every context switch.  Some VAX TLB implementation distinguish system
and user contexts.  The memory management hardware in the Sun-2
architecture has 8 contexts, but no TLB.  There is a lot of room to
experimentation and improvement here.  How many tags should there
be?  Which contexts (processes) should be have tags at a given moment?
The first question is architecture, the other is operating systems.

Consider software for a moment.  One of the things everybody preaches
is to move utility services out of the machine kernel and into
"user-level" server processes.  One of the things nobody does is, yes,
just that.  The cost in context switching is too high for high
performance of system services.  From this observation, it is clear
that you want enough tags to handle, not just the processes ready to
run, but also all those servers which are in some queue waiting for
requests.  On the other hand, you can't afford unbounded TLB hardware.
Where is the point of diminishing returns?  Obviously the answer
depends on many factors, and even a detailed analysis of the factors
would be useful.

You also have the classic problem of allocating a resource (TLB
entries) among different users (contexts).  Extending the virtual
address to include the context tag amounts to statically allocating
each context a fixed share of the TLB.  I can't imagine that this
leads to effective use of the TLB for even 8 contexts.  On the other
hand, what are the costs of deciding which entries, from which
contexts, to discard?  Working set theory indicates that at some point
it is better to throw away entries from the current context, rather
than another context.

Rather than speculate about just where the tradeoffs are, I would like
to spark some discussion of how existing systems could be instrumented
to empirically discover how many tags would pay off, similarly to the
way systems have been instrumented to measure how much a given size or
associativity of TLB would pay off.  It's not obvious that measurements
of, say, BSD UNIX, would really be applicable to a system with the same
services, but IPC, file system, etc, etc, stripped out of the kernel
into user processes.  Still, some empirical data is better than none.
I am interested in what events and situations to look for on a generic
modern OS, rather than details about what code to poke in FooOS.
That's boring without some principled idea what to look for...

Suggestions?  Ideas?  Experiences???

Stu Friedberg  {seismo, allegra}!rochester!stuart  stuart@cs.rochester.edu

darrell@sdcsvax.UUCP (04/22/87)

First a question: isn't this what an AMD29000 is like? It has a 32 line, 2
way set associative TLB. On a TLB miss, it takes a trap and lets you do
whatever you wish. Theoretically, traps are extremely fast. The processor
doesn't save any context for you, and it has enough registers that you can
dedicate a few of them of them to the TLB trap handler. The TLB entries are
also tagged with an 8 bit Task ID which must match the PID in the
configuration register for a TLB hit. I don't think the TLB is big enough
though. I think that TLB reload from a two level page table could be done in
15-20 cycles or so assuming all pages are in memory. I think somebody
mentioned that the MIPS chip does at least part of its TLB reload in
software, anybody care to comment?

My interest in this subject is a lightweight process model operating system
for single user workstations. Being able to design your own protection
mechanisms into the paging scheme makes doing this efficiently a lot easier.
There is also a lot of flexibility to be gained in shared memory and paging
algorithms. You aren't satisfied with the reference bits in the page table?
Go add some more...

I hadn't thought much about servers and their effects on the TLB. I am more
interested in what happens when a user has 10 different interactive processes
going at once and the mouse tracks through five of them on its way from the
editor to the shell... Of course, in such a system there should be enough
sharing that most of the same code is being called. The same applies to
server code. With lots of sharing, not that many TLB entries would be needed.
(Of the N megs of code in memory on my machine, how much of it is redundant?)

We are already starting to see useful systems for this kind of research. The
VICE filesystem operates outside of the kernel. MACH supposedly allows user
specified paging code. Perhaps this could be used to record paging activity
in detail.

Sincerely,
Zalman Stern
Information Technology Center
Carnegie Mellon University
Arpanet: zs01#@andrew.cmu.edu

mash@mips.uucp (John Mashey) (04/23/87)

In article <3027@sdcsvax.UCSD.EDU> stuart@CS.ROCHESTER.EDU (Stuart Friedberg) writes:
>Prodded by something Avie Tevanian of the MACH research group said,
>I have been considering life with a translation lookaside buffer (TLB)
>but without hardware page tables (PT)...
>  
>Given a good TLB design which doesn't require frequent flushing of
>entries, it is possible to do without hardware PT entirely.  I don't
>know of anybody working on a machine with TLB and no PT, but I wouldn't
>be at all surprised to see one in the next few years......

The following machines, which shipped in 1986 or earlier do exactly this:
Celerity 12xx (I think)
HP840 (and other Spectrum RISC machines) (for sure)
MIPS R2000 (for sure)
	(See "Operating Support on a RISC", in Proc. IEEE COMPCON, March
	1986, for rationale behind doing TLBs this way.)

The recently-announced AMD29000 does the same.
>
>Another benefit is reduced hardware/microcode complexity and therefore
>reduced cost....
Yes. Bus interface units are often pretty random.  TLB cells are at
least regular.

>
>Is a TLB/no-PT architecture at all feasible? .....
Yes.  With 4K pages, 64 fully-associative on-chip TLB entries,
an unmapped hole in the address space in the TLB to greatly lessen the
usual TLB-crunching from the nonlocality of kernels, we usually see
1% of the user CPU time (or less) in TLB-miss processing. [Simulations +
measurements/counting].  Once you get it down that far, you stop worrying
about it: lots of other things are MUCH more important for performance.
>
>One thing that is clearly needed for a PT-less TLB system, or even a
>PT-full TLB system with good performance, is a set of context tags
>(process identifiers) so that TLB entries do not need to be flushed on
>every context switch. 

We use 6-bit PIDs in our TLB; worst case context-switch overhead for
TLB-flushing = 1 microsecond [on an 8MHz machine] / context switch on the
average, assuming that you have >64 processes, and you roundrobin them without
rescheduling. [This isn't the way real systems work: usually you switch
amongst a smaller number of more active processes, even on large machines.]

Anyway, TLBs refilled by software are pretty easy to measure: you just
put extra counters in there.  However, as noted: it stops being an
interesting problem when the performance hit is so low.
At least in our case, both 4.3BSD and System V.3 both run [with slightly
different refill code, for convenience] on this, and we've written
half a dozen variants to fit various prospects' existing or preferred
environments.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

darrell@sdcsvax.UUCP (04/23/87)

The discussion about TLB's with no hardware PT brings up the following
remark/suggestion:

It is possible (and has been done) for the TLB to allow multiple
simultaneous page sizes.  This has a lot of attraction for
engineering/scientific users (us folks who use floating point :-)
because typical code segments are much smaller than data segments.
This allows code to live on small pages, data to live on large pages,
and the page tables to be kept small even when the machine memory gets
very, very big.  Try it; your users will like it.  If you don't do
something like this, you will need a very large (expensive in all
relevant senses) TLB to handle simple cases like accessing big arrays
along the second or third dimension.  A big TLB  may also increase
context switch time excessively.

Another question:  The minimum requirements for support of Capabilities
seems to be:

1) Virtualizable instruction set architecture
2) Hardware control to permit read-only access.  

Do any of the new machines in question qualify on both counts?
(e.g. am29000, MIPS R-2000, Celerity . . .)


  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

hansen@mips.UUCP (Craig Hansen) (04/23/87)

In article <3027@sdcsvax.UCSD.EDU>, stuart@CS.ROCHESTER.EDU (Stuart Friedberg) writes:
> Prodded by something Avie Tevanian of the MACH research group said,
> I have been considering life with a translation lookaside buffer (TLB)
> but without hardware page tables (PT).  This message is intended to
> spark some discussion of a) what such a system would be like, and
> b) how existing architectures and operating systems can be instrumented
> to get some empirical data to guide TLB design and sizing.  Since it
> involves tradeoffs between hardware and software, I cross-posted to
> comp.os.research and comp.arch.

The MIPS R2000 processor already does it, and with a 4k page size, a
64-entry, fully-associative TLB, and 15-19 cycles to refill the TLB in
software from a simple page table, large benchmarks typically spend about 1%
of their execution time in the software TLB refill handler. There are 6-bit
process identifiers associated with each entry, so that the TLB doesn't need
to be flushed on context switches, and the permanently memory-resident
portion of the kernel doesn't use the TLB at all, which reduces the overall
size requirement of the TLB by a large margin.

An important benefit of the software TLB handler is that it is easily
changable. We use two different handlers for the two different UNIX ports we
have running on the machine (BSD 4.3 and V.3), because there are two
different page table formats in use.  We've been able to easily add code to
count TLB misses to verify our simulation data under real system conditions.
We have also used it to code around hardware problems in one revision of our
processor chip.

-- 
Craig Hansen
Manager, Architecture Development
MIPS Computer Systems, Inc.
...decwrl!mips!hansen

darrell@sdcsvax.UUCP (04/24/87)

In article <3031@sdcsvax.UCSD.EDU> lamaster@pioneer.arpa (Hugh LaMaster) writes:
+-----
| Another question:  The minimum requirements for support of Capabilities
| seems to be:
| 
| 1) Virtualizable instruction set architecture
| 2) Hardware control to permit read-only access.  
| 
| Do any of the new machines in question qualify on both counts?
| (e.g. am29000, MIPS R-2000, Celerity . . .)
+-----
The Am29000 processor is fully "virtualizable"; all privileged
instructions are trapped with a protection violation in such a way that
the privileged instruction can be examined and executed for the "virtual
machine" by the "real machine", and the virtual machine's instruction
stream can be restarted correctly.

The protection bits provided in the TLB entries are:

	SR	-- supervisor read permission
	SW	-- supervisor write permission
	SE	-- supervisor execute permission

	UR	-- user read permission
	UW	-- user write permission
	UE	-- user execute permission

So the answer is "yes" on both counts.


	-- Tim Olson
	Advanced Micro Devices
	Processor Strategic Development
	(tim@amdcad.AMD.COM)

mcvoy@crys.WISC.EDU (Larry McVoy) (04/25/87)

hansen@mips.UUCP (Craig Hansen) writes:
>of their execution time in the software TLB refill handler. There are 6-bit
>process identifiers associated with each entry, so that the TLB doesn't need

Does that mean you wrap process id's around at 64?  If not, how do you know
if you have a hit or not?
-- 
Larry McVoy 	        mcvoy@rsch.wisc.edu  or  uwvax!mcvoy

"It's a joke, son! I say, I say, a Joke!!"  --Foghorn Leghorn

mash@mips.UUCP (John Mashey) (04/26/87)

In article <3040@sdcsvax.UCSD.EDU> mcvoy@crys.WISC.EDU (Larry McVoy) writes:
>hansen@mips.UUCP (Craig Hansen) writes:
>>of their execution time in the software TLB refill handler. There are 6-bit
>>process identifiers associated with each entry, so that the TLB doesn't need
>
>Does that mean you wrap process id's around at 64?  If not, how do you know
>if you have a hit or not?

The 6-bit identifiers are treated as a resource onto which 16-bit real PIDs
are mapped dynamically, there being no longer-term association.

When you run out, you flush the whole TLB, then give each process a
new PID as you dispatch it.  In the worst case, where you had >64 processes
being round-robbined thru the complete cycle, the overhead for this
costs 1 microsecond / context switch [on a 5Mips machine].
Real systems don't work this way, since you often have much more
back-and-forth switching.  If you understand the amount of work that
UNIX does on a context switch, this is completely in the noise.
Note that 68K designs that use SRAM MMUs usually have the same sort of
algorithm, except they must load up the SRAMs, and there are usually
less than 64 contexts [such as 8 on Sun3s], i.e. not a problem.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

avie@wb1.cs.cmu.edu (Avadis Tevanian) (05/01/87)

In article <3027@sdcsvax.UCSD.EDU> stuart@CS.ROCHESTER.EDU (Stuart Friedberg) writes:
>Prodded by something Avie Tevanian of the MACH research group said,
>I have been considering life with a translation lookaside buffer (TLB)
>but without hardware page tables (PT).

As many have pointed out, building hardware with only a TLB is not a
problem.  What I find interesting; however, is that there has been virtually
no discussion of advantages of a non-PT system other than the obvious
simplification of the hardware.  Sure, you delete some hardware and add a
bit of software that emulates what the hardware would have done.

What is more interesting (from my point of view) is the new opportunities
created by a hardware design that frees you from page table maintenance
without suffering performance degradation due to page faults.  That is,
given that we want to make more flexible use of virtual memory (at least we
do in Mach) for shared libraries, mapped files, shared memory, truly spare
addressing, etc..., does a hardware design with only a TLB help?

I think the answer is yes.  Consider, what you really want for the features
I mentioned is essentially an associative memory that maps arbitrary virtual
addresses (tagged by some ID) to physical addresses.  Page table designs, in
general, have the problem that they are, well, tables.  Tables are not an
efficient data representation for highly sparse structures, or even for
somewhat sparse structures (for example, an address space that was
fragmented due to mapping and unmapping many files of varying sizes).  Most
page table designs have tried to solve this problem using multi-level
tree-like tables... but this only complicates the software that must manage
those tables.

As it turns out, it is relatively easy to design high-level data structures
that describe complex address spaces compactly... we have written several
papers about this in the context of the Mach VM system.  However, since
these are fairly high-level data structures, it is not clear that
virtual-to-physical translation based on these structures is appropriate for
hardware.  That's where the TLB comes in.  If your TLB is efficient enough
(high enough hit ratio), it becomes feasible to allow the full-blown page
fault handler to be run at page fault time.  Of course, you can play
arbitrarily complex games to cache translations in addition to those
maintained by the TLB to eliminate some for the full-blown page faults.

	Avie

darrell@sdcsvax.UUCP (05/05/87)

> ... The memory management hardware in the Sun-2
> architecture has 8 contexts, but no TLB...

[We should note that BAD things happen on a SUN if there are > 8  contexts -DL]

Actually, an alternate view of the Sun MMUs is that they are very large
software-managed TLBs done in a slightly unusual way.  The first stage
of address translation (segment map) is the equivalent of the associative
lookup of a more orthodox TLB.  The second stage (page map) is identical
to normal TLB address translation.  The correspondence is actually quite
close, to the point where one can envision semi-portable memory-management
code that treats the Sun MMU as if it were a big TLB.  The large size of
the maps is a way around the relatively high overhead of doing TLB update
in software with processors not designed for it.  The 8-contexts business
minimizes TLB flushes, which are expensive with a TLB that big.

On the more general topic of TLBs with no PTs, note also Cheriton's recent
MMUless design, in which a software-updated virtual-address cache eliminates
mapping hardware altogether.

				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry



-- 
Darrell Long
Department of Computer Science & Engineering, UC San Diego, La Jolla CA 92093
ARPA: Darrell@Beowulf.UCSD.EDU  UUCP: darrell@sdcsvax.uucp
Operating Systems submissions to: mod-os@sdcsvax.uucp

darrell@sdcsvax.UUCP (05/07/87)

>Actually, an alternate view of the Sun MMUs is that they are very large
>software-managed TLBs done in a slightly unusual way.  The first stage
>of address translation (segment map) is the equivalent of the associative
>lookup of a more orthodox TLB.  The second stage (page map) is identical
>to normal TLB address translation.  The correspondence is actually quite
>close, to the point where one can envision semi-portable memory-management
>code that treats the Sun MMU as if it were a big TLB.
	...
>				Henry Spencer @ U of Toronto Zoology
>				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

The Mach virtual memory implementation has been ported to many widely
varying architectures.  It makes no assumptions about a hardware MMU and in
essence treats whatever MMU may exist as a TLB (think of it as a multi-level
TLB, when a real TLB miss occurs, the second level TLB (traditional page table,
IPT, or whatever) is consulted, if a miss occurs here, the fault handler is
called).  Management of an MMU is performed by a single module via a
machine-independent interface.  To date, Mach has been ported to Vax, RT/PC,
Sun-3, and NS32032 (Encore Multimax and Sequent Balance 21000 both with
32032 MMU conterpart).  Those are the machines I can talk about, it's been
ported to others as well.  To accomodate these ports, exactly 3 lines of
machine independent code was modified:

	o  The RT/PC compiler had a bug autoincrementing bit-fields, 2 lines
	   of code were rewritten to not autoincrement. [ I just love
	   debugging a virtual memory system using a buggy compiler :-)]

	o  On the Multimax, the console prints out diagnostic information
	   at a point in the boot sequence where the VM implementation was
	   also printing out some computed virtual memory parameters.  The
	   output ended up garbled, so we deleted the printf (1 line of
	   code).

Now, back to the main point... since Mach treats an MMU as a TLB it becomes
possible to start doing interesting things.  For example, I just wrote a
program that sparsely used 256 megabytes of VM and ran it on my MicroVAX
which has only 6 megabytes of physical memory.  [Try *that* on Unix].  The
reason this is no problem is that since the VAX page tables are just a cache
of virtual to physical mappings ... the module that manages the VAX page
tables simply throws away mappings at will.  In fact, the way we have
implemented it is to allocate a fixed pool of "page table memory" which is
managed in an LRU-like fashion (typically there is always enough available
so that we never need to reclaim these pages though).  When a page needs to
be mapped, and there is no corresponding "page table page" in the page table
we just allocate one from the page table memory pool.  Since the only
information to even go into a page table entry is a mapping (which can be
recomputed by the fault handler), we can through away (reclaim) these
mappings at will.

	Avie

darrell@sdcsvax.UUCP (05/07/87)

In article <3100@sdcsvax.UCSD.EDU> avie@wb1.cs.cmu.edu (Avadis Tevanian) writes:

>
>Now, back to the main point... since Mach treats an MMU as a TLB it becomes
>possible to start doing interesting things.  For example, I just wrote a
>program that sparsely used 256 megabytes of VM and ran it on my MicroVAX
>which has only 6 megabytes of physical memory.  [Try *that* on Unix].  The
>reason this is no problem is that since the VAX page tables are just a cache
>of virtual to physical mappings ... the module that manages the VAX page
>tables simply throws away mappings at will.  In fact, the way we have
>implemented it is to allocate a fixed pool of "page table memory" which is

I don't follow this part.  Are you saying that pages within a segment can be
mapped sparsely?  Doesn't 4.x BSD allow this?  I confess my ignorance here,
but MANY virtual memory operating systems allow this...

  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

rrr@leo.UUCP ( Robert R Ramos) (05/09/87)

In article <3106@sdcsvax.UCSD.EDU>, lamaster@pioneer.arpa (Hugh LaMaster) writes:
> In article <3100@sdcsvax.UCSD.EDU> avie@wb1.cs.cmu.edu (Avadis Tevanian) writes:
> 
> > For example, I just wrote a
> >program that sparsely used 256 megabytes of VM and ran it on my MicroVAX
> >which has only 6 megabytes of physical memory.  [Try *that* on Unix].  The

> mapped sparsely?  Doesn't 4.x BSD allow this?  I confess my ignorance here,
> but MANY virtual memory operating systems allow this...
> 
Of course UNIX can handle large virtual address spaces we regularly run
programs that are over 200MB, doing logical and circuit sinulations.  Only
the default virtrual space in unix systems are small.  In general they are
a configuration parameter, not dynamically allocated.
-- 
"I would rather suffer a cruel and horrible death" Young Sherlock Holmes
	Robert R. Ramos	(rrr)
{allegra!hplabs!felix, ihnp4!trwrb!felix, seismo!rlgvax}!ccicpg!leo!rrr

avie@wb1.cs.cmu.edu (Avadis Tevanian) (05/09/87)

In article <3106@sdcsvax.UCSD.EDU> lamaster@pioneer.arpa (Hugh LaMaster) writes:
>In article <3100@sdcsvax.UCSD.EDU> I write:
>
>>
>>Now, back to the main point... since Mach treats an MMU as a TLB it becomes
>>possible to start doing interesting things.  For example, I just wrote a
>>program that sparsely used 256 megabytes of VM and ran it on my MicroVAX
>>which has only 6 megabytes of physical memory.  [Try *that* on Unix].  The
>>reason this is no problem is that since the VAX page tables are just a cache
>>of virtual to physical mappings ... the module that manages the VAX page
>>tables simply throws away mappings at will.  In fact, the way we have
>>implemented it is to allocate a fixed pool of "page table memory" which is
>
>I don't follow this part.  Are you saying that pages within a segment can be
>mapped sparsely?  Doesn't 4.x BSD allow this?  I confess my ignorance here,
>but MANY virtual memory operating systems allow this...
>

I don't think I completely understand this question.  4.x BSD stores
important information in VAX page table entries.  For example, a page that
was to be zero filled when touched would be indicated by a pte with such an
indication.  What this means is that if I wanted to allocate 256 meg of VM,
a stock BSD system would grab 2 meg of page tables and place the zero fill
indication in each of the 1/2 million ptes.  [ Actually, it wouldn't have
even got this far unless there was 256 meg of swap space available
somewhere.]  In Mach, if I allocate 256 meg of VM, the kernel makes a simple
notation of this - no page tables are created/destroyed/modified.  When I
actually start touching memory, the machine independent part of the kernel
zero fills pages and requests the machine dependent part to validate the
appropriate addresses.  On the VAX, we dynamically allocate virtual address
space out of the kernel's address space for the user page tables.  Then we
back the user page tables with physical memory out of the page table memory
pool.  If I start mapping so many pages that the page table memory pool
becomes depleted we just removing pages in use in an LRU fashion.  Now,
consider that on a 6 meg system, if there is no sharing of memory, you can
have at most 6 meg of resident memory, hence you need to be able to map 6
meg of memory.  On a Vax, each page table byte maps 128 bytes (4 bytes per
512 byte page).  So, to map all of memory takes a little less than 1% of
physical memory.  We use 2% in the page table memory pool to allow for
sharing and rarely need to forcibly reclaim that memory.

Do you have some specific examples of virtual memory operating systems that
do this (especially examples that are not closely tied to a particular MMU)?
Note that in my example, I am allocating (virtually) every page in the 256
meg range and can touch any page in that range.  I do not depend on segment
registers or any other MMU-specific feature to handle the sparseness.

More food for thought:  the Common Lisp hackers around here like to use the
high address bits as tags.  Their RT version of Common Lisp (running under
Mach) allocates 2.38 GIGABYTES.  Runs fine so long as they have enough
physical memory for the normal working set (6-10 meg) and enough free disk
space to hold the infrequently used (paged out) Lisp things (a few more meg).

	Avie