march@m.cs.uiuc.edu (06/28/90)
While reading Hennessey and Patterson (p. 437), they mention the fact that page tables entries are often paged themselves (with the operative word being often). Now to me, paging you're means of address translation makes no sense. As they continue to point out, the cost of this is appreciable because one has to swap the PTEs back in and then do the translation. Given this, just how "often" is this used? I suppose given a translation-lookaside buffer, the hugh cost of swapping page table entries back in could be offset by some translations being very fast (via the TLB). Anyone ... ? -Steve =============================================================================== Steve March (H) (217)328-5176/328-5230 (W) 333-7408 Domain: march@cs.uiuc.edu Path: {uunet|convex|pur-ee}!uiucdcs!march "Time and space are modes by which we think and not conditions in which we live." - Albert Einstein
amos@taux01.nsc.com (Amos Shapir) (06/28/90)
In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: > >While reading Hennessey and Patterson (p. 437), they mention the fact >that page tables entries are often paged themselves (with the operative >word being often). Now to me, paging you're means of address translation >makes no sense. As they continue to point out, the cost of this is >appreciable because one has to swap the PTEs back in and then do the >translation. Given this, just how "often" is this used? A simple calculation: 32-bit addresses cover 4GB; when using pages of 4KB (which are rather large as paging systems go) you have to keep 1M PTE's, which occupy 4MB. The need to page these is of course proportional to the amount of virtual memory used vs. the amount of physical memory that can be tied down in PTE's. So "often" means - as often as more than a few MB of virtual space are used. -- Amos Shapir amos@taux01.nsc.com, amos@nsc.nsc.com National Semiconductor (Israel) P.O.B. 3007, Herzlia 46104, Israel Tel. +972 52 522408 TWX: 33691, fax: +972-52-558322 GEO: 34 48 E / 32 10 N
henry@zoo.toronto.edu (Henry Spencer) (06/29/90)
In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: >While reading Hennessey and Patterson (p. 437), they mention the fact >that page tables entries are often paged themselves (with the operative >word being often). Now to me, paging you're means of address translation >makes no sense... Unfortunately, it makes a great deal of sense, because the PTEs for a very large process can chew up a lot of memory all by themselves. All it means is that you need a smaller set of PTEs to describe where the main PTEs are, layering one address-translation mechanism on top of another. This is done pretty routinely. -- "Either NFS must be scrapped or NFS | Henry Spencer at U of Toronto Zoology must be changed." -John Osterhout | henry@zoo.toronto.edu utzoo!henry
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (06/29/90)
In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: >While reading Hennessey and Patterson (p. 437), they mention the fact >that page tables entries are often paged themselves (with the operative >word being often). Now to me, paging you're means of address translation >makes no sense. As they continue to point out, the cost of this is >appreciable because one has to swap the PTEs back in and then do the >translation. Given this, just how "often" is this used? I suppose >given a translation-lookaside buffer, the hugh cost of swapping page >table entries back in could be offset by some translations being very >fast (via the TLB). This cost is incurred only if the OS software decides to make the page tables swappable. The cost of a swap may not be known at the chip-designer level, because a general purpose design could be suitable for quite varying amounts of memory, and varying speeds of swap devices, and of course varying demands upon that memory. The OS designer balances memory demands and swap speed against swap frequency. Note that a 16 byte table entry per 512 byte page is only a 3% overhead. Most designs run leaner. The chip designer does control the cost of a simple TLB refill. For example, the 88000 and 68020 (well, 68851) have table-walk hardware, whereas the R3000 designers punted TLB refill to an interrupt handler and some special instructions. The chip designer also controls the replacement algorithm of the TLB. (That is, when an item is loaded into a TLB, some slot has to be selected and overwritten.) The 68020 uses a pseudo-LRU algorithm for this. The 88000 uses FIFO - entries, valid or not, are just shifted off the end into the bit bucket. The R3000 hardware selects victims with a pseudo-random. (Very pseudo.) Are these equivalent? Not hardly! It's always been assumed that it just doesn't matter. Discussions on this topic usually move on the the issue of Process ID tags, which are variously a few bits (MIPS) or one bit (88000). (Does the i860 have PIDs?) From there one gets into task switch issues, which is to say, flushing. -- Don D.C.Lindsay
JONESD@kcgl1.eng.ohio-state.edu (David Jones) (06/29/90)
In article <3300142@m.cs.uiuc.edu>, march@m.cs.uiuc.edu (Steve March) writes: > While reading Hennessey and Patterson (p. 437), they mention the fact > that page tables entries are often paged themselves (with the operative > word being often). Now to me, paging you're means of address translation > makes no sense. As they continue to point out, the cost of this is > appreciable because one has to swap the PTEs back in and then do the > translation. Given this, just how "often" is this used? I suppose > given a translation-lookaside buffer, the hugh cost of swapping page > table entries back in could be offset by some translations being very > fast (via the TLB). The VAX architecture uses pageable page tables for user code. A TLB, even a small one, does a very effective job of reducing the overhead of paged page tables. I once intentionally wrote a pathological program that thrashed the TLB on a MicroVax and it ran 3 times slower than a non-thrashing version. Note that to thrash the TLB entries for the page table pages, you have to separate your memory references by enough to go to another page in the table (64KB in the case of the VAX). Most normal programs would thrash the program page TLB entries more than an order of magnitude more that the page table page TLB entries, so the cost over that of a non-paged page table is small. David L. Jones | Phone: (614) 292-6929 Ohio State Unviversity | Internet: 1971 Neil Ave. Rm. 406 | jonesd@kcgl1.eng.ohio-state.edu Columbus, OH 43210 | jones-d@eng.ohio-state.edu Disclaimer: A repudiation of a claim.
littauer@uts.amdahl.com (Tom Littauer) (06/29/90)
In article <1990Jun28.182303.8352@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: >>While reading Hennessey and Patterson (p. 437), they mention the fact >>that page tables entries are often paged themselves (with the operative >>word being often). Now to me, paging you're means of address translation >>makes no sense... > >Unfortunately, it makes a great deal of sense, because the PTEs for a >very large process can chew up a lot of memory all by themselves. All >it means is that you need a smaller set of PTEs to describe where the >main PTEs are, layering one address-translation mechanism on top of >another. This is done pretty routinely. Correct. A couple of examples where this is desirable and efficient: Large but sparse arrays. The programmer needn't worry about special hacks, but can get on with solving the original problem. Extremely large arrays combined with programs with good locality of reference. -- UUCP: littauer@amdahl.amdahl.com or: {sun,decwrl,hplabs,pyramid,ames,uunet}!amdahl!littauer DDD: (408) 737-5056 USPS: Amdahl Corp. M/S 278, 1250 E. Arques Av, Sunnyvale, CA 94086 I'll tell you when I'm giving you the party line. The rest of the time it's my very own ravings (accept no substitutes).
bob@tera.com (Bob Alverson) (06/29/90)
In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: > >While reading Hennessey and Patterson (p. 437), they mention the fact >that page tables entries are often paged themselves (with the operative >word being often). Now to me, paging you're means of address translation >makes no sense. As they continue to point out, the cost of this is >appreciable because one has to swap the PTEs back in and then do the >translation. Given this, just how "often" is this used? I suppose >given a translation-lookaside buffer, the hugh cost of swapping page >table entries back in could be offset by some translations being very >fast (via the TLB). > The main justification I see for paging page tables is to keep their overhead a function of physical memory size, not virtual memory size. Consider a UNIX process with heap at low addresses going up and stack at high addresses going down. Without a paged page table, all the space in between must have page table entries sitting in physical memory. That doesn't seem like a smart way to use memory. To keep things sane, you probably want to lock in the page containing the page table entries for virtual address space containing the page table. With a 4Mb addr space and 512b pages, you need 8k entries taking (say) 32kb. You need 64 page table entries to cover the page table (sounds recursive), which fits in one page. Bob
martins@ivory.SanDiego.NCR.COM (Martin Sandman) (06/29/90)
>that page tables entries are often paged themselves (with the operative >word being often). Now to me, paging you're means of address translation >makes no sense. As described in H&P I don't understand how this could work either. However, I am familar with a combination segment/page scheme where it works very well. The eight high order bits of the address select one of 256 page tables. Each page table can have up to 128 entries. The segment table entry specifies the number of valid entries in each page table and whether the page table is in memory or not. If the page table in not in real memory, a special fault is taken when the segment table entry is referenced. (A segment table _is_ nailed to real memory.) When this special segment table fault occurs the paging supervisor builds a page table "on the fly" in real memory and restarts the instruction. This time it results in an "ordinary" page fault because the page table is in memory but the page itself is not. This technique clearly saves lots of real memory that could be potentially lost to seldom used page tables. On the other hand, it _seems_ to require an additional memory access to resolve a virtual to real translation. It does but not very often. 99+% of all v-to-r happens in the translation lookaside buffer so the two-stage memory access does not really happen offen. Since the page tables are built on the fly, really just allocated, they don't have to be swapped in. ______________________________________________________________________ = UUCP: sdcsvax!ncr-sd!fiddler!martins Martin D. Sandman = = NCR: Martin.Sandman@SanDiego.NCR.COM NCR Corporation = = Phone: (619) 485-3554 16550 West Bernardo Dr.=
dwc@cbnewsh.att.com (Mordecai the Fowl) (06/30/90)
In article <4137@taux01.nsc.com>, amos@taux01.nsc.com (Amos Shapir) writes: > In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: > > A simple calculation: 32-bit addresses cover 4GB; when using pages > of 4KB (which are rather large as paging systems go) you have > to keep 1M PTE's, which occupy 4MB. The need to page these is > of course proportional to the amount of virtual memory used vs. the > amount of physical memory that can be tied down in PTE's. > note that with a combination of segmentation and paging, you don't need a fully allocated set of page tables for your entire address space. the page tables themselves can be sparsely allocated. it also seems that the only information you need in the page table upon process swapout is whether a particular page was part of the swapout since the page tables can be rebuilt on swapin. you probably want to keep other things such as modified bits but don't need to keep the reference bit unless you are a perfectionist. what this means is that since you don't need to keep the mapped frame number (which is no longer a valid page frame number anyway), the page table representation can be much more compact when a process is swapped. whether you still want to swap this compacted representation of the page table (or whether you actually want to compact the page table instead of just swapping it) is another design issue. also, i believe that as long you keep a record of the size of the swapped image, and in the swapped image you keep information to distinguish the page tables from the actual pages, you can still rebuild all the information in a single (logical) I/O. so swapping the page tables out doesn't cost you much incrementally. all this assumes we are dealing with private pages. don't ask me how to deal with shared pages in a general way. danny chen att!hocus!dwc
aglew@bach.crhc.uiuc.edu (Andy Glew) (06/30/90)
>The chip designer does control the cost of a simple TLB refill. For >example, the 88000 and 68020 (well, 68851) have table-walk hardware, >whereas the R3000 designers punted TLB refill to an interrupt handler >and some special instructions. I hope that "punted" isn't derogatory. First off, the hardware complexity/performance tradeoffs involved in the MIPS software TLB miss handling decision were described in a paper somewhere, and seemed reasonable. It could go either way. Second, though, from the point of view of someone who aspires to architect HW/OS/SW systems (note that OS is somewhere in the middle :-) ), SW TLB handling is wonderful for experimentation at least. For example, you can do a lot of MACH style remapping to avoid copying, by letting the user modify some parts of the translation tables (which therefore get spread over several pages) and making the TLB miss handler enfore security. Eg. let the user specify that a PTE points to another virtual address (not physical - that leads to security problems). And so on. Of course, you can always do this via system calls, but by careful design you can reduce the number of TLB flushes required (which might be the only thing you want a syscall for) by 1/2. More, if you are going to use this for things like LISP style garbage collection. Similarly, many applications (from security minded Europeans, in my experience) wish to restrict the ability to access certain areas of memory. Eg. you have a database that you want to be readable/writable only from the database code. You can always use a system call to change the perms, but, as several customers I was called in to found out, that's slow. By putting some of the perms in a user-modifiable part of memory, and changing the fault handler, you can always *relax* the perms (go from non-accessible to accessible) without a system call. Going back requires a TLB flush, which you might have to use a syscall for - but still, 1 syscall as opposed to 2 is pretty good. Of course if the TLB miss handling is comparable to a syscall in slowness you haven't gained much - but isn't MIPS' point that the SW TLB miss handler is *not* as slow as a syscall? NB. I do not propose that the TLB miss handler be purely user code. Just that it interact with some (untrusted) user accessible data structures. In addition to user interaction with the virtual address mapping, SW TLB handling also gives you the flexibility to use new data structures for the virtual-to-physical mapping. For example, with potentially *very* large address spaces impending, extent based mapping structures may be necessary to save space. -- Andy Glew, aglew@uiuc.edu
boebert@sctc.com (Earl Boebert) (06/30/90)
Check Organick's book on Multics.
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (06/30/90)
There's been a lot of talk about paging page tables. The basic problem is that unless you do something to prevent it, the amount of *physical* memory taken up by page tables is proportional to the amount of *virtual* memory allocated. One really beautiful idea is to stand the whole thing on its head and have your page tables map from *physical* pages to *virtual*, so that the space taken up by page tables is a fixed fraction of the size of *physical* memory. This is the technique used on the IBM RT PC. (1) Would someone who really understands the technique care to post a short description? (2) When I read the RT PC bumf, I was _sure_ that I had seen the technique described before, but I couldn't and still can't remember where. Does anyone know where the "inverted page tables" idea was first published/used? (3) IBM hold the patent. Is any other manufacturer known to have been granted licence to use the technique? (4) Does the RS/6000 use inverted page tables or paged page tables? -- "private morality" is an oxymoron, like "peaceful war".
jfc@athena.mit.edu (John F Carr) (06/30/90)
In article <3341@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: >One really beautiful idea is to stand the whole thing on its >head and have your page tables map from *physical* pages to >*virtual*, so that the space taken up by page tables is a >fixed fraction of the size of *physical* memory. This is the >technique used on the IBM RT PC. >(1) Would someone who really understands the technique care to post > a short description? A 32 bit address on the IBM RT is treated by hardware as a 4 bit segment number/17 bit virtual page number/11 bit byte address. When an address is presented to the memory management unit, the segment number is extracted. Unless this segment is marked "not present" (for example, segment 15 on the RT is dedicated to I/O devices; there is a bit to tell the MMU to let another system handle references to this segment), it is replaced in the virtual address with a 12 bit segment ID to create a 40 bit virtual address. The MMU has a 32 entry TLB; when a virtual address is not found in the TLB, the hardware needs to consult the page tables. The page table is called the "HAT/IPT" -- "Hash Anchor Table/Inverted Page Table". This refers to the two ways of indexing it to perform virtual to physical and physical to virtual address translation. The MMU servicing a TLB miss starts with a HAT lookup: 1. compute an index into the HAT. This is the exclusive OR of the low bits of the virtual page number and the segment number (number of bits depends on memory size). 2. check the "empty" bit; if it is set than page fault 3. extract from the table entry a 13 bit index Then it continues with a search of a chain of IPT entries: 4. read the table entry corresponding to the index a. if the virtual address matches, load this entry into the TLB; end of processing b. if the "last" bit is not set, extract a 13 bit index to the next table entry in the chain; repeat 4 c. if the "last" bit is set then this is the end of the chain and no match has been found, so page fault This takes 5 cycles + 3 cycles per IPT entry searched (cycle time is 80ns to 170 ns depending on CPU model). Ideally, search length should be very close to 1 (this is a hash table with as many buckets as objects). I ran some tests a few days ago and found that in reality the average is between 2 and 3; due to the software implementation most segment numbers and virtual page numbers are small so the high bits of the hash index are predictable. Searching for the virtual page mapped to a physical page is constant time -- just index into the table by the physical page number in question. Each table entry is 16 bytes. When using 2K pages, the overhead is 16/2048 = 1/128 memory dedicated to page tables. >(4) Does the RS/6000 use inverted page tables or paged page tables? The RS/6000 memory management unit is a slightly newer version of the RT MMU. It has longer segment IDs to give more virtual address bits (56, I think). -- --John Carr (jfc@athena.mit.edu)
mash@mips.COM (John Mashey) (07/01/90)
In article <AGLEW.90Jun29204550@bach.crhc.uiuc.edu> aglew@bach.crhc.uiuc.edu (Andy Glew) writes: >>The chip designer does control the cost of a simple TLB refill. For >>example, the 88000 and 68020 (well, 68851) have table-walk hardware, >>whereas the R3000 designers punted TLB refill to an interrupt handler >>and some special instructions. 0) There is a separate exception vector for TLB miss. 1) TLB refill routines are typically 10-15 instructions, depending on which TLB refill routine you use (they're sometiems different for different OSs to match different ideas of PTE arrangements, 2) The usual refill uses the "TLB write random" instruction, which writes a TLB entry into a pseudo-random location in the top 56 entries of the 64-entry TLB. This instruction does get used elsewher; the rest of the instructions in the sequences are nothing special, just a sequnece of user-level instructions plus moves to/from coprocessor 0 (the system coprocessor). 3) Both HP PA and AMD 29K use software-refilled TLBs, although the details are rather different. >First off, the hardware complexity/performance tradeoffs involved in >the MIPS software TLB miss handling decision were described in a paper >somewhere, and seemed reasonable. It could go either way. Proc. 1986 IEEE CompCon, SanFrancisco, March 1986. Paper by DeMoney, Moore, and Mashey. > Of course if the TLB miss handling is comparable to a syscall in >slowness you haven't gained much - but isn't MIPS' point that the SW >TLB miss handler is *not* as slow as a syscall? yes, not as slow, which is why it has own exception vector. > NB. I do not propose that the TLB miss handler be purely user >code. Just that it interact with some (untrusted) user accessible >data structures. This is quite feasible, although I don't know anyone who's done it. As usual, whether you do this or not depends on: a) The problem domain you wish to cover with the chip. b) The cost/complexity of TLBmiss handling as it integrates with the pipeline. This is especially "interesting" as pipelines get more complex. c) The cost of the silicon space for the MMU and its control. d) The cycle count cost of doing software refill versus hardware. e) How much cycle count degradation actually results from TLB processing, versus cycle-time degradation (if any) from various approaches. Anyone who does serious analysis of all this finds that TLB processing is on the order of 1%, except for big array processing, or certain other cases that have awful locality, in which case TLBs (either hardware or software) suffer. Note that it is very easy to get surprised in all this. For example, some chips cannot do PTE table-walking inside their caches, especially in multi-pocessing environments, and it is quite possible that regardless of hardware of software refill, the time may well be dominated by the number of uncached accesses to main memory. In the usual marketing wars, sometimes people claim "hardware TLB refill is faster than software refill." This may be true, or it may not be, in SPECIFIC comparisons. However, anyone making the claim IN GENERAL is almost certainly a) A marketeer extolling the virtues of a product that has hardware-refill, OR b) Someone unlikely to be able answer the following questions: "It's faster?" Compared to which chips? How much faster is it? What percentage of total CPU time is that? -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/01/90)
In article <9758@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: [ ... on paging page tables ... ] The OS designer balances memory demands and swap speed against swap frequency. Note that a 16 byte table entry per 512 byte page is only a 3% overhead. Most designs run leaner. Actually this 3% overhead can be mostly obviated, as the need of paging page tables. The clear win is to have reverse page map tables for the paging itself, whose size is only proportional to the size of real memory, and segment map tables for the description of valid stretches of address space. In this way you have the best of all worlds, because you can have a very large address space with only locally dense data space mappings, and a very low overhead. I am quite sure that Don Lindsay could expound on the advantages of this approach from direct observation, as the Mach portable virtual memory code uses exactly this approach (well, the MACH people actually bitch about the reverse map MMU in the IBM RISCs, because they abuse copy-on-write, and reverse map makes it difficult, but this is just because they use the wrong virtual memory model, i.e. copy-on-write). The chip designer does control the cost of a simple TLB refill. For example, the 88000 and 68020 (well, 68851) have table-walk hardware, whereas the R3000 designers punted TLB refill to an interrupt handler and some special instructions. Refilling the TLB in software is arguably a win -- most obviously because it allows collection of a lot of interesting statistics, and this can help the page replacement policy. It also gets the hardware designer out of some very sticky points. I think that the very best references on the subject are Ibbet and Morris "The MU5 Computer System", MacMillan, and some MU6 papers and reports by Knowles, Woods, Edwards, etc... from Manchester University. They are *the* virtual memory people, incidentally. Discussions on this topic usually move on the the issue of Process ID tags, which are variously a few bits (MIPS) or one bit (88000). (Does the i860 have PIDs?) From there one gets into task switch issues, which is to say, flushing. :-). IMNHO TLBs without address space (*not* process, please) id tags are ridiculous. Some IBM guy measured the hit from cold restarting translation of address space maps because of TLB flush, and it is, ehm, arresting. I find it incredible that system architects in this time and age still design MMUs without address space tags in the TLBs. But then somebody *did* design the SUN-3 MMU, and the address space tag issue looks miniscule compared to what has been perpetrated there. :-> :-/. Hey, after all you win the benchmark wars on MIPS (the four letter word) ratings and other CPU, or at best, single user, oriented ratings. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/03/90)
In article <3341@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: One really beautiful idea is to stand the whole thing on its head and have your page tables map from *physical* pages to *virtual*, so that the space taken up by page tables is a fixed fraction of the size of *physical* memory. This is the technique used on the IBM RT PC. And on virtually all the Manchester University machines for the last few dozen years (major exception being the MU5). (1) Would someone who really understands the technique care to post a short description? Each physical page contains the segments and page number of the virtual page mapped to it. When a translation is needed, all physical page are looked up to see which one contains the given virtual page. The lookup can be sped up in various ways; firstly by caching a few translations, secondly by using an hash index or parallel search hardware. It is fairly obvious from this description that shared memory is not easy with a reverse map MMU, but it can be done. (2) When I read the RT PC bumf, I was _sure_ that I had seen the technique described before, but I couldn't and still can't remember where. Does anyone know where the "inverted page tables" idea was first published/used? The very first virtual memory machine, the Manchester Atlas (MU4), used inverted page tables. A very good description of their latest scheme, the MU6 MMU, is contained in a thesis available on request from them. (3) IBM hold the patent. Is any other manufacturer known to have been granted licence to use the technique? IBM has invented virtual memory, IBM has patented reverse map MMUs, what is good for IBM is good for IBM :-(. This subject has already been discussed; I have had in a drawer a paper on a very simple reverse map MMU design (two ROMs and one RAM) for many years now, and when I get the nerve to publish it... Well, after all IBM never sued Manchester University for having dared copy their inventions and patents a few years early :-). In another thread the abuses of the patent system are discussed. Well, every time I think of IBM and virtual memory... -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
lewine@dg.dg.com (Don Lewine) (07/03/90)
In article <3300142@m.cs.uiuc.edu> march@m.cs.uiuc.edu writes: > >While reading Hennessey and Patterson (p. 437), they mention the fact >that page tables entries are often paged themselves (with the operative >word being often). Now to me, paging you're means of address translation >makes no sense. As they continue to point out, the cost of this is >appreciable because one has to swap the PTEs back in and then do the >translation. Given this, just how "often" is this used? First, read ahead through page 445 to understan VAX memory management. Pay close attention to Figure 8.27. The key concept is that user (P0 & P1) page tables are kept in system space. System space is paged. When I was working on the MicroVAX architecture in 1982-83, it seemed that we could save some gates by making all page tables physical. This was done one other computer architectures. We did some experiments to determine how often page tables were paged. I have forgotten the exact numbers, but the answer was "a great deal." In fact, the amount of page table paging was high enough to convince us that we needed to implement full-VAX memory management. We assumed that VMS could be made smarter and do less paging of page tables, however, even at that the cost of keeping all page tables in memory was way to high. That is, even if VMS only had to read the page tables for the current process, there would be a great deal of wasted memory.
firth@sei.cmu.edu (Robert Firth) (07/04/90)
In article <3341@goanna.cs.rmit.oz.au> ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: >One really beautiful idea is to stand the whole thing on its >head and have your page tables map from *physical* pages to >*virtual*, so that the space taken up by page tables is a >fixed fraction of the size of *physical* memory. Am I being an idiot here, or doesn't this idea have a killer flaw: it precludes having two different virtual addresses mapped onto the same physical address. I can see a lot of problems with such a restriction.
richard@aiai.ed.ac.uk (Richard Tobin) (07/06/90)
In article <1990Jun29.154940.22762@tera.com> bob@colossus.tera.com (Bob Alverson) writes: >Consider a UNIX process with heap at low addresses going up and stack >at high addresses going down. Without a paged page table, all the >space in between must have page table entries sitting in physical memory. Or you must have two page tables. Do operating systems that allow sparse address spaces (eg SunOS 4) use paged page tables to achieve this? Is the fact that BSD doesn't use the Vax's paged page tables the reason why it doesn't support sparse address spaces? -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (07/06/90)
In article <2936@skye.ed.ac.uk> richard@aiai.UUCP (Richard Tobin) writes: >Do operating systems that allow sparse address spaces (eg SunOS 4) use >paged page tables to achieve this? Is the fact that BSD doesn't use >the Vax's paged page tables the reason why it doesn't support sparse >address spaces? The answer is no. The VAX hardware supports multilevel page tables, so the "usual" sparse address spaces can be described by a fairly small page table. For example: the 88000 uses a two level table. The first level is a single 4 KB page, containing 1024 one-word entries. The second level is N pages, where N is less-than-or-equal-to 1024. Each second level page can access 1024 data pages. Conveniently, 1024 * 1024 * 4 KB is 4 GB, which is what you need when virtual addresses are 32 bits. In the example you gave, there was a code region, and a stack. If each of these is smaller than 4 MB, then the 88000's page table for that fits into three pages. In the first level table, there are 1022 null entries, and two valid entries. If (say) a particular compile had a 10 MB heap, then the 88000's page table would occupy 3 pages, +1 for the stack, +1 for the first level. That's 5 pages (20 KB), or an overhead of one fifth of one percent. For smaller programs, the overhead is relatively higher, but that's not a major problem. -- Don D.C.Lindsay
lkaplan@bbn.com (Larry Kaplan) (07/06/90)
In article <PCG.90Jun30230310@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes: >In article <9758@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU >>(Donald Lindsay) writes: >> [ ... on paging page tables ... ] >> The OS designer balances memory demands and swap speed against swap >> frequency. Note that a 16 byte table entry per 512 byte page is >> only a 3% overhead. Most designs run leaner. I believe the 88k is about 4 bytes per 4k page. >Actually this 3% overhead can be mostly obviated, as the need of paging >page tables. The clear win is to have reverse page map tables for the >paging itself, whose size is only proportional to the size of real >memory, and segment map tables for the description of valid stretches >of address space. In this way you have the best of all worlds, because >you can have a very large address space with only locally dense data >space mappings, and a very low overhead. This reverse scheme fails to save you much on machines with large physical memories which are becomming more and more common. We build UNIX machines that regularly have 1 or more gigabytes of real physical memory. Allocating segment tables on demand is certainly helpful, though. >I am quite sure that Don Lindsay could expound on the advantages of >this approach from direct observation, as the Mach portable virtual >memory code uses exactly this approach (well, the MACH people actually >bitch about the reverse map MMU in the IBM RISCs, because they abuse >copy-on-write, and reverse map makes it difficult, but this is just >because they use the wrong virtual memory model, i.e. copy-on-write). Wrong virtual model? This sounds like a religious comment. My understanding is that UNIX has wanted to have true copy-on-write semantics for a while but just never got around to it. Witness the vfork() hack in BSD which was suppose to disappear when "proper sharing semantics" were implemented. The system we build has MACH ancestry (though a totally rewritten VM system) and therefore allowed us to make vfork() the same as fork(). As far as paging the page tables, our page fault handler is always prepared to allocate page table space. The only problem would be to clean the system's TLBs since we put page tables into the kernel's virtual address space. In our multiprocessor systems, my feeling is that it is a win to be able to page your page tables to prevent some process from consumming lots of local page table memory in order to reference alot of system wide memory. Other small local processes (or even user allocations from the same process) would then get very starved for memory on the local node. #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
johnl@esegue.segue.boston.ma.us (John R. Levine) (07/06/90)
I don't get what this argument is about. Every machine I've ever seen that has regular (as opposed to inverted) page tables pages the page tables. Some, like the Vax, do it explicitly by embedding the page table in another pagable address space. Most others, such as the IBM 370 and the Intel 386, get exactly the same effect with two-level page tables. In both cases, the page table is broken up into chunks (which always seem to have a size of one page) and each chunk can be marked resident or non-resident independently. Except for machines like the PDP-11/70 which has a tiny virtual address space, the page table is too big to lock down into memory. Even if you have the memory, having to allocate a large contiguous chunk of physical memory is painful, so paged page tables allow scattered allocation of page tables just like any other memory. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl Marlon Brando and Doris Day were born on the same day.
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/06/90)
In article <58001@bbn.BBN.COM> lkaplan@BBN.COM (Larry Kaplan) writes: >Wrong virtual model? This sounds like a religious comment. My understanding >is that UNIX has wanted to have true copy-on-write semantics for a while but >just never got around to it. Witness the vfork() hack in BSD which was >suppose to disappear when "proper sharing semantics" were implemented. >The system we build has MACH ancestry (though a totally rewritten VM system) >and therefore allowed us to make vfork() the same as fork(). If you implement vfork() the same as fork(), your system is broken. If a large process want to fork(), it requires a large amount of virtual memory space (twice the size of the process). So, if the process almost used up virtual memory space, the process can not fork(). Copy-on-write won't help. If a large process want to vfork(), no extra virtual memory space is necessary. This is a serious problem when large scale computing is necessary. Masataka Ohta
lkaplan@bbn.com (Larry Kaplan) (07/06/90)
I wrote: > >>Wrong virtual model? This sounds like a religious comment. My understanding >>is that UNIX has wanted to have true copy-on-write semantics for a while but >>just never got around to it. Witness the vfork() hack in BSD which was >>suppose to disappear when "proper sharing semantics" were implemented. >>The system we build has MACH ancestry (though a totally rewritten VM system) >>and therefore allowed us to make vfork() the same as fork(). In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: > >If you implement vfork() the same as fork(), your system is broken. > >If a large process want to fork(), it requires a large amount of virtual >memory space (twice the size of the process). So, if the process almost >used up virtual memory space, the process can not fork(). Copy-on-write >won't help. > >If a large process want to vfork(), no extra virtual memory space >is necessary. > >This is a serious problem when large scale computing is necessary. > > Masataka Ohta This shows some serious misunderstanding on your part. The idea behind copy-on-write is that a proper fork() requires NO EXTRA physical memory for the child process (except that required for kernel data structures and page tables). You probably end up needing one stack page pretty soon. Virtual memory is VIRTUAL. Who cares if you were running out in one process? You get an entire new address space in the next. This implementation of fork() is what fork() was always intended to be, as far as I can tell. By doing fork() correctly, the need for a separate vfork() disappears (as stated in the BSD man pages). #include <std_disclaimer> _______________________________________________________________________________ ____ \ / ____ Laurence S. Kaplan | \ 0 / | BBN Advanced Computers lkaplan@bbn.com \____|||____/ 10 Fawcett St. (617) 873-2431 /__/ | \__\ Cambridge, MA 02238
henry@zoo.toronto.edu (Henry Spencer) (07/06/90)
In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >If a large process want to fork(), it requires a large amount of virtual >memory space (twice the size of the process). So, if the process almost >used up virtual memory space, the process can not fork(). Copy-on-write >won't help. Nonsense. There need not be any concept of "using up virtual memory space". Virtual memory is *virtual*; only when those pages need to become real do they consume resources. Forking large processes without needing lots more memory/disk just requires an intelligent implementation. Unfortunately, intelligent implementors are scarce. Vfork is an abomination invented by lazy people; the sooner it goes, the better. (Please don't tell me that hardware problems prevented them from implementing copy-on-write; copy-on-access they *could* have done, and it would have been almost as good.) -- "Either NFS must be scrapped or NFS | Henry Spencer at U of Toronto Zoology must be changed." -John K. Ousterhout | henry@zoo.toronto.edu utzoo!henry
pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/07/90)
In article <1990Jul6.160004.896@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >If a large process want to fork(), it requires a large amount of virtual >memory space (twice the size of the process). So, if the process almost >used up virtual memory space, the process can not fork(). Copy-on-write >won't help. Nonsense. There need not be any concept of "using up virtual memory space". Virtual memory is *virtual*; only when those pages need to become real do they consume resources. Forking large processes without needing lots more memory/disk just requires an intelligent implementation. Unfortunately, intelligent implementors are scarce. What Ohta probably means here is that in such naive implementations that reserve swap space as soon as the corresponding virtual address range is defined, if one process has consumed more than half swap space, copy-on-write will fail, while vfork will not... I do agree with Spencer that this is merely a fixup to a wrong implementation decision, though. I do not believe that copy-on-write and the Unix fork()/exec() semantics are in general useful. IMNHO the right thing is that address space creation and thread creation should be separate operations, and threads of control and memory segments should be movable from one address space to another, and there should be no shared memory at all (except for irrevocably read only segments), either copy-on-write or not. To get something a fork you would do: create new address space, with no thread or segment in it create and map into it the required code and data segments create a new thread in this address space jump this new thread to the other address space This requires the ability to distinguish and manipulate independently between thread and address space and address space and data space; unfortunately the three concepts are inextricably merged in almost all the OS designs that are popular (e.g. the only OS I know that separates the concept of data space from that of address space is MUSS, and precious few, e.g. the CAP monitor or some recent Rochester ones allow you to manage threads and address spaces independently). -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/09/90)
In article <1990Jul6.160004.896@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >In article <5796@titcce.cc.titech.ac.jp> mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >>If a large process want to fork(), it requires a large amount of virtual >>memory space (twice the size of the process). So, if the process almost >>used up virtual memory space, the process can not fork(). Copy-on-write >>won't help. >Nonsense. There need not be any concept of "using up virtual memory space". >Virtual memory is *virtual*; Sigh..., It seems to me that I must explain AGAIN in this newsgroup why vfork() is essential. Virtual memory is NOT nothing. >only when those pages need to become real do >they consume resources. OK. You are correct here, though the terminology is confusing (real hear dose not means you need real memory, but means you need swap space). So, perhaps, you know when a page need to be allocated swap space. With copy-on-write scheme, a page need swap space when the page is written something. >Forking large processes without needing lots more >memory/disk just requires an intelligent implementation. Unfortunately, >intelligent implementors are scarce. I admit that that is an elaborated implementation. But, in general, elabolation should be considered vice in UNIX. In this case, the INTELLIGENT implementation is much worse than a simple-minded (tough copy-on-write is already too much elaborated, and thus vice) implementation. Fortunately, INTELLIGENT implementors are scarce. If you think virtual memory is free and allow forking without reserving actual swap space, when swap space is required, it is often the case that, there is no free swap space, anymore. That is, if a process (which may not be a large process, it may even be a system process) write to a copy-on-write page, it can't. The situation is deadlock and to resolve it, some process must be killed. In some INTELLIGENT implementation, swap space for read only text page will also be allocated on request, so, even instruction read (not write) will cause deadlocking. The problem is serious when you do a large scale computation. >Vfork is an abomination invented by lazy people; the sooner it goes, >the better. You should understand laziness is virtue in UNIX. >(Please don't tell me that hardware problems prevented them >from implementing copy-on-write; copy-on-access they *could* have done, >and it would have been almost as good.) But, the reason why vfork survives in 4.4BSD is, as I heard from a person in Berkeley, that vfork is faster than fork. Even with an elabolated shared memory mechanism of 4.4BSD, it takes time to spawn a large completely new process. Masataka Ohta
jkrueger@dgis.dtic.dla.mil (Jon) (07/10/90)
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) writes: >With copy-on-write scheme, a page need swap space >when the page is written something. But not until. Page and swap file space allocation is as postponeable as the memory copy. Again, the cost is not paid at process creation and quite commonly not ever. In UNIX systems the most common sequence of events after a fork is an exec. Why allocate swap space at process creation when the most common immediately following event will be execution of a different sized program? >If you think virtual memory is free and allow forking without reserving >actual swap space, when swap space is required, it is often the case >that, there is no free swap space, anymore. That way lies MVS, friends; can't do anything without allocating all resources you might ever use, paying for them whether you ever use them or not, denying them to others, including your own other processes. If you want MVS you know where to find it. The rest of us have found more flexible resource management schemes worthwhile. We love ownership but also find option-to-buy an attractive proposition. -- Jon -- Jonathan Krueger jkrueger@dtic.dla.mil uunet!dgis!jkrueger Drop in next time you're in the tri-planet area!
pcg@cs.aber.ac.uk (Piercarlo Grandi) (07/10/90)
In article <58001@bbn.BBN.COM> lkaplan@bbn.com (Larry Kaplan) writes: [ I had written: ] >In this way you have the best of all worlds, because >you can have a very large address space with only locally dense data >space mappings, and a very low overhead. This reverse scheme fails to save you much on machines with large physical memories which are becomming more and more common. We build UNIX machines that regularly have 1 or more gigabytes of real physical memory. Yes, but as somebody else has already remarked, reverse maps mean that the overhead is proportional to physical size, i.e. for a given memory configuration it is constant, while direct maps are proportional to virtual size. Of course if your workload underutilizes real memory, that is the combined sizes of all virtual address spaces is less than the size of the physical memory you lose, but then you are wasting much more space in data space than in translation tables. Allocating segment tables on demand is certainly helpful, though. The only way to get around the fundamentally poor encoding of direct maps and their ensuing unsuitability for large sparse locally dense address spaces is to use many levels of page tables; three levels is not common (region, segment, page). This has problems of its own, typically not just latency (TLBs help here as well), but complicated fault handling and restartability. >I am quite sure that Don Lindsay could expound on the advantages of >this approach from direct observation, as the Mach portable virtual >memory code uses exactly this approach (well, the MACH people actually >bitch about the reverse map MMU in the IBM RISCs, because they abuse >copy-on-write, and reverse map makes it difficult, but this is just >because they use the wrong virtual memory model, i.e. copy-on-write). Wrong virtual model? This sounds like a religious comment. Religious maybe, but the eveils of data space aliasing are there for everybody to see... I at least do see them :-). My understanding is that UNIX has wanted to have true copy-on-write semantics for a while but just never got around to it. I still think that fork() has the wrong semantics and a more decoupled design would be neater and more efficient. If you stay within the fork() based model of address space coupled to control thread, creating a new control thread requires creating a new address space, and copy-on-write is just a fairly tolerable implementation technique. On a PDP-11 it was not really terribly necessary, with its 64KB limit on data space. -- Piercarlo "Peter" Grandi | ARPA: pcg%cs.aber.ac.uk@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
chip@tct.uucp (Chip Salzenberg) (07/10/90)
According to mohta@necom830.cc.titech.ac.jp (Masataka Ohta): >Sigh..., It seems to me that I must explain AGAIN in this newsgroup >why vfork() is essential. "Essential" is pretty strong language. At best, you explain why YOU think vfork() is *valuable*. >So, perhaps, you know when a page need to be allocated >swap space. With copy-on-write scheme, a page need swap space >when the page is written something. This behavior is entirely consistent with other Unix resource management behavior. Programs do not begin by reserving swap space for all the storage they might need. Rather, they ask for memory as necessary. Likewise, during a fork(), a Unix process doesn't reserve swap space; it uses it as necessary. Perhaps you don't like this philosophy. Well, then, it seems you're using the wrong OS for your tastes. >>Forking large processes without needing lots more memory/disk just >>requires an intelligent implementation. Unfortunately, intelligent >>implementors are scarce. > >I admit that that is an elaborated implementation. But, in general, >elabolation should be considered vice in UNIX. "Intelligent" and "elaborate" are not synonyms. -- Chip Salzenberg at ComDev/TCT <chip@tct.uucp>, <uunet!ateng!tct!chip>
mohta@necom830.cc.titech.ac.jp (Masataka Ohta) (07/12/90)
In article <2699E08D.117A@tct.uucp> chip@tct.uucp (Chip Salzenberg) writes: >This behavior is entirely consistent with other Unix resource >management behavior. Programs do not begin by reserving swap space >for all the storage they might need. Rather, they ask for memory as >necessary. Likewise, during a fork(), a Unix process doesn't reserve >swap space; it uses it as necessary. > >Perhaps you don't like this philosophy. Well, then, it seems you're >using the wrong OS for your tastes. I don't know what variation of UNIX you know. So please tell me what happens when there is not enough swap space with your favorite UNIX. With vfork, such a situation never occur (except for stack segment), becasue fork is denied. Without vfork, it is a serious problem. So, please tell me what happens. >"Intelligent" and "elaborate" are not synonyms. No, of course. Masataka Ohta