darrell@sdcsvax.UUCP (04/22/87)
Prodded by something Avie Tevanian of the MACH research group said, I have been considering life with a translation lookaside buffer (TLB) but without hardware page tables (PT). This message is intended to spark some discussion of a) what such a system would be like, and b) how existing architectures and operating systems can be instrumented to get some empirical data to guide TLB design and sizing. Since it involves tradeoffs between hardware and software, I cross-posted to comp.os.research and comp.arch. Given a good TLB design which doesn't require frequent flushing of entries, it is possible to do without hardware PT entirely. I don't know of anybody working on a machine with TLB and no PT, but I wouldn't be at all surprised to see one in the next few years. The trick is to increase the TLB hit rate to the point where you can afford to provide translation information entirely through software. If you do that, you can keep any amount of "PT" information and implement all kinds of nice things like shared segments of variable size, sparse address spaces, etc. Another benefit is reduced hardware/microcode complexity and therefore reduced cost. I imagine that a TLB of a given design can be made lots bigger for less money than it would take to (re)implement hardware/microcode address translation. Large address spaces that require paging to support the PT, like the VAX (two levels) or the 432 (is it really SEVEN levels?), get translated in inexpensive, flexible software rather than expensive, hard-to-change hardware/microcode. Is a TLB/no-PT architecture at all feasible? If you look at the VAX TLB studies published in TOCS, it seems clear that the VAX TLB design can reasonably be sized to reduce misses to just under one percent. This is good, but not good enough. The optimum would be to reduce misses to just those addresses which are not in physical memory; at that point the software has to get into the act to process a page fault anyway. On the other hand, even if there's a TLB entry for every physical page frame, if there's sharing between contexts, you still don't get complete TLB coverage of the physically resident pages. (This suggests segmented address spaces (MULTICS-style, not 80?86-style), and sharing segments rather than pages, but I will assume linear address spaces.) So one question to answer is: how high a TLB miss rate can be tolerated? A better way to state it is: how much time can be devoted to address translation? Then the tolerable TLB miss rate is a function of the cost and complexity of doing an address translation, which is a function of the operating system (virtual machine) rather than the machine architecture (real machine). One thing that is clearly needed for a PT-less TLB system, or even a PT-full TLB system with good performance, is a set of context tags (process identifiers) so that TLB entries do not need to be flushed on every context switch. Some VAX TLB implementation distinguish system and user contexts. The memory management hardware in the Sun-2 architecture has 8 contexts, but no TLB. There is a lot of room to experimentation and improvement here. How many tags should there be? Which contexts (processes) should be have tags at a given moment? The first question is architecture, the other is operating systems. Consider software for a moment. One of the things everybody preaches is to move utility services out of the machine kernel and into "user-level" server processes. One of the things nobody does is, yes, just that. The cost in context switching is too high for high performance of system services. From this observation, it is clear that you want enough tags to handle, not just the processes ready to run, but also all those servers which are in some queue waiting for requests. On the other hand, you can't afford unbounded TLB hardware. Where is the point of diminishing returns? Obviously the answer depends on many factors, and even a detailed analysis of the factors would be useful. You also have the classic problem of allocating a resource (TLB entries) among different users (contexts). Extending the virtual address to include the context tag amounts to statically allocating each context a fixed share of the TLB. I can't imagine that this leads to effective use of the TLB for even 8 contexts. On the other hand, what are the costs of deciding which entries, from which contexts, to discard? Working set theory indicates that at some point it is better to throw away entries from the current context, rather than another context. Rather than speculate about just where the tradeoffs are, I would like to spark some discussion of how existing systems could be instrumented to empirically discover how many tags would pay off, similarly to the way systems have been instrumented to measure how much a given size or associativity of TLB would pay off. It's not obvious that measurements of, say, BSD UNIX, would really be applicable to a system with the same services, but IPC, file system, etc, etc, stripped out of the kernel into user processes. Still, some empirical data is better than none. I am interested in what events and situations to look for on a generic modern OS, rather than details about what code to poke in FooOS. That's boring without some principled idea what to look for... Suggestions? Ideas? Experiences??? Stu Friedberg {seismo, allegra}!rochester!stuart stuart@cs.rochester.edu
darrell@sdcsvax.UUCP (04/22/87)
First a question: isn't this what an AMD29000 is like? It has a 32 line, 2 way set associative TLB. On a TLB miss, it takes a trap and lets you do whatever you wish. Theoretically, traps are extremely fast. The processor doesn't save any context for you, and it has enough registers that you can dedicate a few of them of them to the TLB trap handler. The TLB entries are also tagged with an 8 bit Task ID which must match the PID in the configuration register for a TLB hit. I don't think the TLB is big enough though. I think that TLB reload from a two level page table could be done in 15-20 cycles or so assuming all pages are in memory. I think somebody mentioned that the MIPS chip does at least part of its TLB reload in software, anybody care to comment? My interest in this subject is a lightweight process model operating system for single user workstations. Being able to design your own protection mechanisms into the paging scheme makes doing this efficiently a lot easier. There is also a lot of flexibility to be gained in shared memory and paging algorithms. You aren't satisfied with the reference bits in the page table? Go add some more... I hadn't thought much about servers and their effects on the TLB. I am more interested in what happens when a user has 10 different interactive processes going at once and the mouse tracks through five of them on its way from the editor to the shell... Of course, in such a system there should be enough sharing that most of the same code is being called. The same applies to server code. With lots of sharing, not that many TLB entries would be needed. (Of the N megs of code in memory on my machine, how much of it is redundant?) We are already starting to see useful systems for this kind of research. The VICE filesystem operates outside of the kernel. MACH supposedly allows user specified paging code. Perhaps this could be used to record paging activity in detail. Sincerely, Zalman Stern Information Technology Center Carnegie Mellon University Arpanet: zs01#@andrew.cmu.edu
mash@mips.uucp (John Mashey) (04/23/87)
In article <3027@sdcsvax.UCSD.EDU> stuart@CS.ROCHESTER.EDU (Stuart Friedberg) writes: >Prodded by something Avie Tevanian of the MACH research group said, >I have been considering life with a translation lookaside buffer (TLB) >but without hardware page tables (PT)... > >Given a good TLB design which doesn't require frequent flushing of >entries, it is possible to do without hardware PT entirely. I don't >know of anybody working on a machine with TLB and no PT, but I wouldn't >be at all surprised to see one in the next few years...... The following machines, which shipped in 1986 or earlier do exactly this: Celerity 12xx (I think) HP840 (and other Spectrum RISC machines) (for sure) MIPS R2000 (for sure) (See "Operating Support on a RISC", in Proc. IEEE COMPCON, March 1986, for rationale behind doing TLBs this way.) The recently-announced AMD29000 does the same. > >Another benefit is reduced hardware/microcode complexity and therefore >reduced cost.... Yes. Bus interface units are often pretty random. TLB cells are at least regular. > >Is a TLB/no-PT architecture at all feasible? ..... Yes. With 4K pages, 64 fully-associative on-chip TLB entries, an unmapped hole in the address space in the TLB to greatly lessen the usual TLB-crunching from the nonlocality of kernels, we usually see 1% of the user CPU time (or less) in TLB-miss processing. [Simulations + measurements/counting]. Once you get it down that far, you stop worrying about it: lots of other things are MUCH more important for performance. > >One thing that is clearly needed for a PT-less TLB system, or even a >PT-full TLB system with good performance, is a set of context tags >(process identifiers) so that TLB entries do not need to be flushed on >every context switch. We use 6-bit PIDs in our TLB; worst case context-switch overhead for TLB-flushing = 1 microsecond [on an 8MHz machine] / context switch on the average, assuming that you have >64 processes, and you roundrobin them without rescheduling. [This isn't the way real systems work: usually you switch amongst a smaller number of more active processes, even on large machines.] Anyway, TLBs refilled by software are pretty easy to measure: you just put extra counters in there. However, as noted: it stops being an interesting problem when the performance hit is so low. At least in our case, both 4.3BSD and System V.3 both run [with slightly different refill code, for convenience] on this, and we've written half a dozen variants to fit various prospects' existing or preferred environments. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
darrell@sdcsvax.UUCP (04/23/87)
The discussion about TLB's with no hardware PT brings up the following remark/suggestion: It is possible (and has been done) for the TLB to allow multiple simultaneous page sizes. This has a lot of attraction for engineering/scientific users (us folks who use floating point :-) because typical code segments are much smaller than data segments. This allows code to live on small pages, data to live on large pages, and the page tables to be kept small even when the machine memory gets very, very big. Try it; your users will like it. If you don't do something like this, you will need a very large (expensive in all relevant senses) TLB to handle simple cases like accessing big arrays along the second or third dimension. A big TLB may also increase context switch time excessively. Another question: The minimum requirements for support of Capabilities seems to be: 1) Virtualizable instruction set architecture 2) Hardware control to permit read-only access. Do any of the new machines in question qualify on both counts? (e.g. am29000, MIPS R-2000, Celerity . . .) Hugh LaMaster, m/s 233-9, UUCP {seismo,topaz,lll-crg,ucbvax}! NASA Ames Research Center ames!pioneer!lamaster Moffett Field, CA 94035 ARPA lamaster@ames-pioneer.arpa Phone: (415)694-6117 ARPA lamaster@pioneer.arc.nasa.gov "In order to promise genuine progress, the acronym RISC should stand for REGULAR (not reduced) instruction set computer." - Wirth ("Any opinions expressed herein are solely the responsibility of the author and do not represent the opinions of NASA or the U.S. Government")
hansen@mips.UUCP (Craig Hansen) (04/23/87)
In article <3027@sdcsvax.UCSD.EDU>, stuart@CS.ROCHESTER.EDU (Stuart Friedberg) writes: > Prodded by something Avie Tevanian of the MACH research group said, > I have been considering life with a translation lookaside buffer (TLB) > but without hardware page tables (PT). This message is intended to > spark some discussion of a) what such a system would be like, and > b) how existing architectures and operating systems can be instrumented > to get some empirical data to guide TLB design and sizing. Since it > involves tradeoffs between hardware and software, I cross-posted to > comp.os.research and comp.arch. The MIPS R2000 processor already does it, and with a 4k page size, a 64-entry, fully-associative TLB, and 15-19 cycles to refill the TLB in software from a simple page table, large benchmarks typically spend about 1% of their execution time in the software TLB refill handler. There are 6-bit process identifiers associated with each entry, so that the TLB doesn't need to be flushed on context switches, and the permanently memory-resident portion of the kernel doesn't use the TLB at all, which reduces the overall size requirement of the TLB by a large margin. An important benefit of the software TLB handler is that it is easily changable. We use two different handlers for the two different UNIX ports we have running on the machine (BSD 4.3 and V.3), because there are two different page table formats in use. We've been able to easily add code to count TLB misses to verify our simulation data under real system conditions. We have also used it to code around hardware problems in one revision of our processor chip. -- Craig Hansen Manager, Architecture Development MIPS Computer Systems, Inc. ...decwrl!mips!hansen
darrell@sdcsvax.UUCP (04/24/87)
In article <3031@sdcsvax.UCSD.EDU> lamaster@pioneer.arpa (Hugh LaMaster) writes: +----- | Another question: The minimum requirements for support of Capabilities | seems to be: | | 1) Virtualizable instruction set architecture | 2) Hardware control to permit read-only access. | | Do any of the new machines in question qualify on both counts? | (e.g. am29000, MIPS R-2000, Celerity . . .) +----- The Am29000 processor is fully "virtualizable"; all privileged instructions are trapped with a protection violation in such a way that the privileged instruction can be examined and executed for the "virtual machine" by the "real machine", and the virtual machine's instruction stream can be restarted correctly. The protection bits provided in the TLB entries are: SR -- supervisor read permission SW -- supervisor write permission SE -- supervisor execute permission UR -- user read permission UW -- user write permission UE -- user execute permission So the answer is "yes" on both counts. -- Tim Olson Advanced Micro Devices Processor Strategic Development (tim@amdcad.AMD.COM)
mcvoy@crys.WISC.EDU (Larry McVoy) (04/25/87)
hansen@mips.UUCP (Craig Hansen) writes: >of their execution time in the software TLB refill handler. There are 6-bit >process identifiers associated with each entry, so that the TLB doesn't need Does that mean you wrap process id's around at 64? If not, how do you know if you have a hit or not? -- Larry McVoy mcvoy@rsch.wisc.edu or uwvax!mcvoy "It's a joke, son! I say, I say, a Joke!!" --Foghorn Leghorn
mash@mips.UUCP (John Mashey) (04/26/87)
In article <3040@sdcsvax.UCSD.EDU> mcvoy@crys.WISC.EDU (Larry McVoy) writes: >hansen@mips.UUCP (Craig Hansen) writes: >>of their execution time in the software TLB refill handler. There are 6-bit >>process identifiers associated with each entry, so that the TLB doesn't need > >Does that mean you wrap process id's around at 64? If not, how do you know >if you have a hit or not? The 6-bit identifiers are treated as a resource onto which 16-bit real PIDs are mapped dynamically, there being no longer-term association. When you run out, you flush the whole TLB, then give each process a new PID as you dispatch it. In the worst case, where you had >64 processes being round-robbined thru the complete cycle, the overhead for this costs 1 microsecond / context switch [on a 5Mips machine]. Real systems don't work this way, since you often have much more back-and-forth switching. If you understand the amount of work that UNIX does on a context switch, this is completely in the noise. Note that 68K designs that use SRAM MMUs usually have the same sort of algorithm, except they must load up the SRAMs, and there are usually less than 64 contexts [such as 8 on Sun3s], i.e. not a problem. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086