mash@mips.UUCP (04/29/87)
In article <93@ksr.UUCP> ned@ksr.UUCP (Kittlitz) writes: > >How does MIPS TLB load deal with used/modified bits? I assume that >part of the path always executed determines that the bits are in the >correct state. i.e. some of the instructions included in your count >are for is-used-set?, is-this-a-write-well-then-is-modified-set?. No. See the sequence below. Used bits are done a la VAX. > >Assuming that these are coded with branch-not-taken being common >(used/modified already has correct value), a multiprocessor still has >to do some kind of exclusivity locking when it diddles the bits. >Got any multi implementation or plans you can talk about? e.g. >how long would THAT sequence be? Yes, and no... see the tail-end of this posting. > >Can you describe any special hardware characteristics of the TLB in >greater detail (or is it published)? How about the way you use the TLB >in your UN*X implementation. It was published: IEEE COMPCON, March 1986, SanFrancisco, 126-137 [3 articles; the "Operating System Support on a RISC" one gives the register layouts of all the MMU control, etc, plus gory detail on how they work.] I won't replicate it here. We also presented the actual code at the conference. Here's the BSD version [the SYS V version is 1 instruction longer, since it was more convenient for memory allocation]: NESTED(utlbmiss, 0, k1) # UMIPS-BSD version .set noreorder .set noat mfc0 k0,C0_CTXT # fetch context = (ptebase | badvpn) mfc0 k1,C0_EPC # fetch epc of faulting instr lw k0,0(k0) # fetch pte nop # nothing to do mtc0 k0,C0_TLBLO # copy pte to tbl-entry low nop # nothing to do c0 C0_WRITER # pseudo-random write into tlb from hi/lo j k1 # go back where you came from... c0 C0_RFE # and restore state before you land .set at .set reorder END(utlbmiss) Basically, this is 9 instructions, plus a likely cache-miss on the load-word. It takes about 2 cycles to get here on the average. It it's cache-resident [which is it if it matters, i.e., high miss rates on cold-start], this ends up being about 12-16 cycles. Note that k0 and k1 are 2 registers that the kernel is allowed to trash any time it feels like. Note there are no branches: we looked at checking for validity bits, or seeing if reference was a modify going into non-writable page, but did it the way shown above instead. Why? ANS: the usual reason we have: the statistics said it wasn't worth it. It's usually a win to make frequent ops fast, at the expense of longer path lengths for rare events. If you put an invalid entry into the TLB, or an unwritable one, as soon as you return to the faulting instruction, you will then take a different kind of fault, to another vector, where you eventually figure out: a) This was a reference "just under" the stack-top ==> extend the stack b) This was a wild reference outside anything mapped ==> signal the user with a mem fault c) This was a write to a truly-unwritable page ==> signal the user with a mem fault d) This was a write to a "copy-on-write" page ==> copy the page, mark the new PTE writable, flush the old one from the TLB e) This was a page fault to a demand-fill-zero page ==> give them a zero page. f) This was a page fault to a page that needs to be paged in ==> page it in g) This was a write to a writable page that is still clean ==> mark the PTE writable, replace the unwritable TLB entry with the writable version. It is interesting that b) and c) are incredibily rare events, and you're usually about to blow a process away, or do something else significant, so taking an extra exception is no big deal. Likewise, a), d), e), and f) involve substantial kernel activity, and f) has disk activity. Any UNIX that does copy-on-write will go around any hardware-set modify bits that the architecture provides. That leaves g) as the ONLY case where a machine really uses hardware-set dirty bits. Here's a quiz: what's the order of magnitude of event g)? 1: Memory references 2: TLB misses 3: Paging I/Os? 4: Page-ins ANS: 4: < page-ins, because: a) Text is read-only and stays that way b) When you read in a data page, you already know whether or not the access was a write (if you care), and you can mark the page dirty right off the bat. c) Only if you fault a page for a read, then later write it, or if you fault in a page on a write, but then page it out to swap space, do you take an extra exception. Note that most kernels try hard to get free space by taking read-only pages that need not be written. [Thus: I looked at one of our M/800s, doing about 50-60 blocks paged-in per sec, and usually 0 paged-out, running about 200 context-switches sec, 0% idle time, i.e., cranking right along.] I'd guess, that in this environment, handling the extra overhead [and not having tuned for this case at all], costs about 600 microsecs / second, or about .06%.... To summarize: the transition from writable, but clean, to dirty is a rare event. For many such transitions, the OS wants to take action anyway. Hence, if there is any hardware cost to having hardware-set dirty bits [and there usually is: in speed, or in complexity, or in loss of flexibility because you make the hardware know where page tables are and what they looks like], then having them is somewhat redundant. Note: this is one I don't have exhaustive stats on: we knew early that this was far enough down in the noise not to worry about. For further info, see Clark & Emer, "Performance of the VAX 11/780 Translation Buffer: Simulation and Measurement," ACM Trans. on Comp. Sys, 3(1) 31-62 (Feb 1985), a fine paper that was helpful to us in designing the R2000 2 years ago [very timely!] Multi-processors: here's what I can say: The designs are not generally worse than other CPUs. Sometimes they're actually simpler, or waste less hardware that one has to work around. The basic principles [shared memory, UNIX multis] are: a) Our TLB hardware NEVER modifies a TLB entry once it gets there. Anything that would, normally calls an exception vector, which is exactly what's desirable in an MP box. b) If you do an exhaustive analysis of what goes on [in the vein of the list above], what you find is: 1) The normal cases work fast. 2) It is possible for TLBs and main PTEs to get out of synch. If you go thru all of the transitions, it turns out that the out-of-synch cases are either: a) Innocent, such as another CPU marks a page Dirty, and I still have a non-writable TLB entry. [so what: I'll take an extra trap if I reference the page before I've had to reload that TLB entry anyway]. OR b) Potentially dangerous, but really rare, and the transition is trappable. For example, the page daemon running on another CPU invalidates a PTE that I still have a TLB entry of. This is tricky, and longer than I'm willing to describe here, but one observes that in most cases, such pages are going onto the reclaim list, and there's plenty of time to catch up with them and fix them, AS LONG AS YOU TRAP MODIFIES. 3) In general, you statistically never need mass broadcast TLB- invalidates, via judicious use of "lazy evaluation as applied to operating systems". This principle says that you never do costly, massive operations that intrude into critical paths, if you can defer the operation until later, and if you will frequently find later that a (likely) sequence of events will have occurred in betweeen that helps you avoid having to do the massive operations at all. For example, you COULD do a broadcast TLB-flush on all processors any time you change anything, but this would be awful. On the other hand, if you use timestamps on TLB-flushing events, and when you dispatch a process, you can tell whether or not it might have some inconsistent state, you can THEN flush the TLB if you have to, but usually, you won't have to, because you'll discover that you've flushed the TLB sometime in the meantime. Well: that's enough. Again, all this depends on knowing the relative statistics of events, plus knowing what operating systems really do. [SOAPBOX] good engineering is knowing what can safely be left out. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD: 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
ned@john.UUCP (05/01/87)
I found the Compcon article. I don't understand how EntryHi (the VPN part of the TLB entry) gets set, based on guess-translation of the opcodes in your sample. I get an implication that k0/k1 are somehow not in the general register set, but I didn't see that in the article (or do you just not preserve two registers?) I don't know Unix, so whether or not the reclaim list is a standard thing, I don't know exactly how it is used. I am familiar with several systems using "clock" algorithms, which operate as: - is PTE used off? This page hasn't been used for a while. - is modified on? time to purify the page - else modified is off, this page can be re-used - else PTE used is on, turn it off. Each manipulation which turns off a bit in the PTE has to ensure that it is off in all TLBs. In article <351@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes: >Multi-processors: here's what I can say: > 2) It is possible for TLBs and main PTEs to get out of synch. > If you go thru all of the transitions, it turns out that > the out-of-synch cases are either: > a) Innocent, such as another CPU marks a page Dirty, > and I still have a non-writable TLB entry. [so what: > I'll take an extra trap if I reference the page before > I've had to reload that TLB entry anyway]. > OR > b) Potentially dangerous, but really rare, and the transition > is trappable. For example, the page daemon running on > another CPU invalidates a PTE that I still have a TLB entry of. > This is tricky, and longer than I'm willing to describe here, > but one observes that in most cases, such pages are going onto > the reclaim list, and there's plenty of time to catch up with > them and fix them, AS LONG AS YOU TRAP MODIFIES. Re b),I don't know how you can trap modifies if write-enable is on in some TLB. To avoid instantaneous clearing of TLBS, it seems to me that you need a reference count associated with each PTE (number of TLBs that have this cached). The TLB replacement algorithm would manipulate the counts of the victim and new entry. I expect there is some other method involving periodic scanning of a list of changed PTEs, followed by updating your TLB. ----- disclaimer: generic disclaimer * * - have you seen advertisements with '*'s, and no footnote? It's the same with this disclaimer...