[comp.arch] AM29000 MMU really MIPS TLB long

mash@mips.UUCP (04/29/87)

In article <93@ksr.UUCP> ned@ksr.UUCP (Kittlitz) writes:
>
>How does MIPS TLB load deal with used/modified bits? I assume that
>part of the path always executed determines that the bits are in the
>correct state. i.e. some of the instructions included in your count
>are for is-used-set?, is-this-a-write-well-then-is-modified-set?.
No.  See the sequence below. Used bits are done a la VAX.
>
>Assuming that these are coded with branch-not-taken being common
>(used/modified already has correct value), a multiprocessor still has
>to do some kind of exclusivity locking when it diddles the bits.
>Got any multi implementation or plans you can talk about? e.g.
>how long would THAT sequence be?
Yes, and no... see the tail-end of this posting.
>
>Can you describe any special hardware characteristics of the TLB in
>greater detail (or is it published)? How about the way you use the TLB
>in your UN*X implementation.

It was published: IEEE COMPCON, March 1986, SanFrancisco, 126-137
[3 articles; the "Operating System Support on a RISC" one gives the
register layouts of all the MMU control, etc, plus gory detail
on how they work.]  I won't replicate it here.  We also presented the
actual code at the conference.  Here's the BSD version [the SYS V version
is 1 instruction longer, since it was more convenient for memory allocation]:

NESTED(utlbmiss, 0, k1)		# UMIPS-BSD version
	.set	noreorder
 	.set	noat
	mfc0	k0,C0_CTXT	# fetch context = (ptebase | badvpn)
	mfc0	k1,C0_EPC	# fetch epc of faulting instr
	lw	k0,0(k0)	# fetch pte
	nop			# nothing to do
	mtc0	k0,C0_TLBLO	# copy pte to tbl-entry low
	nop			# nothing to do
	c0	C0_WRITER	# pseudo-random write into tlb from hi/lo
	j	k1		# go back where you came from...
	c0	C0_RFE		# and restore state before you land
 	.set	at
	.set	reorder
	END(utlbmiss)
Basically, this is 9 instructions, plus a likely cache-miss on the load-word.
It takes about 2 cycles to get here on the average.  It it's cache-resident
[which is it if it matters, i.e., high miss rates on cold-start], this
ends up being about 12-16 cycles.  Note that k0 and k1 are 2 registers
that the kernel is allowed to trash any time it feels like.

Note there are no branches: we looked at checking for validity bits,
or seeing if reference was a modify going into non-writable page,
but did it the way shown above instead.
Why? 

ANS: the usual reason we have: the statistics said it wasn't worth it.
It's usually a win to make frequent ops fast, at the expense of
longer path lengths for rare events.

If you put an invalid entry into the TLB, or an unwritable one,
as soon as you return to the faulting instruction, you will then
take a different kind of fault, to another vector, where you eventually
figure out:
	a) This was a reference "just under" the stack-top ==>
		extend the stack
	b) This was a wild reference outside anything mapped ==>
		signal the user with a mem fault
	c) This was a write to a truly-unwritable page ==>
		signal the user with a mem fault
	d) This was a write to a "copy-on-write" page ==>
		copy the page, mark the new PTE writable, flush the
		old one from the TLB
	e) This was a page fault to a demand-fill-zero page ==>
		give them a zero page.
	f) This was a page fault to a page that needs to be paged in ==>
		page it in
	g) This was a write to a writable page that is still clean ==>
		mark the PTE writable, replace the unwritable TLB entry
		with the writable version.
It is interesting that b) and c) are incredibily rare events, and you're
usually about to blow a process away, or do something else significant,
so taking an extra exception is no big deal.
Likewise, a), d), e), and f) involve substantial kernel activity,
and f) has disk activity.  Any UNIX that does copy-on-write will go around
any hardware-set modify bits that the architecture provides.
That leaves g) as the ONLY case where a machine really uses hardware-set
dirty bits.
Here's a quiz: what's the order of magnitude of event g)?
1: Memory references
2: TLB misses
3: Paging I/Os?
4: Page-ins

ANS: 4: < page-ins, because:
	a) Text is read-only and stays that way
	b) When you read in a data page, you already know whether or not the
	access was a write (if you care), and you can mark the page dirty
	right off the bat.
	c) Only if you fault a page for a read, then later write it,
	or if you fault in a page on a write, but then page it out to
	swap space, do you take an extra exception.  Note that most kernels
	try hard to get free space by taking read-only pages that need not
	be written. [Thus: I looked at one of our M/800s, doing about
	50-60 blocks paged-in per sec, and usually 0 paged-out, running
	about 200 context-switches sec, 0% idle time, i.e., cranking
	right along.]  I'd guess, that in this environment, handling
	the extra overhead [and not having tuned for this case at all],
	costs about 600 microsecs / second, or about .06%....

To summarize: the transition from writable, but clean, to dirty is
a rare event.  For many such transitions, the OS wants to take action
anyway.  Hence, if there is any hardware cost to having hardware-set
dirty bits [and there usually is: in speed, or in complexity, or
in loss of flexibility because you make the hardware know where
page tables are and what they looks like], then having them is
somewhat redundant.

Note: this is one I don't have exhaustive stats on: we knew early
that this was far enough down in the noise not to worry about.
For further info, see Clark & Emer, "Performance of the VAX 11/780
Translation Buffer: Simulation and Measurement," ACM Trans. on Comp. Sys,
3(1) 31-62 (Feb 1985), a fine paper that was helpful to us in
designing the R2000 2 years ago [very timely!]

Multi-processors: here's what I can say:
The designs are not generally worse than other CPUs.  Sometimes they're
actually simpler, or waste less hardware that one has to work around.
The basic principles  [shared memory, UNIX multis] are:

a) Our TLB hardware NEVER modifies a TLB entry once it gets there.
Anything that would, normally calls an exception vector, which is exactly
what's desirable in an MP box.

b) If you do an exhaustive analysis of what goes on [in the vein of
the list above], what you find is:
	1) The normal cases work fast.
	2) It is possible for TLBs and main PTEs to get out of synch.
	If you go thru all of the transitions, it turns out that
	the out-of-synch cases are either:
		a) Innocent, such as another CPU marks a page Dirty,
		and I still have a non-writable TLB entry. [so what:
		I'll take an extra trap if I reference the page before
		I've had to reload that TLB entry anyway].
		OR
		b) Potentially dangerous, but really rare, and the transition
		is trappable.  For example, the page daemon running on
		another CPU invalidates a PTE that I still have a TLB entry of.
		This is tricky, and longer than I'm willing to describe here,
		but one observes that in most cases, such pages are going onto
		the reclaim list, and there's plenty of time to catch up with
		them and fix them, AS LONG AS YOU TRAP MODIFIES.
	3) In general, you statistically never need mass broadcast TLB-
	invalidates, via judicious use of "lazy evaluation as applied
	to operating systems".  This principle says that you never do
	costly, massive operations that intrude into critical paths,
	if you can defer the operation until later, and if you will
	frequently find later that a (likely) sequence of events will
	have occurred in betweeen that helps you avoid having to do the
	massive operations at all.  For example, you COULD do a broadcast
	TLB-flush on all processors any time you change anything, but
	this would be awful.  On the other hand, if you use timestamps
	on TLB-flushing events, and when you dispatch a process, you
	can tell whether or not it might have some inconsistent state,
	you can THEN flush the TLB if you have to, but usually,
	you won't have to, because you'll discover that you've flushed
	the TLB sometime in the meantime.

Well: that's enough.  Again, all this depends on knowing the relative
statistics of events, plus knowing what operating systems really do.

[SOAPBOX] good engineering is knowing what can safely be left out.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

ned@john.UUCP (05/01/87)

I found the Compcon article. I don't understand how EntryHi (the VPN
part of the TLB entry) gets set, based on guess-translation of the
opcodes in your sample.  I get an implication that k0/k1 are somehow
not in the general register set, but I didn't see that in the article
(or do you just not preserve two registers?)

I don't know Unix, so whether or not the reclaim list is a standard
thing, I don't know exactly how it is used. I am familiar with several
systems using "clock" algorithms, which operate as:
   - is PTE used off? This page hasn't been used for a while.
     - is modified on? time to purify the page
     - else modified is off, this page can be re-used
   - else PTE used is on, turn it off.
Each manipulation which turns off a bit in the PTE has to ensure that
it is off in all TLBs.

In article <351@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>Multi-processors: here's what I can say:
>	2) It is possible for TLBs and main PTEs to get out of synch.
>	If you go thru all of the transitions, it turns out that
>	the out-of-synch cases are either:
>		a) Innocent, such as another CPU marks a page Dirty,
>		and I still have a non-writable TLB entry. [so what:
>		I'll take an extra trap if I reference the page before
>		I've had to reload that TLB entry anyway].
>		OR
>		b) Potentially dangerous, but really rare, and the transition
>		is trappable.  For example, the page daemon running on
>		another CPU invalidates a PTE that I still have a TLB entry of.
>		This is tricky, and longer than I'm willing to describe here,
>		but one observes that in most cases, such pages are going onto
>		the reclaim list, and there's plenty of time to catch up with
>		them and fix them, AS LONG AS YOU TRAP MODIFIES.

Re b),I don't know how you can trap modifies if write-enable is on in
some TLB.

To avoid instantaneous clearing of TLBS, it seems to me that you need
a reference count associated with each PTE (number of TLBs that have
this cached).  The TLB replacement algorithm would manipulate the
counts of the victim and new entry. I expect there is some other
method involving periodic scanning of a list of changed PTEs, followed
by updating your TLB.

-----
disclaimer: generic disclaimer *

* - have you seen advertisements with '*'s, and no footnote?
It's the same with this disclaimer...