[comp.arch] hardware support of reference and change bits

mark@hubcap.UUCP (Mark Smotherman) (04/21/88)

In reviewing hardware support for page replacement policies, I see that
IBM mainframes allocate the reference(/accessed) bit and the change(/dirty/
modified) bit in the storage key associated with the physical page frame.
On the other hand, modern microprocessors (e.g. 80386, NS32082 MMU) allocate
these bits in the page table entry.  The tradeoff seems to be extra opcodes
on the IBM versus slightly larger page table entries and some increase in
memory traffic (and invalidate signals) for the microprocessors.

On the IBM (XA), three instructions operate on the reference and change bits:
RRBE - reset reference bit extended, SSKE - set storage key extended, and
ISKE - insert storage key extended.  RRBE does the obvious, SSKE can be used
to reset the change bit, and ISKE places a copy of the storage key and
reference and change bits in a specified register.  The reference and change
bits are also set by any channel operations.

I don't have any hardware manuals available for the 386 or 32082 that give
full descriptions, but I assume they work in the following way.

1. Given the absence of the page table entry from the TLB:
  * Upon a reference, the page table entry is brought into the TLB and
    the reference bit is inspected.  If the reference bit is zero, then
    it is set to one.  This update of the entry must be done in a store-
    through manner.  That is, not only the should the TLB copy of the page
    table entry be updated, but the copy of the entry in the cache should
    also be updated.  (Of course, a designer could eliminate the inspection
    of the reference bit during a critical path by performing the store-
    through each time.)  An additional store-through to main memory and
    cache invalidate signal would be needed in a multiprocessor.  There
    would not be a need for a TLB invalidate signal.
  * Upon a change, the page table entry is processed as above, only with
    both bits set to one.  The condition causing memory traffic is if
    either bit is zero (00, 01, or 10).

2. Given that the entry is currently in the TLB:
  * A reference has no effect, since to be in the TLB the reference bit
    must already be set.
  * Upon a change, the update of the page table entry must be done in a
    store-through manner only if the change bit in the TLB is not already
    set to one.  (Eliminating this inspection would seem to relatively
    expensive in terms of the additional memory traffic.  Aren't writes
    about 30% of memory operations, as a rule of thumb?)

3. Immediately after an instruction changes a page table entry (e.g. reset
the reference or change bits), the TLB must be purged.  For multiprocessors
the cache must also be purged (or the change must have been a store-through)
and invalidate signals sent to the other processors to purge their TLBs and
caches.

For those who know, is this truly how things work?  Do you have any idea
(or better yet, any measurements) of the amount of memory traffic involved
in the setting of the reference and change bits?  Can I/O processors (DMA
or whatever) on these micros affect these bits?


P.S.  The IBM XA Principles of Operation manual gives an unusual disclaimer
on the reference bit:  "The reference bit may be set to one by fetching data
or instructions that are neither designated nor used by the program, and,
under certain conditions, a reference may be made without the reference bit
being set to one.  Under certain unusual circumstances, a reference bit may
be set to zero by other than explicit program action." [p. 3-11, March 1983
edition of SA22-7085-0]  The first case would appear to be caused by
prefetching that crosses page boundaries (e.g. branch target buffers).  The
other two cases elude me.
--

Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: mark@hubcap.clemson.edu    UUCP: gatech!hubcap!mark
-- 
Mark Smotherman, Comp. Sci. Dept., Clemson University, Clemson, SC 29634
INTERNET: mark@hubcap.clemson.edu    UUCP: gatech!hubcap!mark

jamesa%betelgeuse@Sun.COM (James D. Allen) (04/21/88)

In article <1458@hubcap.UUCP>, mark@hubcap.UUCP (Mark Smotherman) writes:
> P.S.  The IBM XA Principles of Operation manual gives an unusual disclaimer
> on the reference bit:  "The reference bit may be set to one by fetching data
> or instructions that are neither designated nor used by the program, and,
> under certain conditions, a reference may be made without the reference bit
> being set to one.  Under certain unusual circumstances, a reference bit may
> be set to zero by other than explicit program action." [p. 3-11, March 1983
> edition of SA22-7085-0]  The first case would appear to be caused by
> prefetching that crosses page boundaries (e.g. branch target buffers).  The
> other two cases elude me.

370 Models 165-II, 168, 3032, 3033 (and 308x, 309x ?) do not set the
reference bit on cache hits.

I don't know about "setting the reference bit to zero by other than explicit
program action" but the 168 family stored three copies of each reference
bit and used "majority rule" instead of parity-checking so the clause may
have been intended as a loophole for hardware errors.

jeff@Alliant.COM (Jeff Collins) (04/23/88)

In article <1458@hubcap.UUCP> mark@hubcap.UUCP (Mark Smotherman) writes:
>
	Removed a discussion about IBM reference and dirty bits.
>
>I don't have any hardware manuals available for the 386 or 32082 that give
>full descriptions, but I assume they work in the following way.
>
>1. Given the absence of the page table entry from the TLB:
>  * Upon a reference, the page table entry is brought into the TLB and
>    the reference bit is inspected.  If the reference bit is zero, then
>    it is set to one.  This update of the entry must be done in a store-
>    through manner.  That is, not only the should the TLB copy of the page
>    table entry be updated, but the copy of the entry in the cache should
>    also be updated.  (Of course, a designer could eliminate the inspection
>    of the reference bit during a critical path by performing the store-
>    through each time.)  An additional store-through to main memory and
>    cache invalidate signal would be needed in a multiprocessor.  There
>    would not be a need for a TLB invalidate signal.
>  * Upon a change, the page table entry is processed as above, only with
>    both bits set to one.  The condition causing memory traffic is if
>    either bit is zero (00, 01, or 10).

	On a multiprocessor the decision to write the PTE back to main memory
	or not is determined by the cache protocol.  If it is write-through, 
	then yes, the PTE must be written back to memory.  If the cache is
	write-back, then it may not be written back to memory.  

	When the hardware sets a reference and/or modified bit in the TLB, the
	operating system does not know that the bit is being set, it is
	automatic.  Given that the software does not know that the bit is set,
	there is no way to tell the other processors to perform an invalidate.

	Instead there are two ways to solve this race condition.  One is to
	not share PTEs.  This means that each process has private copies of
	the hardware page tables.  When an update is made by hardware to the
	TLB, no one else cares because no one else could have the PTE cached
	in the TLB (this assumes the TLB is flushed on context switch).

	If the operating system allows shared PTEs (this would be done to
	allow multiple processes to share memory), then the problem can be
	effectively ignored.  With reference bits it is not very important if
	they become inconsistent.  It only means that you lose a little
	accuracy on your working set calculations.  With modified bits it is
	very important to keep them consistent, or to not care what they are.
	This can be done by always assuming that shared data is share, or
	never releasing it - either solution works.  

>
>2. Given that the entry is currently in the TLB:

	Eliminated this text as I had nothing to add.
>
>3. Immediately after an instruction changes a page table entry (e.g. reset
>the reference or change bits), the TLB must be purged.  For multiprocessors
>the cache must also be purged (or the change must have been a store-through)
>and invalidate signals sent to the other processors to purge their TLBs and
>caches.

	This is close.  If the operating system clears a referenced or
	modified bit on a shared PTE, then it must purge it's TLB and cause
	all of the other processors that could have the PTE cached in the TLB
	to purge.  Again note this is only a problem with shared PTEs.

	The cache does not need to be purged.  When the operating system
	writes the PTE entry, it writes to the cache/memory system.  The cache
	will contain the correct version after the reset, the TLB contains an
	old version - which is why the TLB entry must be purged.  (by the way
	most of the MMUs allow a single entry to be purged, instead of the
	whole TLB)
>
>For those who know, is this truly how things work?  Do you have any idea
>(or better yet, any measurements) of the amount of memory traffic involved
>in the setting of the reference and change bits?  Can I/O processors (DMA
>or whatever) on these micros affect these bits?
>

	The setting/clearing of the referenced and modified bits are not a big
	deal (ie. they don't cause a lot of bus traffic).  This is because it
	will only cause traffic the first time it is changed, and that is a
	very small percentage of the overall number of processor reads and
	writes.

	To re-emphasize the multiprocessor issues here - the only trouble is
	with shared user level PTEs.  Note that shared pages do not
	necessarily imply shared PTEs.  It is possible to build virtual memory
	systems that have shared pages and private PTEs - this is what Mach
	and Encore (Umax 4.2) do.  This saves the invalidates and the
	consistency problems.

	I/O processors do not use these bits (they make physical accesses).

mash@mips.COM (John Mashey) (04/26/88)

In article <1647@alliant.Alliant.COM> jeff@alliant.UUCP (Jeff Collins) writes:
....
>	When the hardware sets a reference and/or modified bit in the TLB, the
>	operating system does not know that the bit is being set, it is
>	automatic.  Given that the software does not know that the bit is set,
>	there is no way to tell the other processors to perform an invalidate.
....
>
>	The setting/clearing of the referenced and modified bits are not a big
>	deal (ie. they don't cause a lot of bus traffic).  This is because it
>	will only cause traffic the first time it is changed, and that is a
>	very small percentage of the overall number of processor reads and
>	writes.

At least some current machines, especially several of the RISC systems
(HP Precision, MIPS, AMD 29K) use TLBs that do software refill, and
trap on transitions [such as attempts to set the modify bit].

Current UNIXes often forbid the hardware from writing modify bits directly,
in order to do copy-on-write processing.  In other words, they go around
a hardware feature that often adds substantial complexity to a design,
in order to do what they really want.  Current UNIXes almost always want to
trap the first write to a page, unless the first reference to a page is
a write, not a read, os that the kernel knows that the page should be allocated
as dirty in the first place.

Frequencies of transition between clean-but-writable and dirty vary according
to the UNIX variant.  The cases are as follows:

	a) 1st reference to a data page is a read, so that a copy of the data
	is brought into memory.  Later a write occurs.
	b) 1st reference to a BSS page is a read.  Create a page of zeroes.
	Later, a write occurs.
	c) Fork with copy-on-write is used.  Copy the page tables, mark
	everything read-only, then copy the pages when written.
	d) Use copy-on-write for mapped files, for buffer cache, etc.
	e) User attempts to write to a truly nonwritable data page.

If you look at these cases, you find either that the frequency is low
(on the order of disk event rates), or that there is substantial overhead
(like zeroing a page), or that you're about to kill the process anyway.

Although this has nothing particular to do with RISCs, a number of them
(such as the HP Precision, MIPS R2000, and AMD 29K, at least) do TLB
handling in software, and generally trap modifies rather than letting the
hardware do it.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

kds@mipos2.UUCP (05/12/88)

Using software to change modify/use bits, or even to do TLB refills is not
usually an option on cisc type machines because the time to switch into the
service call is usually quite a bit longer than on something like the mips.
In fact, the mips dedicates some of the user visible general purpose registers
just so they don't have to save them when they want to service, for example,
a TLB refill.  Which isn't to say either method is bad, but just that the
same solution isn't necessarily applicable to the same problem in two different
environments.

You don't have to break many eggs to hate omlets -- Ian Shoales

Ken Shoemaker, Microprocessor Design, Intel Corp., Santa Clara, California
uucp: ...{hplabs|decwrl|amdcad|qantel|pur-ee|scgvaxd|oliveb}!intelca!mipos3!kds