[comp.os.research] Anyone for memory management on the AM29000?

darrell@sdcsvax.ucsd.edu (Darrell Long) (04/29/87)

[ Based on the earlier discussion concerning memory management, I thought this ]
[ might be of interest.  -DL						       ]

From: tve@ethz.UUCP (Th. von Eicken)
Subject: Anyone for memory management on the AM29000?

Disclaimer:
We have the advance information data sheet and not (yet) the user's
manual so maybe our information is out-dated...

When reading the data sheet I noticed that the TLB entries
do not have any "page used" flag nor any "page modified"
flag. Does that mean that the AM29000 memory managenent is even
more crippled than on a VAX (which doesn't have a "page used" flag???

On TLB misses, as far as I understand, a software trap is generated.
Are there any figures on typical interrupt routine times for handling
the misses? What is the performance penalty, compared to miss
handling in hardware?

	Thorsten von Eicken	tve@ethz.uucp
	ETH Zuerich		...!seismo!mcvax!cernvax!ethz!tve
	Switzerland
---

From: stuart@rochester.ARPA (Stuart Friedberg)
Subject: AM29000 memory management (was flame)

In article <67@bernina.UUCP>, tve@ethz.UUCP (Th. von Eicken) writes:
> When reading the data sheet I noticed that the TLB entries
> don not have any "page used" flag nor any "page modified"
> flag. Does that mean that the AM29000 memory managenent is even
> more crippled than on a VAX (which doesn't have a "page used" flag???

A translation lookaside buffer (TLB) is not the same as page tables (PT).
The TLB serves as a cache of recently used address translations, while
the PT serves as the source of translation information.  Reference
(page used) and dirty (page modified) flags belong in the PT.

Stu Friedberg  {seismo, allegra}!rochester!stuart  stuart@cs.rochester.edu

---

From: tim@amdcad.AMD.COM (Tim Olson)
Subject: Re: AM29000 memory management (was flame)

In article <27207@rochester.ARPA>, stuart@rochester.ARPA (Stuart Friedberg) writes:
> In article <67@bernina.UUCP>, tve@ethz.UUCP (Th. von Eicken) writes:
> > When reading the data sheet I noticed that the TLB entries
> > don not have any "page used" flag nor any "page modified"
> > flag. Does that mean that the AM29000 memory managenent is even
> > more crippled than on a VAX (which doesn't have a "page used" flag???
> 
> A translation lookaside buffer (TLB) is not the same as page tables (PT).
> The TLB serves as a cache of recently used address translations, while
> the PT serves as the source of translation information.  Reference
> (page used) and dirty (page modified) flags belong in the PT.
> 
> Stu Friedberg  {seismo, allegra}!rochester!stuart  stuart@cs.rochester.edu

Actually, the referenced and changed bits are for use by the page replacement
algorithm, and are associated with physical addresses.  The PTs, on the other
hand, are (usually) searched using virtual addresses.  Unless you use an
inverted page table (IPT) structure, the R & C bits should be placed in a
separate structure which can be "searched" with physical addresses.

The "best" place for the referenced and changed bits, however, are in an 
external memory array, which "watches" the bus and automatically updates the
R & C bits.  This array can also be read from or written to via I/O space
to read or clear the bits.

Benefits:

	1) Speed -- R & C bits do not need to explicitly be written to memory,
	   nor do they need to be periodically cleared from the TLB.

	2) single copy -- R & C bits exist in only one place and are always
	   up-to-date.  If they were to exist in the TLB, they would have to
	   be "flushed" from the TLB into the correct memory locations every
	   time the page replacement algorithm runs.  This would *really* be
	   a headache for multiprocessor systems which use shared memory.
	   Works with I/O, too.

	3) Fairly inexepensive & easy solution (heck, even the IBM RT-PC does
	   it this way!)

If you *really* want to keep the R & C bits in a software structure in main
memory, it can be done using the standard tricks.  EXERCISE FOR READER:

given a tlb entry which holds the following information:

	VTAG	-- virtual address for this translation
	  UR	-- page has user read permission
	  UW	-- page has user write permission
	  UE	-- page has user execute permission
	 RPN	-- real (physical) page number
	 PGM	-- two user-programmable bits (which also appear on the pins)
	   F	-- a user-programmable bit (doesn't appear on the pins)

devise a method to collect page reference and change statistics.


	-- Tim Olson
	Advanced Micro Devices
	Processor Strategic Development
	(tim@amdcad.AMD.COM)

---

From: shebanow@ji.Berkeley.EDU (Mike Shebanow)
Subject: Re: Branch prediction in the 532

In article <165100008@uiucdcsb> Arch Robison writes:
>
>Are there any good papers on branch prediction?  I've seen two schemes:
>
>	1) Put an extra bit in the instruction indicating which way the branch
>	   will probably go.  This is a compile-time prediction.
>
>	2) Keep a record of past branches, i.e. run-time prediction.   
>
>Though I've heard of examples of machines with each scheme, I've never
>seen any studies of their efficiency.

There have been many good papers on the subject. Here is a partial
bibliography:

	(Note: ISCA = International Symposium on Computer Architecture,
	IBM TDB = IBM Technical Disclosure Bulletin)

[1] S. McFarling and J. Hennesy, "Reducing the Cost of Branches," Proc.
    13th ISCA, June 1986, pp. 396-403.

[2] J. K. Lee and A. J. Smith, "Branch Prediction Strategies and Branch Target
    Buffer Designs," IEEE Computer, vol. 17, #1, Jan 1984.

[3] J. E. Smith, "A Study of Branch Prediction Strategies," 8th ISCA, May 1981,
    pp. 135-148.

[4] J. Losq, "Generalized History Table for Branch Prediction," IBM TDB,
    vol 25, #1, June 1982, pp. 99-101.

[5] A. Liles Jr. and B. E. Wilner, "Branch Prediction Mechanism." IBM TDB,
    vol 22, #7, Dec 1979, pp. 3013-3016.

[6] G. Rao, "Technique for Minimizing Branch Delay Due to Incorrect Branch
    History Table Predictions," IBM TDB, vol 25, #1, June 1982, pp. 97-98.

Based on the results published, the best prediction algorithms use run-time
information. As developed by IBM, a branch target buffer is kept in the
machine (BTB). The BTB is an associative table (direct mapped, n-way set
associative, fully associative, whatever) which uses the address of the
BRANCH as an index. Associated with the branch address are two pieces
of information (usually):

	- target address of the branch (where to go if branch taken -- OR,
	    perhaps the first few instructions at the target address (eg.,
	    AM29000))
	- some prediction information (see below).

The type of prediction information used depends on the algorithm. A simple
scheme which works well is to use a 2 bit counter. The counter is incremented
(until it hits the value 3) when the branch is taken, and decremented (until
it hits the value 0) when the branch is not taken. When a branch is encountered,
it is predicted taken if the counter value is 2 or 3, and predicted not taken
if the counter value is 0 or 1. Lee and Smith describe a more complex scheme
in their paper which provides even better accuracy. Using these runtime
prediction algorithms, accuracies better than 90% have been achieved. With
the opcode method National is using, accuracies in the 70-80% are usual (as
they report).

I want to point out that accuracy is a poor indicator of performance. A much
better measure (for any particular architecture) is instruction block size
(IBS). IBS is defined to be the average number of instructions executed
between branch prediction mistakes. If we have a pipeline of depth 'd',
then an !APPROXIMATE! measure of the efficiency of this pipe (in the presence
of branch stalls, ignoring other types of stalls) is (IBS/(IBS + d)). For
small pipe depths, IBS tends not to matter. For deep pipes though, a small
value for IBS will kill you. I would be curious if anyone on comp.arch 
has these numbers.

Mike Shebanow

---

From: bcase@amdcad.AMD.COM (Brian Case)
Subject: Re: Anyone for memory management on the AM29000?

In article <67@bernina.UUCP> tve@ethz.UUCP (Th. von Eicken) writes:
>When reading the data sheet I noticed that the TLB entries
>don not have any "page used" flag nor any "page modified"
>flag. Does that mean that the AM29000 memory managenent is even
>more crippled than on a VAX (which doesn't have a "page used" flag???
>
>On TLB misses, as far as I understand, a software trap is generated.
>Are there any figures on typical interrupt routine times for handling
>the misses? What is the performance penalty, compared to miss
>handling in hardware?

Yeah, questions about the "missing" page referenced and modified bits
in the TLB are always among the first to be asked when people are
presented with the Am29000.  The deal is:  these bits don't belong in
TLB entries, they belong either in the page tables themselves or in
the physical page map (note that for inverted page tables, these
structures are (or can be) the same thing).  The VAX is brain-damaged
because the TLB reload is done by hardware (well, microcode) and it
forgets to take note of some of the information that OS guys would
like to have.  Since the Am29000 TLB reload is done by a software
routine, you not only can decide what the page tables look like, but
you can also decide whether or not to gather referenced and modified
information.

Note that referenced information is available degenerately by the
very fact that that TLB entry is present at all (the fact that the
TLB entry was fetched from the page table means that the page has
been referenced).  Page modified can be gathered in software too, if
you are willing to take the performance hit:  put the TLB entry for
the page into the TLB but set the write-protection bit(s) (one for
supervisor one for user); then, when a write to the page is attempted,
a protection violation trap will be taken; at this point, look in the
page table to make sure that the page is suposed to be read-only; if
not, then change the TLB entry to allow writing and count a page
modification in the page table (or physical page map).

But this is not the right way to do it anyway.  The right way is to
have a small RAM-based table in the memory controller keep track of
page modification:  there is very little overhead and the information
is maintained on a per-physical-page basis, just as it should be.
Also, it is probably the best way for multiprocessor systems.

I have written a paper about TLB reload for the Am29000, complete
with page table structures and code examples for two-level and inverted-
page tables.  There is also a discussion of TLB miss processing overhead
for a few of our benchmark programs (nroff, our assembler, puzzle,
etc.).  The overhead, in added cycles per instruction, is typically
less than 0.01 with the max (for the examples given) at 0.27 for the
"rm" command (this attrociously high number is due to the fact that
rm is a very short program so the cold-start penalty is a high
percentage of the total time).  The TLB miss ratios go from 1.50%
(yeech) for rm to < 0.01% for puzzle.  Four of the six programs have
TLB miss ratios < 0.05%, the next highest is 0.12% (nroff), and then
is rm at 1.50%.  Note that the only instructions in the Am29000 set
that can cause a TLB miss are jumps, calls, loads, and stores (well,
there can also be a TLB miss "caused" by the other instructions
when a page boundary is crossed, but the frequency of this event is
extremely low).  For the routines I wrote, the two-level TLB miss
handler takes 42 cycles while the inverted-page miss handler takes
63 cycles on average (both include all overhead and assume single-
cycle burst, two-cycle first access memories).

I can send copies of the paper to those interested.... (It's text
and graphics, so I can't just post.)

    bcase

---

From: lamaster@pioneer.arpa (Hugh LaMaster)
Subject: Re: Anyone for memory management on the AM29000?

In article <16336@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes:
>In article <67@bernina.UUCP> tve@ethz.UUCP (Th. von Eicken) writes:
>>When reading the data sheet I noticed that the TLB entries
>>don not have any "page used" flag nor any "page modified"
>>flag.

:

>Yeah, questions about the "missing" page referenced and modified bits
>in the TLB are always among the first to be asked when people are
>presented with the Am29000.  The deal is:  these bits don't belong in
>TLB entries, they belong either in the page tables themselves or in
>the physical page map (note that for inverted page tables, these
>structures are (or can be) the same thing).  The VAX is brain-damaged

:

>
>But this is not the right way to do it anyway.  The right way is to
>have a small RAM-based table in the memory controller keep track of
>page modification:  there is very little overhead and the information
>is maintained on a per-physical-page basis, just as it should be.
>Also, it is probably the best way for multiprocessor systems.
>
:

This may not be the best way to handle it if the architecture supports
multiple page sizes (and even multiple simultaneous page sizes), because the
page size in effect is not (easily) "known" by the memory controller.  

Even with an architecture-fixed page size, you have decreased the complexity
of the processor/TLB at the expense of the memory controller, which is OK for
a multi-chip/board CPU, but probably not the best for a microprocessor
(consider the history of MC68xxx memory management).


  Hugh LaMaster, m/s 233-9,  UUCP {seismo,topaz,lll-crg,ucbvax}!
  NASA Ames Research Center                ames!pioneer!lamaster
  Moffett Field, CA 94035    ARPA lamaster@ames-pioneer.arpa
  Phone:  (415)694-6117      ARPA lamaster@pioneer.arc.nasa.gov

"In order to promise genuine progress, the acronym RISC should stand 
for REGULAR (not reduced) instruction set computer." - Wirth

("Any opinions expressed herein are solely the responsibility of the
author and do not represent the opinions of NASA or the U.S. Government")

---

From: firth@sei.cmu.edu (Robert Firth)
Subject: Re: Branch prediction in the 532

In article <18537@ucbvax.BERKELEY.EDU> shebanow@ji.Berkeley.EDU.UUCP (Mike Shebanow) writes:
...
>Based on the results published, the best prediction algorithms use run-time
>information. As developed by IBM, a branch target buffer is kept in the
>machine (BTB). The BTB is an associative table (direct mapped, n-way set
>associative, fully associative, whatever) which uses the address of the
>BRANCH as an index. Associated with the branch address are two pieces
>of information (usually):
>
>	- target address of the branch (where to go if branch taken -- OR,
>	    perhaps the first few instructions at the target address (eg.,
>	    AM29000))
>	- some prediction information (see below).

Mike, could you please elaborate on this, because after 2 days I find
myself no nearer to understanding it.  The purpose of the "prediction"
information seems to be to allow the processor to prefetch whichever
of the branch continuations (successor or destination) is more likely
to be needed.

But if you have a branch cache, surely BOTH continuations are available
already - one in the normal pipeline and the other in the cache.  So
who cares which is needed? - and why bother to predict it?

---

From: rich@motsj1.UUCP (Rich Goss)
Subject: Re: AM29000 memory management (was flame)

In article <27207@rochester.ARPA> stuart@rochester.ARPA (Stuart Friedberg) writes:
>In article <67@bernina.UUCP>, tve@ethz.UUCP (Th. von Eicken) writes:
>> When reading the data sheet I noticed that the TLB entries
>> don not have any "page used" flag nor any "page modified"
>> flag. Does that mean that the AM29000 memory managenent is even
>> more crippled than on a VAX (which doesn't have a "page used" flag???
>
>A translation lookaside buffer (TLB) is not the same as page tables (PT).
>The TLB serves as a cache of recently used address translations, while
>the PT serves as the source of translation information.  Reference
>(page used) and dirty (page modified) flags belong in the PT.
>
>Stu Friedberg  {seismo, allegra}!rochester!stuart  stuart@cs.rochester.edu

The used and modified flags should be stored along with the page
address in the TLB. Otherwise, the MMU will always have to check
the page table descriptor in main memory to see if these two
flags have been updated on every access to the page being
referenced. For example, a page is read accessed for the first time
and the modify bit is brought into the TLB but not set. If the
page is read accessed again the entry for the page in the TLB is
correct and the page table descriptor in main memory need not be referenced.
However, the next access is a write access to the page. The modify
bit in the TLB is checked, then set, then the MMU should go out
to the page table descriptor in main memory to set the modify bit
in the page table descriptor to agree with the copy of the modify
bit in the TLB. The next access is a write to the page. The TLB
indicates the modify bit has already been set. Therefore the page
table descriptor is correct (the modify bit having already been
set) and no further action is required by the MMU. One can see
that if the modify bit was not cached in the TLB, the MMU would
have to go out to main memory every time the page is referenced 
in order to check and/or set the modify bit.

I do not know how the 29000 MMU operates but the scenario I have
described is used in many demand paged MMU schemes including the
Motorola 68851 PMMU.

-- Rich Goss,
   Motorola Western Regional Field Applications Engineer for 68000 Family

---

From: henry@utzoo.UUCP (Henry Spencer)
Subject: Re: Anyone for memory management on the AM29000?

> When reading the data sheet I noticed that the TLB entries
> don not have any "page used" flag nor any "page modified"
> flag. Does that mean that the AM29000 memory managenent is even
> more crippled than on a VAX (which doesn't have a "page used" flag???

Sounds reasonable.  MIPSCo did this too.  The result isn't "crippled",
because it takes a fair bit of hardware to add those flags and they don't
buy you very much over a software simulation.  For real programs, on a
given page each of those flags is either constantly on or rarely on,
meaning that the software-simulation overhead is quite low.  Notice that
not implementing the flags means that TLB entries never have to be written
back to main memory when they are replaced in the TLB, which makes TLB
management simpler and quicker.  On the whole, it's a win.  I looked at
TLB management a little bit a while ago, and formed a strong suspicion
that software simulation was actually the way to go even when the hardware
*does* support the flags!  (Software simulation is more flexible.)

> On TLB misses, as far as I understand, a software trap is generated.
> Are there any figures on typical interrupt routine times for handling
> the misses? What is the performance penalty, compared to miss
> handling in hardware?

Dunno about the AMD machine, but MIPSCo cited typical TLB-miss-handling
times in software (they do that too) substantially *shorter* than the
VAX 780's hardware times.  Given careful design, there need be little
penalty.
-- 
"If you want PL/I, you know       Henry Spencer @ U of Toronto Zoology
where to find it." -- DMR         {allegra,ihnp4,decvax,pyramid}!utzoo!henry

---

From: mash@mips.UUCP (John Mashey)
Subject: Re: AM29000 memory management (was flame)

In article <121@motsj1.UUCP> rich@motsj1.UUCP (Rich Goss) writes:
...discussions of TLBs that don't have hardware-set modify and access
bits...
>
>The used and modified flags should be stored along with the page
>address in the TLB. Otherwise, the MMU will always have to check
>the page table descriptor in main memory to see if these two
>flags have been updated on every access to the page being
>referenced......
>I do not know how the 29000 MMU operates but the scenario I have
>described is used in many demand paged MMU schemes including the
>Motorola 68851 PMMU.

1) It is true that many do work this way.  That is NOT a reason to
claim that they "should" work this way, and have trouble working without.
2) I guarantee you that you don't
need to have hardware-set modify and reference bits in the TLB.
Machines work perfectly well without them.  The one I'm writing this on
works without them, and its performance is fine, and both System V and
4.3BSD ports came up quickly on it.
3) With some operating systems, the LAST thing in the world that
the OS wants is for the hardware to set modify bits anywhere without
trapping to the OS. For example, System V does copy-on-write handling,
and it ends up making pages look read-only that are really writable,
so it can trap writes, copy the page, and then give it a modifiable
page.  I.e., it has to fake out the hardware to get done what it wants.
4) All the statistics say that "change writable, but clean page to
dirty page" is an infrequent event, that happens on the order of
page-in rates, not on the order of memory access rates.  Hence it is
quite reasonable to do it in software.
5) Others have already talked on the rationale for doing TLBs this way.
See also "Operating System Support on a RISC", IEEE COMPCON,
SanFrancisco, March 1986, 138-143: we talked about this a year ago.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086