[comp.arch] Details on Moto 88200 CMMU

mslater@cup.portal.com (06/11/88)

The following article on the 88200 CMMU is excerpted from the June
issue of Microprocessor Report, which is a subscription
newsletter (in print, not normally available electronically) written for
designers of microprocessor-based hardware.  

Any comments would be appreciated; please send e-mail, and I'll
summarize to the net.

------

One of the strengths of Motorola's 88000 design is that the 88200 CMMU
chips provide the most highly integrated cache memory solution
available. Although the 88100 can be used without the 88200, most
88000 systems will include at least two 88200's--one for instructions
and one for data. Up to four 88200s can be paralleled to increase the
size of the cache and address translation buffers.

The 88200s connect to the 88100 processor via the non-multiplexed P-
bus, and to the memory system via the multiplexed M-bus. There are
separate P-buses for instructions and for data. Most systems use a
common M-bus and common main memory for programs and for data, but it
is possible to keep them separate.

Memory management

The 88200 includes two address translation caches, called the block
address translation cache (BATC) and the page address translation
cache (PATC). (Address translation caches are often called translation
look-aside buffers, or TLBs.) The PATC holds a subset of the full
address translation table which is stored in main memory. This tree-
structured translation table consists of a segment table containing up
to 1024 segment pointers, each of which selects a 1024-entry page
table. (Actually, there are two of these trees--one for supervisor
accesses, and one for user accesses.) Each segment is 4-Mbytes, and
each page is 4-Kbytes.

The operating system must maintain this table in main memory; the CMMU
chips automatically fetch entries from memory as needed. No software
intervention is required when there is a PATC miss; the CMMU hardware
will automatically search the table in main memory and load the
required descriptor. Once an entry has been fetched by the CMMU once,
it remains in the address translation cache until the cache is full;
then, the next time another table entry is needed, the oldest entry in
the translation cache is discarded.

The BATC has 10 entries, and maps blocks of 512 Kbytes. The PATC has
56 entries, each of which maps a 4-Kbyte page. Both translation caches
are fully associative. The BATC is intended for use by the operating
system and other programs using large contiguous areas of memory.
Without the BATC, large programs would require many PATC entries. The
relatively small page size of the PATC supports demand-paged operating
systems. The BATC must be explicitly set up by the system software;
unlike the PATC, the BATC entries are not automatically fetched from a
table in main memory.

In addition to specifying the physical address, each page descriptor
includes protection and control information. Descriptor flags can be
set on a block, segment, or page level to mark an area as cacheable or
not, global (shared) or non-sharable, and write-through or copy-back.
"Used" and "modified" flags are maintained for each page to support
demand-paged virtual memory.

Data/instruction cache

The data/instruction cache is the largest available on a single chip--
16 Kbytes. Figure 1 shows the block diagram of the 88200. The cache is
4-way set-associative, so most thrashing problems are eliminated. When
multiple CMMU chips are used in parallel, they effectively increase
the associativity of the cache; two chips in parallel act as an 8-way
set-associative cache. Note that adding Cmmus increases the size of
the PATC and BATC as well as the data/instruction cache.

Each cache line includes a "disable" bit that allows the self-test
software to turn off any cache line that shows errors. Since RAM cells
rarely go bad once they have passed manufacturing testing, this
suggests that Motorola may have plans to increase their effective
yield by selling less-expensive versions in which not all cache lines
are guaranteed to work.

The MMU delay is overlapped with the cache access to provide zero-
wait-state performance even though it is a physical, rather than a
virtual, cache. Since the MMU's page size is 4K, the least-significant
12 bits of the address are not translated, and thus are not delayed by
the MMU. Eight of these bits (A4--A11) select one of 256 cache sets at
the same time that the MMU is translating the upper bits. When the MMU
translation is complete, the translated address is compared with the
address tags of the four lines in the set. If any of the four match,
then there is a cache hit. Address bits A2 and A3 then select one of
the four words in the line, and the four byte enable signals from the
CPU (which eliminate the need for A0 and A1) enable the appropriate
bytes.

When a read miss occurs, the entire cache line is filled. The CMMU
uses a 4-word block transfer on the M-bus to get the data from main
memory. The transfer itself takes a minimum of five clock cycles, one
for the address cycle and four to transfer the data. Another three
clock cycles are used in the CMMU, so the minimum delay on a miss is
eight clock cycles. With real-world DRAMs for main memory, this delay
is extended to at least 10 clock cycles. Because the access is a four-
word burst, page-mode and nibble-mode access techniques can be used.

One shortcut taken in the 88200 design is that the P-bus access is not
satisfied until the entire line is filled. A more sophisticated
approach is to fill the needed word first, and immediately pass it
along to the CPU; the cache controller can then fill the remainder of
the line.

Snooping maintains cache coherency

Either a write-through or copy-back memory update approach can be
used, and is selected by a control bit in the page descriptor. Copy-
back provides higher performance for multiprocessor systems, but
requires that read snooping be implemented to guarantee cache
coherency. The 88200 includes the snooping logic.

Even when copy-back is selected, the first write to a cached location
is written through. This updates the main memory and invalidates any
other copies of the data that may be present in other caches, to
ensure that no more than one cache contains a modified version of the
data. Successive writes to that memory location are then not written
through to memory. When the cache line (or the entire cache) is
flushed, it is written back to main memory if it is "dirty"--that is,
if it has been modified since it was last written to main memory.

One of the page and segement descriptor control bits indicates whether
or not that page or segment is global, meaning that it can be shared
among processors. This information is output on the M-bus during each
access. All CMMUs monitor M-bus transactions, and any access to a
global location causes the snooping logic to be activated. The address
is compared to the cache address tags to see if the addressed memory
location is present in the cache, and if it has been modified. If so,
the cache chip asserts the M-bus reply code signals to abort the
transaction.

The cache chip that made the original request must give up the bus,
and wait at least one clock cycle before re-requesting it. During this
one-clock delay, the cache that detected the snooping hit takes
control of the M-bus and updates the main memory. When it releases the
M-bus, the requesting cache retries its read cycle, and gets the
updated data from main memory.

Another approach to this problem is for the cache to provide the data
directly to the requesting device, rather than aborting the cycle and
requiring that the requestor retry the access after main memory has
been updated. This has slightly higher performance, but is more
complex to implement.

Because the snooping logic requires access to the cache tags whenever
an M-bus access to a global location occurs, normal operation of the
cache is affected. If a P-bus request occurs while the snooping logic
is checking the tags, a wait state is generated. This causes some drop
in performance in multiprocessor systems if there are many M-bus
accesses to global memory locations. Eliminating this problem requires
a duplicate set of tag bits for the snooping logic, and Motorola's
designers felt that the silicon area required was not justified. The
CMMU provides additional tag state outputs to allow an external set of
tags to be maintained, so high-performance multiprocessor systems can
implement snooping externally to the CMMU and eliminate this
performance degradation.

Note that snooping is not needed on the instruction cache, since
programs should not be modified while they are executing.

If the cache chip's snooping logic detects an M-bus write to a cached
location, it invalidates that cache line. Note that the other aspects
of the cache coherency protocol guarantee that the invalidated data is
merely a copy of data in main memory.

------
Michael Slater, Editor and Publisher, Microprocessor Report
550 California Avenue, Suite 320, Palo Alto, CA 94306; mslater@cup.portal.com

andrew@frip.gwd.tek.com (Andrew Klossner) (06/12/88)

[]

	"The cache is 4-way set-associative, so most thrashing problems
	are eliminated. When multiple CMMU chips are used in parallel,
	they effectively increase the associativity of the cache; two
	chips in parallel act as an 8-way set-associative cache."

No, it's still 4-way set-associative.  Since exactly one CMMU will see
any one page reference, a set hit will still flush one of the 4
existing set elements.

	"Even when copy-back is selected, the first write to a cached
	location is written through. This updates the main memory and
	invalidates any other copies of the data that may be present in
	other caches, to ensure that no more than one cache contains a
	modified version of the data."

It writes back the entire 16-byte cache line, regardless of how much of
it was modified.  And it does so even if the page is marked "not
global," meaning that software guarantees that the cache line is not
present in any other cache.  Ouch!

	"Because the snooping logic requires access to the cache tags
	whenever an M-bus access to a global location occurs, normal
	operation of the cache is affected. If a P-bus request occurs
	while the snooping logic is checking the tags, a wait state is
	generated. This causes some drop in performance in
	multiprocessor systems... "

Every snooped access causes all CMMUs to stop servicing the CPU(s)
while they do the tag check.  Snooping is *very* expensive.  You only
use it for small pieces of memory shared among CPUs.

	"Note that adding Cmmus increases the size of the PATC and BATC
	as well as the data/instruction cache."

Adding CMMUs creates some interesting problems.  With more than one
CMMU on a memory port (instruction or data), you pretty much have to
use part of the memory address as a chip selector.  (We use A12 and A13
to select one of four CMMUs.)  But this memory address is *virtual*, so
suddenly we have a (somewhat) virtual cache with aliasing problems.
For example, if physical page 12 is mapped to virtual page 16, it is
serviced by CMMU0; if the kernel remaps it to virtual page 17, it is
serviced by CMMU1.  The aliasing problems can be solved by snooping all
of memory, but this is prohibitively expensive.  The kernel can flush a
page from cache when freeing it, but a cache page flush takes a minimum
of 256 cycles.  Two solutions that we're looking at are:

  1) make the "software page cluster size" be 4 pages; that is, always
     allocate 4 contiguous physical pages together.  This turns it back
     into a physical cache.  On the downside, this makes for higher
     internal fragmentation and wastes three-fourths of the PATC.

  2) Maintain four separate lists of free pages, one for each of the
     four values of <A13:A12>, and allocate physical pages so that
     a page is always serviced by the same CMMU.  When there are no
     free pages in the right list to back a new virtual page, allocate
     a page from some other list, and flush it from the old CMMU.
     Credit for this idea goes to the Motorola Unix kernel group in
     Tempe, Arizona.

Note that, when you want to enlarge a cache, you end up buying multiple
MMUs to go with your additional RAM.  This is pretty pricey, but it can
provide well scheduled software with additional opportunities for
parallelism: since memory loads and stores are pipelined, a load from
one CMMU can wait on a page table walk while a load from a second CMMU
can be serviced from a cache hit.

On the whole, it's a neat part.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

aglew@urbsdc.Urbana.Gould.COM (06/14/88)

/* Written  7:35 pm  Jun 11, 1988 by andrew@frip.gwd.tek.com in urbsdc:comp.arch */
[]

	"The cache is 4-way set-associative, so most thrashing problems
	are eliminated. When multiple CMMU chips are used in parallel,
	they effectively increase the associativity of the cache; two
	chips in parallel act as an 8-way set-associative cache."

No, it's still 4-way set-associative.  Since exactly one CMMU will see
any one page reference, a set hit will still flush one of the 4
existing set elements.

	"Even when copy-back is selected, the first write to a cached
	location is written through. This updates the main memory and
	invalidates any other copies of the data that may be present in
	other caches, to ensure that no more than one cache contains a
	modified version of the data."

It writes back the entire 16-byte cache line, regardless of how much of
it was modified.  And it does so even if the page is marked "not
global," meaning that software guarantees that the cache line is not
present in any other cache.  Ouch!

	"Because the snooping logic requires access to the cache tags
	whenever an M-bus access to a global location occurs, normal
	operation of the cache is affected. If a P-bus request occurs
	while the snooping logic is checking the tags, a wait state is
	generated. This causes some drop in performance in
	multiprocessor systems... "

Every snooped access causes all CMMUs to stop servicing the CPU(s)
while they do the tag check.  Snooping is *very* expensive.  You only
use it for small pieces of memory shared among CPUs.

	"Note that adding Cmmus increases the size of the PATC and BATC
	as well as the data/instruction cache."

Adding CMMUs creates some interesting problems.  With more than one
CMMU on a memory port (instruction or data), you pretty much have to
use part of the memory address as a chip selector.  (We use A12 and A13
to select one of four CMMUs.)  But this memory address is *virtual*, so
suddenly we have a (somewhat) virtual cache with aliasing problems.
For example, if physical page 12 is mapped to virtual page 16, it is
serviced by CMMU0; if the kernel remaps it to virtual page 17, it is
serviced by CMMU1.  The aliasing problems can be solved by snooping all
of memory, but this is prohibitively expensive.  The kernel can flush a
page from cache when freeing it, but a cache page flush takes a minimum
of 256 cycles.  Two solutions that we're looking at are:

  1) make the "software page cluster size" be 4 pages; that is, always
     allocate 4 contiguous physical pages together.  This turns it back
     into a physical cache.  On the downside, this makes for higher
     internal fragmentation and wastes three-fourths of the PATC.

  2) Maintain four separate lists of free pages, one for each of the
     four values of <A13:A12>, and allocate physical pages so that
     a page is always serviced by the same CMMU.  When there are no
     free pages in the right list to back a new virtual page, allocate
     a page from some other list, and flush it from the old CMMU.
     Credit for this idea goes to the Motorola Unix kernel group in
     Tempe, Arizona.

Note that, when you want to enlarge a cache, you end up buying multiple
MMUs to go with your additional RAM.  This is pretty pricey, but it can
provide well scheduled software with additional opportunities for
parallelism: since memory loads and stores are pipelined, a load from
one CMMU can wait on a page table walk while a load from a second CMMU
can be serviced from a cache hit.

On the whole, it's a neat part.

  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]
/* End of text from urbsdc:comp.arch */

aglew@urbsdc.Urbana.Gould.COM (06/14/88)

>	"Note that adding Cmmus increases the size of the PATC and BATC
>	as well as the data/instruction cache."
>
>Adding CMMUs creates some interesting problems.  With more than one
>CMMU on a memory port (instruction or data), you pretty much have to
>use part of the memory address as a chip selector.  (We use A12 and A13
>to select one of four CMMUs.)  But this memory address is *virtual*, so
>suddenly we have a (somewhat) virtual cache with aliasing problems.
>For example, if physical page 12 is mapped to virtual page 16, it is
>serviced by CMMU0; if the kernel remaps it to virtual page 17, it is
>serviced by CMMU1.  The aliasing problems can be solved by snooping all
>of memory, but this is prohibitively expensive.  The kernel can flush a

Couldn't you use part of the address that is untranslated to do
the select?

smv@necis.UUCP (Steve Valentine) (06/15/88)

In article <10067@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes:
>Adding CMMUs creates some interesting problems.  With more than one
>CMMU on a memory port (instruction or data), you pretty much have to
>use part of the memory address as a chip selector.  (We use A12 and A13
>to select one of four CMMUs.)  But this memory address is *virtual*, so
>suddenly we have a (somewhat) virtual cache with aliasing problems.
	...
	[ Deleted suggestions for how to avoid the problem by flushing or
	  using separate free lists. ]
>  -=- Andrew Klossner   (decvax!tektronix!tekecs!andrew)       [UUCP]
>                        (andrew%tekecs.tek.com@relay.cs.net)   [ARPA]

Why do you care about this aliasing?  If you pull a page from the freelist, it
is generally going to fall into one of two catagories:
Page reclaim on a page fault, in which case it will go back to the same vaddr.
-or-
Fresh page to be zeroed or filled from swap or a.out.  In this case you don't
care about the former contents, as long as they don't linger.  As long as only
the new CMMU is now servicing the page, the old cached data shouldn't be hit.

Am I missing something here?

And now a new question:

In a configuration with multiple CMMUs as described in the previous article,
do you have to have separate page tables for each CMMU, or do they somehow know
how to cooperate in searching a single table?
(For a given region of a virtual address space that is.)

Maintaining multiple page tables may be a problem if a process is permitted to
migrate from one processor with say 4 CMMUs to one with only 2 or 1.

-- 
Steve Valentine - smv@necis.nec.com
NEC Information Systems 1300 Massachusetts Ave., Boxborough, MA 01719
	This signature line is blank when you're not looking at it.