mslater@cup.portal.com (06/11/88)
The following article on the 88200 CMMU is excerpted from the June issue of Microprocessor Report, which is a subscription newsletter (in print, not normally available electronically) written for designers of microprocessor-based hardware. Any comments would be appreciated; please send e-mail, and I'll summarize to the net. ------ One of the strengths of Motorola's 88000 design is that the 88200 CMMU chips provide the most highly integrated cache memory solution available. Although the 88100 can be used without the 88200, most 88000 systems will include at least two 88200's--one for instructions and one for data. Up to four 88200s can be paralleled to increase the size of the cache and address translation buffers. The 88200s connect to the 88100 processor via the non-multiplexed P- bus, and to the memory system via the multiplexed M-bus. There are separate P-buses for instructions and for data. Most systems use a common M-bus and common main memory for programs and for data, but it is possible to keep them separate. Memory management The 88200 includes two address translation caches, called the block address translation cache (BATC) and the page address translation cache (PATC). (Address translation caches are often called translation look-aside buffers, or TLBs.) The PATC holds a subset of the full address translation table which is stored in main memory. This tree- structured translation table consists of a segment table containing up to 1024 segment pointers, each of which selects a 1024-entry page table. (Actually, there are two of these trees--one for supervisor accesses, and one for user accesses.) Each segment is 4-Mbytes, and each page is 4-Kbytes. The operating system must maintain this table in main memory; the CMMU chips automatically fetch entries from memory as needed. No software intervention is required when there is a PATC miss; the CMMU hardware will automatically search the table in main memory and load the required descriptor. Once an entry has been fetched by the CMMU once, it remains in the address translation cache until the cache is full; then, the next time another table entry is needed, the oldest entry in the translation cache is discarded. The BATC has 10 entries, and maps blocks of 512 Kbytes. The PATC has 56 entries, each of which maps a 4-Kbyte page. Both translation caches are fully associative. The BATC is intended for use by the operating system and other programs using large contiguous areas of memory. Without the BATC, large programs would require many PATC entries. The relatively small page size of the PATC supports demand-paged operating systems. The BATC must be explicitly set up by the system software; unlike the PATC, the BATC entries are not automatically fetched from a table in main memory. In addition to specifying the physical address, each page descriptor includes protection and control information. Descriptor flags can be set on a block, segment, or page level to mark an area as cacheable or not, global (shared) or non-sharable, and write-through or copy-back. "Used" and "modified" flags are maintained for each page to support demand-paged virtual memory. Data/instruction cache The data/instruction cache is the largest available on a single chip-- 16 Kbytes. Figure 1 shows the block diagram of the 88200. The cache is 4-way set-associative, so most thrashing problems are eliminated. When multiple CMMU chips are used in parallel, they effectively increase the associativity of the cache; two chips in parallel act as an 8-way set-associative cache. Note that adding Cmmus increases the size of the PATC and BATC as well as the data/instruction cache. Each cache line includes a "disable" bit that allows the self-test software to turn off any cache line that shows errors. Since RAM cells rarely go bad once they have passed manufacturing testing, this suggests that Motorola may have plans to increase their effective yield by selling less-expensive versions in which not all cache lines are guaranteed to work. The MMU delay is overlapped with the cache access to provide zero- wait-state performance even though it is a physical, rather than a virtual, cache. Since the MMU's page size is 4K, the least-significant 12 bits of the address are not translated, and thus are not delayed by the MMU. Eight of these bits (A4--A11) select one of 256 cache sets at the same time that the MMU is translating the upper bits. When the MMU translation is complete, the translated address is compared with the address tags of the four lines in the set. If any of the four match, then there is a cache hit. Address bits A2 and A3 then select one of the four words in the line, and the four byte enable signals from the CPU (which eliminate the need for A0 and A1) enable the appropriate bytes. When a read miss occurs, the entire cache line is filled. The CMMU uses a 4-word block transfer on the M-bus to get the data from main memory. The transfer itself takes a minimum of five clock cycles, one for the address cycle and four to transfer the data. Another three clock cycles are used in the CMMU, so the minimum delay on a miss is eight clock cycles. With real-world DRAMs for main memory, this delay is extended to at least 10 clock cycles. Because the access is a four- word burst, page-mode and nibble-mode access techniques can be used. One shortcut taken in the 88200 design is that the P-bus access is not satisfied until the entire line is filled. A more sophisticated approach is to fill the needed word first, and immediately pass it along to the CPU; the cache controller can then fill the remainder of the line. Snooping maintains cache coherency Either a write-through or copy-back memory update approach can be used, and is selected by a control bit in the page descriptor. Copy- back provides higher performance for multiprocessor systems, but requires that read snooping be implemented to guarantee cache coherency. The 88200 includes the snooping logic. Even when copy-back is selected, the first write to a cached location is written through. This updates the main memory and invalidates any other copies of the data that may be present in other caches, to ensure that no more than one cache contains a modified version of the data. Successive writes to that memory location are then not written through to memory. When the cache line (or the entire cache) is flushed, it is written back to main memory if it is "dirty"--that is, if it has been modified since it was last written to main memory. One of the page and segement descriptor control bits indicates whether or not that page or segment is global, meaning that it can be shared among processors. This information is output on the M-bus during each access. All CMMUs monitor M-bus transactions, and any access to a global location causes the snooping logic to be activated. The address is compared to the cache address tags to see if the addressed memory location is present in the cache, and if it has been modified. If so, the cache chip asserts the M-bus reply code signals to abort the transaction. The cache chip that made the original request must give up the bus, and wait at least one clock cycle before re-requesting it. During this one-clock delay, the cache that detected the snooping hit takes control of the M-bus and updates the main memory. When it releases the M-bus, the requesting cache retries its read cycle, and gets the updated data from main memory. Another approach to this problem is for the cache to provide the data directly to the requesting device, rather than aborting the cycle and requiring that the requestor retry the access after main memory has been updated. This has slightly higher performance, but is more complex to implement. Because the snooping logic requires access to the cache tags whenever an M-bus access to a global location occurs, normal operation of the cache is affected. If a P-bus request occurs while the snooping logic is checking the tags, a wait state is generated. This causes some drop in performance in multiprocessor systems if there are many M-bus accesses to global memory locations. Eliminating this problem requires a duplicate set of tag bits for the snooping logic, and Motorola's designers felt that the silicon area required was not justified. The CMMU provides additional tag state outputs to allow an external set of tags to be maintained, so high-performance multiprocessor systems can implement snooping externally to the CMMU and eliminate this performance degradation. Note that snooping is not needed on the instruction cache, since programs should not be modified while they are executing. If the cache chip's snooping logic detects an M-bus write to a cached location, it invalidates that cache line. Note that the other aspects of the cache coherency protocol guarantee that the invalidated data is merely a copy of data in main memory. ------ Michael Slater, Editor and Publisher, Microprocessor Report 550 California Avenue, Suite 320, Palo Alto, CA 94306; mslater@cup.portal.com
andrew@frip.gwd.tek.com (Andrew Klossner) (06/12/88)
[] "The cache is 4-way set-associative, so most thrashing problems are eliminated. When multiple CMMU chips are used in parallel, they effectively increase the associativity of the cache; two chips in parallel act as an 8-way set-associative cache." No, it's still 4-way set-associative. Since exactly one CMMU will see any one page reference, a set hit will still flush one of the 4 existing set elements. "Even when copy-back is selected, the first write to a cached location is written through. This updates the main memory and invalidates any other copies of the data that may be present in other caches, to ensure that no more than one cache contains a modified version of the data." It writes back the entire 16-byte cache line, regardless of how much of it was modified. And it does so even if the page is marked "not global," meaning that software guarantees that the cache line is not present in any other cache. Ouch! "Because the snooping logic requires access to the cache tags whenever an M-bus access to a global location occurs, normal operation of the cache is affected. If a P-bus request occurs while the snooping logic is checking the tags, a wait state is generated. This causes some drop in performance in multiprocessor systems... " Every snooped access causes all CMMUs to stop servicing the CPU(s) while they do the tag check. Snooping is *very* expensive. You only use it for small pieces of memory shared among CPUs. "Note that adding Cmmus increases the size of the PATC and BATC as well as the data/instruction cache." Adding CMMUs creates some interesting problems. With more than one CMMU on a memory port (instruction or data), you pretty much have to use part of the memory address as a chip selector. (We use A12 and A13 to select one of four CMMUs.) But this memory address is *virtual*, so suddenly we have a (somewhat) virtual cache with aliasing problems. For example, if physical page 12 is mapped to virtual page 16, it is serviced by CMMU0; if the kernel remaps it to virtual page 17, it is serviced by CMMU1. The aliasing problems can be solved by snooping all of memory, but this is prohibitively expensive. The kernel can flush a page from cache when freeing it, but a cache page flush takes a minimum of 256 cycles. Two solutions that we're looking at are: 1) make the "software page cluster size" be 4 pages; that is, always allocate 4 contiguous physical pages together. This turns it back into a physical cache. On the downside, this makes for higher internal fragmentation and wastes three-fourths of the PATC. 2) Maintain four separate lists of free pages, one for each of the four values of <A13:A12>, and allocate physical pages so that a page is always serviced by the same CMMU. When there are no free pages in the right list to back a new virtual page, allocate a page from some other list, and flush it from the old CMMU. Credit for this idea goes to the Motorola Unix kernel group in Tempe, Arizona. Note that, when you want to enlarge a cache, you end up buying multiple MMUs to go with your additional RAM. This is pretty pricey, but it can provide well scheduled software with additional opportunities for parallelism: since memory loads and stores are pipelined, a load from one CMMU can wait on a page table walk while a load from a second CMMU can be serviced from a cache hit. On the whole, it's a neat part. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]
aglew@urbsdc.Urbana.Gould.COM (06/14/88)
/* Written 7:35 pm Jun 11, 1988 by andrew@frip.gwd.tek.com in urbsdc:comp.arch */ [] "The cache is 4-way set-associative, so most thrashing problems are eliminated. When multiple CMMU chips are used in parallel, they effectively increase the associativity of the cache; two chips in parallel act as an 8-way set-associative cache." No, it's still 4-way set-associative. Since exactly one CMMU will see any one page reference, a set hit will still flush one of the 4 existing set elements. "Even when copy-back is selected, the first write to a cached location is written through. This updates the main memory and invalidates any other copies of the data that may be present in other caches, to ensure that no more than one cache contains a modified version of the data." It writes back the entire 16-byte cache line, regardless of how much of it was modified. And it does so even if the page is marked "not global," meaning that software guarantees that the cache line is not present in any other cache. Ouch! "Because the snooping logic requires access to the cache tags whenever an M-bus access to a global location occurs, normal operation of the cache is affected. If a P-bus request occurs while the snooping logic is checking the tags, a wait state is generated. This causes some drop in performance in multiprocessor systems... " Every snooped access causes all CMMUs to stop servicing the CPU(s) while they do the tag check. Snooping is *very* expensive. You only use it for small pieces of memory shared among CPUs. "Note that adding Cmmus increases the size of the PATC and BATC as well as the data/instruction cache." Adding CMMUs creates some interesting problems. With more than one CMMU on a memory port (instruction or data), you pretty much have to use part of the memory address as a chip selector. (We use A12 and A13 to select one of four CMMUs.) But this memory address is *virtual*, so suddenly we have a (somewhat) virtual cache with aliasing problems. For example, if physical page 12 is mapped to virtual page 16, it is serviced by CMMU0; if the kernel remaps it to virtual page 17, it is serviced by CMMU1. The aliasing problems can be solved by snooping all of memory, but this is prohibitively expensive. The kernel can flush a page from cache when freeing it, but a cache page flush takes a minimum of 256 cycles. Two solutions that we're looking at are: 1) make the "software page cluster size" be 4 pages; that is, always allocate 4 contiguous physical pages together. This turns it back into a physical cache. On the downside, this makes for higher internal fragmentation and wastes three-fourths of the PATC. 2) Maintain four separate lists of free pages, one for each of the four values of <A13:A12>, and allocate physical pages so that a page is always serviced by the same CMMU. When there are no free pages in the right list to back a new virtual page, allocate a page from some other list, and flush it from the old CMMU. Credit for this idea goes to the Motorola Unix kernel group in Tempe, Arizona. Note that, when you want to enlarge a cache, you end up buying multiple MMUs to go with your additional RAM. This is pretty pricey, but it can provide well scheduled software with additional opportunities for parallelism: since memory loads and stores are pipelined, a load from one CMMU can wait on a page table walk while a load from a second CMMU can be serviced from a cache hit. On the whole, it's a neat part. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA] /* End of text from urbsdc:comp.arch */
aglew@urbsdc.Urbana.Gould.COM (06/14/88)
> "Note that adding Cmmus increases the size of the PATC and BATC > as well as the data/instruction cache." > >Adding CMMUs creates some interesting problems. With more than one >CMMU on a memory port (instruction or data), you pretty much have to >use part of the memory address as a chip selector. (We use A12 and A13 >to select one of four CMMUs.) But this memory address is *virtual*, so >suddenly we have a (somewhat) virtual cache with aliasing problems. >For example, if physical page 12 is mapped to virtual page 16, it is >serviced by CMMU0; if the kernel remaps it to virtual page 17, it is >serviced by CMMU1. The aliasing problems can be solved by snooping all >of memory, but this is prohibitively expensive. The kernel can flush a Couldn't you use part of the address that is untranslated to do the select?
smv@necis.UUCP (Steve Valentine) (06/15/88)
In article <10067@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes: >Adding CMMUs creates some interesting problems. With more than one >CMMU on a memory port (instruction or data), you pretty much have to >use part of the memory address as a chip selector. (We use A12 and A13 >to select one of four CMMUs.) But this memory address is *virtual*, so >suddenly we have a (somewhat) virtual cache with aliasing problems. ... [ Deleted suggestions for how to avoid the problem by flushing or using separate free lists. ] > -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] > (andrew%tekecs.tek.com@relay.cs.net) [ARPA] Why do you care about this aliasing? If you pull a page from the freelist, it is generally going to fall into one of two catagories: Page reclaim on a page fault, in which case it will go back to the same vaddr. -or- Fresh page to be zeroed or filled from swap or a.out. In this case you don't care about the former contents, as long as they don't linger. As long as only the new CMMU is now servicing the page, the old cached data shouldn't be hit. Am I missing something here? And now a new question: In a configuration with multiple CMMUs as described in the previous article, do you have to have separate page tables for each CMMU, or do they somehow know how to cooperate in searching a single table? (For a given region of a virtual address space that is.) Maintaining multiple page tables may be a problem if a process is permitted to migrate from one processor with say 4 CMMUs to one with only 2 or 1. -- Steve Valentine - smv@necis.nec.com NEC Information Systems 1300 Massachusetts Ave., Boxborough, MA 01719 This signature line is blank when you're not looking at it.