dbjag@io.UUCP (David Benjamin) (12/07/90)
Does anyone know of a good method for forcing a memory cache flush on a Data General Aviion 300 series workstation. Specifically, I would like to request a specific line flush from the m88200 cache handling instructions. I'm not too keen on a full flush of both caches, as that would take too much time. The reason is kind of hairy, but if you must know, it involves self-modifying code which seems to fail on the Aviion when the caches get out of sync. There, glad you asked? This brings up another question. The code is apparently failing because the contents of the two caches different values for the same address. Why wasn't this state prevented by the "M-bus snooping" of the 88200's? Perhaps my understanding of their function is warped. Thanks in advance. -- - Dave Benjamin - - Interleaf - - ...!eddie.mit.EDU!ileaf!dbjag -
lewine@cheshirecat.webo.dg.com (Donald Lewine) (12/07/90)
In article <2308@io.UUCP>, dbjag@io.UUCP (David Benjamin) writes: |> Does anyone know of a good method for forcing a memory cache flush on |> a Data General Aviion 300 series workstation. Specifically, I would like |> to request a specific line flush from the m88200 cache handling instructions. |> I'm not too keen on a full flush of both caches, as that would take too much |> time. |> |> The reason is kind of hairy, but if you must know, it involves self-modifying |> code which seems to fail on the Aviion when the caches get out of sync. |> There, glad you asked? |> |> This brings up another question. The code is apparently failing because |> the contents of the two caches different values for the same address. |> Why wasn't this state prevented by the "M-bus snooping" of the 88200's? |> Perhaps my understanding of their function is warped. |> |> Thanks in advance. |> |> -- |> - Dave Benjamin - |> - Interleaf - |> - ...!eddie.mit.EDU!ileaf!dbjag - The memctl() function is used to support self-modifying code, compile and go languages and the like. It lets you mark a region of memory as writable or executable. It may not be both executable and writable at the same time. See the DG/UX documentation or the 88open binary compatibility standard for details of memctl(). The 88200 (by default) does not do M-bus snooping for the instruction cache. It assumes that there will be no writes to the instruction stream and turning off snoop gives higher performance. The memctl() function may be used to override the default. Another solution is to chase the data out of the cache. You can make some memory references that will conflict with the instructions you want to flush and thus chase them out of the cache. NOTE: This is specific to the 88200. If Motorola were to make a larger cache chip, this code may stop working. -------------------------------------------------------------------- Donald A. Lewine (508) 870-9008 Voice Data General Corporation (508) 366-0750 FAX 4400 Computer Drive. MS D112A Westboro, MA 01580 U.S.A. uucp: uunet!dg!lewine Internet: lewine@cheshirecat.webo.dg.com
andrew@frip.WV.TEK.COM (Andrew Klossner) (12/08/90)
[] "I would like to request a specific line flush from the m88200 cache handling instructions. I'm not too keen on a full flush of both caches, as that would take too much time." In the Motorola kernel, the time it takes to flush the cache is small compared to the time it takes to get into and out of the system call handler. I'd be surprised if Data General has significantly smaller syscall overhead. "The code is apparently failing because the contents of the two caches different values for the same address. Why wasn't this state prevented by the "M-bus snooping" of the 88200's?" Again, in the Motorola kernel, only the data caches are told to participate in snooping. Performance would be seriously degraded if the instruction caches were to join in. On a snoop cycle, *every* snooping 88200 must stop what it's doing for one cycle and listen in. Even if the CPU is fetching instructions from cache, it's held up. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
stevec@bubbyb.Solbourne.COM (Steve Cox) (12/11/90)
In article <2308@io.UUCP> dbjag@io.UUCP (David Benjamin) writes: >The reason is kind of hairy, but if you must know, it involves self-modifying >code which seems to fail on the Aviion when the caches get out of sync. >There, glad you asked? > >This brings up another question. The code is apparently failing because >the contents of the two caches different values for the same address. >Why wasn't this state prevented by the "M-bus snooping" of the 88200's? >Perhaps my understanding of their function is warped. yes, i believe your understanding is correct (that the caches are supposed to be coherent). but, perhaps the code you are trying to modify AND THEN EXECUTE is not marked as global by the DG operating system.(?) i believe that coherency is not maintained for data that isn't marked as global. it doesn't seem reasonable that this is a problem (or "state" as you say) in the cache coherency protocol (i.e. there'd be all sorts of problems in any multi-88200 system if such a bug existed). - stevec -- steve cox stevec@solbourne.com solbourne computer, inc. i've got the need... 1900 pike, longmont, co the need... (303)772-3400 for speed!
andrew@frip.WV.TEK.COM (Andrew Klossner) (12/11/90)
[] "Another solution is to chase the data out of the cache. You can make some memory references that will conflict with the instructions you want to flush and thus chase them out of the cache." But the problem is stale data in the instruction cache. Memory references won't help here; you have to do instruction fetches to flush that data. That would be a worst case of 16 fetches of different instructions, each located at the indicated offset within a page. I suppose it could be done, but it sounds pretty fragile. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
jdarcy@encore.com (Jeff d'Arcy) (12/11/90)
stevec@bubbyb.Solbourne.COM (Steve Cox) writes: >>Why wasn't this state prevented by the "M-bus snooping" of the 88200's? >>Perhaps my understanding of their function is warped. > >yes, i believe your understanding is correct (that the caches are supposed >to be coherent). but, perhaps the code you are trying to modify AND THEN >EXECUTE is not marked as global by the DG operating system.(?) I came in late on this discussion, so if someone's already covered this ground please forgive me. In order for snooping to occur, the Global bit in the writer's page descriptor must be set, and the Snoop Enable bit must also be set in the reader's System Control Register. It is quite believable that this bit will *not* be set for the Instruction CMMU, therefore snooping will not occur, and therefore... You could approach this problem in several ways. Setting Snoop Enable in the I-CMMU might work, as would a flush prior to executing the modified code. My favored solution would be *not* to execute "dirty" code, but I recognize that others don't share my viewpoint there. -- Jeff d'Arcy, Generic Software Engineer - jdarcy@encore.com Non sunt multiplicanda entia praeter necessitatem!
jimk@oakhill.UUCP (Jim Klingshirn) (12/11/90)
n article <2308@io.UUCP>, dbjag@io.UUCP (David Benjamin) writes: |> Does anyone know of a good method for forcing a memory cache flush on |> a Data General Aviion 300 series workstation. Specifically, I would like |> to request a specific line flush from the m88200 cache handling instructions.|> I'm not too keen on a full flush of both caches, as that would take too much |> time. |> |> The reason is kind of hairy, but if you must know, it involves self-modifying|> code which seems to fail on the Aviion when the caches get out of sync. |> There, glad you asked? |> |> This brings up another question. The code is apparently failing because |> the contents of the two caches different values for the same address. |> Why wasn't this state prevented by the "M-bus snooping" of the 88200's? |> Perhaps my understanding of their function is warped. |> |> Thanks in advance. |> |> -- |> - Dave Benjamin - |> - Interleaf - |> - ...!eddie.mit.EDU!ileaf!dbjag i Don Lewine suggested two ways to flush the cache, the first is to use the memctl() system call which was designed for this type of thing. The second was to chase the data out of the cache. Using memctl() is the only way to insure that the code will be portable at all. Since it's a system call, it allows the operating system to clean up all caches, pipelines, buffers etc. It can be customized for the particular version of the 88000 that your application is running on, and can work correctly on any 88000 BCS compliant system. In fact it may even be fairly efficient since the 88200 can flush or invalidate a line at a time. On the other hand there is no way for application software to reliably chase data out of a cache. To define an algorithm that would clear out any size cache, any number of caches, any cache associativity, and any type of cache is impossible, especially when you consider that you may need to clear instruction prefetch buffers, instruction issue pipelines and branch caches in addition to clearing the instruction cache(s). We periodically receive requests to add user level instructions to generate cache control operations. The justification generally is centered around data caches - for instance how can you force data out so that it is guaranteed to get out to the graphics frame buffer when the cache is in copyback mode. To date, the only justification I've heard for instruction cache control is to support self modifying code (including breakpoints). Assuming there are alternate ways to support breakpoints - should we worry about instruction cache control operations? Jim Klingshirn, Motorola 88000 Design
pierson@encore.com (Dan L. Pierson) (12/12/90)
In article <4322@photon.oakhill.UUCP> jimk@oakhill.UUCP (Jim Klingshirn) writes:
We periodically receive requests to add user level instructions
to generate cache control operations. The justification generally
is centered around data caches - for instance how can you force data
out so that it is guaranteed to get out to the graphics frame buffer
when the cache is in copyback mode. To date, the only justification
I've heard for instruction cache control is to support self
modifying code (including breakpoints). Assuming there are alternate
ways to support breakpoints - should we worry about instruction cache
control operations?
Yes, any language system that incrementally compiles or dynamically
loads code then executes it needs to flush caches. Lisp is a good
example of the worst case here. Many Lisp systems:
1. Put compiled code (incrementally or loaded) into dynamic space.
2. Manage dynamic space with a copying garbage collector. This
means that code may move between calls however the old code
location will only be reused after the entire semi-space it's
part of has been freed for reuse (there are less friendly GC
algorithms...).
3. Loading a Lisp compiled file can consist of an arbitrary
sequence of: load a bit, execute some of what you just loaded,
load some more, etc. The i cache has to be cleaned as
efficiently as possible in here.
Lisp is not the only language with this type of requirements, just a
good example.
dan
In real life: Dan Pierson, Encore Computer Corporation, Research
UUCP: {talcott,linus,necis,decvax}!encore!pierson
Internet: pierson@encore.com
--
dan
In real life: Dan Pierson, Encore Computer Corporation, Research
UUCP: {talcott,linus,necis,decvax}!encore!pierson
Internet: pierson@encore.com
meissner@osf.org (Michael Meissner) (12/12/90)
In article <PIERSON.90Dec11173050@xenna.encore.com> pierson@encore.com (Dan L. Pierson) writes: | 2. Manage dynamic space with a copying garbage collector. This | means that code may move between calls however the old code | location will only be reused after the entire semi-space it's | part of has been freed for reuse (there are less friendly GC | algorithms...). Beware that compilers for non-GC languages (like C), may cache a function's address in a register between calls. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
dfields@urbana.mcd.mot.com (David Fields) (12/13/90)
In article <PIERSON.90Dec11173050@xenna.encore.com> pierson@encore.com (Dan L. Pierson) writes: >In article <4322@photon.oakhill.UUCP> jimk@oakhill.UUCP (Jim Klingshirn) writes: >> We periodically receive requests to add user level instructions >> to generate cache control operations. The justification generally >> is centered around data caches - for instance how can you force data >> out so that it is guaranteed to get out to the graphics frame buffer >> when the cache is in copyback mode. To date, the only justification >> I've heard for instruction cache control is to support self >> modifying code (including breakpoints). Assuming there are alternate >> ways to support breakpoints - should we worry about instruction cache >> control operations? > >Yes, any language system that incrementally compiles or dynamically >loads code then executes it needs to flush caches. Lisp is a good >example of the worst case here. Many Lisp systems: <lisp scenario described> >Lisp is not the only language with this type of requirements, just a >good example. First, I'm not a chip designer nor do I work with dynamic compilation systems so ... I've got a few questions. I'll attempt to answer them as best as I can and see if anyone else can do a better job. I'm not going to consider having the instruction and data caches be coherent because then you would have to address the instruction pipeline and I believe that most of us would find that entirely too expensive. 1) What are the reasons to make cache flush operations priviledged? In a multi-processor system, a user could slow response times of other users a little easier than he could before. Are there any other reasons? Are there any security (i.e. Orange Book) issues? For example, does this make an unlimitable high badwidth covert channel? 2) What's the cost of the user level cache flush instruction? It seems as though it shouldn't effect cycle time since you have to provide this in supervisor mode anyway. Is this true? Are there any other costs? 3) What's the real performance penalty of not having one? It's more difficult for the implementors of the dynamic compilation system but from what I've heard many get around the problem by compiling and flushing in larger chunks to amortize the cost. Just like garbage collection. I seem to remember Andrew Klossner posting about someone doing this at Tek. Does anyone have any performance info that they could post? -- Dave Fields // Motorola MCD // uiucuxc!udc!dfields // dfields@urbana.mcd.mot.com
andrew@frip.WV.TEK.COM (Andrew Klossner) (12/13/90)
[] "Using memctl() is the only way to insure that the code will be portable at all ... In fact it may even be fairly efficient since the 88200 can flush or invalidate a line at a time." In the Motorola kernel, the memctl implementation is moby inefficient. It has to be, because there's no memctl option that says "just flush cache"; instead, the code must run through segment and page descriptors flipping the writable state. This was enough of a problem that Tektronix implemented a "just flush the cache" system call in our kernel, for a customer who did incremental compilation and didn't need to be BCS compliant. -=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP] (andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
pierson@encore.com (Dan L. Pierson) (12/13/90)
In article <MEISSNER.90Dec11221308@curley.osf.org> meissner@osf.org (Michael Meissner) writes: In article <PIERSON.90Dec11173050@xenna.encore.com> pierson@encore.com (Dan L. Pierson) writes: | 2. Manage dynamic space with a copying garbage collector. This | means that code may move between calls however the old code | location will only be reused after the entire semi-space it's | part of has been freed for reuse (there are less friendly GC | algorithms...). Beware that compilers for non-GC languages (like C), may cache a function's address in a register between calls. Obviously; that's a standard constraint when interfacing such languages to C and friends. It's also one of the reasons why technically inferior GC algorithms such as mark and sweep may actually work better for multi-language applications. This is rapidly becoming a digression from comp.sys.m88k. dan -- dan In real life: Dan Pierson, Encore Computer Corporation, Research UUCP: {talcott,linus,necis,decvax}!encore!pierson Internet: pierson@encore.com
df@phx.mcd.mot.com (Dale Farnsworth) (12/13/90)
> "Using memctl() is the only way to insure that the code will > be portable at all ... In fact it may even be fairly efficient > since the 88200 can flush or invalidate a line at a time." > > In the Motorola kernel, the memctl implementation is moby inefficient. > It has to be, because there's no memctl option that says "just flush > cache"; instead, the code must run through segment and page descriptors > flipping the writable state. > > This was enough of a problem that Tektronix implemented a "just flush > the cache" system call in our kernel, for a customer who did > incremental compilation and didn't need to be BCS compliant. Just to clear up some misinformation ... The above quote must refer to another vendor. There is nothing in the BCS specification of memctl which requires flipping the writable state of page descriptors and no Motorola kernel has ever done so. Normally, our implementation of memctl validates arguments, sets a couple of flags and flushes the caches. There is a single exception to this: the first (and only the first) time a shared text region is made modifiable by memctl, it is changed into a copy-on-write region which incurs additional overhead. I can think of no reason that a memctl implementation needs to have significantly more overhead than a system call to flush the caches. -Dale -- Dale Farnsworth Motorola Computer Group
alan@encore.encore.COM (Alan Langerman) (12/14/90)
In article <4322@photon.oakhill.UUCP>, jimk@oakhill.UUCP (Jim Klingshirn) writes: |> We periodically receive requests to add user level instructions |> to generate cache control operations. The justification generally |> is centered around data caches - for instance how can you force data |> out so that it is guaranteed to get out to the graphics frame buffer |> when the cache is in copyback mode. To date, the only justification |> I've heard for instruction cache control is to support self |> modifying code (including breakpoints). Assuming there are alternate |> ways to support breakpoints - should we worry about instruction cache |> control operations? |> |> |> Jim Klingshirn, |> Motorola 88000 Design Not only self-modified code, but other-modified code. A process can have its text mapped read/write by a second process, which in turn may alter that text, possibly while the first process is running. (We do this on our MP systems, on which we have built a sophisticated shared-memory debugger.) So we need to be able to flush not only the current processor's I-cache and pipe but also some OTHER processor's I-cache and pipe. (I'm not directly involved in this work, so I'm not sure how we currently solve this problem.) ----- Alan Langerman (alan@encore.com)
pcg@cs.aber.ac.uk (Piercarlo Grandi) (12/14/90)
I have corssposted to comp.arch, and redirected followups there too because this is a very general architectural question, not just limited to the 88k. On 11 Dec 90 01:37:22 GMT, jimk@oakhill.UUCP (Jim Klingshirn) said: jimk> We periodically receive requests to add user level instructions to jimk> generate cache control operations. The justification generally is jimk> centered around data caches - for instance how can you force data jimk> out so that it is guaranteed to get out to the graphics frame jimk> buffer when the cache is in copyback mode. To date, the only jimk> justification I've heard for instruction cache control is to jimk> support self modifying code (including breakpoints). jimk> Assuming there are alternate ways to support breakpoints - should jimk> we worry about instruction cache control operations? Oh yes, definitely. Self modifying code is of the essence. It is *vital* as a technique to support efficiently, especially on a RISC machine, many advanced programming language constructs. If you want to do just Fortran and C, no problem, but probably you don't want. Basically it is extremely important to be able to generate or modify small code sequences ("thunks") on the fly in many OO languages, and even in some older languages, and in many non OO languages as well, and in many other cases. Consider Self, as just a small example; also, virtual functions in C++ can be efficiently implemented with thunks, and one 386 compiler already uses them. Modern Smalltalk implementations generate code on the fly as well, and so on. Also, it is the easiest way to support languages that run in an environment, such as many Lisps, AI languages, which have an embedded compiler and generate compiled code in the workspace. Other applications may be graphic algorithms; in many cases one can produce thunks, ortmodify existing canned ones, so that certain repetitive drawing operations can be done by high speed code customized for the precise purpose. I guess this is why a guy from Interleaf is interested in self modifying code. Dynamic code generation, which is indistinguishable from self modifying code of old, is also especially suited to RISC machines to give sort of dynamic microprogramming; you synthetize higher level operations dynamically out of the simple RISC ones, without having to consider in advance all the cases. After all, consider that self modifying code was used most often in old machines to simulate index registers :-). -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/18/90)
Says Jim Klingshirn: |> |> We periodically receive requests to add user level instructions |> to generate cache control operations. The justification generally |> is centered around data caches - for instance how can you force data |> out so that it is guaranteed to get out to the graphics frame buffer |> when the cache is in copyback mode. To date, the only justification |> I've heard for instruction cache control is to support self |> modifying code (including breakpoints). Assuming there are alternate |> ways to support breakpoints - should we worry about instruction cache |> control operations? |> [Attributions below are based on other postings in this thread] Yes, there is a need for instruction cache control operations, and the need extend far beyond breakpoints (Piercarlo Grandi, Dan Pierson, Dave Benjamin, and others). Furthermore, the current memctl() call is not a good answer, for two reasons. First, it is slow, at least in some implementations. Second, as John Foderaro points out, it has the wrong semantics for some applications, such as Lisp garbage collection. On multi-processor systems it is generally necessary to apply the cache operation to all processors in the system (Alan Langerman, with the approval of MP OS folks everywhere). A user-level instruction is a very bad way to implement such a feature, because it would require a path that allows an instruction executed on one processor to affect the caches of another processor in a way that can be safely used by multiple processors simultaneously from user space without any locking or coordination between each other or the OS. A hardware implementation of this is completely unreasonable (think about how two processors would simultaneously issue page flush operations on different pages, and wait for the flushes to complete, from user space, without deadlocking or tromping over each other's commands). And if some software support is best, then the obvious way to get this support is to trap to the operating system, which can make the right thing happen across all processors in the system. Thus, the problem is not how to get the right support for user cache operations into the hardware. The problem is to devise an OS trap which is fast enough and has the right semantics. Additional support from the 88000 hardware will be required only if we cannot implement an OS trap with acceptable functionality and performance. This suggests two questions: 1) What is "acceptable functionality"? 2) What is "acceptable performance"? Functionality: I believe that John Foderaro's proposal to extend memctl() with an option to bring the data and instruction caches into coherence is excellent, but it doesn't go quite far enough. We shouldn't use memctl() because: 1) It requires that the length be a multiple of the page size, and I'd like to be able to use the line-granular cache operations when possible; and 2) It's a system call, system call entry/exit overhead is appreciable, and I'd like the trap to be as fast as possible. How about a new trap: r2 contains the base address r3 contains the length tb0 0,r0,<CacheSynchronizationTrap> Will cause the data and instruction caches for the specified region (between r2 and r2+r3-1, byte granular, no minimum length) to come into coherence, so that that region can be safely executed. If any byte within a four-byte word in this region is written, the the subsequent execution of that word is undefined until another CacheSynchonizationTrap that covers that word is issued. A length of zero is interpreted to mean all memory. No error checking is necessary. If the region contains invalid addresses, nothing bad happens; the copyback/invalidate just becomes moot. Performance: The execution time for this trap will, of course, vary according to the details of the system. It would not be surprising to discover that it takes twice as long on a system with two 88200s per Pbus, for example. Copyback times obviously will vary according to the number of dirty lines that must be copied back. Invalidating parts of the instruction cache has a performance impact that goes beyond the time required to the invalidation. We cannot control this time, but we can control the overhead required to get into and out of the cache control operation. How do people feel about a target of 10 clock cycles overhead? How about 100? 200? 2000? It is my belief that this trap can be made blindingly fast and that the overhead will be small compared with the actual cost of doing the cache manipulation. How do people feel about this approach? It is it promising enough to justify the work of drafting a proposal and submitting it to 88Open? Path: siberia!hamilton Newsgroups: comp.sys.m88k Distribution: world Followup-To: References: <2308@io.UUCP> <1199@dg.dg.com> <4322@photon.oakhill.UUCP> From: hamilton@siberia.rtp.dg.com (Eric Hamilton) Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton) Organization: Data General Corporation, Research Triangle Park, NC Subject: Re: m88200 cache flushes on DG Aviion Keywords: m88200 Aviion In article <4322@photon.oakhill.UUCP>, jimk@oakhill.UUCP (Jim Klingshirn) writes: |> |> We periodically receive requests to add user level instructions |> to generate cache control operations. The justification generally |> is centered around data caches - for instance how can you force data |> out so that it is guaranteed to get out to the graphics frame buffer |> when the cache is in copyback mode. To date, the only justification |> I've heard for instruction cache control is to support self |> modifying code (including breakpoints). Assuming there are alternate |> ways to support breakpoints - should we worry about instruction cache |> control operations? |> [Attributions below are based on other postings in this thread] Yes, there is a need for instruction cache control operations, and the need extend far beyond breakpoints (Piercarlo Grandi, Dan Pierson, Dave Benjamin, and others). Furthermore, the current memctl() call is not a good answer, for two reasons. First, it is slow, at least in some implementations. Second, as John Foderaro points out, it has the wrong semantics for some applications, such as Lisp garbage collection. On multi-processor systems it is generally necessary to apply the cache operation to all processors in the system (Alan Langerman, with the approval of MP OS folks everywhere). A user-level instruction is a very bad way to implement such a feature, because it would require a path that allows an instruction executed on one processor to affect the caches of another processor in a way that can be safely used by multiple processors simultaneously from user space without any locking or coordination between each other or the OS. A hardware implementation of this is completely unreasonable (think about how two processors would simultaneously issue page flush operations on different pages, and wait for the flushes to complete, from user space, without deadlocking or tromping over each other's commands). And if some software support is required, then the obvious way to get this support is to trap to the operating system, which can make the right thing happen across all processors in the system. Thus, the problem is not how to get the right support for user cache operations into the hardware. The problem is to devise an OS trap which is fast enough and has the right semantics. Additional support from the 88000 hardware will be required only if we cannot implement an OS trap with acceptable functionality and performance. This suggests two questions: 1) What is "acceptable functionality"? 2) What is "acceptable performance"? Functionality: I believe that John Foderaro's proposal to extend memctl() with an option to bring the data and instruction caches into coherence is excellent, but it doesn't go quite far enough. We shouldn't use memctl() because: 1) It requires that the length be a multiple of the page size, and I'd like to be able to use the line-granular cache operations when possible; and 2) It's a system call, system call entry/exit overhead is appreciable, and I'd like the trap to be as fast as possible. How about a new trap: r2 contains the base address r3 contains the length tb0 0,r0,<CacheSynchronizationTrap> Will cause the data and instruction caches for the specified region (between r2 and r2+r3-1, byte granular, no minimum length) to come into coherence, so that that region can be safely executed. If any byte within a (four byte) word in this region is written, the the subsequent execution of that word is undefined until another CacheSynchonizationTrap that covers that word is issued. A length of zero is interpreted to mean all memory. No error checking is necessary. If the region contains invalid addresses, nothing bad happens and the cache synchronization proves to be very