hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/22/90)
In a previous posting, Jim Klingshirn asked about the need for user instructions to generate cache control operations. I argued that a better and more implementable solution is to supply a user trap, fielded by the operating system, which will apply the appropriate cache operations. I did not discuss the details of how such a trap would be used, nor what the implications for virtual memory and multiple processors would be. In other words, I assiduously avoided the problem of code that is dynamically generated or moved, the problem that Sean Foderaro refers to as "read-write-execute" data. This posting proposes a solution to that problem. The reader should refer to Foderaro's posting for a discussion of why the existing memctl() function is not a solution to this problem. [In this context, an "Icache" is an instruction CMMU and a "Dcache" is a data CMMU.] The problem: In general, 88000 icaches do not snoop and instruction fetches are not marked global. Note the phrase "in general"; some vendors may build systems in which either or both of these statements are not true, but in general we shouldn't require that all current and future 88000s support these capabilities, which are not free. Thus, the icaches are not generally coherent when instructions in memory are changed. There are two relevant ways in which instructions can change. First, there is the normal activity of the kernel virtual memory manager, which pages code in as necessary, loads code from disk or across the net, and generally moves instructions around in page-sized chunks. Second, user programs may want to modify a region of memory and then try to execute it. This is a legitimate functional requirement; see the postings by Sean Foderaro, Piercarlo Grandi, David Benjamin, and others for examples. Because the icaches are not coherent, whenever any executable memory is changed, it is necessary to invalidate (some portion of) the icache and force (some portion of) the dcache to copyback modified data. The invalidate eliminates stale data from the instruction cache; the copyback ensures that non-global instruction cache fills will not fetch stale data from memory. Multiprocessor examples: In a multiprocessor system it is generally necessary to invalidate/copyback all processors' caches. This is because the offending data may be stale in any processor's cache. For example: Process A on processor 1 pages the instructions a1 into page p1. Process A executes a1 on processor 0. Process A is rescheduled onto processor 2 and executes a1. Process A is rescheduled onto processor 3 and executes a1. Process A terminates. Process B on processor 2 page faults on non-resident instruction b1. The virtual memory manager decides to fill page p1 with b1. A network demon decides to fetch b1 from a remote executable. The network demon is scheduled onto processor 1. The network demon starts copying b1 into p1. The network demon is rescheduled onto processor 3. The network demon is rescheduled onto processor 2. The network demon completes the pagein. Process B now resumes execution. At this point, the instruction caches of processors 1, 2, and 3 incoherently believe that page p1 contains a1, as does main memory. The correct value, b1, can be found in the data cache of one or more of the processors on which the network demon executed. It should be obvious that every icache must invalidate and every dcache must copyback page p1 before B can safely resume execution. Similar examples can be constructed for user-modified code, even if only one process is involved. Indeed, the problem is the same: the offending writes, punctuated by reschedules, may have occurred on several processors, instruction fetches may have occurred on several processors, all of the instruction caches may be more or less incoherent and the correct data may be scattered through several data caches. What this means for rwx pages: There are two conclusions that follow from the discussion above. 1) The operating system must know, at page replacement time, that a particular page is potentially executable, and if it is is, must issue to every processor a dcache copyback and an icache invalidaate for that page. 2) If a user tries to execute data, a dcache copyback and an icache invalidate must be issued to every processor for the data area in question, after the data is modified and before it is executed. This is exactly the same operation that the operating system must perform during page replacement, so it is trivially true that the necessary hardware support is present. This is not a problem for the operating system, which has more or less direct access to the cache hardware and controls page replacement. It is a problem for user rwx pages, for two reasons. First, user code has no direct access to the cache hardware. Second, the OS virtual memory manager must somehow be notified that a data page is potentially executable, so that it can page it in correctly. If these two problems can be overcome, there is no reason why read-write-execute pages cannot be made to work on any current or future 88000 processors, in uni-processor or multi-processor systems. Proposal: I propose that we support read-write-execute pages by defining mechanisms that user applications may invoke to identify potentially executable data and to provoke cache writebacks and invalidates as necessary. I have already proposed a cache manipulation operation in a previous posting to comp.sys.m88k: > > r2 contains the base address > r3 contains the length > > tb0 0,r0,<CacheSynchronizationTrap> > > Will cause the data and instruction caches for the specified region (between > r2 and r2+r3-1, byte granular, no minimum length) to come into coherence, > so that that region can be safely executed. > If any byte within a four-byte word in this region is written, > the the subsequent execution of that word is > undefined until another CacheSynchonizationTrap that covers that word is > issued. A length of zero is interpreted to mean all memory. > We also need some way to notify the kernel that a piece of storage is potentially executable. The following mechanisms come to mind: - Add a MCT_RWX (state 4) argument to memctl(). When an area is memctl'd to MCT_RWX the operating system must treat it as potentialy executable for paging purposes. This is probably the solution of choice in the BCS world. - Use mprotect() in the V.4 world for the same purpose. - Add bits to the executable format to indicate that stack extensions and/or sbrk() extensions should be treated as potentially executable. This would be done as well as, not instead of, the memctl/mprotect thing. Note that the MCT_RWX memctl operation has exactly the interface, but not the semantics, proposed by Foderaro. It does not necessarily do any cache manipulation at all; it merely notifies the virtual memory manager that some pageins will in the future require special treatment. For example, a LISP interpreter might choose to use the MCT_RWX memctl() option to mark its entire heap as read-write-execute. This would be done once. Whenever code was dynamically compiled into the heap, and whenever code was moved by the garbage collector, the CacheSynchronizationTrap would be issued by the application to bring the instruction caches back into coherence. Whenever the virtual memory manager paged any part of the heap, it would recognize the read-write-execute state and properly invalidate the instruction caches.
rfg@NCD.COM (Ron Guilmette) (12/22/90)
In article <1990Dec21.201522.16487@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
+
+1) The operating system must know, at page replacement time, that a
+ particular page is potentially executable...
...
+I propose that we support read-write-execute pages by defining mechanisms
+that user applications may invoke to identify potentially executable
+data and to provoke cache writebacks and invalidates as necessary.
...
+We also need some way to notify the kernel that a piece of storage is
+potentially executable. The following mechanisms come to mind:
+
+ - Add a MCT_RWX (state 4) argument to memctl(). When an area
+ is memctl'd to MCT_RWX the operating system must treat it as
+ potentialy executable for paging purposes. This is probably
+ the solution of choice in the BCS world.
+ - Use mprotect() in the V.4 world for the same purpose.
I frankly am having a hard time understanding what exactly this
discussion is all about.
I understand that it would be nice to have a "standard" way of telling
the OS that some part of the virtual address space is executable. So
what? As noted, in V.4 you will be able to use mprotect (or mmap) to
do this.
I see people talking about the BCS/OCS. That's V.3 stuff!!! Won't
the V.4 ABI will make that all obsolete (and also give you mprotect)
anyway? If so, what's the big deal? Is it really worth it at this
stage to be fretting about what the OCB/BCS does (or does not) say?
Obviously, it *is* worthwhile to make sure that the precise semantics
of mprotect() are suitable to meet a variety of needs, but why should
anybody be haggling (at this late date) about OCS/BCS changes?
--
// Ron Guilmette - C++ Entomologist
// Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg
// Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/24/90)
In article <3072@lupine.NCD.COM>, rfg@NCD.COM (Ron Guilmette) writes: |> In article <1990Dec21.201522.16487@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes: |> + |> +1) The operating system must know, at page replacement time, that a |> + particular page is potentially executable... |> ... |> +I propose that we support read-write-execute pages by defining mechanisms |> +that user applications may invoke to identify potentially executable |> +data and to provoke cache writebacks and invalidates as necessary. |> ... |> +We also need some way to notify the kernel that a piece of storage is |> +potentially executable. The following mechanisms come to mind: |> + |> + - Add a MCT_RWX (state 4) argument to memctl(). When an area |> + is memctl'd to MCT_RWX the operating system must treat it as |> + potentialy executable for paging purposes. This is probably |> + the solution of choice in the BCS world. |> |> + - Use mprotect() in the V.4 world for the same purpose. |> |> I frankly am having a hard time understanding what exactly this |> discussion is all about. |> The discussion is about how to have read-write-execute semantics in multiprocessor systems without requiring that the hardware support, without software intervention, coherency between the instruction cache(s) and the data cache(s). This requires two things. One is a way of notifying the OS that a given area of memory is both writable and executable. The second is a way of bringing the instruction caches into instantaneous coherence so that an application can write an instruction into an rwx region, do the thing that brings the caches into coherence, and then execute the newly written instruction. |> I understand that it would be nice to have a "standard" way of telling |> the OS that some part of the virtual address space is executable. So |> what? As noted, in V.4 you will be able to use mprotect (or mmap) to |> do this. |> Not "nice" but "necessary", at least in the context of comp.sys.m88k; it may be exactly the other way around in comp.arch..... The mprotect() call gives us a standard answer to the first requirement (telling the OS that a given area is executable) but not the second (bringing the caches into instantaneous coherence). A fast trap for this purpose is at least discussably useful in both the V.3 and the V.4 worlds. |> I see people talking about the BCS/OCS. That's V.3 stuff!!! Won't |> the V.4 ABI will make that all obsolete (and also give you mprotect) |> anyway? If so, what's the big deal? Is it really worth it at this |> stage to be fretting about what the OCB/BCS does (or does not) say? |> |> Obviously, it *is* worthwhile to make sure that the precise semantics |> of mprotect() are suitable to meet a variety of needs, but why should |> anybody be haggling (at this late date) about OCS/BCS changes? |> The bulk of the discussion is relevant to both BCS and ABI. It is true that the ABI is closer to a complete solution because it has mprotect(), and that a BCS solution is less interesting at this date. A full ABI solution can be delivered with mprotect() plus a little bit more, and a full BCS solution can be delivered by augmenting memctl() to deliver the relevant mprotect() functionality, plus exactly the same little bit more. Thus, it may be reasonable not to worry about the BCS, but it is not the case that the V.4 ABI will render the whole discussion obsolete, nor that the advent of mprotect() alone will necessarily solve the problem or end the discussion. ---------------------------------------------------------------------- Eric Hamilton +1 919 248 6172 Data General Corporation hamilton@dg-rtp.dg.com 62 Alexander Drive ...!mcnc!rti!xyzzy!hamilton Research Triangle Park, NC 27709, USA
rfg@NCD.COM (Ron Guilmette) (12/25/90)
In article <1990Dec23.222149.9473@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes: > >The mprotect() call gives us a standard answer to the first requirement >(telling the OS that a given area is executable) but not the second (bringing >the caches into instantaneous coherence). A fast trap for this purpose >is at least discussably useful in both the V.3 and the V.4 worlds. Ignoring the V.3 world for the moment, let me just ask some innocent (and naive?) questions and see what pops up. First question: When you say "the caches" are we talking about the I and D caches on a single processor system, or are we talking about more than that? Let me assume for the moment that the problem that **most** (but admitedly not all) folks are concerned about at the moment is the I/D coherency for a uniprocessor. Now please don't jump all over me if I've got my facts all confused, but let me just toss out an off-the-cuff, top-of-the-head idea and see what (if any) merit it might have. Before I begin, let me say that it seems to me that (depending upon the application and the frequency with which you are going to be using these tricks) that it might be acceptable to alternatively (a) write some executable code into an area, then (b) call mprotect() to set the permissions on the area to include EXECUTE, then (c) execute the code, then (d) call mprotect again to set the area back to just read-write, then (e) start the cycle all over. If this would work, then I would imagine that it would be pretty easy to get vendors to sync the I/D caches at the point of each mprotect(EXECUTE) call. Anyway, assuming that the frequency at which this stuff has to happen is too high to allow that (simple?) solution, how about this instead? You have an area which is mprotected to allow write & execute... when it is first setup that way, the OS maps those pages into the D address space of the process (using mapping tables referenced by the DATA cmmu) but sets those same pages (at the same logical addresses) as "unmapped" in the I address space (using the mapping tables referenced by the INSTRUCTION cmmu). Now I can write stuff in there and as soon as I try to execute any of it I'll catch a page fault, right? At that instant, the OS could swap the mappings (i.e. map the page IN TO the I space and OUT OF the D space) and sync the caches. I could now proceed until I tried (later on) to again treat the area as data space, at which time I would again catch a page fault and the OS could again swap the mappings back again. Whatdaya think? >Thus, it may be reasonable not to worry about the BCS, but it is not the case >that the V.4 ABI will render the whole discussion obsolete, nor that the advent >of mprotect() alone will necessarily solve the problem or end the discussion. Agreed, however we are getting into some rather low-level semantics here. Is it possible to descend below the level addressed by the ABI and to arrive at a level which so low that its issues can only be described as "quality of implementation" issues? -- // Ron Guilmette - C++ Entomologist // Internet: rfg@ncd.com uucp: ...uunet!lupine!rfg // Motto: If it sticks, force it. If it breaks, it needed replacing anyway.
pardo@cs.washington.edu (David Keppel) (01/03/91)
>[Ongoing discussion about instruction-space modification]
To add to the fire, I have a paper on a portable interface for
instruction-space modification. The interface would be implemented
using (one or more of) the schemes that have been discussed here. A
PostScript copy of the paper is available via anonymous ftp from
`june.cs.washington.edu' (128.95.1.4) in `pub/pardo/fly.ps.Z'.
;-D on ( The king of runtime ) Pardo