execute Proposal

hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/22/90)

In a previous posting, Jim Klingshirn asked about the need for
user instructions to generate cache control operations.  I
argued that a better and more implementable solution is to
supply a user trap, fielded by the operating system, which
will apply the appropriate cache operations.  I did not discuss
the details of how such a trap would be used, nor what the
implications for virtual memory and multiple
processors would be.  In other words, I assiduously avoided the
problem of code that is dynamically generated or moved, the problem
that Sean Foderaro refers to as "read-write-execute" data.
This posting proposes a solution to that problem.  The reader
should refer to Foderaro's posting for a discussion of why the
existing memctl() function is not a solution to this problem.

[In this context, an "Icache" is an instruction CMMU and a "Dcache"
 is a data CMMU.]

The problem:
In general, 88000 icaches do not snoop and instruction fetches are not
marked global.  Note the phrase "in general"; some vendors may build
systems in which either or both of these statements are not true, but in
general we shouldn't require that all current and future 88000s support
these capabilities, which are not free.  Thus, the icaches are not
generally coherent when instructions in memory are changed.

There are two relevant ways in which instructions can change.
First, there is the normal activity of the kernel virtual memory manager,
which pages code in as necessary, loads code from disk or across the
net, and generally moves instructions around in page-sized chunks.
Second, user programs may want to modify a region of memory and then try to
execute it.  This is a legitimate functional requirement; see the postings
by Sean Foderaro, Piercarlo Grandi, David Benjamin, and others for examples.

Because the icaches are not coherent, whenever any executable memory is
changed, it is necessary to invalidate (some portion of) the icache
and force (some portion of) the dcache to copyback modified data.  The
invalidate eliminates stale data from the instruction cache; the copyback
ensures that non-global instruction cache fills will not fetch stale
data from memory.


Multiprocessor examples:
In a multiprocessor system it is generally necessary to invalidate/copyback
all processors' caches.  This is because the offending data may be stale
in any processor's cache.  For example:

	Process A on processor 1 pages the instructions a1 into page p1.
	Process A executes a1 on processor 0.
	Process A is rescheduled onto processor 2 and executes a1.
	Process A is rescheduled onto processor 3 and executes a1.
	Process A terminates.
	Process B on processor 2  page faults on non-resident instruction b1.
	The virtual memory manager decides to fill page p1 with b1.
	A network demon decides to fetch b1 from a remote executable.
	The network demon is scheduled onto processor 1.
	The network demon starts copying b1 into p1.
	The network demon is rescheduled onto processor 3.
	The network demon is rescheduled onto processor 2.
	The network demon completes the pagein.
	Process B now resumes execution.

At this point, the instruction caches of processors 1, 2, and 3 incoherently
believe that page p1 contains a1, as does main memory.  The correct value,
b1, can be found in the data cache of one or more of the processors on which
the network demon executed.  It should be obvious that every icache must
invalidate and every dcache must copyback page p1 before B can safely
resume execution.

Similar examples can be constructed for user-modified code, even if
only one process is involved.  Indeed, the problem is the
same: the offending writes, punctuated by reschedules, may have occurred
on several processors, instruction fetches may have occurred on
several processors, all of the instruction caches may be more or
less incoherent and the correct data may be scattered through
several data caches.


What this means for rwx pages:
There are two conclusions that follow from the discussion above.

1) The operating system must know, at page replacement time, that a
   particular page is potentially executable, and if it is is, must
   issue to every processor a dcache copyback and an icache invalidaate
   for that page.

2) If a user tries to execute data, a dcache copyback and an icache
   invalidate must be issued to every processor for the data area in
   question, after the data is modified and before it is executed.
   This is exactly the same operation that the operating system
   must perform during page replacement, so it is trivially true that the
   necessary hardware support is present.

This is not a problem for the operating system, which has more or less
direct access to the cache hardware and controls page replacement.
It is a problem for user rwx pages, for two reasons.  First,
user code has no direct access to the cache hardware.  Second, the
OS virtual memory manager must somehow be notified that a data page
is potentially executable, so that it can page it in correctly.

If these two problems can be overcome, there is no reason why
read-write-execute pages cannot be made to work on any current or
future 88000 processors, in uni-processor or multi-processor systems.

Proposal:
I propose that we support read-write-execute pages by defining mechanisms
that user applications may invoke to identify potentially executable
data and to provoke cache writebacks and invalidates as necessary.

I have already proposed a cache manipulation operation in a previous
posting to comp.sys.m88k:
>
>	r2 contains the base address
>	r3 contains the length
>
>	tb0 0,r0,<CacheSynchronizationTrap>
>
> Will cause the data and instruction caches for the specified region (between
> r2 and r2+r3-1, byte granular, no minimum length) to come into coherence,
> so that that region can be safely executed.
> If any byte within a four-byte word in this region is written,
> the the subsequent execution of that word is
> undefined until another CacheSynchonizationTrap that covers that word is
> issued.  A length of zero is interpreted to mean all memory.
>

We also need some way to notify the kernel that a piece of storage is
potentially executable.  The following mechanisms come to mind:

	- Add a MCT_RWX (state 4) argument to memctl().  When an area
	  is memctl'd to MCT_RWX the operating system must treat it as
	  potentialy executable for paging purposes.  This is probably
	  the solution of choice in the BCS world.
	- Use mprotect() in the V.4 world for the same purpose.
	- Add bits to the executable format to indicate that stack extensions
	  and/or sbrk() extensions should be treated as potentially
	  executable.  This would be done as well as, not instead of, the
	  memctl/mprotect thing.

Note that the MCT_RWX memctl operation has exactly the interface, but not
the semantics, proposed by Foderaro.  It does not necessarily do any
cache manipulation at all; it merely notifies the virtual memory manager
that some pageins will in the future require special treatment.

For example, a LISP interpreter might choose to use the MCT_RWX memctl()
option to mark its entire heap as read-write-execute.  This would be
done once.  Whenever code was dynamically compiled into the heap,
and whenever code was moved by the garbage collector, the
CacheSynchronizationTrap would be issued by the application to bring
the instruction caches back into coherence.  Whenever the virtual
memory manager paged any part of the heap, it would recognize the
read-write-execute state and properly invalidate the instruction caches.

rfg@NCD.COM (Ron Guilmette) (12/22/90)

In article <1990Dec21.201522.16487@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
+
+1) The operating system must know, at page replacement time, that a
+   particular page is potentially executable...
...
+I propose that we support read-write-execute pages by defining mechanisms
+that user applications may invoke to identify potentially executable
+data and to provoke cache writebacks and invalidates as necessary.
...
+We also need some way to notify the kernel that a piece of storage is
+potentially executable.  The following mechanisms come to mind:
+
+	- Add a MCT_RWX (state 4) argument to memctl().  When an area
+	  is memctl'd to MCT_RWX the operating system must treat it as
+	  potentialy executable for paging purposes.  This is probably
+	  the solution of choice in the BCS world.

+	- Use mprotect() in the V.4 world for the same purpose.

I frankly am having a hard time understanding what exactly this
discussion is all about.

I understand that it would be nice to have a "standard" way of telling
the OS that some part of the virtual address space is executable.  So
what?  As noted, in V.4 you will be able to use mprotect (or mmap) to
do this.

I see people talking about the BCS/OCS.  That's V.3 stuff!!!  Won't
the V.4 ABI will make that all obsolete (and also give you mprotect)
anyway?  If so, what's the big deal?  Is it really worth it at this
stage to be fretting about what the OCB/BCS does (or does not) say?

Obviously, it *is* worthwhile to make sure that the precise semantics
of mprotect() are suitable to meet a variety of needs, but why should
anybody be haggling (at this late date) about OCS/BCS changes?

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/24/90)

In article <3072@lupine.NCD.COM>, rfg@NCD.COM (Ron Guilmette) writes:
|> In article <1990Dec21.201522.16487@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
|> +
|> +1) The operating system must know, at page replacement time, that a
|> +   particular page is potentially executable...
|> ...
|> +I propose that we support read-write-execute pages by defining mechanisms
|> +that user applications may invoke to identify potentially executable
|> +data and to provoke cache writebacks and invalidates as necessary.
|> ...
|> +We also need some way to notify the kernel that a piece of storage is
|> +potentially executable.  The following mechanisms come to mind:
|> +
|> +	- Add a MCT_RWX (state 4) argument to memctl().  When an area
|> +	  is memctl'd to MCT_RWX the operating system must treat it as
|> +	  potentialy executable for paging purposes.  This is probably
|> +	  the solution of choice in the BCS world.
|> 
|> +	- Use mprotect() in the V.4 world for the same purpose.
|> 
|> I frankly am having a hard time understanding what exactly this
|> discussion is all about.
|>
The discussion is about how to have read-write-execute semantics
in multiprocessor systems without requiring that the hardware support,
without software intervention, coherency between the instruction cache(s)
and the data cache(s).

This requires two things.  One is a way of notifying the OS that a given
area of memory is both writable and executable.  The second is a way of
bringing the instruction caches into instantaneous coherence so that
an application can write an instruction into an rwx region, do the
thing that brings the caches into coherence, and then execute the newly
written instruction.
 
|> I understand that it would be nice to have a "standard" way of telling
|> the OS that some part of the virtual address space is executable.  So
|> what?  As noted, in V.4 you will be able to use mprotect (or mmap) to
|> do this.
|>
Not "nice" but "necessary", at least in the context of comp.sys.m88k;
it may be exactly the other way around in comp.arch.....


The mprotect() call gives us a standard answer to the first requirement
(telling the OS that a given area is executable) but not the second (bringing
the caches into instantaneous coherence).  A fast trap for this purpose
is at least discussably useful in both the V.3 and the V.4 worlds.
  
|> I see people talking about the BCS/OCS.  That's V.3 stuff!!!  Won't
|> the V.4 ABI will make that all obsolete (and also give you mprotect)
|> anyway?  If so, what's the big deal?  Is it really worth it at this
|> stage to be fretting about what the OCB/BCS does (or does not) say?
|> 
|> Obviously, it *is* worthwhile to make sure that the precise semantics
|> of mprotect() are suitable to meet a variety of needs, but why should
|> anybody be haggling (at this late date) about OCS/BCS changes?
|>
The bulk of the discussion is relevant to both BCS and ABI.  It is true
that the ABI is closer to a complete solution because it has mprotect(),
and that a BCS solution is less interesting at this date.
A full ABI solution can be delivered with mprotect() plus a little bit more,
and a full BCS solution can be delivered by augmenting memctl() to
deliver the relevant mprotect() functionality, plus exactly the same little bit
more.

Thus, it may be reasonable not to worry about the BCS, but it is not the case
that the V.4 ABI will render the whole discussion obsolete, nor that the advent
of mprotect() alone will necessarily solve the problem or end the discussion. 

----------------------------------------------------------------------
Eric Hamilton				+1 919 248 6172
Data General Corporation		hamilton@dg-rtp.dg.com
62 Alexander Drive			...!mcnc!rti!xyzzy!hamilton
Research Triangle Park, NC  27709, USA

rfg@NCD.COM (Ron Guilmette) (12/25/90)

In article <1990Dec23.222149.9473@dg-rtp.dg.com> hamilton@siberia.rtp.dg.com (Eric Hamilton) writes:
>
>The mprotect() call gives us a standard answer to the first requirement
>(telling the OS that a given area is executable) but not the second (bringing
>the caches into instantaneous coherence).  A fast trap for this purpose
>is at least discussably useful in both the V.3 and the V.4 worlds.

Ignoring the V.3 world for the moment, let me just ask some innocent
(and naive?) questions and see what pops up.

First question:  When you say "the caches" are we talking about the
I and D caches on a single processor system, or are we talking about
more than that?

Let me assume for the moment that the problem that **most** (but admitedly
not all) folks are concerned about at the moment is the I/D coherency
for a uniprocessor.

Now please don't jump all over me if I've got my facts all confused, but
let me just toss out an off-the-cuff, top-of-the-head idea and see what
(if any) merit it might have.

Before I begin, let me say that it seems to me that (depending upon
the application and the frequency with which you are going to be
using these tricks) that it might be acceptable to alternatively
(a) write some executable code into an area, then (b) call mprotect()
to set the permissions on the area to include EXECUTE, then (c)
execute the code, then (d) call mprotect again to set the area back
to just read-write, then (e) start the cycle all over.

If this would work, then I would imagine that it would be pretty easy
to get vendors to sync the I/D caches at the point of each mprotect(EXECUTE)
call.

Anyway, assuming that the frequency at which this stuff has to happen
is too high to allow that (simple?) solution, how about this instead?
You have an area which is mprotected to allow write & execute... when it
is first setup that way, the OS maps those pages into the D address space
of the process (using mapping tables referenced by the DATA cmmu) but sets
those same pages (at the same logical addresses) as "unmapped" in the I
address space (using the mapping tables referenced by the INSTRUCTION cmmu).

Now I can write stuff in there and as soon as I try to execute any of
it I'll catch a page fault, right?

At that instant, the OS could swap the mappings (i.e. map the page IN TO
the I space and OUT OF the D space) and sync the caches.

I could now proceed until I tried (later on) to again treat the area as
data space, at which time I would again catch a page fault and the OS
could again swap the mappings back again.

Whatdaya think?

>Thus, it may be reasonable not to worry about the BCS, but it is not the case
>that the V.4 ABI will render the whole discussion obsolete, nor that the advent
>of mprotect() alone will necessarily solve the problem or end the discussion. 

Agreed, however we are getting into some rather low-level semantics here.
Is it possible to descend below the level addressed by the ABI and to
arrive at a level which so low that its issues can only be described
as "quality of implementation" issues?

-- 

// Ron Guilmette  -  C++ Entomologist
// Internet: rfg@ncd.com      uucp: ...uunet!lupine!rfg
// Motto:  If it sticks, force it.  If it breaks, it needed replacing anyway.

pardo@cs.washington.edu (David Keppel) (01/03/91)

>[Ongoing discussion about instruction-space modification]

To add to the fire, I have a paper on a portable interface for
instruction-space modification.  The interface would be implemented
using (one or more of) the schemes that have been discussed here.  A
PostScript copy of the paper is available via anonymous ftp from
`june.cs.washington.edu' (128.95.1.4) in `pub/pardo/fly.ps.Z'.

	;-D on  ( The king of runtime )  Pardo