[comp.sys.m88k] m88200 cache flushes on DG Aviion

dbjag@io.UUCP (David Benjamin) (12/07/90)

Does anyone know of a good method for forcing a memory cache flush on
a Data General Aviion 300 series workstation.  Specifically, I would like
to request a specific line flush from the m88200 cache handling instructions.
I'm not too keen on a full flush of both caches, as that would take too much 
time.

The reason is kind of hairy, but if you must know, it involves self-modifying
code which seems to fail on the Aviion when the caches get out of sync.
There, glad you asked?

This brings up another question.  The code is apparently failing because
the contents of the two caches different values for the same address.
Why wasn't this state prevented by the "M-bus snooping" of the 88200's?
Perhaps my understanding of their function is warped.

Thanks in advance.

-- 
- Dave Benjamin                                                               -
- Interleaf                                                                   -
- ...!eddie.mit.EDU!ileaf!dbjag                                               -

lewine@cheshirecat.webo.dg.com (Donald Lewine) (12/07/90)

In article <2308@io.UUCP>, dbjag@io.UUCP (David Benjamin) writes:
|> Does anyone know of a good method for forcing a memory cache flush on
|> a Data General Aviion 300 series workstation.  Specifically, I would like
|> to request a specific line flush from the m88200 cache handling instructions.
|> I'm not too keen on a full flush of both caches, as that would take too much 
|> time.
|> 
|> The reason is kind of hairy, but if you must know, it involves self-modifying
|> code which seems to fail on the Aviion when the caches get out of sync.
|> There, glad you asked?
|> 
|> This brings up another question.  The code is apparently failing because
|> the contents of the two caches different values for the same address.
|> Why wasn't this state prevented by the "M-bus snooping" of the 88200's?
|> Perhaps my understanding of their function is warped.
|> 
|> Thanks in advance.
|> 
|> -- 
|> - Dave Benjamin                                                               -
|> - Interleaf                                                                   -
|> - ...!eddie.mit.EDU!ileaf!dbjag                                               -

	The memctl() function is used to support self-modifying
	code, compile and go languages and the like.  It lets you
	mark a region of memory as writable or executable.  It may
    not be both executable and writable at the same time. See
    the DG/UX documentation or the 88open binary compatibility
    standard for details of memctl().

    The 88200 (by default) does not do M-bus snooping for the
    instruction cache.  It assumes that there will be no writes
    to the instruction stream and turning off snoop gives 
    higher performance.  The memctl() function may be used to
    override the default.

    Another solution is to chase the data out of the cache.
    You can make some memory references that will conflict
    with the instructions you want to flush and thus chase them
    out of the cache.  NOTE: This is specific to the 88200.  If
    Motorola were to make a larger cache chip, this code may stop
    working.
    
--------------------------------------------------------------------
Donald A. Lewine                (508) 870-9008 Voice
Data General Corporation        (508) 366-0750 FAX
4400 Computer Drive. MS D112A
Westboro, MA 01580  U.S.A.

uucp: uunet!dg!lewine   Internet: lewine@cheshirecat.webo.dg.com

andrew@frip.WV.TEK.COM (Andrew Klossner) (12/08/90)

[]

	"I would like to request a specific line flush from the m88200
	cache handling instructions.  I'm not too keen on a full flush
	of both caches, as that would take too much time."

In the Motorola kernel, the time it takes to flush the cache is small
compared to the time it takes to get into and out of the system call
handler.  I'd be surprised if Data General has significantly smaller
syscall overhead.

	"The code is apparently failing because the contents of the two
	caches different values for the same address.  Why wasn't this
	state prevented by the "M-bus snooping" of the 88200's?"

Again, in the Motorola kernel, only the data caches are told to
participate in snooping.  Performance would be seriously degraded if
the instruction caches were to join in.  On a snoop cycle, *every*
snooping 88200 must stop what it's doing for one cycle and listen in.
Even if the CPU is fetching instructions from cache, it's held up.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

stevec@bubbyb.Solbourne.COM (Steve Cox) (12/11/90)

In article <2308@io.UUCP> dbjag@io.UUCP (David Benjamin) writes:
>The reason is kind of hairy, but if you must know, it involves self-modifying
>code which seems to fail on the Aviion when the caches get out of sync.
>There, glad you asked?
>
>This brings up another question.  The code is apparently failing because
>the contents of the two caches different values for the same address.
>Why wasn't this state prevented by the "M-bus snooping" of the 88200's?
>Perhaps my understanding of their function is warped.

yes, i believe your understanding is correct (that the caches are supposed 
to be coherent).  but, perhaps the code you are trying to modify AND THEN 
EXECUTE is not marked as global by the DG operating system.(?)  i believe
that coherency is not maintained for data that isn't marked as global.
it doesn't seem reasonable that this is a problem (or "state" as you say)
in the cache coherency protocol (i.e.  there'd be all sorts of problems
in any multi-88200 system if such a bug existed).

- stevec  

--
steve cox			stevec@solbourne.com
solbourne computer, inc.	i've got the need...
1900 pike, longmont, co		the need...
(303)772-3400			for speed!

andrew@frip.WV.TEK.COM (Andrew Klossner) (12/11/90)

[]

	"Another solution is to chase the data out of the cache.  You
	can make some memory references that will conflict with the
	instructions you want to flush and thus chase them out of the
	cache."

But the problem is stale data in the instruction cache.  Memory
references won't help here; you have to do instruction fetches to flush
that data.  That would be a worst case of 16 fetches of different
instructions, each located at the indicated offset within a page.  I
suppose it could be done, but it sounds pretty fragile.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

jdarcy@encore.com (Jeff d'Arcy) (12/11/90)

stevec@bubbyb.Solbourne.COM (Steve Cox) writes:
>>Why wasn't this state prevented by the "M-bus snooping" of the 88200's?
>>Perhaps my understanding of their function is warped.
>
>yes, i believe your understanding is correct (that the caches are supposed 
>to be coherent).  but, perhaps the code you are trying to modify AND THEN 
>EXECUTE is not marked as global by the DG operating system.(?)

I came in late on this discussion, so if someone's already covered this
ground please forgive me.

In order for snooping to occur, the Global bit in the writer's page descriptor
must be set, and the Snoop Enable bit must also be set in the reader's System
Control Register.  It is quite believable that this bit will *not* be set for
the Instruction CMMU, therefore snooping will not occur, and therefore...

You could approach this problem in several ways.  Setting Snoop Enable in the
I-CMMU might work, as would a flush prior to executing the modified code.  My
favored solution would be *not* to execute "dirty" code, but I recognize that
others don't share my viewpoint there.

--

Jeff d'Arcy, Generic Software Engineer - jdarcy@encore.com
    Non sunt multiplicanda entia praeter necessitatem!

jimk@oakhill.UUCP (Jim Klingshirn) (12/11/90)

n article <2308@io.UUCP>, dbjag@io.UUCP (David Benjamin) writes:
|> Does anyone know of a good method for forcing a memory cache flush on
|> a Data General Aviion 300 series workstation.  Specifically, I would like
|> to request a specific line flush from the m88200 cache handling instructions.|> I'm not too keen on a full flush of both caches, as that would take too much |> time.
|> 
|> The reason is kind of hairy, but if you must know, it involves self-modifying|> code which seems to fail on the Aviion when the caches get out of sync.
|> There, glad you asked?
|> 
|> This brings up another question.  The code is apparently failing because
|> the contents of the two caches different values for the same address.
|> Why wasn't this state prevented by the "M-bus snooping" of the 88200's?
|> Perhaps my understanding of their function is warped.
|> 
|> Thanks in advance.
|> 
|> -- 
|> - Dave Benjamin                                                               -
|> - Interleaf                                                                   -
|> - ...!eddie.mit.EDU!ileaf!dbjag                                               i

Don Lewine suggested two ways to flush the cache, the first is to use
the memctl() system call which was designed for this type of thing.  
The second was to chase the data out of the cache. 

Using memctl() is the only way to insure that the code will  be
portable at all.  Since it's a system call, it allows the operating 
system to clean up all caches, pipelines, buffers etc.  
It can be customized for the particular version of the 88000
that your application is running on, and can work correctly 
on any 88000 BCS compliant system.  In fact it may even be fairly
efficient since the 88200 can flush or invalidate a line at a time.

On the other hand there is no way for application software to reliably 
chase data out of a cache.  To define an algorithm that would clear out 
any size cache, any number of caches, any cache associativity, and any 
type of cache is impossible, especially when you consider that you may 
need to clear instruction prefetch buffers, instruction issue pipelines 
and branch caches in addition to clearing the instruction cache(s).   

We periodically receive requests to add user level instructions
to generate cache control operations.  The justification generally
is centered around data caches - for instance how can you force data
out so that it is guaranteed to get out to the graphics frame buffer 
when the cache is in copyback mode.  To date, the only justification
I've heard for instruction cache control  is to support self 
modifying code (including breakpoints).  Assuming there are alternate
ways to support breakpoints - should we worry about instruction cache
control operations?


Jim Klingshirn,
Motorola 88000 Design

pierson@encore.com (Dan L. Pierson) (12/12/90)

In article <4322@photon.oakhill.UUCP> jimk@oakhill.UUCP (Jim Klingshirn) writes:
   We periodically receive requests to add user level instructions
   to generate cache control operations.  The justification generally
   is centered around data caches - for instance how can you force data
   out so that it is guaranteed to get out to the graphics frame buffer 
   when the cache is in copyback mode.  To date, the only justification
   I've heard for instruction cache control  is to support self 
   modifying code (including breakpoints).  Assuming there are alternate
   ways to support breakpoints - should we worry about instruction cache
   control operations?

Yes, any language system that incrementally compiles or dynamically
loads code then executes it needs to flush caches.  Lisp is a good
example of the worst case here.  Many Lisp systems:

    1. Put compiled code (incrementally or loaded) into dynamic space.

    2. Manage dynamic space with a copying garbage collector.  This
       means that code may move between calls however the old code
       location will only be reused after the entire semi-space it's
       part of has been freed for reuse (there are less friendly GC
       algorithms...). 

    3. Loading a Lisp compiled file can consist of an arbitrary
       sequence of: load a bit, execute some of what you just loaded,
       load some more, etc.  The i cache has to be cleaned as
       efficiently as possible in here.

Lisp is not the only language with this type of requirements, just a
good example.

                                            dan

In real life: Dan Pierson, Encore Computer Corporation, Research
UUCP: {talcott,linus,necis,decvax}!encore!pierson
Internet: pierson@encore.com


--

                                            dan

In real life: Dan Pierson, Encore Computer Corporation, Research
UUCP: {talcott,linus,necis,decvax}!encore!pierson
Internet: pierson@encore.com

meissner@osf.org (Michael Meissner) (12/12/90)

In article <PIERSON.90Dec11173050@xenna.encore.com> pierson@encore.com
(Dan L. Pierson) writes:

|     2. Manage dynamic space with a copying garbage collector.  This
|        means that code may move between calls however the old code
|        location will only be reused after the entire semi-space it's
|        part of has been freed for reuse (there are less friendly GC
|        algorithms...). 

Beware that compilers for non-GC languages (like C), may cache a
function's address in a register between calls.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

dfields@urbana.mcd.mot.com (David Fields) (12/13/90)

In article <PIERSON.90Dec11173050@xenna.encore.com> pierson@encore.com (Dan L. Pierson) writes:
>In article <4322@photon.oakhill.UUCP> jimk@oakhill.UUCP (Jim Klingshirn) writes:
>>  We periodically receive requests to add user level instructions
>>  to generate cache control operations.  The justification generally
>>  is centered around data caches - for instance how can you force data
>>  out so that it is guaranteed to get out to the graphics frame buffer 
>>  when the cache is in copyback mode.  To date, the only justification
>>  I've heard for instruction cache control  is to support self 
>>  modifying code (including breakpoints).  Assuming there are alternate
>>  ways to support breakpoints - should we worry about instruction cache
>>  control operations?
>
>Yes, any language system that incrementally compiles or dynamically
>loads code then executes it needs to flush caches.  Lisp is a good
>example of the worst case here.  Many Lisp systems:
	<lisp scenario described>
>Lisp is not the only language with this type of requirements, just a
>good example.

First, I'm not a chip designer nor do I work with dynamic compilation
systems so ...

I've got a few questions.  I'll attempt to answer them as best
as I can and see if anyone else can do a better job.  I'm not
going to consider having the instruction and data caches be
coherent because then you would have to address the instruction
pipeline and I believe that most of us would find that entirely
too expensive.

1) What are the reasons to make cache flush operations priviledged?

In a multi-processor system, a user could slow response times of
other users a little easier than he could before.  Are there any
other reasons?  Are there any security (i.e. Orange Book) issues?  For
example, does this make an unlimitable high badwidth covert
channel?

2) What's the cost of the user level cache flush instruction?
It seems as though it shouldn't effect cycle time since you
have to provide this in supervisor mode anyway.  Is this true?
Are there any other costs?

3) What's the real performance penalty of not having one?
It's more difficult for the implementors of the dynamic compilation
system but from what I've heard many get around the problem by
compiling and flushing in larger chunks to amortize the cost.
Just like garbage collection.  I seem to remember Andrew Klossner
posting about someone doing this at Tek.  Does anyone have
any performance info that they could post?

-- 
Dave Fields // Motorola MCD // uiucuxc!udc!dfields // dfields@urbana.mcd.mot.com

andrew@frip.WV.TEK.COM (Andrew Klossner) (12/13/90)

[]

	"Using memctl() is the only way to insure that the code will
	be portable at all ... In fact it may even be fairly efficient
	since the 88200 can flush or invalidate a line at a time."

In the Motorola kernel, the memctl implementation is moby inefficient.
It has to be, because there's no memctl option that says "just flush
cache"; instead, the code must run through segment and page descriptors
flipping the writable state.

This was enough of a problem that Tektronix implemented a "just flush
the cache" system call in our kernel, for a customer who did
incremental compilation and didn't need to be BCS compliant.

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

pierson@encore.com (Dan L. Pierson) (12/13/90)

In article <MEISSNER.90Dec11221308@curley.osf.org> meissner@osf.org (Michael Meissner) writes:
   In article <PIERSON.90Dec11173050@xenna.encore.com> pierson@encore.com
   (Dan L. Pierson) writes:

   |     2. Manage dynamic space with a copying garbage collector.  This
   |        means that code may move between calls however the old code
   |        location will only be reused after the entire semi-space it's
   |        part of has been freed for reuse (there are less friendly GC
   |        algorithms...). 

   Beware that compilers for non-GC languages (like C), may cache a
   function's address in a register between calls.

Obviously; that's a standard constraint when interfacing such
languages to C and friends.  It's also one of the reasons why
technically inferior GC algorithms such as mark and sweep may actually
work better for multi-language applications.

This is rapidly becoming a digression from comp.sys.m88k.

                                            dan

--

                                            dan

In real life: Dan Pierson, Encore Computer Corporation, Research
UUCP: {talcott,linus,necis,decvax}!encore!pierson
Internet: pierson@encore.com

df@phx.mcd.mot.com (Dale Farnsworth) (12/13/90)

> 	"Using memctl() is the only way to insure that the code will
> 	be portable at all ... In fact it may even be fairly efficient
> 	since the 88200 can flush or invalidate a line at a time."
> 
> In the Motorola kernel, the memctl implementation is moby inefficient.
> It has to be, because there's no memctl option that says "just flush
> cache"; instead, the code must run through segment and page descriptors
> flipping the writable state.
> 
> This was enough of a problem that Tektronix implemented a "just flush
> the cache" system call in our kernel, for a customer who did
> incremental compilation and didn't need to be BCS compliant.

Just to clear up some misinformation ...

The above quote must refer to another vendor.  There is nothing in the BCS
specification of memctl which requires flipping the writable state of page
descriptors and no Motorola kernel has ever done so.

Normally, our implementation of memctl validates arguments, sets a couple of
flags and flushes the caches.  There is a single exception to this: the
first (and only the first) time a shared text region is made modifiable by
memctl, it is changed into a copy-on-write region which incurs additional
overhead.

I can think of no reason that a memctl implementation needs to have
significantly more overhead than a system call to flush the caches.

-Dale

-- 
Dale Farnsworth		Motorola Computer Group

alan@encore.encore.COM (Alan Langerman) (12/14/90)

In article <4322@photon.oakhill.UUCP>, jimk@oakhill.UUCP (Jim Klingshirn) writes:
|> We periodically receive requests to add user level instructions
|> to generate cache control operations.  The justification generally
|> is centered around data caches - for instance how can you force data
|> out so that it is guaranteed to get out to the graphics frame buffer 
|> when the cache is in copyback mode.  To date, the only justification
|> I've heard for instruction cache control  is to support self 
|> modifying code (including breakpoints).  Assuming there are alternate
|> ways to support breakpoints - should we worry about instruction cache
|> control operations?
|> 
|> 
|> Jim Klingshirn,
|> Motorola 88000 Design

Not only self-modified code, but other-modified code.  A process can
have its text mapped read/write by a second process, which in turn may
alter that text, possibly while the first process is running.  (We do
this on our MP systems, on which we have built a sophisticated
shared-memory debugger.)  So we need to be able to flush not only the
current processor's I-cache and pipe but also some OTHER processor's
I-cache and pipe.  (I'm not directly involved in this work, so I'm not
sure how we currently solve this problem.)

-----
Alan Langerman		(alan@encore.com)

pcg@cs.aber.ac.uk (Piercarlo Grandi) (12/14/90)

I have corssposted to comp.arch, and redirected followups there too
because this is a very general architectural question, not just limited
to the 88k.

On 11 Dec 90 01:37:22 GMT, jimk@oakhill.UUCP (Jim Klingshirn) said:

jimk> We periodically receive requests to add user level instructions to
jimk> generate cache control operations.  The justification generally is
jimk> centered around data caches - for instance how can you force data
jimk> out so that it is guaranteed to get out to the graphics frame
jimk> buffer when the cache is in copyback mode.  To date, the only
jimk> justification I've heard for instruction cache control is to
jimk> support self modifying code (including breakpoints).

jimk> Assuming there are alternate ways to support breakpoints - should
jimk> we worry about instruction cache control operations?

Oh yes, definitely. Self modifying code is of the essence. It is *vital*
as a technique to support efficiently, especially on a RISC machine,
many advanced programming language constructs. If you want to do just
Fortran and C, no problem, but probably you don't want.

Basically it is extremely important to be able to generate or modify
small code sequences ("thunks") on the fly in many OO languages, and
even in some older languages, and in many non OO languages as well, and
in many other cases.

Consider Self, as just a small example; also, virtual functions in C++
can be efficiently implemented with thunks, and one 386 compiler already
uses them. Modern Smalltalk implementations generate code on the fly as
well, and so on.

Also, it is the easiest way to support languages that run in an
environment, such as many Lisps, AI languages, which have an embedded
compiler and generate compiled code in the workspace.

Other applications may be graphic algorithms; in many cases one can
produce thunks, ortmodify existing canned ones, so that certain
repetitive drawing operations can be done by high speed code customized
for the precise purpose. I guess this is why a guy from Interleaf is
interested in self modifying code.

Dynamic code generation, which is indistinguishable from self modifying
code of old, is also especially suited to RISC machines to give sort of
dynamic microprogramming; you synthetize higher level operations
dynamically out of the simple RISC ones, without having to consider in
advance all the cases. After all, consider that self modifying code was
used most often in old machines to simulate index registers :-).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/18/90)

Says Jim Klingshirn:
|> 
|> We periodically receive requests to add user level instructions
|> to generate cache control operations.  The justification generally
|> is centered around data caches - for instance how can you force data
|> out so that it is guaranteed to get out to the graphics frame buffer 
|> when the cache is in copyback mode.  To date, the only justification
|> I've heard for instruction cache control  is to support self 
|> modifying code (including breakpoints).  Assuming there are alternate
|> ways to support breakpoints - should we worry about instruction cache
|> control operations?
|> 
[Attributions below are based on other postings in this thread]

Yes, there is a need for instruction cache control operations,
and the need extend far beyond breakpoints (Piercarlo Grandi,
Dan Pierson, Dave Benjamin, and others).
Furthermore, the current memctl() call is
not a good answer, for two reasons.  First, it is slow, at least
in some implementations.  Second, as John Foderaro
points out, it has the wrong semantics for some applications, such as
Lisp garbage collection.

On multi-processor systems it is generally necessary to apply
the cache operation to all processors in the system (Alan
Langerman, with the approval of MP OS folks everywhere).
A user-level instruction is a very bad way
to implement such a feature, because it would require a path that
allows an instruction executed on one processor to affect the caches
of another processor in a way that can be safely used by multiple
processors simultaneously from user space without any locking or coordination
between each other or the OS.  A hardware implementation of this is
completely unreasonable (think about how two processors would simultaneously
issue page flush operations on different pages, and wait for the flushes to
complete, from user space, without deadlocking or tromping over each other's
commands).  And if some software support is best, then
the obvious way to get this support is to trap to the operating system,
which can make the right thing happen across all processors in the system.

Thus, the problem is not how to get the right support for user cache operations
into the hardware.  The problem is to devise an OS trap which is fast enough
and has the right semantics.  Additional support from the 88000 hardware will
be required only if we cannot implement an OS trap with acceptable functionality
and performance.  This suggests two questions:

1) What is "acceptable functionality"?
2) What is "acceptable performance"?

Functionality:  I believe that John Foderaro's proposal to extend memctl()
with an option to bring the data and instruction caches into coherence
is excellent, but it doesn't go quite far enough.  We shouldn't use memctl()
because: 1) It requires that the length be a multiple of the page size, and
I'd like to be able to use the line-granular cache operations when possible;
and 2) It's a system call, system call entry/exit overhead is appreciable,
and I'd like the trap to be as fast as possible.  How about a new trap:

	r2 contains the base address
	r3 contains the length

	tb0 0,r0,<CacheSynchronizationTrap>

Will cause the data and instruction caches for the specified region (between
r2 and r2+r3-1, byte granular, no minimum length) to come into coherence,
so that that region can be safely executed.
If any byte within a four-byte word in this region is written,
the the subsequent execution of that word is
undefined until another CacheSynchonizationTrap that covers that word is
issued.  A length of zero is interpreted to mean all memory.

No error checking is necessary.  If the region contains invalid addresses,
nothing bad happens; the copyback/invalidate just becomes moot.

Performance:  The execution time for this trap will, of course, vary
according to the details of the system.  It would not be surprising to
discover that it takes twice as long on a system with two 88200s per
Pbus, for example.  Copyback times obviously will vary according to
the number of dirty lines that must be copied back.  Invalidating parts of
the instruction cache has a performance impact that goes beyond the time
required to the invalidation.  We cannot control this time, but we can control
the overhead required to get into and out of the cache control operation.
How do people feel about a target of 10 clock cycles overhead?
How about 100? 200? 2000?

It is my belief that this trap can be made blindingly fast and that
the overhead will be small compared with the actual cost of doing the
cache manipulation.


How do people feel about this approach?  It is it promising enough to
justify the work of drafting a proposal and submitting it to 88Open?




Path: siberia!hamilton
Newsgroups: comp.sys.m88k
Distribution: world
Followup-To: 
References: <2308@io.UUCP> <1199@dg.dg.com> <4322@photon.oakhill.UUCP>
From: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Reply-To: hamilton@siberia.rtp.dg.com (Eric Hamilton)
Organization: Data General Corporation, Research Triangle Park, NC
Subject: Re: m88200 cache flushes on DG Aviion
Keywords: m88200 Aviion

In article <4322@photon.oakhill.UUCP>, jimk@oakhill.UUCP (Jim Klingshirn) writes:

|> 
|> We periodically receive requests to add user level instructions
|> to generate cache control operations.  The justification generally
|> is centered around data caches - for instance how can you force data
|> out so that it is guaranteed to get out to the graphics frame buffer 
|> when the cache is in copyback mode.  To date, the only justification
|> I've heard for instruction cache control  is to support self 
|> modifying code (including breakpoints).  Assuming there are alternate
|> ways to support breakpoints - should we worry about instruction cache
|> control operations?
|> 
[Attributions below are based on other postings in this thread]


Yes, there is
a need for instruction cache control operations, and the need
extend far beyond breakpoints (Piercarlo Grandi, Dan Pierson, Dave Benjamin,
and others).  Furthermore, the current memctl() call is
not a good answer, for two reasons.  First, it is slow, at least
in some implementations.  Second, as John Foderaro
points out, it has the wrong semantics for some applications, such as
Lisp garbage collection.

On multi-processor
systems it is generally necessary to apply the cache operation to
all processors in the system (Alan Langerman, with the approval of
MP OS folks everywhere).  A user-level instruction is a very bad way
to implement such a feature, because it would require a path that
allows an instruction executed on one processor to affect the caches
of another processor in a way that can be safely used by multiple
processors simultaneously from user space without any locking or coordination
between each other or the OS.  A hardware implementation of this is
completely unreasonable (think about how two processors would simultaneously
issue page flush operations on different pages, and wait for the flushes to
complete, from user space, without deadlocking or tromping over each other's
commands).  And if some software support is required, then
the obvious way to get this support is to trap to the operating system,
which can make the right thing happen across all processors in the system.

Thus, the problem is not how to get the right support for user cache operations
into the hardware.  The problem is to devise an OS trap which is fast enough
and has the right semantics.  Additional support from the 88000 hardware will
be required only if we cannot implement an OS trap with acceptable functionality
and performance.  This suggests two questions:

1) What is "acceptable functionality"?
2) What is "acceptable performance"?

Functionality:  I believe that John Foderaro's proposal to extend memctl()
with an option to bring the data and instruction caches into coherence
is excellent, but it doesn't go quite far enough.  We shouldn't use memctl()
because: 1) It requires that the length be a multiple of the page size, and
I'd like to be able to use the line-granular cache operations when possible;
and 2) It's a system call, system call entry/exit overhead is appreciable,
and I'd like the trap to be as fast as possible.  How about a new trap:

	r2 contains the base address
	r3 contains the length

	tb0 0,r0,<CacheSynchronizationTrap>

Will cause the data and instruction caches for the specified region (between
r2 and r2+r3-1, byte granular, no minimum length) to come into coherence,
so that that region can be safely executed.  If any byte within a (four byte) word
in this region is written, the the subsequent execution of that word is
undefined until another CacheSynchonizationTrap that covers that word is
issued.  A length of zero is interpreted to mean all memory.

No error checking is necessary.  If the region contains invalid addresses,
nothing bad happens and the cache synchronization proves to be very