[comp.arch] m88200 cache flushes on DG Aviion

pcg@cs.aber.ac.uk (Piercarlo Grandi) (12/14/90)

I have corssposted to comp.arch, and redirected followups there too
because this is a very general architectural question, not just limited
to the 88k.

On 11 Dec 90 01:37:22 GMT, jimk@oakhill.UUCP (Jim Klingshirn) said:

jimk> We periodically receive requests to add user level instructions to
jimk> generate cache control operations.  The justification generally is
jimk> centered around data caches - for instance how can you force data
jimk> out so that it is guaranteed to get out to the graphics frame
jimk> buffer when the cache is in copyback mode.  To date, the only
jimk> justification I've heard for instruction cache control is to
jimk> support self modifying code (including breakpoints).

jimk> Assuming there are alternate ways to support breakpoints - should
jimk> we worry about instruction cache control operations?

Oh yes, definitely. Self modifying code is of the essence. It is *vital*
as a technique to support efficiently, especially on a RISC machine,
many advanced programming language constructs. If you want to do just
Fortran and C, no problem, but probably you don't want.

Basically it is extremely important to be able to generate or modify
small code sequences ("thunks") on the fly in many OO languages, and
even in some older languages, and in many non OO languages as well, and
in many other cases.

Consider Self, as just a small example; also, virtual functions in C++
can be efficiently implemented with thunks, and one 386 compiler already
uses them. Modern Smalltalk implementations generate code on the fly as
well, and so on.

Also, it is the easiest way to support languages that run in an
environment, such as many Lisps, AI languages, which have an embedded
compiler and generate compiled code in the workspace.

Other applications may be graphic algorithms; in many cases one can
produce thunks, ortmodify existing canned ones, so that certain
repetitive drawing operations can be done by high speed code customized
for the precise purpose. I guess this is why a guy from Interleaf is
interested in self modifying code.

Dynamic code generation, which is indistinguishable from self modifying
code of old, is also especially suited to RISC machines to give sort of
dynamic microprogramming; you synthetize higher level operations
dynamically out of the simple RISC ones, without having to consider in
advance all the cases. After all, consider that self modifying code was
used most often in old machines to simulate index registers :-).
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (12/14/90)

In article <PCG.90Dec13214759@odin.cs.aber.ac.uk> pcg@cs.aber.ac.uk (Piercarlo Grandi) writes:
>
>
>jimk> We periodically receive requests to add user level instructions to
>jimk> generate cache control operations. 
>jimk> To date, the only
>jimk> justification I've heard for instruction cache control is to
>jimk> support self modifying code (including breakpoints).
>
>jimk> Assuming there are alternate ways to support breakpoints - should
>jimk> we worry about instruction cache control operations?
>
>Oh yes, definitely. Self modifying code is of the essence. It is *vital*
>as a technique to support efficiently, especially on a RISC machine,
>many advanced programming language constructs.
This last is really, absolutely, true.

>[various other excellent reasons]

>Also, it is the easiest way to support languages that run in an
>environment, such as many Lisps, AI languages, which have an embedded
>compiler and generate compiled code in the workspace.
>

Another use for this is simply the incremental compiler. I have two vital
programs that depend on efficient, easy, inclemental compilation:
the user imputs various expressions (arithmetic) that are then compiled
and exectued as part of the application. For this we need to have the
code generated on the fly inside the application program itself. This
is trivial so long as the OS doesn't simply prevent it and
you can guarantee that stuff you write as data will execute correctly
as code.

I agree that any machine that prohibits any of this is terminally
broken.

Doug McDonald

mash@mips.COM (John Mashey) (12/15/90)

In article <1990Dec14.031745.8840@ux1.cso.uiuc.edu> mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
....
>Another use for this is simply the incremental compiler. I have two vital
>programs that depend on efficient, easy, inclemental compilation:
>the user imputs various expressions (arithmetic) that are then compiled
>and exectued as part of the application. For this we need to have the
>code generated on the fly inside the application program itself. This
>is trivial so long as the OS doesn't simply prevent it and
>you can guarantee that stuff you write as data will execute correctly
>as code.

>I agree that any machine that prohibits any of this is terminally
>broken.

I think I agree, in the sense that there always must be some way
to do this.  However, if anybody thinks that there must be a way to
generate instructions, then jump to them, without doing anything
else, is doomed to increasing heartbreak, as:
1) More and more implementations use split Instruction and Data caches,
for rational technical reasons.
2) Most implementations provide no automatic way to keep I-caches
instantaneously coherent with D-cache changes,
for rational technical reasons.
3) Even when it's possible, the OS usually turns such synchronization
off, given the performance hit implied.

Of course, some designs are willing to pay the price, given the need
to run large amounts of existing code that did self-modification.
For example, I think the Amdahl machines use split I & D caches, but
synchronize them.  The 486 uses a single joint cache, and does pretty
well at hiding the overhead by other means.  Current RISC architectures
would pay a much higher price to have automatic synchronization of
split caches, which is why they don't do it.

All of this means that most UNIXes based on these things have some
system call, or otherwise programmatic interface that at least
causes a specified chunk of memory to become synchronized.
Here's a good discussion topic, and in fact, maybe this is something
where it would be AWFULLY nice to get standardized among people
that do it:

WHAT'S THE PROGRAMMATIC INTERFACE FOR CACHE-FLUSHING (and any other
cache-manipulation operations) on your favorite machine?
ARE THERE ANY STANDARDS FOR SCUH THINGS ACROSS VENDORS? (I can hope :-)

If nobody is working on standardizing the programmatic interface,
we probably should be, as a service to the industry...
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (12/15/90)

In article <44118@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>
>WHAT'S THE PROGRAMMATIC INTERFACE FOR CACHE-FLUSHING (and any other
>cache-manipulation operations) on your favorite machine?
>ARE THERE ANY STANDARDS FOR SCUH THINGS ACROSS VENDORS? (I can hope :-)
>
>If nobody is working on standardizing the programmatic interface,
>we probably should be, as a service to the industry...
>-- 

This is not a job for the OS (the standardization, that is). It
is a job for the languages. In some, like Fortran 77, of course, you
can't execute data. But in C you can. Almost. The "almost" is 
unequivocally the worst failure of the ANSI C spec: it should have 
included a routine (or macro) that makes a section of data into
a code area and returns a code pointer to it.

Doug McDonald

af@spice.cs.cmu.edu (Alessandro Forin) (12/17/90)

I was told on the IBM 6000 box there are instructions to flush (ranges of?)
the cache at the user level.
Seems to me this shows there is no inherent compelling architectural reason
why RISC architectures should not provide a good cache interface to
their users (both OS and applications) if they create the need for one.

A system call is not a "good" cache interface.  No matter how many standard
commitee members you put on top of it.

sandro-
-----------------------------------------------------------------------------
 Alessandro Forin / School of Computer Science / Carnegie-Mellon University
 Schenley Park / Pittsburgh, PA 15213 / Ph: (412) 268-6861 / FAX 681-5739
 ARPA: af@cs.cmu.edu

hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/17/90)

In article <11421@pt.cs.cmu.edu>, af@spice.cs.cmu.edu (Alessandro Forin) writes:
|> 
|> I was told on the IBM 6000 box there are instructions to flush (ranges of?)
|> the cache at the user level.
|> Seems to me this shows there is no inherent compelling architectural reason
|> why RISC architectures should not provide a good cache interface to
|> their users (both OS and applications) if they create the need for one.
|> 
|> A system call is not a "good" cache interface.  No matter how many standard
|> commitee members you put on top of it.
|> 
Well, a system call *is* an instruction, albeit a very slow one, as far
as the user-visible program architecture is concerned.

The reason for requiring OS assistance/interference (pick one, according
to your prejudices) in user cache operations is multi-processor systems.
Direct hardware support for cache invalidate and writeback commands in
an MP system is possible, but constrains and complicates the hardware
design immensely.  Think about how hardware might implement cache
writeback/invalidate operations in an MP system....  When multiple processors
initiate cache operations at the same time....  When each processor has its
own data and instruction caches....  When the initiating processor cannot be
allowed to proceed until all other processors have completed their writeback/invalidate....  Any reasonable implementation is going to
require some system software involvement.
The problem is somewhat similar to flushing the address translation cache after
changing a page mapping, and OS designers have been using barrier
synchronization for that purpose for years.

What you should expect (and demand - this is an active thread in comp.sys.m88k)
is that the system call that does the cache manipulation be very fast.  You
ought to be able to think of it as costing more like to a trap to emulate
an unimplemented instruction than as a full-blown system call.  Of course,
if a substantial amount of the instruction cache must be invalidated or we
have to wait for a substantial amount of the data cache to writeback, then
the overhead for starting the cache operation will be the least of your
problems.

I'm not going to get involved in the "Resolved:  Standards are Harmful"
debate.  Suffice it to say that the user cache interface is neither more
nor less amenable/vulnerable to standardization than the rest of the user
visible architecture.  If you are in a standard-sensitive environment then
you have to deal with a standards committee; if you aren't, you don't.

dave@88opensi.88open.ORG (Dave Cline) (12/18/90)

In article <44118@mips.mips.COM>, mash@mips.COM (John Mashey) writes:
> 
> WHAT'S THE PROGRAMMATIC INTERFACE FOR CACHE-FLUSHING (and any other
> cache-manipulation operations) on your favorite machine?
> ARE THERE ANY STANDARDS FOR SCUH THINGS ACROSS VENDORS? (I can hope :-)
> 
> If nobody is working on standardizing the programmatic interface,
> we probably should be, as a service to the industry...

SVR4 defines interfaces for cache flushing.

In SVID3, see msync(KE_OS) or the foundation interface, memcntl(RT_OS)

These are required by the generic ABI.

Dave Cline                            uucp: ...uunet!88opensi!dave
88open Consortium, Ltd.                     dave@88open.org
100 Homeland Court, Suite 800
San Jose, CA 95110

rouellet@crhc.uiuc.edu (Roland G. Ouellette) (12/18/90)

> The reason for requiring OS assistance/interference (pick one, according
> to your prejudices) in user cache operations is multi-processor systems.
> Direct hardware support for cache invalidate and writeback commands in
> an MP system is possible, but constrains and complicates the hardware
> design immensely.  Think about how hardware might implement cache
> writeback/invalidate operations in an MP system....

Actually it's not all that bad.  REI on VAX (besides doing a ton of
other stuff) flushes the icache.  A separate write-back instruction
cache makes hardly any sense at all (expecially when you talk about
using the OS to flush things back to memory...  the instruction stream
seldom changes...  and the architecture can require the user,
compiler, whatever, to do something when it does).  The icache might
take invalidates from other processors, but require a cache flush
instruction upon dynamicly generating code.  The hardware support is
about six transistors per cache line in the instruction cache which is
used to clear the valid bit on the line.  In an MP system, having a
large coherent shared backup will make the cache refill penalty
reasonably small.
--
= Roland G. Ouellette			ouellette@tarkin.enet.dec.com	=
= 1203 E. Florida Ave			rouellet@[dwarfs.]crhc.uiuc.edu	=
= Urbana, IL 61801	   "You rescued me; I didn't want to be saved." =
=							- Cyndi Lauper	=

hamilton@siberia.rtp.dg.com (Eric Hamilton) (12/18/90)

In article <ROUELLET.90Dec17155459@pinnacle.crhc.uiuc.edu>, rouellet@crhc.uiuc.edu (Roland G. Ouellette) writes:
|> > The reason for requiring OS assistance/interference (pick one, according
|> > to your prejudices) in user cache operations is multi-processor systems.
|> > Direct hardware support for cache invalidate and writeback commands in
|> > an MP system is possible, but constrains and complicates the hardware
|> > design immensely.  Think about how hardware might implement cache
|> > writeback/invalidate operations in an MP system....
|> 
|> Actually it's not all that bad.  REI on VAX (besides doing a ton of
|> other stuff) flushes the icache.  A separate write-back instruction
|> cache makes hardly any sense at all (expecially when you talk about
|> using the OS to flush things back to memory...  the instruction stream
|> seldom changes...  and the architecture can require the user,
|> compiler, whatever, to do something when it does).  The icache might
|> take invalidates from other processors, but require a cache flush
|> instruction upon dynamicly generating code.  The hardware support is
|> about six transistors per cache line in the instruction cache which is
|> used to clear the valid bit on the line.
|>
Some context was lost when this discussion outgrew comp.sys.m88k....

The question there was whether it makes sense to supply user-level
non-privileged instructions that will copyback (a range of) the data cache
and invalidate (a range of) the instruction cache.  These operations
are important to the folks doing incremental compilation, dynamic linking,
garbage collection, planting breakpoints/watchpoints, and the like, especially
in a multi-threaded and multi-processor environment.
When the code stream changes, it's necessary to cause all data caches in
an MP system to writeback, and then to invalidate all instruction caches
(Harvard architecture, instruction caches don't snoop is a reasonable
implementation choice); only after the
data caches have completed their writebacks is it safe to allow any
processor to start refilling its instruction cache.  This is not easy
to do in hardware, or at least not so easy that it should be done there
without software involvement.

Life is somewhat easier on VAX and similar proprietary CISC architectures, because
there is much flexibility to move the implementation from hardware to microcode
to an OS trap handler while preserving the user-level illusion of direct hardware
support for the desired functionality.  But even there, I would expect
that many implementations would implement what appear to be user-level
cache control operations by trapping to kernel software or microcode, which is
not exactly direct hardware support.

Surely the VAX REI instruction doesn't flush all instruction caches in a
multi-processor?
|>
|> In an MP system, having a
|> large coherent shared backup will make the cache refill penalty
|> reasonably small.

Absolutely right.

dricejb@drilex.UUCP (Craig Jackson drilex1) (12/19/90)

In article <1990Dec15.143354.8493@ux1.cso.uiuc.edu> mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
>In article <44118@mips.mips.COM> mash@mips.COM (John Mashey) writes:
>>WHAT'S THE PROGRAMMATIC INTERFACE FOR CACHE-FLUSHING (and any other
>>cache-manipulation operations) on your favorite machine?
>>ARE THERE ANY STANDARDS FOR SCUH THINGS ACROSS VENDORS? (I can hope :-)
>>
>>If nobody is working on standardizing the programmatic interface,
>>we probably should be, as a service to the industry...
>
>This is not a job for the OS (the standardization, that is). It
>is a job for the languages. In some, like Fortran 77, of course, you
>can't execute data. But in C you can. Almost. The "almost" is 
>unequivocally the worst failure of the ANSI C spec: it should have 
>included a routine (or macro) that makes a section of data into
>a code area and returns a code pointer to it.

Such a routine would have made ANSI C unimplementable (or expensive to
implement) on some machines.  Contrary to Mr. McDonald's beliefs, a
few architectures have been commercially successful over the years without
the ability for an arbitrary program to store instructions in memory
and execute them.  In particular, the architecture of the Burroughs B6700
and its descendents, which is still marketed and used as the Unisys
A-Series, does not have that ability.  This architecture incorporates
one of Per Brinch Hansen's ideas from Concurrent Pascal, several years
before he had it: put the protection in the compilers, not the hardware.
Therefore any thing which acts as a compiler is system-security-sensitive.

Actually, such a system call *could* be implemented on the A-Series.  It
would simply have to syntactically vet the supposed 'code',make any necessary
translations/modifications for the target environment, and then bind it
to a function-object in the calling program.  Note that the area
of memory offered need not contain actual machine instructions for
the current machine; it could just as well contain C code, since we're
going to have to parse it in any case.  The set of OS code which performs
this function will be known as a 'compiler'.

(Such a capability has indeed been put in recent versions of the A-Series
operating system.  The language compiled is called 'Scode', and has not
yet been documented outside of Unisys.  I don't think it is callable
from A-Series C.)

This is somewhat far afield of the original cache-purging thread, and
I'm certain the semantics described above aren't what Mr. McDonald
had in mind.  However, if you think about it, they are actually
equivalent semantics to a 'give me a pointer-to-function from this
pointer-to-data' system call.  Which is what would be needed, in general,
in the C language, or in POSIX.  (A simple cache-synchronize isn't 
sufficient; what you have is a pointer-to-data, and only pointers-to-function
are executable.)
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}

dricejb@drilex.UUCP (Craig Jackson drilex1) (12/19/90)

To follow-up to my own message, I really don't think any new language
features, really, are required.  You simply make the extension of
allowing a data pointer to be cast to a function pointer.  The contents
of the area pointed to by the data pointer are implementation-dependent.
If the pointed-to data cannot be converted into a function of the
required type, (function *)NULL results.

Everything else can get hidden underneath the table, behind the curtains...
-- 
Craig Jackson
dricejb@drilex.dri.mgh.com
{bbn,axiom,redsox,atexnet,ka3ovk}!drilex!{dricej,dricejb}