[comp.arch] Self-modifying code

yuval@taux01.UUCP (Gideon Yuval) (02/18/88)

Is good support for self-modifying code a real  issue?  all  CPUs  support  a
"modify  code  -- invalidate all caches -- execute" cycle, which is enough to
run  (say)  loaders &  debuggers;  i  THINK  this  is  enough  for  all  real
applications, but want to be sure.

Are any exceptions to this (AI or other) known?



-- 
Gideon Yuval, +972-52-522255 (work), -2-690992 (home), yuval@taux01.nsc.com

kers@otter.hple.hp.com (Christopher Dollin) (02/18/88)

"yuval@taux01.UUCP (Gideon Yuval)" says:

> Is good support for self-modifying code a real  issue?  all  CPUs  support  a
> "modify  code  -- invalidate all caches -- execute" cycle, which is enough to
> run  (say)  loaders &  debuggers;  i  THINK  this  is  enough  for  all  real
> applications, but want to be sure.

That depends. In the Poplog environment, for example, user code (Pop11, Prolog,
ML, Common Lisp) is compiled to native code by an incremental compiler. You
could call this "loading", but I suspect that many people will think of a
loader as being something that loads rather larger peices of code than
individual procedures.

What's more, the Pop11 "partial application" feature, whereby a procedure is
constructed from a base procedure by supplying some of its arguments, also
involves the creation of new procedure records in the store. While the compiler
is not often invoked repeatedly in user code, partial application (and to a 
lesser extend procedure composition) is. This means that code cache flushing 
might happen rather often.

Strictly, none of this is "self-modifying" code, since the procedures are
constructed as data-objects before being used. But the data cache has to be
flushed first. 

Of course, some peiple might say that a mixed-language incremental development
system with integrated editor wasn't a "real" application.....................




Regards,
Kers                                    | "Why Lisp if you can talk Poperly?"

webber@athos.rutgers.edu (Bob Webber) (02/19/88)

In article <486@taux01.UUCP>, yuval@taux01.UUCP (Gideon Yuval) writes:
> Is good support for self-modifying code a real  issue?  all  CPUs  support  a
> "modify  code  -- invalidate all caches -- execute" cycle, which is enough to
> run  (say)  loaders &  debuggers;  i  THINK  this  is  enough  for  all  real
> applications, but want to be sure.
> 
> Are any exceptions to this (AI or other) known?

At most unix systems, the classic example of ``self-modifying'' code is
dynamic loading where you want things like the lisp interpreter to
be able to use compiled code while it is interpreting.  Similarly,
some people have used dynamic-loading by hand to simulate some of the
benefits of shared libraries (i.e., minimizing the size of executables
that are not being executed).

Incremental compilation is another place where one could imagine
self-modifying code being used to good effect.

Some people generate code that for floating point operations that modifies
itself depending on the availability of an fpa at execution time.

A number of people interested in object-oriented programming will
present algorithms.  When asked how they expect them to run as fast as
hand coded C, often they start talking about dynamic code generation
for common cases.  One even hears people contemplate about programs
that profile their own execution and optimize ``hotspots''.

------- BOB (webber@athos.rutgers.edu ; rutgers!athos.rutgers.edu!webber)

ok@quintus.UUCP (Richard A. O'Keefe) (02/19/88)

In article <486@taux01.UUCP>, yuval@taux01.UUCP (Gideon Yuval) writes:
> Is good support for self-modifying code a real  issue?  all  CPUs  support  a
> "modify  code  -- invalidate all caches -- execute" cycle, which is enough to
> run  (say)  loaders &  debuggers;  i  THINK  this  is  enough  for  all  real
> applications, but want to be sure.
> 
> Are any exceptions to this (AI or other) known?

Historically, "self-modification" has referred to modifying instructions
which will then be executed again.  For example, walking down an array
on an IBM 650 without index registers required changing address fields
in instructions.

What languages like Pop-2, Lisp, Prolog, SmallTalk, &c &c require is the
ability to 
(a) dynamically add *new* code;  if the operating system provided a
    "load this .obj file whereever you please and protect it as
    executable code, but tell me where you put everything" facility,
    that would be fine.
(b) reclaim old code which will never be used again;  if the operating
    system provided an "unload this chunk that you loaded for me before,
    and if you protect it so that attempts to exceute it will trap, so
    much the better" facility, that would be fine.
Note that separate I&D doesn't mean you can't have dynamically loaded
code;  it just means that an ordinary user program can't do it, and
many operating systems vendors never think to provide it.

Some implementations of SmallTalk and Lisp do use a self-modifying scheme.
For example, one technique for handling
	Object msg
in SmallTalk is to have instructions like
	load	Object
	call	usual_handler_for_msg
		^^^^^^^^^^^^^^^^^^^^^
where the handler checks to see if it is the right one, and if it
isn't, does a full method lookup to find the right handler, and
then pokes the address of the right handler back in the call instruction.
A similar technique has been used in some Lisp systems ("snapping links").
Indirection can be used instead, at a price.

aglew@ccvaxa.UUCP (02/20/88)

Branch target caching in SOAR, and some 68000 SMALLTALK implementations,
is self modifying.

Smalltalk methods are multiway branches based on possibly complicated
conditions. Instead of evaluating all the conditions, these implementations
branch to the most recently used method, and do a simple validity check,
defaulting to the more general case. The general case patches a branch
back into the code.

Of course, there are other ways to do it...


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

freeman@spar.SPAR.SLB.COM (Jay Freeman) (02/20/88)

In article <666@cresswell.quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:

>What languages like Pop-2, Lisp, Prolog, SmallTalk, &c &c require is the
>ability to 
>(a) dynamically add *new* code;  if the operating system provided a
>    "load this .obj file whereever you please and protect it as
>    executable code, but tell me where you put everything" facility,

Not quite:  Occasionally one wants to be able to compile directly to
memory without bothering to write an object file; the loader would not
be called in such a case.  The objective is to be able to write some
bits into memory and then call or jump to what you just wrote.

bzs@bu-cs.BU.EDU (Barry Shein) (02/20/88)

Actually I'm not sure exactly where the confusion between
self-modifying and dynamically loading code is coming in. I've seen at
least one lisp which used self-modifying code to some advantage,
typically to poke in debugger and other evalhooks when desired (into
the main eval loop) or simply remove them when not wanted (for speed's
sake.) It may only be a few tests saved, but consider how many times a
lisp program travels through the eval loop, like every iterative loop
for interpreted code.

	-Barry Shein, Boston University

dhesi@bsu-cs.UUCP (Rahul Dhesi) (02/21/88)

[re: how to overcome the code versus data distinction]

For those implementors who wish to allow data to be safely treated as
code, I suggest a system call that will accept a range of memory
addresses and a request to change it from code to data or from data to
code.  E.g.:

     #include <modech.h>
     if (modech(startadd, endadd, TO_CODE) == -1)
        error ("could not change data into code for execution");

This preserves the advantages if any of the code/data distinction
without being a pain to the implementor.  No doubt there would be some
alignment restrictions on the values of startadd and endadd.
-- 
Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi

aglew@ccvaxa.UUCP (02/22/88)

>[re: how to overcome the code versus data distinction]
>
>To allow data to be safely treated as code...
>
>     #include <modech.h>
>     if (modech(startadd, endadd, TO_CODE) == -1)
>        error ("could not change data into code for execution");
>
>Rahul Dhesi         UUCP:  <backbones>!{iuvax,pur-ee,uunet}!bsu-cs!dhesi

That's along the right track, but won't necessarily work on strict Harvard
architectures. How about

     int f();
     f = codealloc(size);
     codecpyin(f,data,size);
     codecpyout(f,data,size);

I promised Karl Heuer that I'd come up with a standard proposal - one day,
maybe.

ok@quintus.UUCP (Richard A. O'Keefe) (02/22/88)

In article <2169@bsu-cs.UUCP>, dhesi@bsu-cs.UUCP (Rahul Dhesi) writes:
> [re: how to overcome the code versus data distinction]
> For those implementors who wish to allow data to be safely treated as
> code, I suggest a system call that will accept a range of memory
> addresses and a request to change it from code to data or from data to
> code.  E.g.:
> 
>      #include <modech.h>
>      if (modech(startadd, endadd, TO_CODE) == -1)
>         error ("could not change data into code for execution");
> 
It won't work in a separate I&D machine.  To start with, that range of
data addresses may *already* be in use in the code space, or it may not
be valid.  (Imagine a machine where code space is negative and data
space is positive.)  Another thing that can go wrong is that, as on the
B6700, code and data may be different *kinds* of address.  If you can
use the same addresses for code and data as Rahul Dhesi suggests, your
problem is just an unco-operative operating systems.  That's why I
suggested "load this .obj file where you like, and tell me where you put
it", because that *does* make sense in genuinely separate I&D machines.
Of course it doesn't have to be a .obj file; the thing you hand to the
kernel might be a chunk of data memory to copy, or for that matter a
Postscript program to set individual bits (:-).  The important thing is
that it has to be a copy, because "separate I&D" means *separate*.

An interesting point of dynamic *extension* of programs is that (if the
code space allocator grain size is a multiple of the instruction cache
entry size) you don't need to flush the instruction cache, because valid
entries are still valid.

rogerh@arizona.edu (Roger Hayes) (02/25/88)

Someone (Rob Pike?) used self-modifying code to do bitblits.  The
general case bitblit code did a case analysis to construct the optimal
program for that particular blit, then jumped to the constructed
program.  Worked quite well, as I recall.

			Roger Hayes
			rogerh@arizona.edu

ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) (02/25/88)

rogerh@arizona.edu (Roger Hayes) write
>Someone (Rob Pike?) used self-modifying code to do bitblits.  The
>general case bitblit code did a case analysis to construct the optimal
>program for that particular blit, then jumped to the constructed
>program.  Worked quite well, as I recall.

It was for Blit bitmap graphics terminal, originally driven by 68000, and
later by WE32000.  It is more like a very special compiler than a 
self-modifying code.  The BitBlt code does case analysis, and generate
small 'code', and pass control to it (well, it is a self modifying code,
in some sense...).  The small code is without various case analysis
(BitBlt is not so simple, if special cases are considered), and faster
than the code with all the 'if' statements. 
I beleave the same technique is used for Microsoft Window's BitBlt code.
They generate code in (I thoght) stack segment, and branch to it.  It was
fairly nasty code because of 64K segment on 80X86s, if frame buffer
size exceeds 64K.

==============================================================================
Any opinion expressed here is my own.
------------------------------------------------------------------------------
Ryutarou Ohbuchi	"Life's rock."   "Climb now, work later." and, now,
			"Life's snow."   "Ski now, work later."
ohbuchi@cs.unc.edu	<on csnet>
Department of Computer Science, University of North Carolina at Chapel Hill
==============================================================================

mash@mips.COM (John Mashey) (02/26/88)

In article <1415@thorin.cs.unc.edu> ohbuchi@unc.UUCP (Ryutarou Ohbuchi) writes:
>rogerh@arizona.edu (Roger Hayes) write
>>Someone (Rob Pike?) used self-modifying code to do bitblits....
John Reiser was the author of the code that saw widespread use in the Blits.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

tainter@ihlpg.ATT.COM (Tainter) (02/27/88)

In article <1415@thorin.cs.unc.edu>, ohbuchi@unc.cs.unc.edu (Ryutarou Ohbuchi) writes:
> rogerh@arizona.edu (Roger Hayes) write
> >Someone (Rob Pike?) used self-modifying code to do bitblits.  The
> >general case bitblit code did a case analysis to construct the optimal
> >program for that particular blit, then jumped to the constructed
> >program.  Worked quite well, as I recall.

I did preciesly this.  It gave me a nice compact and fast routine.

The only alternatives I could come up with were:
--separate code for each operation jumped to through a jump table
	This was optimally fast but hugh.
--calling subroutines through jump tables
	This was good on space but relatively slow since it amounted to
	retesting the opcode inside the loop
--a sort of hybrid of the above
	This still came out quite large.

My self modifying code used a template into which it copied instructions
for the various operations.  This gave a near optimal tight loop for each
of the operations in a nice compact form.
 
> Ryutarou Ohbuchi	"Life's rock."   "Climb now, work later." and, now,

--j.a.tainter

leonid@TAURUS.BITNET (03/03/88)

May I add to this discussion that on most UNIX system with Virtual Memory
(e.g. BSD 4.2/3) there is alot of self-modifying code. Except for the
debugger case, the idea is implemented by copying some code onto the
stack, modifying it there and executing. Some examples I can think of
are signal handling, Floating-Point hardware dependent code on SUN.
Since dynamic loading is not yet impelemented on BSD systems I dont
really know how they're gonna do it.

In a word, VM UNIX systems tend to keep the text segment readonly
for the sace of text sharing, swap space saving etc, with the few
exceptions where that stack segment is used to keep temporary copies
of executable code that must be modified. Debugging is an exception.

Leonid

pardo@june.cs.washington.edu (David Keppel) (07/11/88)

[ What about that self-modifying code ]
nather@astro.as.utexas.edu (Ed Nather) writes:
> "horrid"
[ hard to write, understand, maintain ]
>
>But it's really needed -- if you have to watch an animated display
>of data on a CRT all night long that blinks and flickers, you learn
>very quickly that techniques to get display refresh faster than the
>human flicker-frequency can save lots of pounding headaches.  If
>they are "forbidden" techniques that require a descent into
>assembly coding, so be it.  Headaches hurt.

Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code:
+ It's machine dependent (we already knew that).
+ Read-only (e.g., code) pages may be shared.
+ Some machines have sperate I and D spaces.  It's not inconceivable
  that different machines in the same family will have different
  memory models.
+ Many machines use seperate instruction and data caches.  Instruction
  caches can be made much simpler (= smaller = faster) by not
  requiring them to pay attention to the data bus.  Writing to a
  memory location is not guaranteed to update the I cache, thus you
  may or may not execute the modified instruction.  Next week's model
  may have a large enough cache that this week's code breaks.

Good enough?

	;-D on  ( Compiler hints won't save you here )  Pardo

nather@ut-sally.UUCP (Ed Nather) (07/11/88)

In article <5254@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes:
> [ What about that self-modifying code ]
> nather@astro.as.utexas.edu (Ed Nather) writes:
> > "horrid"
> 

Tsk tsk -- out of context, like the movie ad that quotes a reviewer who said:
"This movie is a great waste of time -- it's insipid, boring, and stupid.  I've
never seen anything like it."

The ad quoted:   "...great...I've never seen anything like it."

I said the practice "...was formally declared horrid ...", NOT that *I* thought 
it was.  I don't think that.

> Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code:
> + It's machine dependent (we already knew that).

Assembly code does tend to be machine dependent, all right -- but C was 
designed to give the programmer the same facility, but with better control and
better discipline.  It works quite well for those things it covers.  It is, in
some reasonable sense, portable.  A good compiler could generate self-modifying
code to suit, I'm sure.  What we lack is the formal discipline for using it
wisely and well.

> + Read-only (e.g., code) pages may be shared.

Only in a time-sharing system.  Most real-time control programs can't tolerate
such an environment for lots of reasons, mostly due to critical timing 
requirements.  Anyway, if generated code is treated as "executable data" this 
whole point becomes irrelevant.

> + Some machines have sperate I and D spaces.  It's not inconceivable
>   that different machines in the same family will have different
>   memory models.

Not a problem.  Generated executable code changes, just like any other variable
values, and can be treated the same way.

> + Many machines use seperate instruction and data caches.  Instruction
>   caches can be made much simpler (= smaller = faster) by not
>   requiring them to pay attention to the data bus.  Writing to a
>   memory location is not guaranteed to update the I cache, thus you
>   may or may not execute the modified instruction.  Next week's model
>   may have a large enough cache that this week's code breaks.
> 

Again, not a problem.  You don't take the generated code out of the I cache,
but out of the D cache, since it can (by definition) change.

> Good enough?

Sorry, no.  I've heard LOTS of arguments against programs that generate their
own code, and all of them -- so far as I am aware -- assume the "proper" answer
in generating arguments to show the answer is proper.  Be honest: suppose you
*had* to do it this way, but needed a formal discipline to keep it clean and
understandable, and *fast.*  Couldn't you find a reasonable way?

"It's unclean because ... well, because it's so ... so ... UNCLEAN!"

-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{backbones}!{noao,ut-sally}!utastro!nather
nather@astro.as.utexas.edu

lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/12/88)

In article <12357@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>In article <5254@june.cs.washington.edu>, pardo@june.cs.washington.edu
(David Keppel) writes:
>> [ What about that self-modifying code ]
....
>> Some reasons NOT to write S-M (Self Modifying or Sado-Masochistic) code:
....
>> + Read-only (e.g., code) pages may be shared.
>
>Only in a time-sharing system.  Most real-time control programs can't tolerate
>such an environment for lots of reasons, mostly due to critical timing 
>requirements.  Anyway, if generated code is treated as "executable data" this 
>whole point becomes irrelevant.

There are time-sharing systems that can handle critical tasks (e.g.
real-time applications with strict timing requirements) well. These systems
have the possibility to give tasks priorities. A critical task can be given
a high priority to enable it to meet its time limits. Such a system may very
well have code that is shared.

Another related issue is that self-modifying code cannot be executed
directly from ROM. Executing programs in ROM is an important memory-saving
technique in small, diskless, special-purpose systems.

>> Good enough?
>
>Sorry, no.  I've heard LOTS of arguments against programs that generate their
>own code, and all of them -- so far as I am aware -- assume the "proper"
>answer in generating arguments to show the answer is proper.  Be honest:
>suppose you *had* to do it this way, but needed a formal discipline to keep
>it clean and understandable, and *fast.*  Couldn't you find a reasonable way?
>
>"It's unclean because ... well, because it's so ... so ... UNCLEAN!"

I guess there are situations with extreme performance requirements where
self-modifying code is justified. It would be interesting to see what a
semantics for self-modifying programs would look like. Regarding the
"uncleanliness" I think there are two main sources for this. The first is
the difficulty for a human mind to trace what's going on in a self-modifying
program. The second is an advantage read-only data (e.g. non-self-modifying
code) has to writable data: read-only data can be accessed asynchronously,
in any order, even concurrently, without any constraints. Therefore
read-only is "clean" in a concurrent environment. An example apart from code
is data bases, where read-only data don't have to be locked during
transactions.

Bjorn Lisper

pardo@june.cs.washington.edu (David Keppel) (07/12/88)

nather@ut-sally.UUCP (Ed Nather) writes:
>pardo@june.cs.washington.edu (David Keppel) writes:
>>[ Out of context quote ]
>[ Funny retort. ]
Sorry.  I do like your rejoinder, though.   Could be about
self-modifying code, even   :-)

> Sorry, no.  I've heard LOTS of arguments against programs that
> generate their own code, and all of them -- so far as I am aware
> -- assume the "proper" answer in generating arguments to show the
> answer is proper.  Be honest: suppose you *had* to do it this way,
> but needed a formal discipline to keep it clean and understandable,
> and *fast.*  Couldn't you find a reasonable way?

Let me rephrase my position.  Their is nothing architecturally weird
about programs that generate their own code.  Doing so does not cause
any problems on any machines that I am aware of, although few
OPERATING SYSTEMS support this.

There is a *major* problem with modifying existing code.  Consider a
machine with seperate I-cache and D-cache.  If you execute an
instruction at location 0x500, then one at 0x504 that branches back
to the one at 0x500, and the one at 0x500 modifies itself, either:

  + The cache can be built so that it pays attention to the changing
    data (code) in the I-cache.  This requires a "snoopy" cache, which
    is more complicated, less dense, and slower.  If you are
    constrained by price or die space, you may be forced to use a
    slower, smaller (worse hit-ratio) cache.
  + You can execute out of the D-cache.  This is effectively the
    same as having a snoopy I-cache.  If you have BOTH an I-cache and
    a D-cache, you're left with coherency problems that are solved by
    building a snoopy I-cache.
  + You can define this operation to be illegal, build a smaller,
    faster, non-snoopy cache, and pay the price when you need to run
    code that could be expressed "naturally" as self-modifying code.

Assume that you have an algorithm that is, say, 1000 instructions and
that each instruction takes 4 cycles to execute (must be a RISC :-).
One particular part of it, 10 instructions, is written as
self-modifying code.  It can also be written as 100 instructions of
non-self-modifying (static) code.  Further assume that the effect of
having a non-snoopy cache is to make the computer run 10% faster
(e.g., the cache is faster AND has a better hit rate).  Then the
running time of the program on the machine with the snoopy I-cache
is:

    T = 990 * 4.0 + 10 * 4.0 = 4000 cycles

On the machine without a snoopy I-cache (10% faster),

    T = 990 * 3.6 + 100 * 3.6 = 3924 cycles

which is marginally FASTER for NOT using self-modifying code,
even though it takes an additional 90 isntructions.


	;-D on  ( Cycles to burn reading flames )  Pardo

mash@mips.COM (John Mashey) (07/12/88)

In article <12357@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
....
>> + Many machines use seperate instruction and data caches.  Instruction
>>   caches can be made much simpler (= smaller = faster) by not
>>   requiring them to pay attention to the data bus.  Writing to a
>>   memory location is not guaranteed to update the I cache, thus you
>>   may or may not execute the modified instruction.  Next week's model
>>   may have a large enough cache that this week's code breaks.
>> 
>
>Again, not a problem.  You don't take the generated code out of the I cache,
>but out of the D cache, since it can (by definition) change.

No, machines that do this don't work that way, i.e., you only execute
code from the I-cache or main memory, not from the D-cache (or the cache
currently acting as the D-cache, in cases where you can swap them).

There is nothing wrong with generating code on the fly.
What is wrong is assuming a model of machines that looks just like
a CPU + DRAM, or with caches that act only like coherent, joint I&D caches.

This is just one instance of a general problem: how do you handle
caches that are visible to the executable code in any way?
What's needed is a good set of cache-control primitives that:
	a) Cover all of the cases
	b) Become widely used enough for people to count on.
	c) Are easy to make nops for enviornments where they don't matter.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

chasm@killer.UUCP (Charles Marslett) (07/12/88)

In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
> Another related issue is that self-modifying code cannot be executed
> directly from ROM. Executing programs in ROM is an important memory-saving
> technique in small, diskless, special-purpose systems.

Actually, I always thought that ROM was much more important as a way of
eliminating the need to load the code and as a way of guaranteeing non-
self-modifying-ness.  Usually ROM code is larger (thus memory-consuming)
than RAM-loaded code.  This is because most RAM based systems assume some
form of selfmodification of code, ranging from relocation tables that avoid
the need for self-relative addressing everywhere to file initialization code
that gets overwritten by its own buffer.  On the other hand, some RAM based
systems require 4 Mb to run an editor, so you may be right in those cases
:->} !!

Seriously, newer architectures have reduced the difference (to 0 perhaps),
but the emphasis on RISC these days may resurrect self-modifying code --
a RISC-er technique is not known to mankind!  (:-D}===<<).

> Bjorn Lisper


Charles Marslett

hjm@cernvax.UUCP (hjm) (07/12/88)

If the program is considered to be data for an interpreter which I will for
convenience call a CPU, then does self-modifying code seem any different?  As
long as the data in this table remains consistent, then there should be no
difficulty in dealing with it mathematically.  All of the problems of database
consistency have been tackled (but not all solved) some time ago.  Code is data
according to a PATCH program, and the only tricky bits there are to ensure that
any changes fit inside the space that they replace and to patch up the jumps
where necessary.  A bit like a structure linked by pointers, such as a binary
tree, except that code is in general a directed cyclic graph (the directedness
is due to the program counter).

Now, if the routines that ensure the consistency of the code change themselves,
then the world will most likely fall down around one's head, but then things get
to be so recursive that my stack oveflows ...

	Hubert Matthews

nather@ut-sally.UUCP (07/12/88)

In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
> 
> I guess there are situations with extreme performance requirements where
> self-modifying code is justified. It would be interesting to see what a
> semantics for self-modifying programs would look like. 

I agree.  That was the whole point of my original posting -- to see if someone
with a talent for languages could ignore the current prejudice against this
practice, and see what might be done.  If there's really no way to do it in a
reasonable manner, fine -- but just *saying* it's awful doesn't *make* it awful.

> The first [source of uncleanliness] is
> the difficulty for a human mind to trace what's going on in a self-modifying
> program. 

I think this arises because of the two different intellectual "levels" involved:
what the program is doing which generates the instructions (by analogy with the
instruction generator in a compiler) and what those instructions are going to
do when they are executed.  These *must* be kept completely separate or massive
confusion results.  But if the written code is approached with full awareness
of this requirement, the confusion vanishes -- in my experience, anyway.

> The second is an advantage read-only data (e.g. non-self-modifying
> code) has to writable data: read-only data can be accessed asynchronously,
> in any order, even concurrently, without any constraints. Therefore
> read-only is "clean" in a concurrent environment. 

I believe self-modifying code was condemned long before concurrent environments
were even possible.  This may be a disadvantage, but *no* techinque exists
without trade-offs.  I think we can live with this one.

> An example apart from code
> is data bases, where read-only data don't have to be locked during
> transactions.

OK, but I'm not sure this is relevant here.  I do, however, appreciate your
keeping "data" plural.  It happens so rarely nowdays ...


-- 
Ed Nather
Astronomy Dept, U of Texas @ Austin
{backbones}!{noao,ut-sally}!utastro!nather
nather@astro.as.utexas.edu

baum@Apple.COM (Allen J. Baum) (07/13/88)

[]
>In article <5262@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes:
>
>There is a *major* problem with modifying existing code.  Consider a
>machine with seperate I-cache and D-cache.  If you execute an
>instruction at location 0x500, then one at 0x504 that branches back
>to the one at 0x500, and the one at 0x500 modifies itself, either:
>
>  + The cache can be built so that it pays attention to the changing
>    data (code) in the I-cache....
>  + You can execute out of the D-cache....
>  + You can define this operation to be illegal....

or you can recognize the situation and issue a write-back-Dcache-line
instruction, and a flush-Icache-line instruction, and everything works.
Note that a compile-and-go situation is exactly this; you are building
a program in the Dcache, and then you want to execute it. While it may
be unlikely that any remnants are left over after linking, et. al., this
may be the only way to guarantee it.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/13/88)

In article <4776@killer.UUCP> chasm@killer.UUCP (Charles Marslett) writes:
>In article <33441@yale-celray.yale.UUCP>, I write:
>> Another related issue is that self-modifying code cannot be executed
>> directly from ROM. Executing programs in ROM is an important memory-saving
>> technique in small, diskless, special-purpose systems.
>
>Actually, I always thought that ROM was much more important as a way of
>eliminating the need to load the code and as a way of guaranteeing non-
>self-modifying-ness.  Usually ROM code is larger (thus memory-consuming)
>than RAM-loaded code.  This is because most RAM based systems assume some
>form of selfmodification of code, ranging from relocation tables that avoid
>the need for self-relative addressing everywhere to file initialization code
>that gets overwritten by its own buffer....

As I wrote was actually thinking primarily of diskless systems. In such a
system the program is stored in ROM. If the program is self-modifying, its
code has to be loaded into RAM before it can execute. If not it can be
executed directly in ROM and only the data areas need to be RAM. Thus memory
is saved (we don't need the extra RAM to hold the code).

Bjorn Lisper

gillies@p.cs.uiuc.edu (07/13/88)

I think many rules have an exception (e.g. "gotos considered harmful").
There is a difference between self-modifying code and non-reentrant
code.  Today, people are finding new uses for self-modifying code that
were unforseen 20 years ago:

1.  Simple programs that need OPTIMAL speed, and would be outrageously
expensive on ANY machine.  The major example is BITBLT, a subroutine
with about 10-15 parameters that does a massive amount of linear-time
work.  If the BITBLT subroutines generates the machine code expressly
for the particular arguments, you can save a factor of 3 or more in
speed.  This is very important for interactive graphics.

2.  Peter Deutsch and PARCPLACE systems used a new technique to speed
up Smalltalk on conventional computers.  When a subroutine is
executed, the types are checked and if possible, the subroutine is
compiled for speed & executed.  Then this compiled code is cached for
possible use later, when the arguments have the same types.  This also
gave a very large constant speedup to the smalltalks marketed by
ParcPlace systems.

I think papers on both these techniques appeared in the first OOPSLA
conference.

In light of these new techniques, I believe it's time to reexamine
our opinions about self-modifying code.  

Don Gillies, Dept. of Computer Science, University of Illinois
1304 W. Springfield, Urbana, Ill 61801      
ARPA: gillies@cs.uiuc.edu   UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies

hjm@cernvax.UUCP (hjm) (07/13/88)

I have been mulling over the idea of self-modifying code (SMC) for a while and
I've come to the conclusion that there is no good definition of SMC.

For example, if treating code as data is the definition, then does passing a
procedure as a parameter in PASCAL, or a pointer to a function in C count?
Probably not.

OK, what about a jump table.  Consider an array of pointers to functions in C.
Does changing the pointers count as SMC?  Again, I don't think so.

So, changing a pointer by assigning to it is not SMC, but putting a new jump
instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
is SMC.  Does one level of indirection really make that much difference?

Of course, if you want to be really horrid in C, you can try something like
this:

char codearray[] = { 0x03, /* one byte instruction for something */
		     0xc9  /* one byte return instruction */
		   }

and then 'call' this function using a cast to turn the pointer codearray into
a pointer to a function.  (Who needs #asm anyway!)  Then you can modify the
code as much as you want.  This _is_ SMC without a doubt, because you can
overwrite code.  So, I propose a very weak definition for SMC as code that
writes over other code.

As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
but mov (pc)+,#1234 which loads an immediate does so too)?  Why the subtle
difference?  Any thoughts on the subject?

	Hubert Matthews

(I don't consider LISP or PROLOG programs that create code on the fly to be
SMC.  Does anyone disagree?)

aglew@urbsdc.Urbana.Gould.COM (07/13/88)

>Today, people are finding new uses for self-modifying code that
>were unforseen 20 years ago:
>
>1. ... BITBLT ... 
>
>2.  Peter Deutsch and PARCPLACE systems used a new technique to speed
>up Smalltalk on conventional computers...
>
>In light of these new techniques, I believe it's time to reexamine
>our opinions about self-modifying code.  
>
>Don Gillies, Dept. of Computer Science, University of Illinois
>1304 W. Springfield, Urbana, Ill 61801      
>ARPA: gillies@cs.uiuc.edu   UUCP: {uunet,ihnp4,harvard}!uiucdcs!gillies

Two useful techniques. Both of them can be cast in the form of
"compile an entire procedure/function" (although I have seen BITBLTs
that compile code fragments and branch to them, I think those are 
better understood as BITBLTs with streamlined function calling
sequences).

One of my favourite languages was POP-2, where the compiler was
a subroutine that could be used to compile code, returning what
was essentially a pointer to a function.

May I suggest that, perhaps, this is the moral: from the point of view
of programming principles, self-modifying code where the granule is
the procedure is not bad. Smaller granules are dangerous/hard to use.


(I've been running a pretty good streak of stupid mistakes in my
posts recently. Tell me all about them)

Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have MX records
    aglew@xenurus.gould.com     - if you don't
    ...!ihnp4!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

hankd@pur-ee.UUCP (Hank Dietz) (07/14/88)

In article <12360@ut-sally.UUCP>, nather@ut-sally.UUCP (Ed Nather) writes:
> In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
> > 
> > I guess there are situations with extreme performance requirements where
> > self-modifying code is justified. It would be interesting to see what a
> > semantics for self-modifying programs would look like. 
> 
> I agree.  That was the whole point of my original posting -- to see if someone
> with a talent for languages could ignore the current prejudice against this
> practice, and see what might be done.  If there's really no way to do it in a
> reasonable manner, fine -- but just *saying* it's awful doesn't *make* it awful.

The key problem with self-modifying code is that code created at runtime may
REPLACE code that appeared at "compile" time, hence, in general, no STATIC
understanding of the code is possible.  This can make it hard for humans to
understand the code, but more importantly it implies that compilers cannot
perform optimizations, e.g., you cannot perform register allocation / value
propagation / common subexpression elimination / in-line expansion because
the code you remove might be code which is modified at runtime.

The fix need not be the elimination of the generate-and-execute programming
concept, rather, the language merely needs to specify constraints on the code
which is to be generated at runtime.  For example, key constraints would be
that the runtime generated code could not redefine any existing function,
could only be invoked through a statically-specified interface, and could only
access data which either was statically-declared as accessible through that
interface or was created as new data local to the new code.  In other words,
we can permit code to add to itself, but not to redefine what existed at
compile time:  self-extending code.

An outline of fixes to be applied to Lisp (to create "Refined Lisp") appears
in my PhD thesis:  H. Dietz, "The Refined-Language Approach to Compiling for
Parallel Supercomputers," June 1987.  The folks working on Scheme had similar
ideas....

						-hankd

radford@calgary.UUCP (Radford Neal) (07/14/88)

There are interesting cases where on-the-fly generation of code seems
to be essential to get good asymptotic space and/or time complexity.

Consider the following code fragment (version A):

     for (i = 0; i<L; i++)
     { if (c1) p1();
       if (c2) p2();
       ...
       if (cN) pN();
     }

The Boolean variables c1, c2, ... cN are assumed to be loop invariants.
Assuming the pi all take the same constant time and code space, the
time complexity of this code is O(N), and its space complexity is also
O(N). (I'm ignoring the factor of L for number of loop iterations.)

Now consider the following equivalent code, specialized for N=2, but
clearly generalizable to any N (version B):

     if (c1)
     { if (c2) for (i = 0; i<L; i++) { p1(); p2(); }
       else    for (i = 0; i<L; i++) { p1(); }
     }
     else
     { if (c2) for (i = 0; i<L; i++) { p2(); }
       else    /* Nothing */ ;
     }

Code like this will have a time complexity of O(M), where M is the
number of ci that are actually true, and a space complexity of O(2^N).

Let's say that M, the number of true ci, is typically logN. This gives
the following comparison for versions A and B:

     version  space   time

        A     O(N)    O(N)
        B     O(2^N)  O(logN)

Thus you get to choose between time that is exponentially worse than
it might be, or space that is exponentially worse than it might be.

The following lets you get both (version C):

     start generating code;
     if (c1) generate instruction to call p1;
     if (c2) generate instruction to call p2;
     ...
     if (cN) generate instruction to call pN;
     for (i = 0; i<L; i++) execute generated instructions;

This gives you the low time complexity of version B, with the low
space complexity of version A. It's machine dependent, however.

It _is_ possible to get the same form of time complexity without
machine dependent operations, as follows (version D):

     void (*pa)[N+1];
     j = 0;
     if (c1) pa[j++] = p1;
     if (c2) pa[j++] = p2;
     ...
     if (cN) pa[j++] = pN;
     pa[j] = 0;
     for (i = 0; i<L; i++)
     { for (j = 0; pa[j]; j++) (*pa[j])();
     }

This essentially works by generating code in a very specialized
language, and then interpreting that code. This gives the same
form of time complexity as version B, but with a larger constant
factor - perhaps considerably larger if the pi aren't really procedures,
but just short code sequences.

This is the general sort of reason for the bit-blit algorithms that
generate code on the fly, and I've encountered the same sort of 
situation in other contexts (e.g. a "life" program). One could 
imagine a compiler performing the transformation from version A
to the machine-dependent version C, but I've never heard of such.
Even if it did, it is a bit disturbing that a transformation of
such significance for space/time complexity isn't expressed at the 
source level. This would be solved by having the compiler transform
version D to the faster version C, but that's probably even harder
for the compiler-writer to accomplish than the A to C transformation.

    Radford Neal

peter@ficc.UUCP (Peter da Silva) (07/14/88)

I have a question:

	Why are an Icache plus a Dcache better than just
	a big shared cache as big as both?
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation.
"Have you hugged  U  your wolf today?" (uunet,tness1)!sugar!ficc!peter.

peter@ficc.UUCP (Peter da Silva) (07/14/88)

In article <76700032@p.cs.uiuc.edu>, gillies@p.cs.uiuc.edu writes:
> 1.  Simple programs that need OPTIMAL speed, and would be outrageously
> expensive on ANY machine.  The major example is BITBLT, a subroutine
> with about 10-15 parameters that does a massive amount of linear-time
> work.  If the BITBLT subroutines generates the machine code expressly
> for the particular arguments, you can save a factor of 3 or more in
> speed.  This is very important for interactive graphics.

You can save an even greater factor by building the BitBlt into the
hardware. This is what the Commodore Amiga does, using a chip called
the Biltter. I'm pretty surprised that a chipset this good that's cheap
enough to be put in a consumer machine hasn't been picked up (or
emulated) by the big boys. It makes a significant difference to the
speed of the machine.

And you could cut that ugly self modifying code out of BitBlt. Instead
you just:
	Wait for the blitter to be free (lock it).
	Set up its parameters (it can take 3 source bitmaps and one
		destination bitmaps, with any minterm)
	Blit (at DMA speeds, up to one word per clock).

You can even return to the caller before the blit's finished, if the
locking is in hardware or if the blitter can generate an interrupt to
clear the semaphore.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation.
"Have you hugged  U  your wolf today?" (uunet,tness1)!sugar!ficc!peter.

gregr@edge.UUCP (Greg Rose) (07/15/88)

The discussion of self modifying code, as it has progressed in this newsgroup,
will never reach an end. This is simply because the opponents and proponents
are not talking about the same things at all.

"Self modifying code" such as was used on early computers, for example
early CDC 6600 machines (not flaming them, just the example I know)
was a NECESSARY technique for performing subroutine linkage. There was no
instruction to perform a jump to an arbitrary address (i.e. the return
address of the subroutine) so the subroutine entry code *had* to overwrite
a jump instruction's operand (at a known, per subroutine, address) before
calling. This was usually the instruction immediately at or before the start
of the subroutine. In psuedo code:

	/* caller */			/* callee */
	ld	r0,#ret_addr		subr:	jmp	anywhere /*TBA*/
	st	r0,subr+1			...
	jmp	subr+2				...
ret_addr:					jmp 	subr /* return */
	...

(Assume that it is word addressed and the jmp instruction takes one
word, as does its operand.)

THIS is self modifying code at its worst, gaaakk phut, urk, where is the
bathroom. Ever wondered why the early Fortran standards didn't even
*allow* recursion?

Even earlier machines used similar techniques for array indexing and other
irrelevant stuff.

"Run time generated code" is a whole different kettle of fish, and
seems to be what all of the proponents in the discussion have been referring
to. This is where issues such as separate I/D spaces/caches, operating
system support, and so on, come in.

BITBLT is, as noted, the classic example where run time generated code
can do the job much faster than any parameterised loops.

But load-and-go compilers are also a sensible thing in some environments.
The only difference between generating code in your own address space
and jumping to it, or generating it in a file and 'exec'ing it,
is that the operating system knows how to do the latter correctly.
Nobody really objects to optimising compilers, now do they? So
what is the problem with optimising code generators for some other class
of problem.

I agree with John Mashey that the most needed thing at the moment
is appropriate operating system or hardware support to address the
caching and accessability issues.

The second most needed thing is to stop calling "run time generated code"
"self modifying code" so people will think before barfing.

Greg Rose, Softway Pty Ltd, Australia
...uunet!munnari!greg@softway.oz
"Assume your favourite disclaimer since I don't think I need one"

jackson@esosun.UUCP (Jerry Jackson) (07/15/88)

In article <752@cernvax.UUCP> hjm@cernvax.UUCP (hjm) writes:

   Path: esosun!seismo!uunet!mcvax!cernvax!hjm
   From: hjm@cernvax.UUCP (hjm)
   Newsgroups: comp.lang.c,comp.arch
   Summary: What is self-modifying code anyway?
   Keywords: self-modifying code
   Date: 13 Jul 88 09:44:20 GMT
   References: <3353@cognos.UUCP> <619@goofy.megatest.UUCP> <429@uwovax.uwo.ca> <5254@june.cs.washington.edu> <12357@ut-sally.UUCP> <5262@june.cs.washington.edu>
   Reply-To: hjm@cernvax.UUCP ()
   Organization: CERN European Laboratory for Particle Physics, CH-1211 Geneva, Switzerland
   Lines: 36
   Xref: esosun comp.lang.c:11534 comp.arch:5582

>>   As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
>>   an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
>>   but mov (pc)+,#1234 which loads an immediate does so too)?  Why the subtle
>>   difference?  Any thoughts on the subject?

	   Hubert Matthews

   (I don't consider LISP or PROLOG programs that create code on the fly to be
   SMC.  Does anyone disagree?)


The main difference that I see is that the code in one case is not
reentrant.  In systems like unix where it is possible for more than
one process to share a code segment at runtime, the jump table is
local to each process while the code is not.  Imagine if vi had true
self-modifying code in it...  (Many users typically share a code
segment for programs like vi and csh) The example in which you create
an array and then call it as a function does not really fit this
definition of self-modifying, since the code that is changed is
created in data space at runtime and hence cannot interfere with other
programs using the same code segment.

*Real* SMC would be something like:

main()
{
	char func();
	long *longptr;

	longptr = (long *)func;
	*(longptr + 20) = 0xffffffff;  /* shudder!!!!! */
}

+-----------------------------------------------------------------------------+
|   Jerry Jackson                       UUCP:  seismo!esosun!jackson          |
|   Geophysics Division, MS/22          ARPA:  esosun!jackson@seismo.css.gov  |
|   SAIC                                SOUND: (619)458-4924                  |
|   10210 Campus Point Drive                                                  |
|   San Diego, CA  92121                                                      |
+-----------------------------------------------------------------------------+

cjohnson@unisoft.UUCP (Christopher Lyle Johnson) (07/15/88)

As has been already pointed out, one of the problems with self
modifying code stems from separate instruction and data caches.
Even on Harvard architectures, if it is possible to mark an area
of code with a cache inhibit bit SM code doesn't cause too many
problems.  In kernel mode or on systems where user processes can
access the mmu, SM code may make sense in certain situations.  
I have no major aesthetic complaint if the performance win is
large enough.

Of course, if the code isn't cached, perhaps the win is less
significant.  If we want to toggle the don't cache bit (i.e. set
while modifying the code, clear after the code is written), it
may be difficult to claim a performance win for non-supervisor
processes.  If it takes two system calls per SM operation, the
overhead is pretty high.

(If it is faster to flush the caches, that may be an option,
but there may be very little control over the granularity.  How
would you implement a 'flush_address(base, bounds)' system call
on your favorite mmu/cache system?)

However..., if a user process can't get to the mmu the operating
system has a bit of a problem.  I guess the trick is to write
protect program text and execute protect the data.   Of course,
this causes a execute protect fault on the modified code page(s).
The kernel could then flush the caches and write protect the
page.  Again, this is a lot of overhead.

(Yes, I know some mmus don't have an execute protect mode.  Work
arounds may exist, eg. the function code mode on the 030.  Ugh.)

All in all, a person had better have a good reason for writing
self modifying code.  In some cases the performance win from
the code may be lost in overhead.  There are many other facets
to this problem, too.

					Cheers,
					    cj*

=========================================================
This message does not reflect the views of
UniSoft Corp but this sentence does.

baum@Apple.COM (Allen J. Baum) (07/15/88)

[]
>In article <28200177@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes:
>
>>Today, people are finding new uses for self-modifying code that
>>were unforseen 20 years ago:
>>
>>1. ... BITBLT ... 
>>
>>2.  Peter Deutsch and PARCPLACE systems used a new technique to speed
>>up Smalltalk on conventional computers...
>>

This was supposed to be a reply to Doug Pardo, but the Brain-Damaged-Mailer
bounced the reply, so I'm posting...

What about the case mentioned by someone for speeding up SmallTalk? In this
case, you're compiling a small routine, instead of the whole program, but
have the same information.
   HP used something like this for a fast APL compiler
long ago in a galaxy far away. They compiled code for adding to objects
according to the object types at compilation time, and the code included a 
check for just those types. If the check failed, it bounced back to the
compiler, which generated code for a more general case; so, for example,
the first time it might compile code for arrays of size i=constant, and if
the size changed, then it would compile slightly more general code for arrays
of size n=variable, then if that also failed, maybe for arrays of rank m=var,
size n=var, etc. It ran much faster than interpreters running on faster
machines so it can be a big win. Again, in all cases, you know which addresses
to flush before you branch to them.
  
--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

barmar@think.COM (Barry Margolin) (07/15/88)

In article <749@cernvax.UUCP> hjm@cernvax.UUCP () writes:
>If the program is considered to be data for an interpreter which I will for
>convenience call a CPU, then does self-modifying code seem any different?  As
>long as the data in this table remains consistent, then there should be no
>difficulty in dealing with it mathematically.  All of the problems of database
>consistency have been tackled (but not all solved) some time ago.  Code is data
>according to a PATCH program, and the only tricky bits there are to ensure that
>any changes fit inside the space that they replace and to patch up the jumps
>where necessary.  A bit like a structure linked by pointers, such as a binary
>tree, except that code is in general a directed cyclic graph (the directedness
>is due to the program counter).

This view is valid, but simplifies the issues.  If you are going to
think of the program as a database, you must actually consider it as a
distributed database.  This is because the data is replicated in
several places: the physical memory, the swapping device, the CPU
caches, and internal processor registers.  The problems of maintaining
consistency in replicated databases is much less understood than
concurrency control in non-distributed databases.  Additionally, any
form of consistency maintenance must incur some performance penalty,
and hardware designers are under great pressure to make machines go as
fast as possible.  Multiple caches that do not maintain consistency
are one answer to this.  Since self-modifying code has generally been
considered bad style for quite some time, they felt comfortable
implementing a hardware optimization that assumes code will not modify
nearby locations.

Barry Margolin
Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

bclaus@wright.EDU (Brian Clausing) (07/15/88)

in article <5262@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) says:
> Xref: wright comp.lang.c:8563 comp.arch:4081
[...discussion about self-modifying code.  It fouls up caches.
It is intellectually unmanageable with current ISPs....]

As to the semantics of self-modifying code, isn't it identical to
a Turing machine's or am I missing something?  Of course, reduction
machines, like Turner's combinatoric implementation of SASL, are
inherently self modifying.
MB Clausing

baum@Apple.COM (Allen J. Baum) (07/15/88)

[]
>In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
>I have a question:
>
>	Why are an Icache plus a Dcache better than just
>	a big shared cache as big as both?

In terms of hit/miss ratios, a unified cache is clearly better. However, in
terms of effective access time, a split I/D cache is better. This is because
both can be accessed simultaneously, instead of one having to wait for the
other to finish. If both are getting accessed simultaneously (which should
happen 40% of the time, if Loads/Stores account for %40 of instructions), then
this more than offsets the increase in miss ratio.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

tim@amdcad.AMD.COM (Tim Olson) (07/16/88)

In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
| I have a question:
| 
| 	Why are an Icache plus a Dcache better than just
| 	a big shared cache as big as both?

When the processor has separate instruction and data buses, so
concurrent accesses can occur to both caches.

-- 
	-- Tim Olson
	Advanced Micro Devices
	(tim@delirun.amd.com)

lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/16/88)

In article <1744@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes:
>
>There are interesting cases where on-the-fly generation of code seems
>to be essential to get good asymptotic space and/or time complexity.
>
>Consider the following code fragment (version A):
>
>     for (i = 0; i<L; i++)
>     { if (c1) p1();
>       if (c2) p2();
>       ...
>       if (cN) pN();
>     }
>
>The Boolean variables c1, c2, ... cN are assumed to be loop invariants.

(...version B deleted...)

>The following lets you get both (version C):
>
>     start generating code;
>     if (c1) generate instruction to call p1;
>     if (c2) generate instruction to call p2;
>     ...
>     if (cN) generate instruction to call pN;
>     for (i = 0; i<L; i++) execute generated instructions;

Interesting. This reminds a lot of the kind of compile-time manipulation
that optimizing compilers do. The loop

     for (i = 0; i<L; i++) execute generated instructions;

is an optimized version of A *when the ci are known*. If they were known at
compile-time an optimizing compiler could have generated the same code. The
only difference is that the self-modifying program does this *at runtime*,
when we are guaranteed to know the values of the ci's. Maybe we should call
this technique "run-time compilation"?

Bjorn Lisper

pardo@june.cs.washington.edu (David Keppel) (07/16/88)

In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
>[ Why are two caches (I, D) better than one (I+D)?

For a given real-estate, a non-snoopy I-cache will hold more data
bits and be faster than a snoopy I-cache.

+ For a given real-estate, the I-cache hit rate will be better.  This
  makes the average instruction fetch time lower.  This may be a win
  if instruction fetch rate is at all a bottleneck.

+ For a given cache size, the real-estate can be smaller.  This means:
 + The cache may be cheaper.
 + The cache may be small-enough to put closer to the instruction
   register, making it effectively faster.

+ The logic is simpler and more regular.
  + Faster design times.
  + Fewer bugs.
  + Easier to incorporate into VLSI designs (==faster).
  + Less logic to drive => faster, less power.

+ The cache doesn't need to be connected to as many things.
  + More freedom to place the cache => faster, cheaper.
  + Less external logic to drive => smaller, faster, cheaper.

+ Aliasing and data distribution are less of (none) a problem.
  This lets you build (with much less pain)
  + Heirarchical cache organizations (faster).
  + Virtually-addressed caches (faster).

I-caches are particularly useful on machines that make heavy use
of memory.  This includes:
+ Many high-performance processors.
+ Shared-memory multiprocessors.

The primary problem (that I'm aware of) with non-snoopy I-caches
is that you must manually flush the I-cache every time that an
instruction is changed.  (Strictly speaking, every time you change an
I-space address that has been touched since the last cold start).

Does that answer your question ?-)

	;-D on  ( Faster )  Pardo

petolino%joe@Sun.COM (Joe Petolino) (07/16/88)

>>	Why are an Icache plus a Dcache better than just
>>	a big shared cache as big as both?
>
>In terms of hit/miss ratios, a unified cache is clearly better.

I beg to differ.  We ran a few simulations, comparing split and unified
caches (total cache size in the 32K-512K range), and when the caches were
direct-mapped (i.e. 1-way set-associative) the unified cache performed worse. 
Our guess is that when code and data are forced to co-exist in the same
space, you get a high probability of collision and thrashing.  This effect
went away when we increased the degree of set associativity.  One way to
think about it is that having two (direct-mapped) caches gives you some of
the benefits of set-associativity.  Take this with a grain of NaCl: we only
tried a few test programs, and each was small enough to fit into the cache.

-Joe

tps@chem.ucsd.edu (Tom Stockfisch) (07/16/88)

In article <33652@yale-celray.yale.UUCP> lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
>In article <1744@vaxb.calgary.UUCP> radford@calgary.UUCP (Radford Neal) writes:
>>There are interesting cases where on-the-fly generation of code seems
>>to be essential to get good asymptotic space and/or time complexity.
>>
>>     for (i = 0; i<L; i++)
>>     { if (c1) p1();
>>       if (c2) p2();
>>       ...
>>       if (cN) pN();
>>     }
>>[SMC solution follows]
>>     start generating code;
>>     if (c1) generate instruction to call p1;
>>     if (c2) generate instruction to call p2;
>>     ...
>>     if (cN) generate instruction to call pN;
>>     for (i = 0; i<L; i++) execute generated instructions;

Wouldn't the following be "nearly" as efficient, and be portable to boot?
At least it has the same asymptotic complexity.

	enum { END, P1, P2, ..., Pn }	code[MAXCODE], *p;

	p = code;
	if (c1)	*p++ =	P1;
	if (c2)	*p++ =	P2;
	...
	if (cn)	*p++ =	Pn;
	*p =	END;

	for ( pixel = 0; pixel++ < NPIXELS )
		for ( p = code; *p != END; )
			switch (*p++)
			{
			case	P1:
				[p1 instructions]
				break;
			case	P2:
				[p2 instructions]
				break;
			...
			case	Pn:
				[pn instructions]
				break;
			default:
				assert(0);
				/*NOTREACHED*/
			}
-- 

|| Tom Stockfisch, UCSD Chemistry	tps@chem.ucsd.edu

sean@lfcs.ed.ac.uk (Sean Matthews) (07/16/88)

With this talk about self modifying code, no one has mentioned the
theoretical work that has been done on the subject.

There is a quite well known paper (and quite controversial) on the
subject of programs that can modify themselves and their interpreters
by Brian Smith

Reflection and Semantics in a procedural language
MIT-LCS-TR-272 Mass. Inst. Tech.
January 1982

Reflection and semantics in Lisp
11th ACM symposium on principles of programming languages

Then there is the reply to these papers by
Mitchel Wand and Daniel Freidman

The mystery of the tower revealed:
a non-reflective description of the reflective tower
CACM 1986
(this is as far as I can go since I have a copy of a copy and there
is no publishing information on it)

An extended version of this paper was published in
Meta-level architectures and reflection,
P. Maes and D. Nardi (editors)
Elsevier Science Publishers B.V. (North-Holland) 1988

There is a lot of theory about self modifying systems and
self referential systems but by the time you start looking into it
you are in philosophical logic, not comp.architecture

Se\'an Matthews
arpa: sean%uk.ac.ed.aipna@nss.cs.ucl.ac.uk

mash@mips.COM (John Mashey) (07/16/88)

In article <1087@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:

>	Why are an Icache plus a Dcache better than just
>	a big shared cache as big as both?

1) (Major) with split caches, you can access a word of instruction and
a word of data in the same cycle.  With a joint cache, you get one or
the other.  As a truly gross estimate, expect to lose 10-40% of your
performance, given conflicts between I-refs and loads/stores.

2) Less important, but still useful.  As you make a direct-mapped cache
bigger, each successive 2X improves the hit-rate, but it improves it less than
the last 2X.  At some point, it is hard to make it bigger and keep the same
speed SRAMs, and then the only way (keeping the same organization) to make
the hit rate faster is to make it set-associative.  Given the same total
cache size, hit rates for many programs (not all, there are exceptions):
	joint, direct-mapped <=
	split, direct-mapped <=
	joint, 2-way set associative
Note: those are Hit Rates, not performance.  There are a bunch of reasons
to keep caches (at least the ones nearest the CPU) direct-mapped when
you can.

The following is a fine paper that analyzes the various tradeoffs in
cache design [rather than just hit rates], is:

	S. Pryzbylski, M. Horowitz, J. Hennessy, "Performance Tradeoffs
	in Cache Design", Proc. 15th Annual Symposium on Computer
	Architecture, May 30 - June 2, 1988, Honolulu, Hawaii.
	(Computer Architecture News 16, 2(May 1988), 290-298.
	
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bill@proxftl.UUCP (T. William Wells) (07/16/88)

In article <60175@sun.uucp>, petolino%joe@Sun.COM (Joe Petolino) writes:
> >>    Why are an Icache plus a Dcache better than just
> >>    a big shared cache as big as both?
> >
> >In terms of hit/miss ratios, a unified cache is clearly better.
>
> I beg to differ...
>                                  ...Take this with a grain of NaCl: we only
> tried a few test programs, and each was small enough to fit into the cache.

Another relevant point about separate caches is this:
instructions and data have different access patterns.  A cache
designed for the one is not necessarily going to be right for the
other; a cache designed for both may not do as well as the
separate ones.  So this possibility has to be balanced against
the general benefit of having a larger pool to work with.

sl@van-bc.UUCP (pri=-10 Stuart Lynne) (07/17/88)

In article <752@cernvax.UUCP> hjm@cernvax.UUCP () writes:
>I have been mulling over the idea of self-modifying code (SMC) for a while and
>I've come to the conclusion that there is no good definition of SMC.

>For example, if treating code as data is the definition, then does passing a
>procedure as a parameter in PASCAL, or a pointer to a function in C count?
>Probably not.

>OK, what about a jump table.  Consider an array of pointers to functions in C.
>Does changing the pointers count as SMC?  Again, I don't think so.

>So, changing a pointer by assigning to it is not SMC, but putting a new jump
>instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
>is SMC.  Does one level of indirection really make that much difference?


A simple definition might be any program that cannot be compiled and run
in shared text mode (or equivalent for non Unix application environments).

Modifying a jump table in your data space does not affect how the program
will run for other users. 

Modifying a jump instruction in the shared text *will* affect how the
program will run for other users.


I have used SMC in places that where by design not required to be shared and
where a high degree of efficency was required. For example in a p-System
type interpreter you need to have a test in the opcode fetch and dispatch
loop for a pseudo interrupt. This takes a lot of time. By simply patching in
a jmp to_int as the first instruction of the loop we eliminate the need for
an explicit test. 

To make the above example work for multi-user systems, we used a jump
indirect through a var which contained either the ifetch address or
interrupt handler address. A little slower but still faster than explicit
test in the ifetch loop.

Similiar type of problems can arise when doing high performance device
drivers. 


-- 
Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl     Vancouver,BC,604-937-7532

leo@philmds.UUCP (Leo de Wit) (07/17/88)

In article <752@cernvax.UUCP> hjm@cernvax.UUCP () writes:
>I have been mulling over the idea of self-modifying code (SMC) for a while and
>I've come to the conclusion that there is no good definition of SMC.
>
>For example, if treating code as data is the definition, then does passing a
>procedure as a parameter in PASCAL, or a pointer to a function in C count?
>Probably not.

True. A pointer is data.

>OK, what about a jump table.  Consider an array of pointers to functions in C.
>Does changing the pointers count as SMC?  Again, I don't think so.

Again true. Note that it may be possible that the system (O.S.,
processor) checks whether the new pointer value represents a valid
address, or a valid entry point. If this is (always) desirable is an
entirely different question. E.g. some architectures will protest if
you try to jump to an odd address. PASCAL will leave you probably less
room to cheat than C.

>So, changing a pointer by assigning to it is not SMC, but putting a new jump
>instruction in (e.g. jmp #somewhere_else) in place of an existing instruction
>is SMC.  Does one level of indirection really make that much difference?

Yes, I think it does make a difference. Maybe not always, but there are
cases that you can't get away with this: think of re-entrant code, or
shared text segments. Now I'm thinking of recursion, what about putting
the code on the stack 8-) ? No worry about re-entrancy, and the C
program model becomes more complete (we have already global, static and
automatic data, and global and static functions, and now there's
automatic functions...).

>Of course, if you want to be really horrid in C, you can try something like
>this:
>
>char codearray[] = { 0x03, /* one byte instruction for something */
>		     0xc9  /* one byte return instruction */
>		   }

This must be (or was a) Z80 freak! At least I recognize the 0xc9 as:
RET.  On a Z80 you could do other horrible things too, since
instruction sizes vary from 1-4 bytes; by carefully picking your
instructions you could use the same code twice (using a one-off entry
point), with entirely different result. Good (?) old days, when memory
was more expensive than programming effort....

>and then 'call' this function using a cast to turn the pointer codearray into
>a pointer to a function.  (Who needs #asm anyway!)  Then you can modify the
>code as much as you want.  This _is_ SMC without a doubt, because you can
>overwrite code.  So, I propose a very weak definition for SMC as code that
>writes over other code.

Note that not all implementations will allow initialized arrays to be
altered.  I remember a problem we had last year on VMS while passing a
initialized char array to mktemp() (or whatever it's name is); the
program stackdumped because mktemp tried to write into the readonly
array (yes, VMS put it in readonly!).

>As a final note, why is it 'clean' to alter a jump table and 'unclean' to alter
>an inline constant (e.g. jmp @offset(r0) uses a value in memory as the address
>but mov (pc)+,#1234 which loads an immediate does so too)?  Why the subtle
>difference?  Any thoughts on the subject?

Try putting your code in ROM and see what happens. Just one example.
Besides I think jump tables are generally not altered, pointers are.
The jump tables could well be in the text segment, for instance.
A jump table is not an array of function pointers.

>	Hubert Matthews
>
>(I don't consider LISP or PROLOG programs that create code on the fly to be
>SMC.  Does anyone disagree?)

No, unless the code is buggy; such code on a fly could well be
Self Multiplying 8-).

  Leo.

smryan@garth.UUCP (Steven Ryan) (07/17/88)

>As to the semantics of self-modifying code, isn't it identical to
>a Turing machine's or am I missing something?

No. The state mechanism of a Turing machine is fixed and finite.

To show equivalence, it is necessary to show that self modifying code
does not increase the number of states. Usually, people just appeal
to Church-Turing and assume all the states are there, somewhere.

For something like a load and go compiler, since every possible program
is not hidden somewhere in the compiler, it would appear to escape CT.
However the compiled program can be regarded as a transformation of the
input tape into another region of the tape and its apparent execution
as an interpretation of the tape by the compiler/loader/cpu state machine.

Then again, if you believe really hard and clap your hands.....

                            sm ryan

----------------------------------
Hugh: Why does he carry a sword?
Fred: He has delusions of grandeur.
him:  Uh, Fred...
      (SWOOP)
Fred: YEEK
him:  ...before you make a value judgement about a person's delusions...
      (poink)
Fred: AWK! My head feathers!
him:  ...you'd better be absolutely sure they ARE delusions.
      (JAB)
Fred: OUCH!
                    -- Odd Bodkins

david@daisy.UUCP (David Schachter) (07/17/88)

In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes:
>The discussion of self modifying code, as it has progressed in this newsgroup,
>will never reach an end. This is simply because the opponents and proponents
>are not talking about the same things at all.

Actually, the discussion will never end because it changes as it goes along.

Oh, and regarding the chap who asked for a definition of SMC: SMC is code that,
when you look at it, you go "Blecch-- this is self-modifying code."

seanf@sco.COM (Sean Fagan) (07/17/88)

In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes:
[brief summary of the SM code discussion]
>"Self modifying code" such as was used on early computers, for example
>early CDC 6600 machines (not flaming them, just the example I know)
>was a NECESSARY technique for performing subroutine linkage. There was no
>instruction to perform a jump to an arbitrary address (i.e. the return
>address of the subroutine) so the subroutine entry code *had* to overwrite
>a jump instruction's operand (at a known, per subroutine, address) before
>calling. This was usually the instruction immediately at or before the start
>of the subroutine. In psuedo code:
>
[sample pseudo code]

NO NO NO NO!!!!!! The CDC 6600 (and 7600, and the Crays, as well) had a Jump
to Subroutine instruction:  RJ <addr> (Return Jump) (on the 6600); what this
would do is write a <JP {calling address+1},NOP,NOP> into <addr> (the
machine could pack up to 4 instructions in a word [60 bit words, 15 or 30
bit instructions], so the assembler had to look for labels for RJ
instructions and pad out the rest of the word with NOPs 8-)).  (The machine
did this because it lacked one simple little feature found on many of
today's machines:  a stack.)  Therefore, when you did a RJ, it would perform
the indicated write, and start execution at <addr>+1, and, to return, you
would JP (or do an unconditional branch, which wouldn't clear out the
instruction cache) to the entry point, and boom! you're back where you want
to be.  Because of other features of the machine (such as banks of memory
[i.e., the memory was accessed as one of 8(?) banks of RAM, with each
succeeding address in a different bank.  Accessing any individual address could
be, oh, 8 times slower than the machine cycle, but accessing the next
address wouldn't be.  Therefore, when you did an RJ, it could write to
<addr>, and then get <addr+1> at the next cycle], and the fact that you
could write to a location and execute another instruction [instruction
overlap]), this was *much* faster than having to maintain a stack would be,
and it also didn't tie up any of the registers (it only had 24 of them, one
of which was hardwired to 0, one of which was usually 1, and 7 of which, if
accessed, would either screw up memory or another register).

They were very fast machines, as I've said here before.  If only they didn't
have NOS...

>Greg Rose, Softway Pty Ltd, Australia
>...uunet!munnari!greg@softway.oz

Sean.
---------------
Sean Eric Fagan
seanf@sco
(408) 458-1422, ext. 3561

elg@killer.DALLAS.TX.US (Eric Green) (07/17/88)

in article <4776@killer.UUCP>, chasm@killer.UUCP (Charles Marslett) says:
> In article <33441@yale-celray.yale.UUCP>, lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) writes:
>> Another related issue is that self-modifying code cannot be executed
>> directly from ROM. Executing programs in ROM is an important memory-saving
>> technique in small, diskless, special-purpose systems.
> 
> Actually, I always thought that ROM was much more important as a way of
> eliminating the need to load the code and as a way of guaranteeing non-
> self-modifying-ness.  Usually ROM code is larger (thus memory-consuming)
> than RAM-loaded code. 

Uh, I don't know what systems you're talking about, but for the small
microcontrollers that I've used -- the ROM is assembled to be placed
in the memory map at an absolute location. That is, it takes up the
same amount of space whether it's in ROM or RAM (in fact, one of my
favorite hardware designs is a little SRAM card that I use to spoof
ROM for prototyping purposes).  As for re-using memory buffers etc.,
those are always in RAM, so those are always re-usable -- no matter
where your executable is loaded. 

32K bytes of ROM costs about 1/4th of what 32K bytes of static RAM
costs (not to mention the eliminating-need-to-load). Note that these
are dedicated control systems using 6502's, Z-80's, and 6809's, which
have a total address space of 64K -- so 32K of ROM is a heckuva lot.

> Seriously, newer architectures have reduced the difference (to 0 perhaps),
> but the emphasis on RISC these days may resurrect self-modifying code --
> a RISC-er technique is not known to mankind!  (:-D}===<<).

Actually, RISC may be the stake-in-the-heart of self-modifying code.
RISC technology overcomes the increased memory bandwidth requirement
by using enhanced cache technology, 256-byte-at-a-time prefetch from
page-mode DRAM's, heavy pipelining, and every other trick in the book.
Most of which are quite contrary to the thought of self-modifying
code. Another RISC trick is to use the saved silicon for extra
registers, and have the compilers keep as many valus as possible in
the CPU registers. You can always do a one-word memory-indirect load
off of a register much faster than you can do a "memory-write(the
address into the instruction) memory-read(the address part of the
modified instruction) memory-read" triplet (which is at least three
two-word instructions, as vs. a single one-word instruction, on a
machine where one cycle = one word processed).

But, I HAVE done self-modifying code on a 6502-based microcontroller
(out of sheer necessity). I still make an occasional enhancement to
that software, over 3 years later, as new applications arise, and have
had no problems with readability. After all, in assembly on a
microcontroller, you have to save the address somewhere, and the
fact that the "somewhere" happens to be inside an assembly language
instruction makes little difference. The argument against
self-modifying code lies elsewhere besides readability.

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
       MISFORTUNE, n. The kind of fortune that never misses.

cik@l.cc.purdue.edu (Herman Rubin) (07/17/88)

In article <361@scolex>, seanf@sco.COM (Sean Fagan) writes:
> In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes:
> [brief summary of the SM code discussion]

			........

> NO NO NO NO!!!!!! The CDC 6600 (and 7600, and the Crays, as well) had a Jump
> to Subroutine instruction:  RJ <addr> (Return Jump) (on the 6600); what this
> would do is write a <JP {calling address+1},NOP,NOP> into <addr> 

			........

The RJ would set up the return address, but how about the address in the call?
It is not that unusual to have subroutines or functions as arguments of called
subroutines, or computed by the program in some other way.  

I have used a program which allowed the user to decide which way something
should be done in a small subprogram, with the main program having many calls
to a dummy procedure in that module which changed the call to the desired
program.  This way the dozens of places in the main program would compile the
call to the dummy, and the first execution would change the call (and the
return at that time).
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

g-rh@cca.CCA.COM (Richard Harter) (07/17/88)

In article <989@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>>As to the semantics of self-modifying code, isn't it identical to
>>a Turing machine's or am I missing something?

>No. The state mechanism of a Turing machine is fixed and finite.

>To show equivalence, it is necessary to show that self modifying code
>does not increase the number of states. Usually, people just appeal
>to Church-Turing and assume all the states are there, somewhere.

	This needs some qualification.  Strictly speaking no computers
are Turing machines because they do not have an infinite memory or 
equivalent.  However almost all computers would be universal Turing
machines if they somehow did.

	However it depends on what level you are speaking of.  If you
are talking about the hardware, that's a *qualified* Turing machine.
You can talk about a language being a virtual Turing machine; that makes
sense, albeit you want to be more careful with your language than I am
being here.  The language remains fixed, and most languages are universal
Turing machines.  In this case the SMC is data.  If you want to talk
about a specific program as a virtual Turing machine (probably not
universal) then, yes, it can't modify itself.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

smryan@garth.UUCP (Steven Ryan) (07/18/88)

>to Subroutine instruction:  RJ <addr> (Return Jump) (on the 6600); what this
>would do is write a <JP {calling address+1},NOP,NOP> into <addr> (the

Technically,
              +    EQ  caller_address+1
              -    VFD 30/0

>bit instructions], so the assembler had to look for labels for RJ

Compass forces upper on all labels.

>instructions and pad out the rest of the word with NOPs 8-)).

And SB0 B0+46000.

>did this because it lacked one simple little feature found on many of
>today's machines:  a stack.)

Actually hardware support of a stacks. I implemented reentrant and recursive
stack frames for doing multitasking in Compass.

>the indicated write, and start execution at <addr>+1, and, to return, you
>would JP (or do an unconditional branch, which wouldn't clear out the
>instruction cache) to the entry point, and boom! you're back where you want

And sneaky code puts exit code above the entry point so that it falls into
jump back to the caller:

           RTN0    exit processing
           RTN     PS            entry point.
                   junk
                   ZR   ...,RTN0
                   more junk

>[i.e., the memory was accessed as one of 8(?) banks of RAM, with each

8, Aye.

>succeeding address in a different bank.  Accessing any individual address could
>be, oh, 8 times slower than the machine cycle, but accessing the next

Actually, I think a major cycle was 10 minor cycles.

>They were very fast machines, as I've said here before.  If only they didn't
>have NOS...

Meaning NOS 1 I hope.

NOS 2 has a lot nifty features I wish Unix had.

aglew@urbsdc.Urbana.Gould.COM (07/18/88)

>I have a question:
>
>	Why are an Icache plus a Dcache better than just
>	a big shared cache as big as both?
>
>Peter da Silva  `-_-'  Ferranti International Controls Corporation.

The best data I have seen says that a big shared cache as big as both
would be better - all other things being equal.

The other things include stuff like cycle time, simultaneous access,
line sizes.

For example, with split I/D caches you can have simultaneous access
to both the I and D cache in the same cycle; with a shared cache this
is harder to do.  The I cache doesn't need to have expensive bus
watching circuitry for invalidations.  The I and D caches can be 
located physically closer to where they will be used.  The smaller
the cache, the faster.  An I cache can be optimized for instruction
access, which is much more sequential than data access (this is the 
same type of argument that can lead to a separate cache, or no cache, 
for vectors).


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have MX records
    aglew@xenurus.gould.com     - if you don't
    ...!ihnp4!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (07/18/88)

In article <1394@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>In article <1276@edge.UUCP> gregr@edge.UUCP (Greg Rose) writes:
>>The discussion of self modifying code, as it has progressed in this
>>newsgroup, will never reach an end. This is simply because the opponents
>>and proponents are not talking about the same things at all.
>
>Actually, the discussion will never end because it changes as it goes along.

In other words: a self-modifying discussion!  :-)

Bjorn Lisper

conybear@moncsbruce.oz (Roland Conybeare) (07/18/88)

From article <752@cernvax.UUCP>, by hjm@cernvax.UUCP (hjm):
> As a final note, why is it 'clean' to alter a jump table and 'unclean' to 
> alter an inline constant (e.g. jmp @offset(r0) uses a value in memory as the
> address but mov (pc)+,#1234 which loads an immediate does so too)?  Why
> the subtle difference?  Any thoughts on the subject?
> 
> 	Hubert Matthews

I can see several reasons.

* the big, big reason for referring to code via pointers, and getting the effect
  of self-modifying code via such pointers, is that you make your changes
  independent of the size of the code.  Real SMC will only work when the new
  code is no larger than the old code.  I think this is a very restrictive
  assumption.

* when you alter a jump table (in C, at least) you are doing so within the 
  language, and can expect the compiler to understand you.  A language which
  allows you to modify instructions directly would of necessity depend strongly
  on the machine architecture to run these instructions.  Otherwise, why don't
  we all use Universal Assembly Language?

Roland Conybeare
conybear@moncsbruce.oz

  an instruction, like mov (pc)+,#1234 you are assuming that the change you
  make 

daveb@geac.UUCP (David Collier-Brown) (07/18/88)

From article <33441@yale-celray.yale.UUCP>, by lisper-bjorn@CS.YALE.EDU (Bjorn Lisper):
> In article <12357@ut-sally.UUCP> nather@ut-sally.UUCP (Ed Nather) writes:
>>In article <5254@june.cs.washington.edu>, pardo@june.cs.washington.edu
> (David Keppel) writes:

[a long, and generally well-reasoned debate elided]

>>Sorry, no.  I've heard LOTS of arguments against programs that generate their
>>own code, 

  Gentles, could we ***PLEASE*** reserve the term "self-modifying code" for
code which actually modifies itself on the fly (eg, for generating
indexes by instruction modification instead of using index registers)?

  Much of what is being discussed here is part of the well-known and
respectable "generate and execute" paradigm. The only difference
between this and the normal code-generation paradigm is that an
application generates code at run-time, not a compiler at a previous
time.

  Mixing the two is making this discussion "noisy".

--dave (sorry about the pedantry, but its important) c-b
-- 
 David Collier-Brown.  {mnetor yunexus utgpu}!geac!daveb
 Geac Computers Ltd.,  |  Computer science loses its
 350 Steelcase Road,   |  memory, if not its mind,
 Markham, Ontario.     |  every six months.

peter@ficc.UUCP (Peter da Silva) (07/18/88)

In article <1110@ficc.UUCP>, karl@ficc.UUCP (karl lehenbauer#) writes:
[ about some neat examples of SMC ]

What about the PolyForth scheduler. Each entry in the process table contained
a jump instruction. The next instruction was the save space for the current
PC in a context switch. So a context switch involved jumping to the next
processes process table entry. When a process waited, it put the address of
the next process table entry in that location. This meant that the process
was going to effectively busy-wait, but it's a *very* efficient busy-wait.
For the sort of tight real-time control that PolyForth was designed for, this
was apparently the most efficient way of doing things.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation.
"Have you hugged  U  your wolf today?" (uunet,tness1)!sugar!ficc!peter.

henry@utzoo.uucp (Henry Spencer) (07/19/88)

In article <1090@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
>You can save an even greater factor by building the BitBlt into the
>hardware....

No.  You can save a large factor over *poor* software implementations of
BitBlt.  Not over good ones.  The dynamic-compilation implementations on
the Blit et al run the memory at nearly full speed for the usual cases
of BitBlt; it is fundamentally impossible to do better.  Commodore et al
find hardware implementations attractive because they don't know how to
write good software.
-- 
Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

baum@Apple.COM (Allen J. Baum) (07/19/88)

[]

For those of you interested in on-the-fly generated code, here is a reference
to an interesting article:
Turning interpreters into Compilers
Frank Pagan -C.S.U. Fullerton
June 88 Software- Practice and Experience

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

rcd@ico.ISC.COM (Dick Dunn) (07/19/88)

First, a clarification:  Self-modifying code was *not* necessary on the
6600 CPU.  The call instruction generated a return instruction.  However,
it *was* "necessary" to use self-modifying code in the peripheral (I/O)
processors, because there were certain operations on channels which
required an immediate operand (the channel number)--the only reasonable way
to select a channel at execution time was to modify the instruction.  (A
jump table of instructions with constant channel numbers would have worked,
but would have been far too slow.)

Back to the subroutine call in the CPU, which worked by planting a return
instruction at the target, then jumping one word beyond.  Herman Rubin
asked:  "And how would you call a function which is an argument?"...
> The RJ would set up the return address, but how about the address in the call?
> It is not that unusual to have subroutines or functions as arguments of called
> subroutines, or computed by the program in some other way.  

Of course--even in FORTRAN this is common, and the 6600 was a FORTRAN
machine if ever there was one.  The answer is simple:  You emulate the RJ
instruction.  You form the return instruction (actually this can be done at
compile time to save execution) and plop it into memory just as the RJ
would have done; then you jump to the target routine.  The jump address is
formed from a register plus a constant, so it can be a variable or computed
address with no problem.  This approach roughly doubles the time to call a
routine; the return is unaffected.

[It is one of life's enduring mysteries why the "RJ" (subroutine call)
instruction allowed only a constant operand, while the "JP" (unconditional
branch) allowed register plus constant!  There was room in the RJ
instruction format for a register.]

It's interesting to look at what it costs to do subroutine calls with
return addresses on a stack (instead of plunked into memory) so that you
can have recursive calls.  It's only about 1.5* the time for the native
call/return to constant address, and it's about a break-even for call/
return to variable address.  (It's slightly faster on the 6400, slower on
the 6600.)
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Are you making this up as you go along?

hwfe@ur-tut (Harlan Feinstein) (07/19/88)

Reasons for writing self-modifying code have been scarce, and have basically
centered around speed reasons.  Speed is not the only reasons why someone
would want to write s-m code.  I've seen a program for the IBM PC called 
PC-LOCK that prohibits one from accessing the hard disk without providing
a password.  The install and de-install programs are self-modifying to 
make tracing them all but impossible.  

One time last summer I changed my password right before the weekend, and
wouldn't you know it, come Monday I had no clue as to what the password was.
I tried all kinds of disassembly and debug tracing to no avail.  Hence, another
reasons for self-modifying code.

--harlan

hjm@cernvax.UUCP (hjm) (07/19/88)

Programmers have become quite accustomed to the destructive assignment of
data ( x := y + 2 ), so why should the destructive changing of code be any
different?  If the practice were more common, then it wouldn't seem quite so
bad, surely.  After all, a Turing machine with a finite length tape is not
*that* difficult ...

	Hubert Matthews

david@daisy.UUCP (David Schachter) (07/20/88)

In article <1988Jul18.231158.19500@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>The dynamic-compilation implementations on
>the Blit et al run the memory at nearly full speed for the usual cases
>of BitBlt; it is fundamentally impossible to do better.  Commodore et al
>find hardware implementations attractive because they don't know how to
>write good software.

I disagree.  It is not necessarily true that good software blit code can
run as fast as hardware supported blit operations.  Sometimes, maybe, but
not always.  Henry's claim "it is fundamentally impossible to do better"
doesn't fly.  What about a CPU with a 16 bit bus competing with a blit
chip with 64 bit access to the frame buffer?  What about a CPU that must
fetch instructions on the same bus used for data?


Henry's statement is over-broad.


					-- David Schachter

#include "disclaimer.legal"
#include "funny.quote"

rcd@ico.ISC.COM (Dick Dunn) (07/20/88)

One other quick note about self-modifying code on the CDC 6000/7000 series:
It was generally a very bad idea to do it because of the instruction
lookahead.  The 7600 looked farther ahead than the 6600, and the 6400 only
had a one-word instruction lookahead.  Thus, depending on the processor,
it might or might not have already fetched an instruction before you
modified it--leading to the superficially baffling phenomenon that the
processor would appear not to have executed the instructions which would
show up in a memory dump.

But it's an ill wind, etc...  The fact that self-modifying code behaved
differently (but predictably so) on different processor models was used to
create a code sequence which could identify the processor at run time and
let code adapt itself accordingly.
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Are you making this up as you go along?

darin@laic.UUCP (Darin Johnson) (07/20/88)

In article <1988Jul18.231158.19500@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> In article <1090@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
> >You can save an even greater factor by building the BitBlt into the
> >hardware....
> 
> No.  You can save a large factor over *poor* software implementations of
> BitBlt.  Not over good ones.  The dynamic-compilation implementations on
> the Blit et al run the memory at nearly full speed for the usual cases
> of BitBlt; it is fundamentally impossible to do better.

However, a hardware BitBlt using DMA can be operating while the CPU is
servicing other processes, with no slowdown in CPU speed.  So even if
you can squeeze your BitBlt routine down to the bare minimum and it
takes x CPU cycles to do the blit, that is x cycles that are wasted.
Granted, if you are using a single-tasking machine, you won't get
much speedup using hardware blits.  Since the Amiga is multi-tasking,
its blitter chip does provide speedup.  Also, the Amiga blitter can do
logical operations (on three inputs) just as fast as a normal memory
copy - It'd be pretty hard to hack some 680x0 code to XOR three inputs
and write to three outputs as fast as it can move memory.  Also, the
overhead is very low compared to the overhead involved in dynamic compilation.

If dynamic-compilation is that much faster, then you would have expected
someone to have written a program that does uses this technique on the Amiga,
rather than using the blitter.  So far, I haven't seen any.  If one shows up
that is comparable in speed to the hardware, I would bet that the overhead
and memory involved are high enough that the programs usefulness is diminished.

-- 
Darin Johnson (...pyramid.arpa!leadsv!laic!darin)
              (...ucbvax!sun!sunncal!leadsv!laic!darin)
	"All aboard the DOOMED express!"

colwell@mfci.UUCP (Robert Colwell) (07/21/88)

In article <759@cernvax.UUCP> hjm@cernvax.UUCP () writes:
>
>Programmers have become quite accustomed to the destructive assignment of
>data ( x := y + 2 ), so why should the destructive changing of code be any
>different?  If the practice were more common, then it wouldn't seem quite so
>bad, surely.  After all, a Turing machine with a finite length tape is not
>*that* difficult ...

Hubert, I was going to let this whole SMC discussion slide on by, but
your post finally convinced me otherwise.

First, you meant "y" := y + 2, not "x", right?

Second, it seems like only yesterday when we (the royal we) CPU
architects were so concerned with trying to narrow the semantic gap
between what a programmer was trying to express and what the
machine/language was coercing her into.  Languages like Ada and
machine architectures like capability machines were intended to
address this perceived need.  The basic idea is that (oversimplifying
drastically) in a small programming environment (you at your
standalone workstation) anything goes in terms of protection and
language type checking, etc.  

The interesting domain is that of programming-in-the-large, with
large teams of programmers producing hundreds of thousands of lines
of code (SDI/Shuttle).  There, extracting the last nano-mip of
performance is of far less interest than in avoiding bugs and
producing maintainable code, and I still believe that supporting that
environment is a far more important challenge to architects than
worrying about SMC could ever be.  And I don't think (heavy sarcasm
here) that UNIX/C is the ultimate answer to that environment.




Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

anw@nott-cs.UUCP (07/21/88)

<Reminiscence mode on>

	The Manchester University ATLAS, on which I cut my teeth (violins),
included in its standard (ABL) Assembler a most extraordinary example of
SMC.  Each 48-bit instruction consisted of a 10-bit opcode, two 7-bit
register numbers and a 24-bit address.  If the MSB of the opcode was 1,
the operation was an extracode, essentially one of 512 hard-wired
subroutines [trig fns, complex arith, double arith, tape motions, I/O, etc,
and how 512 such subroutines were squeezed into 4096 words, shared with a
compiler, supervisor, test routines etc is another gory story].  If it was
0, then the next bit was ignored, and the following two bits determined the
type of operation (00 illegal, 01 integer, 10 test, 11 floating).

	ABL included 128 user-settable flags.  Where were these stored?
Where else but in the ignored bits of 128 consecutive instructions within
ABL itself which happened not to be extracodes.  I remember being amazed
by the discovery.  I'm still not sure whether one should be appalled, or
whether one should marvel at the ingenuity.  All those people who were
recently distrustful of an optimising assembler would have been really
dubious about a self-modifying assembler!

	I once, and only once, wrote SMC myself.  Also on ATLAS.  It was a
self-initialising random number generator.  First time through, it
initialised itself, either from the clock or from the user-supplied seed,
then pulled the real code in on top of itself.  Must have saved 3us per call,
may well have amounted to 0.1% of the total run-time in heavy simulations.
I was proud of it at the time.

	ATLAS would be a fine candidate for Bob Webber's collection of
historic computers for which there ought to be simulations.  I still have
a cupboard full of machine-code listings of most of the important software.

<Reminiscence mode off>
-- 
Andy Walker, Maths Dept., Nott'm Univ., UK.
anw@maths.nott.ac.uk

guy@gorodish.Sun.COM (Guy Harris) (07/21/88)

> Second, it seems like only yesterday when we (the royal we) CPU
> architects were so concerned with trying to narrow the semantic gap
> between what a programmer was trying to express and what the
> machine/language was coercing her into.  Languages like Ada and
> machine architectures like capability machines were intended to
> address this perceived need.

A naive (and not rhetorical) question: what evidence is there to indicate the
degree to which "narrowing the semantic gap" with capability machines and the
like would improve the productivity of programmers or the reliability of
programs, and to which other techniques (e.g., making a very fast conventional
machine, possibly but not necessarily using RISC technology, and using that
speed to improve the programming environment with better software) achieve the
same goal?

peter@athena.mit.edu (Peter J Desnoyers) (07/21/88)

In article <60782@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>
>A naive (and not rhetorical) question: what evidence is there to indicate the
>degree to which "narrowing the semantic gap" with capability machines and the
>like would improve the productivity of programmers or the reliability of
>programs, and to which other techniques [fast machines, good software]
>achieve the same goal?

Some protection and tracing features are MUCH slower in software.
These features are also useful in writing wonderful debuggers. You
need a bus snooper or ICE with an 8088 to achieve debugger features
that can be done in software on the 68020 or 80386. And per-task
memory protection is a big win in any environment, but impossible in
software. 

				Peter Desnoyers
				peter@athena.mit.edu

guy@gorodish.Sun.COM (Guy Harris) (07/22/88)

> >A naive (and not rhetorical) question: what evidence is there to indicate
> >the degree to which "narrowing the semantic gap" with capability machines
> >and the like would improve the productivity of programmers or the
> >reliability of programs, and to which other techniques [fast machines, good
> >software] achieve the same goal?
> 
> Some protection and tracing features are MUCH slower in software.
> These features are also useful in writing wonderful debuggers. You
> need a bus snooper or ICE with an 8088 to achieve debugger features
> that can be done in software on the 68020 or 80386. And per-task
> memory protection is a big win in any environment, but impossible in
> software. 

OK, maybe I didn't make myself clear.  I'm not referring to features you find
on many random processors or systems out there, such as memory management units
or "trap on transfer of control/trap on reference to particular location"
breakpoints.  I'm referring to the sort of architectural features you find in
machines that, say, require references to objects to go through a "descriptor"
for the object, to do bounds checking and the like.

One instantiation of the question would be "are you better off doing that or
having a fast machine plus a compiler that can e.g.  generate bounds checking
code but avoid doing it in some cases where it's 'known' that you don't need
it?"  For instance, don't bother doing it in

	double	a[SIZE], b[SIZE], c[SIZE];

	for (i = 0; i < SIZE; i++)
		a[i] = b[i] + c[i];

(or the FORTRAN, or Ada, or COBOL, or... equivalent), since you know "i" will
always be within the bounds of all the arrays.

Not "do you think you'd be better off" or "do you have an analysis that makes
it 'intuitively obvious' that you'd be better off", but "has anybody compared
actual implementations of the two approaches, and concluded that, with
everything taken into account, you're better off with one or the other?"  I.e.,
taking into account the fact that microcode and hardware is like software in
that it is possible for it to be buggy (either designed incorrectly or
implemented incorrectly), and taking into account the time it takes to design
that system, and....

henry@utzoo.uucp (Henry Spencer) (07/22/88)

In article <1404@daisy.UUCP> david@daisy.UUCP (David Schachter) writes:
>In article <1988Jul18.231158.19500@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>>The dynamic-compilation implementations on
>>the Blit et al run the memory at nearly full speed for the usual cases
>>of BitBlt; it is fundamentally impossible to do better...
>
>I disagree.  It is not necessarily true that good software blit code can
>run as fast as hardware supported blit operations...

On the Blit hardware?  I doubt it.  In any case, if the implementation
uses the full memory bandwidth (a more precise form of my earlier statement),
then it *is* fundamentally impossible to do better.  The ways around this
generally involve structuring the hardware in such a way that the hardware
has access to memory bandwidth that cannot be used by the software.  Other
things being equal (which they often aren't, of course), one may question
whether this is really such a smart idea.

>...  What about a CPU with a 16 bit bus competing with a blit
>chip with 64 bit access to the frame buffer?

One wonders whether it might be a better investment to put in a wider CPU,
which should speed up *everything*, not just BitBlt.

>What about a CPU that must
>fetch instructions on the same bus used for data?

What about it?  Most modern CPUs have ways around this on at least a small
scale.  The Blit CPU is a lousy old 68000, not even a 68010; by using fun
instructions like move-multiple, one can build code (for the usual BitBlt
cases) that needs instruction fetches relatively rarely.

I agree that it's possible to break the hardware in such a way that a
software BitBlt is inherently slow.  I prefer unbroken hardware myself.
-- 
Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

rcd@ico.ISC.COM (Dick Dunn) (07/22/88)

> > >You can save an even greater factor by building the BitBlt into the
> > >hardware....
> > No.  You can save a large factor over *poor* software implementations of
> > BitBlt...
> However, a hardware BitBlt using DMA can be operating while the CPU is
> servicing other processes, with no slowdown in CPU speed...

First, that's only true if either the bitblt is being done to *and* from
memory not accessed by the CPU, or the memory bandwidth is high enough to
handle both the CPU and the bitblt DMA cycles.  I'd guess that's rare; we
seem to be afflicted with memory-bound CPUs in a lot of systems.

But more importantly, in many cases you can't do anything else because
either (a) there's only one process running or (b) you can't afford the
context-switch time.  Unless you get a bitblt operation that takes longer
than *two* context switches (out and back), it costs more.  For the case
where you're bitblting characters (probably the most common case), you will
lose big.

> If dynamic-compilation is that much faster, then you would have expected
> someone to have written a program that does uses this technique on the Amiga,
> rather than using the blitter.  So far, I haven't seen any...

Ummm...but what does this prove?  Not much, I think, until someone actually
tries it and finds out whether it's slower.  Given that the Amiga already
has a useful bitblt (of whatever form), there's no great incentive to write
another.
-- 
Dick Dunn      UUCP: {ncar,nbires}!ico!rcd           (303)449-2870
   ...Are you making this up as you go along?

colwell@mfci.UUCP (Robert Colwell) (07/22/88)

In article <60782@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>> Second, it seems like only yesterday when we (the royal we) CPU
>> architects were so concerned with trying to narrow the semantic gap
>> between what a programmer was trying to express and what the
>> machine/language was coercing her into.  Languages like Ada and
>> machine architectures like capability machines were intended to
>> address this perceived need.
>
>A naive (and not rhetorical) question: what evidence is there to indicate the
>degree to which "narrowing the semantic gap" with capability machines and the
>like would improve the productivity of programmers or the reliability of
>programs, and to which other techniques (e.g., making a very fast conventional
>machine, possibly but not necessarily using RISC technology, and using that
>speed to improve the programming environment with better software) achieve the
>same goal?

As far as I know, there is no evidence that you would necessarily
find compelling, but then I could say that same thing about almost
everything else in this field, too.  There are, on the other hand,
some good reasons to believe that we can do better than imperative
languages running on essentially unprotected architectures.

I know I can't do this topic justice in this forum, but here's my
quick take on the subject.  Think about the different computer
languages you have used, and what was good or bad about each.  Backus
argued (in his famous paper on functional languages) that one of the
reasons that Lisp is a good language is that you don't have to
mentally execute a program (as you do with imperative languages) in
order to convince yourself that it will do what is wanted.  The
language allows you to more closely express what is desired, so you
test correctness by inspection.  And yes, there are cases where it's
more obvious/easier-to-express something in C than Lisp.  But both
examples serve to illustrate the point -- you, as a programmer, have
some virtual entity you want to realize (data structure or
algorithm), and the closer you can get to realizing exactly what you
have in mind, the more likely the code is to be correct,
maintainable, and understandable (and the smaller the semantic gap
between what you want and what you can express).

That's partly an allegory.  The capability people argue that the same
thing extends into all areas of computer systems.  Recall the classic
arguments about turning off runtime bounds checking to reclaim that
lost performance -- why should a programmer, in the best of all
possible worlds, have to worry about things like that?  It doesn't
help any of the major productivity metrics.

Say what you want about Ada on the 432 (and I've said plenty
already), as a programming environment (forget the speed problems
temporarily) it was really nice.  Your program had access to only the
data structures you specified, and only in the ways you specified,
and if your code screwed up, the machine could tell you where and in
what way.  To me, the hardest bugs to find are those where you fix
one bug (in someone else's code, of course) and inadvertently break
it in some obscure way such that something completely unrelated is
getting trashed.  For this to even be possible (in my opinion) means
that the programming environment is lacking something crucial,
notwithstanding that about 95% of all programmers on this planet see
just such an environment.  The environment is failing to express what
the programmer wanted, and it's a combined failure of the machine
architecture, the language, and probably the OS too.  The semantic
gap argument says that the more the desired and achieved environments
differ, the larger the quantity of bugs, and the worse their
obscurity will be.

I know you're tempted at this point to say that even if one grants
this as a shortcoming of current architectures/environments, there
have been no successes so far at addressing it.  That's another topic
of conversation, I think; all I'm trying to do here is offer some of
the justification for why a lot of research has gone into other ways
to compute (that don't have sheer instrs-executed-per-unit-time as
their primary figure of merit).

All the strong-type-checking stuff built into some languages has the
same motivation as above.

For me, the bottom line is that, as usual, there aren't any easy
answers (because if there were somebody would've patented them by
now), but we shouldn't lose track of the corresponding questions just
on that basis.  The problem is getting worse, after all -- more, not
fewer, lives currently depend on the correctness of software than
ever before, and that trend will continue (fly-by-computer airplanes,
dynamically-unstable planes, military systems, ...).

I assume that the reason people like you and me are concentrating on
performance is that that's what sells.  I don't think that needs
defending, but I also see few reasons to believe that sheer
performance alone will provide the cures to the problems I tried to
outline above.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

henry@utzoo.uucp (Henry Spencer) (07/22/88)

In article <302@laic.UUCP> darin@laic.UUCP (Darin Johnson) writes:
>... a hardware BitBlt using DMA can be operating while the CPU is
>servicing other processes, with no slowdown in CPU speed...

Uh, using what for memory bandwidth?  A good implementation of BitBlt,
whether in software *or* hardware, will run the memory flat out, leaving
nothing for "servicing other processes".  (Caches help a lot on instruction
fetches, but data has this annoying tendency to require real memory fetches.)
If your CPU is sitting there waiting for memory, it might as well be doing
the BitBlt itself.

>...the Amiga blitter can do
>logical operations (on three inputs) just as fast as a normal memory
>copy - It'd be pretty hard to hack some 680x0 code to XOR three inputs
>and write to three outputs as fast as it can move memory.

True, but who cares?  The overwhelmingly common cases of BitBlt are doing
an operation like writing a character, where the software overhead of
deciding what to do totally dominates doing it, and bulk movement of bits,
where a straight copy operation can run memory-bound on a well-built box
with any good implementation.

>Also, the
>overhead is very low compared to the overhead involved in dynamic compilation.

See previous paragraph.  The Blit code does not do dynamic compilation for
the small cases where overhead is dominant anyway (when you're drawing a
character, you are only moving a few words of data, and figuring out which
words to move is much harder than actually doing it, regardless of whether
the actual doing is hardware or software), only for the big ones that can
amortize the overhead well.  BTW, the overhead is smaller than you think.

>If dynamic-compilation is that much faster, then you would have expected
>someone to have written a program that does uses this technique on the Amiga,

Yes.  Hence my harsh words about the competence of the people involved.
Note that doing a really good software BitBlt is A LOT OF WORK -- it's not
the sort of thing an amateur knocks off in an evening.  The techniques are
not that well known (especially in the micro community, which is notorious
for its ignorance of the state of the art -- the people at Commodore most
likely had no idea it was even possible to do a fast software BitBlt).

Note that some of the earlier Suns had quite a bit of hardware BitBlt
assist, and the newer ones don't.  Sun learned.  Commodore will, someday.
-- 
Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

colwell@mfci.UUCP (Robert Colwell) (07/22/88)

In article <60859@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>> >A naive (and not rhetorical) question: what evidence is there to indicate
>> >the degree to which "narrowing the semantic gap" with capability machines
>> >and the like would improve the productivity of programmers or the
>> >reliability of programs, and to which other techniques [fast machines, good
>> >software] achieve the same goal?
> ...
>Not "do you think you'd be better off" or "do you have an analysis that makes
>it 'intuitively obvious' that you'd be better off", but "has anybody compared
>actual implementations of the two approaches, and concluded that, with
>everything taken into account, you're better off with one or the other?"  I.e.,
>taking into account the fact that microcode and hardware is like software in
>that it is possible for it to be buggy (either designed incorrectly or
>implemented incorrectly), and taking into account the time it takes to design
>that system, and....

You are asking for way too much, Guy.  Computer system designers
can't even agree on performance metrics, and performance is probably
the easiest thing to measure about a computer system (as compared to,
say, reliability or programmer productivity).

Even the units in which programmer productivity is usually measured
are suspect -- lines-of-debugged-code per person per unit time????

My intuition is that the practical answer is obvious -- if you have a
machine that is 2x the performance of mine, the prospect for my
environment being more "productive" than yours (without even being
able to quantize the difference) isn't going to help me sell very
many machines.  Since this is a user-perception problem, technical
answers won't necessarily help, either.

But my intuition is also that, for the same reasons that we
instinctively reach for the best tools for the job at hand (C,
assembly, Lisp, Ada, etc.), we would do considerably better as
programmers if the "thing" (including architecture, language, & OS)
executing our code matched our expectations as closely as possible,
and with few surprises.  And further, if it made sure we knew the
nature of whatever surprises occurred, rather than (a C example)
having an array-pointer-out-of-bounds happen to trash a scalar value
that was coincidentally declared nearby.  I don't see higher
performance as a very good means of preventing errors like that.

I think the reason you won't find many academic or formal studies
attempting to quantify the benefits of object-orientation (or other
higher-productivity programming environments) is that such studies
would be fraught with difficulty and expense, and would be unlikely
to yield completely unambiguous results.  Hell, there are still
people who think assembly language programming is better; how would
you prove to them they're wrong?  (If they smoke, too, you'll know
you're in big trouble if all you've got is a rational argument on
your side.)


Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

elg@killer.DALLAS.TX.US (Eric Green) (07/23/88)

in article <1988Jul18.231158.19500@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
> In article <1090@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
>>You can save an even greater factor by building the BitBlt into the
>>hardware....
> No.  You can save a large factor over *poor* software implementations of
> BitBlt.  Not over good ones.  The dynamic-compilation implementations on
> the Blit et al run the memory at nearly full speed for the usual cases
> of BitBlt; it is fundamentally impossible to do better.  Commodore et al
> find hardware implementations attractive because they don't know how to
> write good software.

I won't comment on Commodore's software guys (I'm typing this on an
Amiga, so, depending on when I last found a bug in the OS, I'm either
in complete agreement or violent disagreement), but there's one big
advantage of a bitblt chip over even a tight assembly-language
memory-move loop (using DBcc, no less): instruction fetch.

On newer processors such as the 68020 and 68030, instruction fetch
during a loop is no problem, since such a small tight loop is almost
certain to be in the cache. But on older processors such as the 68000
or 68010, you would burn up 3/4ths of your bus bandwidth just fetching
instructions! (no cache).

Also note that even fetching instructions out of cache, it still takes
at least the 68020 some time to execute them (presumably, the 68030's
pipelining will keep instruction fetch & execution overlapped). The
hardware blitter, of course, will never have instruction fetches...

Given the technology of 1983-1984, a hardware blit chip made a lot of
sense, then. Now.... well, the 68030 costs a bundle, so why pay more
for a blit chip that can't go any faster anyhow?

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
   PC Pursuit, n., a service specializing in mangling your postings,
     by eating characters so you can't see what you're typing.

elg@killer.UUCP (07/23/88)

in article <7439@ico.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) says:
> 
>> > >You can save an even greater factor by building the BitBlt into the
>> > >hardware....
>> > No.  You can save a large factor over *poor* software implementations of
>> > BitBlt...
>> However, a hardware BitBlt using DMA can be operating while the CPU is
>> servicing other processes, with no slowdown in CPU speed...
> First, that's only true if either the bitblt is being done to *and* from
> memory not accessed by the CPU, or the memory bandwidth is high enough to
> handle both the CPU and the bitblt DMA cycles.  

That's exactly what the Amiga does. There is 512K of RAM dedicated to
screen operations, while the OS and (with xtended RAM) user programs
run out RAM or ROM which is on a seperate bus inaccessible to the
blitter. Thus, while the blitter runs, user programs are free to run
without bus contention. In addition, the blitter takes advantage of
the fact that the 68000 only uses half the available memory bandwidth,
and even when the 68000 is accessing "chip" RAM (blitter-accessible
RAM), the blitter can still run at half-speed until the 68000 goes
somewhere else. All in all, a very useful and elegant design, given
its age ('83-'84 or therebouts), and the low cost of implementation
($650 for a complete machine, with disk drive, whereas 68020's and
68030's would cost almost the same as the complete A-500 after
production costs and two levels of markups).

> context-switch time.  Unless you get a bitblt operation that takes longer
> than *two* context switches (out and back), it costs more.  For the case
> where you're bitblting characters (probably the most common case), you will
> lose big.

That's one of the major flaws that they found in the Amiga's handling
of text i/o... Charlie Heath of Microsmiths replaced the bitblt with
plain old memory moves, and is almost   twice as fast.

However, I still would not want to use a plain old 68000 machine which
had to use a software loop to scroll a 32K bitmap.... I've seen Macs
and ST's, which have much simpler hardware and software which should
be faster than the Amiga's message-passing OS (which requires a
context switch to send the text to the console.device process , then
another context switch to get back to the user process)... and the
speed of window management and scrolling simply cannot compare. That
hardware makes more difference than you'd think, given all the other
factors involved. 

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
PC Pursuit: A diabolical conspiracy to cause spelling flames upon the
net, by rendering the typer inable to read what he has written.

henry@utzoo.uucp (Henry Spencer) (07/24/88)

In article <473@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>... The capability people argue that the same
>thing extends into all areas of computer systems.  Recall the classic
>arguments about turning off runtime bounds checking to reclaim that
>lost performance -- why should a programmer, in the best of all
>possible worlds, have to worry about things like that?  ...

My counterargument is that it is almost as much of an imposition on the
programmer to have such checks done at runtime as it is to have them not
done at all.  Given the impossibility of complete testing, there is no
way to guarantee such tests will catch a bug during development.  This
means that the programmer has to go through the same complicated exercise
of trying to assure himself that the problem will never happen.  What is
really wanted is a way to check these properties at COMPILE TIME... in
which case the runtime checks become largely superfluous anyway.

Can it be done?  Well, in one sense the answer is clearly yes, because a
proof of program correctness has to include it, and we know that automating
such proofs is possible (although seldom practical at the moment).  The
question is whether it can be done with practical, affordable machinery
without crippling the language.  My own long-held conjecture is that the
answer is "yes", but I don't have proof of that yet.
-- 
Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (07/24/88)

In article <572@tuck.nott-cs.UUCP> anw@maths.nott.ac.uk (Dr A. N. Walker) writes:
>...self-initialising random number generator.  First time through, it
>initialised itself, either from the clock or from the user-supplied seed,
>then pulled the real code in on top of itself...

Let us not forget, also, that self-modifying code wasn't uncommon in
bootstraps, back in the days when ROMs were expensive or unavailable.
The disk bootstrap for a late-model PDP-8 was two instructions, so
brief that nobody ever bothered putting it in ROM, and it was about
as self-modifying as you could get:  power-up-reset happened to leave
the disk controller in just the right state for reading block 0 into
memory starting at 0, so the bootstrap -- toggled in at a specific place
in low core -- was "kick the disk controller; branch to self".  It
got overwritten as block 0 came in, and of course block 0 was carefully
crafted to catch control as the "branch to self" was overwritten!
-- 
Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

colwell@mfci.UUCP (Robert Colwell) (07/24/88)

In article <1988Jul23.221743.22169@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <473@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>>... The capability people argue that the same
>>thing extends into all areas of computer systems.  Recall the classic
>>arguments about turning off runtime bounds checking to reclaim that
>>lost performance -- why should a programmer, in the best of all
>>possible worlds, have to worry about things like that?  ...
>
>My counterargument is that it is almost as much of an imposition on the
>programmer to have such checks done at runtime as it is to have them not
>done at all.  Given the impossibility of complete testing, there is no
>way to guarantee such tests will catch a bug during development.  This
>means that the programmer has to go through the same complicated exercise
>of trying to assure himself that the problem will never happen.  What is
>really wanted is a way to check these properties at COMPILE TIME... in
>which case the runtime checks become largely superfluous anyway.

We probably agree that the compiler should check as much as it
possibly can; the earlier an error is detected, the less the
ambiguity about what is wrong and the cheaper the cure.  But there is
an awful lot the compiler can't know about -- programs that use input
data, generate pointers, or do just about anything else that's
interesting, are going to do many operations that are difficult for
the compiler to validate at compile time.

I'm not sure I catch your drift on the imposition of runtime checks.
To me, the best human debuggers have the ability to figure out the
fastest which assumptions are being made that are wrong, thus getting
to the heart of some problem the quickest.  If I have a program bug,
I want the machine to express as directly as possible the intent of
my program so that I can narrow the bug-space down to my code alone.
If the machine allows a change to one variable (an array, say) to
affect some other unrelated variable, then it is not conforming to
the intent of my program.  In fact, it is not conforming to anything
useful at all, since nobody would argue that it is useful programming
practice to ever do such a thing on purpose (I hope?).  Given that,
the fact that the machine can do such a thing, let alone do it as a
default, is what I'd label a shortcoming in the basic architecture.
One we've all long ago learned to live with, to be sure, but it's not
something to be proud of, at very least.

>Can it be done?  Well, in one sense the answer is clearly yes, because a
>proof of program correctness has to include it, and we know that automating
>such proofs is possible (although seldom practical at the moment).  The
>question is whether it can be done with practical, affordable machinery
>without crippling the language.  My own long-held conjecture is that the
>answer is "yes", but I don't have proof of that yet.

I sure hope you're right.  In fact, if progress towards this goal
becomes evident, I'd propose we start discussing ways of turning
architecture towards better ways of supporting this instead of
throughput or programming environments.


Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

elg@killer.DALLAS.TX.US (Eric Green) (07/25/88)

in article <1988Jul22.164129.5495@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
> In article <302@laic.UUCP> darin@laic.UUCP (Darin Johnson) writes:
>>... a hardware BitBlt using DMA can be operating while the CPU is
>>servicing other processes, with no slowdown in CPU speed...
> 
> Uh, using what for memory bandwidth?  A good implementation of BitBlt,
> whether in software *or* hardware, will run the memory flat out, leaving
> nothing for "servicing other processes".  

But the 68000 (uncached) version would use 3/4ths of that bandwidth 
to fetch INSTRUCTIONS. Also note that the bandwidth of modern 100ns 
memories is much greather than a 8mhz 68000 can take advantage of...
even
when the 68000 is on the bus 100% of the time that it can be, you
still have half that memory bandwidth left for blitter operations etc.
(a fact that the designers of the ST use to do their video refresh).

Given the cost constraints, and the state of technology in 1984, it
made a lot of sense to have a hardware blitter. The fact that the Blit
terminal did fast blits without a blitter probably has more to do with
the fact that a) the machine was designed before modern high-speed
DRAMs, when 250ns RAMs were fast, and b) the machine had a much higher
resolution screen than the Amiga, meaning that much more bus bandwidth
was taken up with video refresh (thereby improving the desirability of
caching the loop, since you couldn't do data accesses any faster
anyhow).

> If your CPU is sitting there waiting for memory, it might as well be doing
> the BitBlt itself.

At least with the 68000 and modern DRAMs, the memory spends most of
its time waiting for the CPU!

> deciding what to do totally dominates doing it, and bulk movement of bits,
> where a straight copy operation can run memory-bound on a well-built box
> with any good implementation.

Memory-bound? Yes, if you're using a 68020 or 68030 with a tight loop
that'll fit into their internal cache. But the 68000 would spend half
of that bandwith simply fetching instructions, not moving bits....


> the sort of thing an amateur knocks off in an evening.  The techniques are
> not that well known (especially in the micro community, which is notorious
> for its ignorance of the state of the art -- the people at Commodore most
> likely had no idea it was even possible to do a fast software BitBlt).

True, the majority of the micro world thinks multitasking is a neat
idea, message passing is what secretaries do, etc. But first time I
ever heard the designers of the Amiga accused as such (the Amiga, BTW,
was NOT designed by Commodore). Considering that they implemented such nifty
state-of-the-art ideas as a message-passing multi-tasking OSetc. (only
to ruin it all by commissioning a cruddy filesystem), and considering
that they all had Suns on their desktops when they were designing it,
it sounds like you're being harsh for no good reason, considering the
design constraints (512K RAM, under-$1500 price, etc.

> Note that some of the earlier Suns had quite a bit of hardware BitBlt
> assist, and the newer ones don't.  Sun learned.  Commodore will, someday.

Earlier Suns were based upon 68000/68010 techynology, where hardware
bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't
need the hardware assist (see the beginning of this message). 

Unfortunately, Commodore probably will retain their blitter even when
they move into 68020/68030 machines, for backward-compatability
reasons (there's a helluva lot of programs out there, mostly games,
directly accessing that blitter).BTW, games programmers are known for
their dedication to speed... and most of the games programmers that I
know have made extensive use of the blitter whenever it made sense
(when moving large amounts of data, when setup time is no longer the
dominating factor). 

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
PC Pursuit: Or, How to Eat the Typist's Echo in Three Words or Less.

ge@hobbit.sci.kun.nl (Ge' Weijers) (07/25/88)

From article <1087@ficc.UUCP>, by peter@ficc.UUCP (Peter da Silva):
> I have a question:
> 
> 	Why are an Icache plus a Dcache better than just
> 	a big shared cache as big as both?

The answer is bandwidth. The CPU does not have to stop filling the instruction
pipeline when it accesses/writes data.
-- 
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge

peter@ficc.UUCP (Peter da Silva) (07/25/88)

In article <7439@ico.ISC.COM>, rcd@ico.ISC.COM (Dick Dunn) writes:
      Me:
> > > >You can save an even greater factor by building the BitBlt into the
> > > >hardware....
> > > No.  You can save a large factor over *poor* software implementations of
> > > BitBlt...
  Me:
> > However, a hardware BitBlt using DMA can be operating while the CPU is
> > servicing other processes, with no slowdown in CPU speed...

> First, that's only true if either the bitblt is being done to *and* from
> memory not accessed by the CPU, or the memory bandwidth is high enough to
> handle both the CPU and the bitblt DMA cycles.  I'd guess that's rare; we
> seem to be afflicted with memory-bound CPUs in a lot of systems.

Well, for the Amiga the memory bandwidth is high enough, *and* the blitter
only affects display memory. Memory requests for the CPU to memory outside
the display memory area (the first 512K as it turns out) is on another bus.

> But more importantly, in many cases you can't do anything else because
> either (a) there's only one process running or (b) you can't afford the
> context-switch time.

The Amiga context switch time is on the order of a procedure call... the
Amiga Exec is a realtime operating system... all processes are "lightweight"
in UNIX jargon.

> Unless you get a bitblt operation that takes longer
> than *two* context switches (out and back), it costs more.  For the case
> where you're bitblting characters (probably the most common case), you will
> lose big.

When scrolling a window you just blit the whole thing instead of doing it
a character at a time, and you win big (80 by 24 by 8 by 8 bits == 15,360
bytes). As for text, see below.

> [dynamic compilation]
> Ummm...but what does this prove?  Not much, I think, until someone actually
> tries it and finds out whether it's slower.  Given that the Amiga already
> has a useful bitblt (of whatever form), there's no great incentive to write
> another.

Well, there are already programs out that SetFunction the Text() routine to
speed up text drawing of fixed 8-by-8 characters (the common case, as you
say) and there is apparently lots of dynamically compile BitBlt code out
there waiting to be nabbed.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation.
"Have you hugged  U  your wolf today?" (uunet,tness1)!sugar!ficc!peter.

guy@gorodish.Sun.COM (Guy Harris) (07/26/88)

> We probably agree that the compiler should check as much as it
> possibly can; the earlier an error is detected, the less the
> ambiguity about what is wrong and the cheaper the cure.  But there is
> an awful lot the compiler can't know about -- programs that use input
> data, generate pointers, or do just about anything else that's
> interesting, are going to do many operations that are difficult for
> the compiler to validate at compile time.

In the case of programs that use input data, said programs should validate the
input data before using it (unless they have good reason to know the data will
*never* be incorrect); I would prefer to see

	input, line 12: value "12038" is out of range (should be between
	  17 and 137)

than to see

	ILLEGAL ARRAY REFERENCE: "frobozz.q", line 31
		Subscript was 12038; should be between 17 and 137

as the former, at least, tells me that the ultimate problem is in the data, not
the code.  Could a compiler know to remove run-time subscript checks in:

	i = read();
	if (i < 17 || i > 137)
		fail("input, line %d: value "%d" is out of range (should be
		    between 17 and 137");
	printf("input value for %d is %d\n", i, array[i]);

If so, it should probably issue warnings when it *isn't* able to remove
run-time subscript checks, to inform programmers that they should probably
put those checks in themselves.

> >Can it be done?  Well, in one sense the answer is clearly yes, because a
> >proof of program correctness has to include it, and we know that automating
> >such proofs is possible (although seldom practical at the moment).  The
> >question is whether it can be done with practical, affordable machinery
> >without crippling the language.  My own long-held conjecture is that the
> >answer is "yes", but I don't have proof of that yet.
> 
> I sure hope you're right.  In fact, if progress towards this goal
> becomes evident, I'd propose we start discussing ways of turning
> architecture towards better ways of supporting this instead of
> throughput or programming environments.

Yes, but does this need "architectural" support, at least in the sense of
"instruction set architecture"?  If a compiler for a "conventional" machine can
do that level of validation, subscript-range checking features in the
instruction set would be seldom, if ever, used.

If "architecture" includes the compiler, then I agree that "architectural"
support for this would be nice.

guy@gorodish.Sun.COM (Guy Harris) (07/26/88)

> You are asking for way too much, Guy.  Computer system designers
> can't even agree on performance metrics, and performance is probably
> the easiest thing to measure about a computer system (as compared to,
> say, reliability or programmer productivity).

Well, I don't necessarily want numbers complete with reliable error bars; I'd
be interested just to hear anecdotal stories such as "it took this team 10
times as long to do this project under XYZZY's Ada on a PC as it did for some
other team to do it on FFGGF's Ada on the FFGGF
Ada-Interpreter-In-Microcode(TM) processor", as long as the two teams were of
comparable quality.

> But my intuition is also that, for the same reasons that we
> instinctively reach for the best tools for the job at hand (C,
> assembly, Lisp, Ada, etc.), we would do considerably better as
> programmers if the "thing" (including architecture, language, & OS)
> executing our code matched our expectations as closely as possible,
> and with few surprises.

Yes, but is there any indication of whether we'd do better by changing the
architecture (meaning the instruction-set architecture) or by changing the
compiler, holding other things pretty much equal (e.g., the language)?

> And further, if it made sure we knew the nature of whatever surprises
> occurred, rather than (a C example) having an array-pointer-out-of-bounds
> happen to trash a scalar value that was coincidentally declared nearby.
> I don't see higher performance as a very good means of preventing errors
> like that.

If the performance is enough greater, you could spend some of the performance
gain by inserting instructions to do run-time array bounds checks.  If the
compiler is good enough to optimize away most of those checks, you may spend
less than you might expect to.

smryan@garth.UUCP (Steven Ryan) (07/26/88)

>If the machine allows a change to one variable (an array, say) to
>affect some other unrelated variable, then it is not conforming to
>the intent of my program.  In fact, it is not conforming to anything
>useful at all, since nobody would argue that it is useful programming
>practice to ever do such a thing on purpose (I hope?).

This particular example is done all the time in handling recursive data
structures.

>>Can it be done?  Well, in one sense the answer is clearly yes, because a
>>proof of program correctness has to include it, and we know that automating
>>such proofs is possible (although seldom practical at the moment).

Depends on whether all possible programs (including those of monkey
programmers) or just `practical' programs are considerred. Formal proofs
of all possible programs are impossible, flat out, no appeal. Formal proofs
of practical programs depend on how powerful practical has to be to remain
useful.

david@sun.uucp (David DiGiacomo) (07/26/88)

In article <1988Jul22.164129.5495@utzoo.uucp> henry@utzoo.uucp (Henry
Spencer) writes:
>Note that some of the earlier Suns had quite a bit of hardware BitBlt
>assist, and the newer ones don't.  Sun learned.  Commodore will, someday.

This is just my opinion, but I think what Sun learned is that once you
stuck all that other gunk on the CPU board there was no room left for
anything beyond a minimal frame buffer.

The fastest Sun frame buffers currently available are the "CG3" and "CG5"
VME color boards, which have extensive "hardware BitBlt assist".  Of
course, it's all datapath (the notorious rasterop chips) and no sequencing.
The CPU does all the address generation.

P.S. An Amiga-like autonomous "blitter" would be pretty useless in a
virtual memory workstation unless you gave it its own MMU.

-- 
David DiGiacomo, Sun Microsystems, Mt. View, CA  sun!david david@sun.com

peter@ficc.UUCP (Peter da Silva) (07/26/88)

I suppose that if *all* your emmory is on the same bus as your display
drivers, then this statement is true:

In article ... henry@utzoo.uucp (Henry Spencer) writes:
> I agree that it's possible to break the hardware in such a way that a
> software BitBlt is inherently slow.  I prefer unbroken hardware myself.

On the other hand, at a given speed a 68030 can only go so fast. If you
can let it do other things while another processor is doing the display,
then why not? Then the statement becomes:

  "I agree that it's possible to break the hardware in such a way that a
   hardware BitBlt is inherently slow.  I prefer unbroken hardware myself."

I have always been uncomfortable with the idea of having your main CPU
throwing all those display bits around anyway... hell, given the price
of the 68030 it might pay Sun to stick an extra one in there to unload
*all* of the display management from the main CPU.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation.
"Have you hugged  U  your wolf today? when it was pointed out to them by

levy@ttrdc.UUCP (Daniel R. Levy) (07/26/88)

In article <1988Jul23.221743.22169@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> -- 
> Anyone who buys Wisconsin cheese is|  Henry Spencer at U of Toronto Zoology
> a traitor to mankind.  --Pournelle |uunet!mnetor!utzoo! henry @zoo.toronto.edu

First it was NASA, now it's Wisconsin cheese.  What gripe do you (and Pournelle
for that matter) have against Wisconsin cheese?  (The imagination boggles.)
-- 
|------------Dan Levy------------|  THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY
|    AT&T  Data Systems Group    |  AND ARE NOT TO BE IMPUTED TO AT&T.
|        Skokie, Illinois        | 
|-----Path:  att!ttbcad!levy-----|

jesup@cbmvax.UUCP (Randell Jesup) (07/26/88)

In article <4912@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>in article <1988Jul22.164129.5495@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
>> Uh, using what for memory bandwidth?  A good implementation of BitBlt,
>> whether in software *or* hardware, will run the memory flat out, leaving
>> nothing for "servicing other processes".  
>
>But the 68000 (uncached) version would use 3/4ths of that bandwidth 
>to fetch INSTRUCTIONS. Also note that the bandwidth of modern 100ns 
>memories is much greather than a 8mhz 68000 can take advantage of...
>even
>when the 68000 is on the bus 100% of the time that it can be, you
>still have half that memory bandwidth left for blitter operations etc.
>(a fact that the designers of the ST use to do their video refresh).

	Also, remember that there are more than one memory busses out
there, and even if cycles are tight in video ("chip") memory, the processor
can run flat out from rom or expansion ram.

>> If your CPU is sitting there waiting for memory, it might as well be doing
>> the BitBlt itself.

	Don't forget, the 68000 doesn't have the BFEXTU instruction, so
the inner loop of a BitBlit isn't as tight as on a 68020.

>> the sort of thing an amateur knocks off in an evening.  The techniques are
>> not that well known (especially in the micro community, which is notorious
>> for its ignorance of the state of the art -- the people at Commodore most
>> likely had no idea it was even possible to do a fast software BitBlt).

	Excuse me, I have a paper in front of me published in EUUG last fall
by a researcher at AT&T on optimized software BitBlits in C on an 68020.  If
you want to insult people please do it to the XYZZY-clone makers and the people
who write software for them.

>> Note that some of the earlier Suns had quite a bit of hardware BitBlt
>> assist, and the newer ones don't.  Sun learned.  Commodore will, someday.
>
>Earlier Suns were based upon 68000/68010 techynology, where hardware
>bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't
>need the hardware assist (see the beginning of this message). 

	Hardware BitBlit is still useful in a lightwieght process architecture
where you have divided memory spaces (and thus much higher effective band-
width).

>Unfortunately, Commodore probably will retain their blitter even when
>they move into 68020/68030 machines, for backward-compatability
>reasons (there's a helluva lot of programs out there, mostly games,
>directly accessing that blitter).BTW, games programmers are known for
>their dedication to speed... and most of the games programmers that I
>know have made extensive use of the blitter whenever it made sense
>(when moving large amounts of data, when setup time is no longer the
>dominating factor). 

	Actually, a number of the assembler hacker-types who write games for
the amiga ignore the blitter, because if you bypass the system software it's
a pain to set up, or because in their precise setup they can do things faster
because of various restrictions.  The blitter is still great for general
purpose stuff.

	Other nice things done by our blitter are things like line draws and
area fills.

	Fast software implementations of BitBlit tend to be 2 sources, 1
destination blits (therefor 16 operations).  If you expand them to 3 sources,
1 destination (256 operations), they tend to take up a LOT of code/ROM
space, or they slow down a lot, or both.  3 sources are VERY useful for
animation, since it lets you do "cookie cutter" blits, with an object that
is masked onto the destination, without destroying the areas in the rectangle
but outside the mask.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

henry@utzoo.uucp (Henry Spencer) (07/26/88)

In article <4894@killer.DALLAS.TX.US> elg@killer.UUCP writes:
>... In addition, the blitter takes advantage of
>the fact that the 68000 only uses half the available memory bandwidth...

I.e., that the memory system is too fast for the processor to use fully.
Can you say "imbalanced system"?

>However, I still would not want to use a plain old 68000 machine which
>had to use a software loop to scroll a 32K bitmap.... I've seen Macs
>and ST's... the speed of window management and scrolling simply cannot 
>compare...

Have you tried a Blit (aka 5620, aka 630)?  That's a demonstration of
what *can* be done with a plain old 68000 using software BitBlt.  I said
Commodore didn't know how to write a fast software implementation; I
didn't mean to imply that Atari or Apple did better!  One *can* do faster
BitBlt than the Blit does, given enough added hardware, but I really doubt
the cost-effectiveness.  (Remembering, again, that the alternative is not
to leave out the custom goodies, but to use that investment of money and
hardware effort to make the CPU run faster.)
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (07/26/88)

In article <4912@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>> ... A good implementation of BitBlt,
>> whether in software *or* hardware, will run the memory flat out, leaving
>> nothing for "servicing other processes".  
>
>But the 68000 (uncached) version would use 3/4ths of that bandwidth 
>to fetch INSTRUCTIONS...

Sigh.  Please think before answering.  An artfully-build MOVEM loop does
*not* spend 3/4 of its time fetching instructions.
 
>... the bandwidth of modern 100ns 
>memories is much greather than a 8mhz 68000 can take advantage of...

Suggesting that one should either buy a faster processor or spend less
money on the memory.  There are sillier things than putting 100ns RAMs
on an 8MHz 68000, yes, but I'd have to think for a moment to come up
with specific examples.

>Given the cost constraints, and the state of technology in 1984, it
>made a lot of sense to have a hardware blitter. The fact that the Blit
>terminal did fast blits without a blitter probably has more to do with
>the fact that a) the machine was designed before modern high-speed
>DRAMs, when 250ns RAMs were fast...

The 630, designed rather more recently, does things the same way.  So
do the monochrome Suns.

> and b) the machine had a much higher
>resolution screen than the Amiga, meaning that much more bus bandwidth
>was taken up with video refresh (thereby improving the desirability of
>caching the loop, since you couldn't do data accesses any faster
>anyhow).

I don't follow this.  There is no cache in the Blit.

>> Note that some of the earlier Suns had quite a bit of hardware BitBlt
>> assist, and the newer ones don't.  Sun learned...
>
>Earlier Suns were based upon 68000/68010 techynology, where hardware
>bitblt's make a lot of sense. Later Sun's, based upon 68020s, don't
>need the hardware assist (see the beginning of this message). 

No, they dumped the hardware assist *before* they switched to the 020,
not after, unless I'm much mistaken.  Most of the Sun-2s (68010) had
no hardware assist.

More generally, I will repeat -- more explicitly -- something I pointed
out before:  the fair comparison is not to the same system without BitBlt
hardware, but to a system where the same effort and funding (custom chips
are *not* cheap to design) have been invested in making the main CPU faster
instead.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

colwell@mfci.UUCP (Robert Colwell) (07/26/88)

In article <61251@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>In the case of programs that use input data, said programs should validate the
>input data before using it (unless they have good reason to know the data will
>*never* be incorrect); I would prefer to see
>
>	input, line 12: value "12038" is out of range (should be between
>	  17 and 137)
>
>than to see
>
>	ILLEGAL ARRAY REFERENCE: "frobozz.q", line 31
>		Subscript was 12038; should be between 17 and 137
>
>Yes, but does this need "architectural" support, at least in the sense of
>"instruction set architecture"?  If a compiler for a "conventional" machine can
>do that level of validation, subscript-range checking features in the
>instruction set would be seldom, if ever, used.
>
>If "architecture" includes the compiler, then I agree that "architectural"
>support for this would be nice.

But the whole point of capability machines (to name a single example)
is that one cannot cover the space of all interesting
exception-possibilities using only a compiler, no matter how smart.
For one thing, the program could be coercing data types back and
forth such that checking an operation on a type can only be done at
the time the operation is applied.

But a more fundamental matter is how one manages the development of a
large software project with dozens of programmers contributing to a
single large end product.  The modules are all separately compiled,
so there is no question of the compiler helping out much.

Given access to all the symbol tables, you could imagine the linker
doing some reasonable checks of consistency (number and types of
args, for instance), but even that fails when pointers are being
passed (pointers to functions, even).

You can catch a lot of the common cases with good programming style,
as you note above.  But you can't catch them all, and the question
that capability machines pose is "how close can we come to an
airtight programming environment, and how much would it cost"?
(simplistic paraphrase, I know; maybe it'll draw some capability
people out of their woodwork abstraction!)

And how about functional languages?  Again, the compiler can only do
so much, and the space it covers is not intuitively obvious to the
programmer.  If it doesn't entirely cover the bug-space, and you
aren't sure what it *does* cover, then the coverage is much less
useful (and may not be useful at all, as Henry Spencer was alluding
to in an earlier post.)

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

colwell@mfci.UUCP (Robert Colwell) (07/26/88)

In article <1152@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
>I suppose that if *all* your emmory is on the same bus as your display
>drivers, then this statement is true:
>
>In article ... henry@utzoo.uucp (Henry Spencer) writes:
>> I agree that it's possible to break the hardware in such a way that a
>> software BitBlt is inherently slow.  I prefer unbroken hardware myself.
>
>On the other hand, at a given speed a 68030 can only go so fast. If you
>can let it do other things while another processor is doing the display,
>then why not? Then the statement becomes:
>
>  "I agree that it's possible to break the hardware in such a way that a
>   hardware BitBlt is inherently slow.  I prefer unbroken hardware myself."
>
>I have always been uncomfortable with the idea of having your main CPU
>throwing all those display bits around anyway... hell, given the price
>of the 68030 it might pay Sun to stick an extra one in there to unload
>*all* of the display management from the main CPU.

My blit knowledge is now dated, but when I designed a color version
of the Perq workstation (you remember them, sure you do) I had to
keep the same basic architecture that the Perq already had -- CPU and
display shared main memory.  To get the bandwidth to what I needed I
had to make the display fetch 256 bits at a time from main memory.

Now never mind whether it's a good idea to make CPU and display share
a common memory.  If you *have* 256 bits/fetch, and your rasterop can
handle that, I think it made good sense to provide rasterop hardware.
I couldn't see making the CPU data paths that wide just for pushing
display bits.

I think it comes down to what Henry already said -- if your blit is
already saturating memory bandwidth, you don't need hardware (hard to
believe that the CPU is going to get anything useful done if no
memory bandwidth is available, so the co-processor rationale is a
weak one when memory is shared).  But at least in my case, memory
bandwidth could not possibly be saturated by just the CPU.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

daveh@cbmvax.UUCP (Dave Haynie) (07/27/88)

in article <1988Jul26.024039.28579@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
> The 630, designed rather more recently, does things the same way.  So
> do the monochrome Suns.

All the monochrome Suns-2s around here have a pretty piss-poor graphic
preformance as compared to what the Amiga does with it's blitter and
some well written code.  Not to imply that all of the Amiga's video
routines are as fast as they could be, they certainly aren't.  And the
Sun-2s are even running faster than the Amiga.  Certainly you may be
able to get better display performance out of a Sun-3, but with a fast
68020, you'd be pretty hurtin' if you couldn't.  And still, at least
with out UNIX software (I did put the "well-written" qualifier in), our
text scrolling is several time that of the Sun-3's I've seen. 

> No, they dumped the hardware assist *before* they switched to the 020,
> not after, unless I'm much mistaken.  Most of the Sun-2s (68010) had
> no hardware assist.

And as I pointed out, it shows.

> MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
> smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu
-- 
Dave Haynie  "The 32 Bit Guy"     Commodore-Amiga  "The Crew That Never Rests"
   {ihnp4|uunet|rutgers}!cbmvax!daveh      PLINK: D-DAVE H     BIX: hazy
		"I can't relax, 'cause I'm a Boinger!"

guy@gorodish.Sun.COM (Guy Harris) (07/27/88)

> >Yes, but does this need "architectural" support, at least in the sense of
> >"instruction set architecture"?  If a compiler for a "conventional" machine
> >can do that level of validation, subscript-range checking features in the
> >instruction set would be seldom, if ever, used.
> >
> >If "architecture" includes the compiler, then I agree that "architectural"
> >support for this would be nice.
> 
> But the whole point of capability machines (to name a single example)
> is that one cannot cover the space of all interesting
> exception-possibilities using only a compiler, no matter how smart.
> For one thing, the program could be coercing data types back and
> forth such that checking an operation on a type can only be done at
> the time the operation is applied.

You missed the point.  As I said in an earlier posting, you can spend some
increased speed by putting in code to do the appropriate checks on a
"conventional" machine.  If the compiler can eliminate most of those checks,
so much the better; you spend less of the increased speed.

I don't see that, for example, doing these checks in microcode, as opposed to
regular code, is *ipso facto* the only way to do things.  In some ways, that's
what I've been asking: what reason is there to believe that capability
machines and the like are the *only* way, or even just the *best* way, to
achieve the kind of programming environment I suspect most, if not all, of us
would like to see?

> But a more fundamental matter is how one manages the development of a
> large software project with dozens of programmers contributing to a
> single large end product.  The modules are all separately compiled,
> so there is no question of the compiler helping out much.
> 
> Given access to all the symbol tables, you could imagine the linker
> doing some reasonable checks of consistency (number and types of
> args, for instance), but even that fails when pointers are being
> passed (pointers to functions, even).

I'm not so sure of that.  I don't see how (assuming a "reasonable" language and
"reasonable" linkers) doing the check at link time is intrinsically any
different from doing it at compile time, if (as per the assumptions listed) the
linker has all the information about types that was available to the compiler.
I also don't see that pointers are that bad a problem, assuming pointers are
typed (again, assuming a "reasonable" language); in ANSI C, for instance,
"pointer to 'int'-valued function taking an 'int' argument" is a different type
from "pointer to 'int'-valued function taking a 'float' argument".

> You can catch a lot of the common cases with good programming style,
> as you note above.  But you can't catch them all, and the question
> that capability machines pose is "how close can we come to an
> airtight programming environment, and how much would it cost"?
> (simplistic paraphrase, I know; maybe it'll draw some capability
> people out of their woodwork abstraction!)

I agree that you may not be able to catch all examples of incorrect code.
However, if you can avoid doing any checking in those cases where you *can*
catch them at compile time, you may be better off.  The further questions that
capability machines pose are:

	1) How much "safer" are they than "conventional" machines plus
	   compilers that generate run-time checks?

and

	2) If this extra safety is expensive, is it worth the cost?

(Remember, the sort of embedded systems to which you've referred in the past do
have real-time constraints, so there is presumably some minimum performance
required; as the jobs they do get more sophisticated, the minimum performance
required increases.)

smryan@garth.UUCP (Steven Ryan) (07/27/88)

In article <1075@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes:
>>If the machine allows a change to one variable (an array, say) to
>>affect some other unrelated variable, then it is not conforming to
>>the intent of my program.  In fact, it is not conforming to anything
>>useful at all, since nobody would argue that it is useful programming
>>practice to ever do such a thing on purpose (I hope?).
>
>This particular example is done all the time in handling recursive data
>structures.

What do I mean by that? [I do not know how to reply by private mail--student
driver.]

A simple example is reversing a singly linked list and, just to make it
interesting, inserting the list ordinal:

p := nil; q := first_element; n := 1;
while q isnt nil
 do
  r := q.link;     -- r and q.link are variables pointing at the location.
  q.link := p;     -- p and q.link are now pointing at the same location.
  p := q;          -- now p and q.
  p.ordinal := n;  -- as a sideeffect, also modifies q.ordinal.
  q := r;          -- now r and q.
  n +:= 1
 od

elg@killer.DALLAS.TX.US (Eric Green) (07/27/88)

In message <1988Jul26.024039.28579@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
>In article <4912@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>>> ... A good implementation of BitBlt,
>>> whether in software *or* hardware, will run the memory flat out, leaving
>>> nothing for "servicing other processes".  
>>... the bandwidth of modern 100ns 
>>memories is much greather than a 8mhz 68000 can take advantage of...
>
>Suggesting that one should either buy a faster processor or spend less
>money on the memory.  There are sillier things than putting 100ns RAMs
>on an 8MHz 68000, yes, but I'd have to think for a moment to come up
>with specific examples.

The slowest 256kbit DRAM chips that I have ever seen are 150ns. I
believe that the Amiga uses the 120ns parts. The difference in price
is on the order of cents. 

>More generally, I will repeat -- more explicitly -- something I pointed
>out before:  the fair comparison is not to the same system without BitBlt
>hardware, but to a system where the same effort and funding (custom chips
>are *not* cheap to design) have been invested in making the main CPU faster
>instead.

The original design goal, as I understand it, was to produce a
low-cost consumer machine with high-performance graphics. Which
basically means using an off-the-shelf microprocessor with adequate
development tools (the Amiga OS was cross-compiled on Sun workstations
using the Greenhills compiler). The machine was based around 256kbit
DRAM's, which means that the memory bandwidth is going to be greater
than can be used up by an 8mhz 68000. To go faster, they would have
needed a faster 68000. Which in 1985 would have probably added $200 to
the cost of the machine (after going through two levels of markups).
That would have made it miss the $1500 price criteria (as vs. the
Blit, which, I understand, had 1 megabyte of memory and a price tag of
$10,000).

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
       MISFORTUNE, n. The kind of fortune that never misses.

ge@hobbit.sci.kun.nl (Ge' Weijers) (07/27/88)

From article <1848@van-bc.UUCP>, by sl@van-bc.UUCP (pri=-10 Stuart Lynne):
[About self-modifying code]
> A simple definition might be any program that cannot be compiled and run
> in shared text mode (or equivalent for non Unix application environments).

What about programs that write code into the data segment and then 
call it:

int myproc[1000];

..... generating s/m code

(*((int (*)())myproc))(); /* and calling it */

We need a better definition I'm afraid
-- 
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge

ge@hobbit.sci.kun.nl (Ge' Weijers) (07/27/88)

From article <1988Jul26.022555.28494@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
) In article <4894@killer.DALLAS.TX.US> elg@killer.UUCP writes:
)>However, I still would not want to use a plain old 68000 machine which
)>had to use a software loop to scroll a 32K bitmap.... I've seen Macs
)>and ST's... the speed of window management and scrolling simply cannot 
)>compare...
) 
) Have you tried a Blit (aka 5620, aka 630)?  That's a demonstration of
) what *can* be done with a plain old 68000 using software BitBlt.  I said
) Commodore didn't know how to write a fast software implementation; I
) didn't mean to imply that Atari or Apple did better!  One *can* do faster
) BitBlt than the Blit does, given enough added hardware, but I really doubt
) the cost-effectiveness.

If you ever get a chance to see the programmers editor Tempus for the
Atari ST: it scrolls *FAST*, a factor 10 better than most programs.
In fact it is so fast the speed almost useless.
The authors stated in an interview that they'd rather not have to
do the programming again. The advantage of the Amiga is that most
reasonably well written programs have fast scrolling without having to
reinvent the wheel.


-- 
Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands
UUCP: {uunet!,}mcvax!kunivv1!hobbit!ge

henk@ace.nl (Henk Hesselink) (07/27/88)

In article <1152@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
        
> I have always been uncomfortable with the idea of having your main CPU
> throwing all those display bits around anyway... hell, given the price
> of the 68030 it might pay Sun to stick an extra one in there to unload
> *all* of the display management from the main CPU.

The latest Sony "News" workstation has two '030s for exactly that
reason, and proves that it does pay.  A very nice machine indeed.

Henk Hesselink

darin@nova.laic.uucp (Darin Johnson) (07/28/88)

In article <1988Jul26.022555.28494@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes:
> Have you tried a Blit (aka 5620, aka 630)?  That's a demonstration of
> what *can* be done with a plain old 68000 using software BitBlt.
> One *can* do faster
> BitBlt than the Blit does, given enough added hardware, but I really doubt
> the cost-effectiveness.

As I recall, the Blit (or 5620, ATT told me not to call it a blit), had
very nice speed.  However, it cost quite a lot of money ($8000?) so I wouldn't
really call it that cost effective (most of the price was probably
video and cpu though).

Darin Johnson (...pyramid.arpa!leadsv!laic!darin)
              (...ucbvax!sun!sunncal!leadsv!laic!darin)
	"All aboard the DOOMED express!"

baum@Apple.COM (Allen J. Baum) (07/28/88)

[]
>In article <61459@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>I don't see that, for example, doing these checks in microcode, as opposed to
>regular code, is *ipso facto* the only way to do things....
 Presumably, hardware (done in parallel with fetches) is the way to go
>... the questions that capability machines pose are:
>
>	1) How much "safer" are they than "conventional" machines plus
>	   compilers that generate run-time checks?
>
>and
>
>	2) If this extra safety is expensive, is it worth the cost?

In some sense, the cost is very little, probably not much more than your 
standard segmentation scheme. The hardware cost is usually a tag bit, which
is fast, pretty cheap on chip, but possibly expensive in your memory system.

--
{decwrl,hplabs,ihnp4}!nsc!apple!baum		(408)973-3385

colwell@mfci.UUCP (Robert Colwell) (07/28/88)

In article <61459@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>> >Yes, but does this need "architectural" support, at least in the sense of
>> >"instruction set architecture"?  If a compiler for a "conventional" machine
>> >can do that level of validation, subscript-range checking features in the
>> >instruction set would be seldom, if ever, used.
>> >
>> >If "architecture" includes the compiler, then I agree that "architectural"
>> >support for this would be nice.
>> 
>> But the whole point of capability machines (to name a single example)
>> is that one cannot cover the space of all interesting
>> exception-possibilities using only a compiler, no matter how smart.
>> For one thing, the program could be coercing data types back and
>> forth such that checking an operation on a type can only be done at
>> the time the operation is applied.
>
>You missed the point.  As I said in an earlier posting, you can spend some
>increased speed by putting in code to do the appropriate checks on a
>"conventional" machine.  If the compiler can eliminate most of those checks,
>so much the better; you spend less of the increased speed.

Actually, I didn't miss the point, I ignored it because I wanted to
deal with something I consider more important.  Yes you can put in
more checks.  And if you aren't going to let me get anything better
than that for a programming environment, then I'll take it as an
improvement over the C/Unixes that currently exist.

But I have a lot of doubts about this approach.  To me it's like
observing that an earth dam has 300 holes, and proposing to fix them
one-by-one, each hole needing some different fix from the one before.
The whole point of capabilities (at least as the 432 implemented
them) was to provide a seamless programming environment, with exactly
the same abstraction (that of an "object") presented to the
programmer, no matter in which direction she looked (left or right
through the code, or down to the hardware).  You can't get this by
throwing a large number of bandaids at the problem, even if you say
they don't cost much in performance.

>I don't see that, for example, doing these checks in microcode, as opposed to
>regular code, is *ipso facto* the only way to do things.  In some ways, that's
>what I've been asking: what reason is there to believe that capability
>machines and the like are the *only* way, or even just the *best* way, to
>achieve the kind of programming environment I suspect most, if not all, of us
>would like to see?

Actually, I didn't say it was the only way.  In fact, all I've been
trying to argue is that there are reasons to believe that there are
things worth considering from that kind of research and that type of
programming environment.  Sometimes I feel like all we do is debate
the race cars, when it's the production cars that represent all the
money.

[discussion about specific checks elided; this message is too long]

>I agree that you may not be able to catch all examples of incorrect code.
>However, if you can avoid doing any checking in those cases where you *can*
>catch them at compile time, you may be better off.  The further questions that
>capability machines pose are:

You won't catch me disagreeing with this, even for capability
machines.  In fact, that was a major point of our article in the 13th
Computer Arch Symp (I think it was 13 -- the one in Tokyo).

>	1) How much "safer" are they than "conventional" machines plus
>	   compilers that generate run-time checks?
>
>and
>
>	2) If this extra safety is expensive, is it worth the cost?

Absolutely.  I tried to quantify the cost of 432-style object
orientation in my thesis.  In a nutshell, I concluded that that style
of machine be from 1 to 4 times slower than a conventional
unprotected architecture made of equivalent implementation
technology.  (1 times slower means same speed).  That's an order of
magnitude faster than the 432 really was, but that doesn't count (see
next issue of ACM TOCS to see why).

You could do their transparent multiprocessing and buy that factor of
4 back, but if you got above 6 or 7 you'd saturate the bus
(memory-to-memory architecture).  Things get hard to sort out after
that.

Is that performance hit worth it?  Who knows?  I'd say it's probably
worth it a lot more often than most people currently think.  An awful
lot of code got written and debugged on machines that were a lot slower
than the could-have-been 432.  You have to allow for a natural
tendency (which I think largely fuels this whole industry) to always
want more and to refuse to ever go backwards in anything, speed,
memory, disk, sw complexity...

>(Remember, the sort of embedded systems to which you've referred in the past do
>have real-time constraints, so there is presumably some minimum performance
>required; as the jobs they do get more sophisticated, the minimum performance
>required increases.)

Yeah, well, the Space Shuttle doesn't exactly have blazing hardware.
But it's running some of the most sophisticated software I
know of.  And telephone switchers aren't mini-crays by any means.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

guy@gorodish.Sun.COM (Guy Harris) (07/28/88)

> But I have a lot of doubts about this approach.  To me it's like
> observing that an earth dam has 300 holes, and proposing to fix them
> one-by-one, each hole needing some different fix from the one before.
> The whole point of capabilities (at least as the 432 implemented
> them) was to provide a seamless programming environment, with exactly
> the same abstraction (that of an "object") presented to the
> programmer, no matter in which direction she looked (left or right
> through the code, or down to the hardware).

What if they look at the microcode, or directly at the hardware (if the
underlying machine isn't implemented using microcode)?  In order for the
environment to be truly airtight, the microcode (or hardware) has to be
airtight.

It may be that producing a sufficiently airtight compiler that implements the
object-oriented model by producing the "right" code, including checks, is
either impossible or more costly than producing a microcode or hardware
implementation atop a compiler that doesn't have to be so airtight because the
microcode or hardware is.  If people's experience points in this direction,
fine.

However, how is this different from, say, implementing the interpreter in some
"conventional" machine language, or in some language compiled into
"conventional" machine language?  If it's not different, could the interpreter
then try compiling the code to be interpreted into that machine language,
preserving the run-time checks?  Could this still be sufficiently airtight?  If
so, would it be faster than, say, a microcode implementation of the same
interpreter with no compilation?  How about a microcode implementation that
compiles either into some faster-to-interpret code, or directly into microcode?

> Actually, I didn't say it was the only way.  In fact, all I've been
> trying to argue is that there are reasons to believe that there are
> things worth considering from that kind of research and that type of
> programming environment.

I'll certainly agree with that; however...

> Sometimes I feel like all we do is debate the race cars, when it's the
> production cars that represent all the money.

I won't agree with that, because

	1) I don't think that's *all* people have been doing.  While there's
	   been relatively little discussion of it in *this* forum, I get the
	   impression - perhaps erroneously - that people are thinking about
	   how to implement the desired kind of programming environment, even
	   if more in the "atop a 'conventional' machine language" form than in
	   the "machine language that does lots of it directly" form; e.g., the
	   SOAR people, people doing LISP implementations on "conventional"
	   machines, etc..

	2) I don't think the analogy is appropriate.  For one thing, most
	   computers are - for better or worse - "conventional", so a better
	   analogy would be "sometimes it feels like all we do is debate the
	   production cars, when we should be concentrating on 'safety' cars",
	   e.g. cars with significantly new features to increase safety.

The second analogy raises some interesting questions:

	1) How much are people willing to spend to have safer cars?  Are they
	   willing to pay X amount of extra money for airbags?  (How much of a
	   slowdown are people willing to accept for some amount of additional
	   run-time checking?)

	2) How much safer does a particular feature make you?  (If 99% of
	   the errors a particular run-time check detects can be caught at
	   compile-time, how much safer are you if you catch the extra 1%?)

	3) How many problems does a feature have that reduce its desirability -
	   or even its contribution to safety?  For example, if there's a risk
	   of airbags detonating spontaneously, what is the risk of this
	   happening and how bad is it if it does?  (What are the chances that
	   the implementation of the environment, and all its checking, has
	   a bug in it, and how badly are you screwed if it is?)

> Absolutely.  I tried to quantify the cost of 432-style object
> orientation in my thesis.  In a nutshell, I concluded that that style
> of machine be from 1 to 4 times slower than a conventional
> unprotected architecture made of equivalent implementation
> technology.

Just out of curiosity:

How would this figure change with newer implementation technologies for
"conventional" machines (e.g.  more powerful optimizers, different
architectural style (for which you may read "the stuff that RISC implementors
talk about a lot"), etc.?

How would it change with newer implementations for protected architectures
(including some of the same newer technologies mentioned for "conventional"
machines, and also techniques such as compiling "protected architecture" code
into "conventional" code, or doing "quick" checks in hardware and trapping to
software checking code if they fail)?  (For all I know, some of these may not
be "newer" techniques; if so, how did that change the relative performance?)

How easy is it to apply some of those techniques to instruction-set
architectures such as the 432's, as opposed to applying them to, say, RISC
architectures and using other techniques to implement a "protected" system atop
such a "conventional" architecture?

> You could do their transparent multiprocessing and buy that factor of
> 4 back, but if you got above 6 or 7 you'd saturate the bus
> (memory-to-memory architecture).  Things get hard to sort out after
> that.

Yes, but could you buy an equivalent amount back with multiprocessing on
"conventional" architectures with the protection implemented elsewhere than in
the instruction set, even if the multiprocessing might be less transparent?
Could it be a greater amount if the "conventional" architecture were not so
memory-reference-intensive?

Could you change the "protected" system to make *it* less memory-intensive and
improve *its* multiprocessor performance?  (At this point, one would no longer
expect eyebrows to raise at the appearance of a new register-oriented
machine.... :-))

> Is that performance hit worth it?  Who knows?  I'd say it's probably
> worth it a lot more often than most people currently think.  An awful
> lot of code got written and debugged on machines that were a lot slower
> than the could-have-been 432.  You have to allow for a natural
> tendency (which I think largely fuels this whole industry) to always
> want more and to refuse to ever go backwards in anything, speed,
> memory, disk, sw complexity...

Yes, but there may be an economic justification for that tendency; the issues
may not all be "technical" in the narrow sense.  It may be hard, as you pointed
out, to show improvements in programmer productivity, code reliability, etc.
in a better programming environment, but if the advocates of such environments
want people to spend money on those sorts of improvements rather than straight
performance improvements, they should expect those who would pay the bills to
be more swayed by that kind of evidence.

(Another problem is that it may not even be cost-effective to adopt a
technology that's clearly "better", technically; I suspect a new motor fuel
would have to be quite a bit better before people were willing to write off the
existing investment in the production and distribution of existing fuels.
There are a lot of "conventional" machines out there.)

> Yeah, well, the Space Shuttle doesn't exactly have blazing hardware.
> But it's running some of the most sophisticated software I
> know of.  And telephone switchers aren't mini-crays by any means.

Yeah, well, I don't know that either of those processors would necessarily have
the requisite horsepower for some of the embedded applications somebody might
want to put computers into in the future.  That's why I said "as the jobs they
do get more sophisticated, the minimum performance required increases;" a 4x
performance hit relative to "conventional" computers may rule out a "protected"
processor for some application until you can either come up with a way of
solving the problem faster with existing hardware or make the hardware faster.

I'm certainly willing to believe that there are applications now that could use
a "protected" processor, but I'm not sure I'm willing to believe that there
aren't applications that, at present, couldn't use a "protected" processor if
it were 4x slower than a "conventional" one.  In this case, a useful question
to ask might be "can I get the same level of protection with an adequate level
of performance if I don't put all the protection in the instruction set?"

(BTW, I infer from the paper "Engineering a RISC Compiler System", by some
people at MIPS, that their compiler can optimize register usage across
procedure boundaries even when the calling and called procedures were
separately compiled - they keep Ucode around for multiple compilation units, so
that information for both calling and called procedure is available to the
optimizer - so the claim made in "Fast Object-Oriented Procedure Calls:
Lessons from the Intel 432" that "Clearly, the same approach (interprocedural
register allocation) is inapplicable to separately compiled modules." is
false.  It may be true in some cases - for example, if the modules are bound
together at run time - but it is hardly clear that it is always true.  I
believe David Wall has worked on link-time register allocation as well.)

pf@diab.se (Per Fogelstr|m) (07/28/88)

In article <480@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>I think it comes down to what Henry already said -- if your blit is
>already saturating memory bandwidth, you don't need hardware (hard to
>believe that the CPU is going to get anything useful done if no
>memory bandwidth is available, so the co-processor rationale is a
>weak one when memory is shared).  But at least in my case, memory
>bandwidth could not possibly be saturated by just the CPU.
>
I disagree !  If a specialized "hardware" can do a BitBlit at least twice
as fast (and propably even faster) than the CPU you will always have 50%
or more CPU time left than if the cpu was doing the job itself. Note that
a good designed Blit can saturate the memory bandwidth with no problem.
The design i'm working with right now Blits 80-90 kpixels (depth unlimited)
per second with a 16 bit bus and 100ns dynamic rams. This means it can scroll
a 1024 x 1024 screen in 10ms, less than one display time, and it can move
windows on the screen in real time as the Intel GPC does with hardware
windows. Do that with the CPU. Even with external hardware the CPU will not
be able to generate the addresses fast enough. (You have to generate a new
address each 50-100ns).

colwell@mfci.UUCP (Robert Colwell) (07/28/88)

In article <61781@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
[Guy, how come you always delete the previous correspondent's name?
Are you trying to save me embarrassment?]

[lots of interesting questions and points to ponder removed...I think
this forum has run out of gas on a topic this complex, so in true
Usenet style I'm going to pick nits instead.  I'll try to get back to 
your 432 questions when I get a chance.]

>I'm certainly willing to believe that there are applications now that could use
>a "protected" processor, but I'm not sure I'm willing to believe that there
>aren't applications that, at present, couldn't use a "protected" processor if
>it were 4x slower than a "conventional" one.  In this case, a useful question
>to ask might be "can I get the same level of protection with an adequate level
>of performance if I don't put all the protection in the instruction set?"

If you think of the 432 as merely a "protected processor" you've not
fully grasped what they created there.  It was an *environment*, of
which the central processor was just a component.  You, as a
programmer, did not see the 432 programming environment as a
processor running with a certain amount of memory, a set of
registers, some available OS calls, a certain type of virtual memory,
etc.  You the programmer were given a new universe in which all the
pieces obeyed the same rules, and the rules were simple and few and
not derived from hardward or language artifacts (by and large,
anyway).  I do not believe you can get this environment piecemeal.
(But you can get closer to it than we currently are, and we probably
agree on both the feasibility and desireability of this.)

>(BTW, I infer from the paper "Engineering a RISC Compiler System", by some
>people at MIPS, that their compiler can optimize register usage across
>procedure boundaries even when the calling and called procedures were
>separately compiled - they keep Ucode around for multiple compilation units, so
>that information for both calling and called procedure is available to the
>optimizer - so the claim made in "Fast Object-Oriented Procedure Calls:
>Lessons from the Intel 432" that "Clearly, the same approach (interprocedural
>register allocation) is inapplicable to separately compiled modules." is
>false.  It may be true in some cases - for example, if the modules are bound
>together at run time - but it is hardly clear that it is always true.  I
>believe David Wall has worked on link-time register allocation as well.)

We were talking about procedure calls made within an object-based
programming environment, and I don't think the claim is false.  In
such an environment you do not want to tie your parameter-passing
abstraction, with its expected runtime type-checking, to whatever
register convention happens to be the current fashion.  We didn't say
you couldn't do it -- we meant you don't want to.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

henry@utzoo.uucp (Henry Spencer) (07/29/88)

In article <1152@ficc.UUCP> peter@ficc.UUCP (Peter da Silva) writes:
>On the other hand, at a given speed a 68030 can only go so fast. If you
>can let it do other things while another processor is doing the display,
>then why not? ... hell, given the price
>of the 68030 it might pay Sun to stick an extra one in there to unload
>*all* of the display management from the main CPU.

And when it's not doing display management, of course, it can run user
programs.  Congratulations!  You have just re-invented the multiprocessor
system.  The Wheel of Reincarnation strikes again.  The more smarts you
put in your display processor, the more it resembles a somewhat-crippled
main processor.  You get a much more useful system if you break down and
admit that you're building a multiprocessor machine, and make all the
CPUs general-purpose.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (07/29/88)

In article <4324@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>	Don't forget, the 68000 doesn't have the BFEXTU instruction, so
>the inner loop of a BitBlit isn't as tight as on a 68020.

The inner loops of normal large-area BitBlts do not usually do bitfield 
extractions.  Look at how the operation is actually used.

>>> ... The techniques are
>>> not that well known (especially in the micro community, which is notorious
>>> for its ignorance of the state of the art -- the people at Commodore most
>>> likely had no idea it was even possible to do a fast software BitBlt).
>
>	Excuse me, I have a paper in front of me published in EUUG last fall
>by a researcher at AT&T on optimized software BitBlits in C on an 68020.  If
>you want to insult people please do it to the XYZZY-clone makers and the people
>who write software for them.

I don't consider a 68020 a micro, myself.  The 68020 machine I am writing
this on is much bigger and faster than the minicomputer I was using a
month ago.  When I say "micro", I mean something that competes with PCs.

As for insulting the wrong people, if you're reading papers like that, I
would consider you a (laudable) rare exception to the (deplorable) rule.

>	Fast software implementations of BitBlit tend to be 2 sources, 1
>destination blits (therefor 16 operations).  If you expand them to 3 sources,
>1 destination (256 operations), they tend to take up a LOT of code/ROM
>space, or they slow down a lot, or both...

This is where dynamic compilation, the original subject of this conversation
(remember back then? :-)) wins big:  you don't need to duplicate all that
code N times, just know how to generate it.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (07/29/88)

In article <4929@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>>More generally, I will repeat -- more explicitly -- something I pointed
>>out before:  the fair comparison is not to the same system without BitBlt
>>hardware, but to a system where the same effort and funding (custom chips
>>are *not* cheap to design) have been invested in making the main CPU faster
>>instead.
>
>... To go faster, they would have
>needed a faster 68000. Which in 1985 would have probably added $200 to
>the cost of the machine (after going through two levels of markups).

Which, I would guess, is the same order of magnitude as the cost added
by the custom chips, after the same markups.

>Blit, which, I understand, had 1 megabyte of memory and a price tag of
>$10,000).

Wrong twice.  I don't think even the 5620 cost that much, and the biggest
one had half a meg.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) (07/29/88)

In article <4333@cbmvax.UUCP>, daveh@cbmvax.UUCP (Dave Haynie) writes:
> in article <1988Jul26.024039.28579@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
> > The 630, designed rather more recently, does things the same way.  So
> > do the monochrome Suns.
> 
> All the monochrome Suns-2s around here have a pretty piss-poor graphic
> preformance as compared to what the Amiga does with it's blitter and
> some well written code.  Not to imply that all of the Amiga's video
> routines are as fast as they could be, they certainly aren't.  And the
> Sun-2s are even running faster than the Amiga.  Certainly you may be
> able to get better display performance out of a Sun-3, but with a fast
> 68020, you'd be pretty hurtin' if you couldn't.  And still, at least
> with out UNIX software (I did put the "well-written" qualifier in), our
> text scrolling is several time that of the Sun-3's I've seen. 
> 

Yes but there is a valid reason.  The Amiga is only (at least the workbench)
updating a 320x200x2bit plane display.  The sun screen is much, much larger,
(maybe 1000 x 800?).   It is a whole lot easier to move around 16k or so
than up to say 100k on the sun.

This just like comparing the Amiga to the Mac II.  The Amiga isn't that
speedy if you are using HAM 640x400, which is 192k of memory.  Then 
the Mac II also slows down in its 640x400x8 mode which is 256k of memory.
But I would having a 256 color palette is much nicer than HAM in my opinion. 

Things like blitters are nice, but I would much rather have a better CPU.
Now how about a 68000 like CPU with say 128 data reg. and 128 address reg..
That is somethin I really could get into using.  Things like blitters are
a dime a dozen, I want some real power!

                                   Wayne Knapp 


P.S. The Amiga was pretty hot three years ago, but I just think that in 
general it is no longer that wonderful.  Time for a major upgrade? 

karl@sugar.uu.net (Karl Lehenbauer) (07/29/88)

(Those blit terminals were monochrome, weren't they?  Amiga screens have up
to six bitplanes.)

> I.e., that the memory system is too fast for the processor to use fully.
> Can you say "imbalanced system"?

Henry's remarks are flames in the sense that they're kind of insulting when 
they don't really have to be, I think.  Anytime someone uses Mr. Rogers'
"Can you say 'xxx'?", they're being condescending.  I don't think the Amiga 
is imbalanced either.  The extra clock cycles are used for screen, audio 
and disk DMA.  Is that really imbalanced?  Would slower memory have been
better?

Whether or not blitters are worthwhile, the Amiga has capabilities in terms 
of realtime video generation (examples: various Videoscape 3D demos, NewTek 
demo, PD animations, etc) that can't be touched by anything within about 10X 
of its price, and can't be touched by a lot of big, expensive workstations 
because their operating systems aren't realtime.  Also, look at the cost of 
software.  Software for that $5K Targa board for your Mac II can run over $7K, 
fine if you're institutional, but kind of narrow to think that's the Minimum 
Acceptable System.  Should one expect a $550 machine to have a 68020?  (No,
a 32-bit RISC, someday)   My point is that whatever the theory of whether it 
should have a blitter, in practice there is a lot of cool software that does 
amazing things, cheap - a gestalt win.

I've done some programming on the Amiga and I'm very, very impressed with the 
realtime exec, the long names for all the calls, the shared libraries and
so forth.  My other realtime OS work has been with RSX-11M, and M+,
RMX, C Exec, Forth and various homegrown-style ones and so far the Amiga is 
a superset of the others...and it's hardly the work of people who don't know 
how to write software.
-- 
-- 
-- Karl Lehenbauer, karl@sugar.uu.net aka uunet!sugar!karl, +1 713 274 5184

karl@sugar.uu.net (Karl Lehenbauer) (07/29/88)

> I.e., that the memory system is too fast for the processor to use fully.
> Can you say "imbalanced system"?

Henry's remarks are flames in the sense that they're kind of insulting when 
they don't really have to be, I think.  Anytime someone uses Mr. Rogers'
"Can you say 'xxx'?", they're being condescending.  I don't think the Amiga 
is imbalanced either.  The extra clock cycles are used for screen, audio 
and disk DMA.  Is that really imbalanced?  Would slower memory have been
better?

I think Henry's right that "best of all is a really fast CPU."  I don't
know what the Amiga people themselves could have done to make the 68000 
itself faster, though.  Maybe it should be "best of all is a really
fast CPU and a bunch of other stuff?"  I don't want my Unix system to
run really slow while drawing characters as pixels, either.  I want a 
blit card or something like that.  I guess I want both.

Whether or not blitters are worthwhile, the Amiga has capabilities in terms 
of realtime video generation (examples: various Videoscape 3D demos, NewTek 
demo, PD animations, etc) that can't be touched by anything within about 10X 
of its price, and can't be touched by a lot of big, expensive workstations 
because their operating systems aren't realtime.  Also, look at the cost of 
software.  Software for that $5K Targa board for your Mac II can run over $7K, 
fine if you're institutional, but kind of narrow to think that's the Minimum 
Acceptable System.  Should one expect a $550 machine to have a 68020?  (No,
a 32-bit RISC, someday)   My point is that whatever the theory of whether it 
should have a blitter, in practice there is a lot of cool software that does 
amazing things, cheap - a gestalt win.

I've done some programming on the Amiga and its system software is very
sophisticated.  It is hardly the work of people who don't know how to
program, as fun as it may be to stir up trouble by saying that it is.
-- 
-- 
-- Karl Lehenbauer, karl@sugar.uu.net aka uunet!sugar!karl, +1 713 274 5184

aglew@urbsdc.Urbana.Gould.COM (07/29/88)

..> Talk about capabilities?

Why has no-one mentioned one of the biggest problems with
capability based systems? The security people who evaluate
and classify secure _systems_ (C2, B1, etc., Gould knows
about them) don't like "OS in hardware" or "OS in microcode"
because then they have to understand the microcode/microarchitecture,
and there's more of it to understand than in a conventional system.
Plus the fact that capability inheritance rules have never
been satisfactorily modeled and shown to be secure (I'm sure
that I'll receive flames about that, eh? :-)

***************************************************************
Attempting to break out of the C vs ASM and HW vs SW BLT morass
***************************************************************

I notice Bob Colwell from Multiflow is a regular reader.
Just read the IEEE Transactions on Computers article on 
Multiflow's TRACE machines. 

    Questions: Exactly how does the store register file
on the floating point subsystem operate?
    You mention that there are only 8 32 bit busses across
the board edge. Exactly what are they? Are there more busses
in the system, but only a subset to any board? Are address
and data busses split? Etc.
    Everybody I know who has a reason to know the details of
the TRACE says that it is loaded down with crossbars. How and
where?


Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have MX records
    aglew@xenurus.gould.com     - if you don't
    ...!ihnp4!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

henry@utzoo.uucp (Henry Spencer) (07/30/88)

In article <308@laic.UUCP> darin@nova.laic.uucp (Darin Johnson) writes:
>As I recall, the Blit (or 5620, ATT told me not to call it a blit), had
>very nice speed.  However, it cost quite a lot of money ($8000?) so I wouldn't
>really call it that cost effective ...

The selling price was something like $5k, as I recall, and it *cost* a lot
less than that (as witness the rather lower price of the 630, its replacement,
despite greater functionality with similar technology).
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

darin@nova.laic.uucp (Darin Johnson) (07/30/88)

In article <3057@tekig5.PEN.TEK.COM>, wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes:
> This just like comparing the Amiga to the Mac II.  The Amiga isn't that
> speedy if you are using HAM 640x400, which is 192k of memory.  Then 
> the Mac II also slows down in its 640x400x8 mode which is 256k of memory.
> But I would having a 256 color palette is much nicer than HAM in my opinion. 
> 
> P.S. The Amiga was pretty hot three years ago, but I just think that in 
> general it is no longer that wonderful.  Time for a major upgrade? 

however, it is still the biggest bang for the buck.  It is still the
only computer with NTSC video, without having to pay megabucks.

I could buy a 68030 board (and extra 32 bit ram when the prices go down),
plug it into my Amiga, and still have spent less money than a Mac II
with color - and the Mac still won't have multitasking without spending
more money.  The extra colors and higher resolution on the MacII are
nice, but the Mac II just isn't a personal computer anymore... (sort
of like a workstation, but without the performance)

Darin Johnson (...pyramid.arpa!leadsv!laic!darin)
              (...ucbvax!sun!sunncal!leadsv!laic!darin)
	"All aboard the DOOMED express!"

henry@utzoo.uucp (Henry Spencer) (07/30/88)

In article <477@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>... there is
>an awful lot the compiler can't know about -- programs that use input
>data, generate pointers, or do just about anything else that's
>interesting, are going to do many operations that are difficult for
>the compiler to validate at compile time.

My personal thesis is that the difficulty of doing this in a well-written
program -- I am not interested in poorly-written ones -- is consistently
overestimated.  Remember, the programmer is supposed to be sure that these
errors aren't going to happen, because even if they are caught immediately
they are generally a disaster for production code.  Frequently, the sort
of checks that a programmer must put in to be sure about such things are
the sort of thing that a compiler ought to be able to spot fairly easily,
e.g. validation of input data.

>I'm not sure I catch your drift on the imposition of runtime checks.
>To me, the best human debuggers have the ability to figure out the
>fastest which assumptions are being made that are wrong, thus getting
>to the heart of some problem the quickest.  If I have a program bug,
>I want the machine to express as directly as possible the intent of
>my program so that I can narrow the bug-space down to my code alone.

We may have a slight misunderstanding here, which I admit I didn't notice
in my original posting.  I'm not claiming that run-time checks aren't
useful during debugging.  However, I claim that run-time checks are not
a particularly good approach to catching problems during production use,
because "subscript error in line 32, core dumped" can be every bit as
bad as trash output in a production environment.  What one really wants
is a *guarantee* that (in the absence of hardware malfunction), it will
not happen.  This cannot be achieved by testing, regardless of how much
help the hardware provides.

>I sure hope you're right.  In fact, if progress towards this goal
>becomes evident, I'd propose we start discussing ways of turning
>architecture towards better ways of supporting this instead of
>throughput or programming environments.

This is actually a programming-environment issue, though, and architecture
is nearly irrelevant.  A much more important issue is the programming
languages being used, and how tractable they are.  C, for example, is not
a very favorable case -- too many problems with pointers.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

oconnor@nuke.steinmetz (Dennis M. O'Connor) (07/30/88)

An article by ge@hobbit.sci.kun.nl (Ge' Weijers) says:
] From article <1087@ficc.UUCP>, by peter@ficc.UUCP (Peter da Silva):
] > 	Why are an Icache plus a Dcache better than just
] > 	a big shared cache as big as both?
] 
] The answer is bandwidth. The CPU does not have to stop filling
] the instruction pipeline when it accesses/writes data.
] -- 
] Ge' Weijers, Informatics dept., Nijmegen University, the Netherlands

This is indeed part of the answer. The other part is that
instruction-fetch behavior is much simpler and more predictable
than operand fetch, and that I-caches don't need to be
written-through, allowing an isolated I-cache to be simpler
and more effective than a combined operand/instruction cache.

Specilized I-caches, like a Branch-Target cache combined with pre-fetch,
can produce effective hit rates of 99%+ with only a few thousand
bits of storage. Operand caches can't do anywhere near this well.
--
 Dennis O'Connor   oconnor%sungod@steinmetz.UUCP  ARPA: OCONNORDM@ge-crd.arpa
    "Never confuse USENET with something that matters, like PIZZA."

henry@utzoo.uucp (Henry Spencer) (07/30/88)

In article <479@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
>But the whole point of capability machines (to name a single example)
>is that one cannot cover the space of all interesting
>exception-possibilities using only a compiler, no matter how smart.
>For one thing, the program could be coercing data types back and
>forth such that checking an operation on a type can only be done at
>the time the operation is applied.

Yes, it could be -- but how does the *programmer* know it's right?
Remember, programs are not supposed to dump core in production operation.
Remember also Dijkstra's comments on how human minds are actually fairly
simple and find it difficult to reason about complex situations.  There
are three possibilities here:

1. Code is simple enough for human programmer to figure out easily, in
	which case the compiler should be able to do likewise, perhaps
	with a bit of help.

2. Code is wrong, human programmer is (e.g.) not validating his inputs
	properly.  Compiler should reject it.

3. The code really is right but very subtle.  Programmer has to have a
	way to tell the compiler "this is subtle but right".  This should
	not be the default, as it effectively is now.

My conjecture, not proven at the moment, is that only modest intelligence
in the compiler and occasional tidying-up of code for clarity would be
needed to make case 3 fairly rare, given favorable cirumstances (it is
not clear that C constitutes a sufficiently favorable circumstance).

>But a more fundamental matter is how one manages the development of a
>large software project with dozens of programmers contributing to a
>single large end product.  The modules are all separately compiled,
>so there is no question of the compiler helping out much.

Please consider modern compiler technology, not 20-year-old versions.
Ways to achieve inter-module checking are a major concern in modern
languages, as witness (e.g.) C++ and Modula 2.  It can be done, and
ought to be done!

>Given access to all the symbol tables, you could imagine the linker
>doing some reasonable checks of consistency (number and types of
>args, for instance), but even that fails when pointers are being
>passed (pointers to functions, even).

If the assumptions being made about those pointers are simple ones, they
should be in the code, for compilers (and humans!) to see.  If they are
complex ones, then in a multi-programmer project they are probably being
violated already.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

jesup@cbmvax.UUCP (Randell Jesup) (07/30/88)

In article <3057@tekig5.PEN.TEK.COM> wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes:
>Yes but there is a valid reason.  The Amiga is only (at least the workbench)
>updating a 320x200x2bit plane display.  The sun screen is much, much larger,
>(maybe 1000 x 800?).   It is a whole lot easier to move around 16k or so
>than up to say 100k on the sun.

	Wrong: 640x200x2bit or 640x400x2bit (for workbench, more bitplanes
are allowed for "custom" screen.)  Also, When I'm using suntools, I tend to
use 80-column wide windows, 30 or so lines high, 1 bitplane deep.  Overall,
about the same amount of display memory used for the window I'm scrolling
in each case.

>This just like comparing the Amiga to the Mac II.  The Amiga isn't that
>speedy if you are using HAM 640x400, which is 192k of memory.  Then 
>the Mac II also slows down in its 640x400x8 mode which is 256k of memory.
>But I would having a 256 color palette is much nicer than HAM in my opinion. 

	HAM only operates in lores (320x{200,400}) in the current chips.
The biggest amount of memory for a screen/window is 128K (640x400x4bits).
The Mac-II is not a reasonable comparison to the current Amigas (look at
price, when they were designed, processor, etc.)  The interesting thing
is that in windowing and screen update/scrolling speed the 68000 8Mhz
16-bit $600 Amiga can do so well compared to the 68020 16Mhz 32-bit Mac-II
that costs several thousand, which also uses VRAMs (a win due to bus
contention), which didn't exist when the current amiga was designed.

>Things like blitters are nice, but I would much rather have a better CPU.
>Now how about a 68000 like CPU with say 128 data reg. and 128 address reg..
>That is somethin I really could get into using.  Things like blitters are
>a dime a dozen, I want some real power!

	Ok, where can we buy this CPU for less than $20?  In fact, where
can one buy this CPU at all?  Warning: task switch time on that will
be atrocious, and instructions will have to double in size or so to hold
those register addresses.

	Leading edge CPU design is an expensive thing compared to building
a blitter.  The last major CPU MOS (Commodore's silicon arm) designed was
the 6502/6510.  They concentrate on other things now, like the custom
chips for the C64 and now the Amiga.

>P.S. The Amiga was pretty hot three years ago, but I just think that in 
>general it is no longer that wonderful.  Time for a major upgrade? 

	The new non-interlaced chips are more or less public knowlege,
though most of the details aren't.

	I can say that we certainly aren't standing still; Gould announced a
few months ago in Germany that we are working on the Amiga 3000, based on the
68030.

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

aglew@urbsdc.Urbana.Gould.COM (07/30/88)

> Can you say "imbalanced system"?

Getting away from the Blitter wars, what exactly do people mean
by a "balanced system", and why is it so important.

This is a somewhat rhetorical question. And yes, I know about
Amdahl's MIPS/MB/s rule of thumb, and a few others.

I ask this because I have seen several situations where building
unbalanced systems has been an attractive corporate strategy.
Eg. a small company that can't do everything at once, and wants
to maintain a high rate of delivery of new projects: 
    Start off with a really fast memory system, and good, but not 
necessarily the fastest CPUs. Deliver this as your first system. 
    Keep the memory system, but work all out on the CPU. 
Deliver this as your second system - with the advantages that the
memory system is tried and proven, so all that you have to devote
debugging and langauge effort to is the CPU system.
    Now the memory is tired, so spend time and effort working on
it - but keep the proven CPUs...

And so on. This is an overlapping product development strategy
- eventually, the whole system gets redone, but at any time you
don't have the expense and risk of giving the customer a completely 
new system.

Reactions?



Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have MX records
    aglew@xenurus.gould.com     - if you don't
    ...!ihnp4!uiucuxc!ccvaxa!aglew  - paths may still be the only way
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

colwell@mfci.UUCP (Robert Colwell) (07/31/88)

>Just read the IEEE Transactions on Computers article on 
>Multiflow's TRACE machines. 
>
>    Questions: Exactly how does the store register file
>on the floating point subsystem operate?

The data that you'd like to have end up in memory, you put into the
SRF first.  When you kick off a memory store, it's an integer opcode
-- the integer section has the DTLB.  This was done so that the
floating section can keep cranking out flops while the integer
section can keep track of loop indices, memory addresses, and other
scalar types of things.  The SRF is a legitimate target of a flop or
an integer op.

>    You mention that there are only 8 32 bit busses across
>the board edge. Exactly what are they? Are address and data busses split?

By "address bus" I assume you mean physical address bus to memory;
those are "dedicated" in the sense that they only carry addresses,
never data.  There are also a set of store buses that carry data to
the memory controllers on stores.  There are eight other buses,
four FL buses and four IL buses.  The "L" means Load, but these buses
are also used to carry certain cluster-to-cluster transfers.
Buses are all tagged as to destination, and self-drain in the event
of stalls, traps, etc.

>Are there more busses
>in the system, but only a subset to any board? 

Yes, the integer boards get the four IL buses, one of the physical
address buses each, and two IF buses (that don't count because
they're carried over very short cables to their floating-point
board-pair).  The floating boards touch all four FL buses and all
four Store buses -- that's where the eight comes from.  By the way,
if it wasn't clear -- in figure 4 of the paper, the "I3,I2,I1,I0"
boxes are the Integer boards, the Fn boxes are the Floating boards,
and the Mn boxes are the memory controllers.

>    Everybody I know who has a reason to know the details of
>the TRACE says that it is loaded down with crossbars. How and
>where?

There are a lot of crossbars, but "loaded down?"  If you have a lot
of general-purpose buses, you need a fair amount of muxing hardware
somewhere in order to take advantage of it.  Once you've paid the
price for having all those buses (bus drivers everywhere you look),
providing the crossbars to maximize the win only seems right.
Besides, in this architecture, you can put those crossbars into the
gate arrays.  The only place we couldn't make them fit was on the
store buses, so there's a bunch of PALs to handle it there (10 or so
per F board, I guess?).  The main places that come to mind are the
inputs to the Register Files (both integer and floating -- they're
built from the same gate array type), and the store file outputs.

Also, don't skim over the point we made in the article too fast about
putting a full crossbar between address generators and memory.  If
you don't do this, the compiler has to exactly where something is in
memory (or is going to be) in order to schedule the bus to transfer
the datum.  With a crossbar, the compiler only has to know where
something is relative to everything else, an easier task.

Bob Colwell            mfci!colwell@uunet.uucp
Multiflow Computer
175 N. Main St.
Branford, CT 06405     203-488-6090

mike@arizona.edu (Mike Coffin) (07/31/88)

From article <1988Jul29.202400.28068@utzoo.uucp>, by henry@utzoo.uucp (Henry Spencer):
> ...  There are three possibilities here:
> 
> 1. Code is simple enough for human programmer to figure out easily, in
> 	which case the compiler should be able to do likewise, perhaps
> 	with a bit of help.
> 
> 2. Code is wrong, human programmer is (e.g.) not validating his inputs
> 	properly.  Compiler should reject it.
> 
> 3. The code really is right but very subtle.  Programmer has to have a
> 	way to tell the compiler "this is subtle but right".  This should
> 	not be the default, as it effectively is now.

4. Code is simple enough for a human programmer to figure out easily
given "deep" knowledge about what the program is doing, in which case
a compiler has big troubles figuring out almost anything.  I have
worked on some mathematical software --- computational group theory,
if it matters --- where the code itself looks broken unless you know
the right theorem.  The theorem establishes an invariant that makes
the code very easy to understand.

I just thought of a simple example: a piece of code relies on the fact
that an array always contains a permutation of the integers 1:N.  This
invariant is originally established by generating a random permutation
via a very clever algorithm (Floyd's), and then maintained by always
permuting the integers --- sometimes swapping them, other times
rotating several.  A procedure that relies on the fact that the array
is a permutation is going to be extremely opaque to the compiler, yet
might be transparent to a programmer.  (Especially if the array is
called "permutation".)

-- 

Mike Coffin				mike@arizona.edu
Univ. of Ariz. Dept. of Comp. Sci.	{allegra,cmcl2,ihnp4}!arizona!mike
Tucson, AZ  85721			(602)621-4252

elg@killer.DALLAS.TX.US (Eric Green) (07/31/88)

In message <1988Jul28.173620.7325@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) says:
>In article <4929@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>>... To go faster, they would have
>>needed a faster 68000. Which in 1985 would have probably added $200 to
>>the cost of the machine (after going through two levels of markups).
>
>Which, I would guess, is the same order of magnitude as the cost added
>by the custom chips, after the same markups.

An Amiga 500 costs $650 retail. That's with power supply, disk drive,
512K of RAM (which probably accounts for 1/3rd of the cost, by
itself), etc. Somehow, I don't see $200 of that as being blitter,
especially since there's four custom chips in there and the blitter is
only part of one of them.

Commodore has a long record of building inexpensive custom chips,
dating back to the lowly Commodore 64 (which had a custom video chip
and a custom sound chip, along with a gate-array for general "glue"
purposes). Perhaps Sun cannot build a blitter for less than $500, but
that's Sun's problem.

You cannot build an argument against blitters on the basis of cost
(also see the $30 NSC chip). You will have to fall back to the bus
bandwidth argument. On a 68000, which cannot saturate modern memories,
period, having a blitter use the extra bandwidth is smart. On a 68030
with an icache, the CPU can already saturate the bus, so having a
blitter is more of a tradeoff. Unless you have a seperate bus for the
blitter and video memory, a blitter buys you nothing with a 68030 (or
a cached 68020, for that matter).

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
       MISFORTUNE, n. The kind of fortune that never misses.

karl@sugar.uu.net (Karl Lehenbauer) (08/01/88)

In article <1988Jul28.170834.6949@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer)
writes some more flamelets, and...
> ...You get a much more useful system if you break down and
> admit that you're building a multiprocessor machine, and make all the
> CPUs general-purpose.

I don't know at what point this becomes weirding out.  Isn't it OK to have a
dedicated processor on the multiport serial I/O boards?  Isn't it OK to have
microprocessors in the disk controllers?  ...in a graphics terminal?  So why
not on a graphics board?  Why can't we put a processor anywhere that's useful
and have general multiprocessing as well?  Why should my graphics CPU, with
dedicated memory and, eventually, parallel processing, hardware transforms
and such have to be complicated by the need to run user programs?  Your 
statement could be read to criticize every engineer designing a part of a 
system who used a uP for something for not implementing with random logic 
instead and adding another general purpose processor.  That's absurd, right?  
OK then, so why is the display among all things sacrosanct from the use of 
custom or dedicated hardware?
-- 
-- 
-- Karl Lehenbauer, karl@sugar.uu.net aka uunet!sugar!karl, +1 713 274 5184

peter@ficc.UUCP (Peter da Silva) (08/02/88)

In article ... henry@utzoo.uucp (Henry Spencer) writes:
> The more smarts you
> put in your display processor, the more it resembles a somewhat-crippled
> main processor.  You get a much more useful system if you break down and
> admit that you're building a multiprocessor machine, and make all the
> CPUs general-purpose.

So, the question then becomes a matter of price. When one is building a
cheap (under $1000) computer, the cost of a second 68000 (that still won't
be as fast as a blitter, since it doesn't have those nice I-caches in
your 68030) becomes a significant factor.

Here's a way to decide: you write the fastest LIFE program you can
on the 68030, using the "dumb" algorithm (calculate all cells every cycle).
Allow for clock rate differences.

The Blitter can be used to pump LIFE along at a bit over 20 generations
per second. The 68000 can't hope to catch up. How's the 68030?
-- 
Peter da Silva, Ferranti International Controls Corporation, sugar!ficc!peter.
"You made a TIME MACHINE out of a VOLKSWAGE2

jesup@cbmvax.UUCP (Randell Jesup) (08/02/88)

In article <1988Jul28.171444.7068@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <4324@cbmvax.UUCP> jesup@cbmvax.UUCP (Randell Jesup) writes:
>>>> ... The techniques are
>>>> not that well known (especially in the micro community, which is notorious
>>>> for its ignorance of the state of the art -- the people at Commodore most
>>>> likely had no idea it was even possible to do a fast software BitBlt).

>>	Excuse me, I have a paper in front of me published in EUUG last fall
>>by a researcher at AT&T on optimized software BitBlits in C on an 68020.

>I don't consider a 68020 a micro, myself.  The 68020 machine I am writing
>this on is much bigger and faster than the minicomputer I was using a
>month ago.  When I say "micro", I mean something that competes with PCs.

	True, but 68020 code is largely applicable to the 68000, plus the
paper included examples in pure C and in 68000.  And 68020's are about to
become considered micros.  (though you can build mini's out of them: the
difference between mini and micro is becoming a matter more of price tag
and I/O speed than processor type - and even those are fuzzy.)

	We do compete with PC's (actually, the A500 is positioned even
lower than most PC-clones, about $600).  However, we try very hard to be
much closer to "state of the art" than other micros (multitasking, semaphores,
blitters, etc, etc).  And our OS comes with the machine, it doesn't cost
you an extra $795 or whatever.  ;-)

>As for insulting the wrong people, if you're reading papers like that, I
>would consider you a (laudable) rare exception to the (deplorable) rule.

	The only reason it ticked me off was the direct reference to
Commodore (for whom I work on Amiga system software).

>>	Fast software implementations of BitBlit tend to be 2 sources, 1
>>destination blits (therefor 16 operations).  If you expand them to 3 sources,
>>1 destination (256 operations), they tend to take up a LOT of code/ROM
>>space, or they slow down a lot, or both...
>
>This is where dynamic compilation, the original subject of this conversation
>(remember back then? :-)) wins big:  you don't need to duplicate all that
>code N times, just know how to generate it.

	Well, you need to revise your algorithms.  The paper I cited, for
example, uses effectively "canned" code pre-compiled by the compiler.  Quite
amusing code to read, it does truely evil things with the macroprocessor,
but is more or less portable (and almost entirely in C, though the fastest
version has one or two #asm's in it).

	It means you have to REALLY do code generation on the fly, if you
need 256 operations.  Not impossible, but much more of a pain to write/
support.  Having a compiler do most of your work for you makes things much
easier.  :-)

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

jesup@cbmvax.UUCP (Randell Jesup) (08/02/88)

In article <1988Jul28.173620.7325@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <4929@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes:
>>... To go faster, they would have
>>needed a faster 68000. Which in 1985 would have probably added $200 to
>>the cost of the machine (after going through two levels of markups).
>
>Which, I would guess, is the same order of magnitude as the cost added
>by the custom chips, after the same markups.

	Actually no, since the custom chips do everything else as well
(video, sprites, ram refresh, 4-channel audio, disk io, serial, parallel.)
The blitter is a small part of one chip, using the address generation/DMA
control of another (which does these for all the custom chip DMA channels).

	Back in '84-'85 14+Mhz 68000's weren't very cheap.  (and we still
would have needed the custom chips for everything else).

-- 
Randell Jesup, Commodore Engineering {uunet|rutgers|allegra}!cbmvax!jesup

wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) (08/03/88)

In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes:
> 
> Here's a way to decide: you write the fastest LIFE program you can
> on the 68030, using the "dumb" algorithm (calculate all cells every cycle).
> Allow for clock rate differences.
> 
> The Blitter can be used to pump LIFE along at a bit over 20 generations
> per second. The 68000 can't hope to catch up. How's the 68030?

Okay, so how does blitter help if the cell array is the frame buffer.  I
mean it seems kind of a waste to work in an array and then to move it 
out to the frame buffer.  

                                            Wayne Knapp

henry@utzoo.uucp (Henry Spencer) (08/04/88)

In article <6490@megaron.arizona.edu> mike@arizona.edu (Mike Coffin) writes:
>4. Code is simple enough for a human programmer to figure out easily
>given "deep" knowledge about what the program is doing, in which case
>a compiler has big troubles figuring out almost anything...

Well, remember, I'm not talking about proof of correctness; all I want
is absence of runtime errors.  Even so, I concede that this sort of thing
is likely to happen now and then.  Having to turn the checking off once
in a while (and, preferably, having to justify this in writing) still
strikes me as better than not having it at all.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!mnetor!utzoo!henry henry@zoo.toronto.edu

scott@applix.UUCP (Scott Evernden) (08/04/88)

In article <3084@tekig5.PEN.TEK.COM> wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes:
>In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes:
>> 
>> The Blitter can be used to pump LIFE along at a bit over 20 generations
>> per second. The 68000 can't hope to catch up. How's the 68030?
>
>Okay, so how does blitter help if the cell array is the frame buffer.

The blitter can be used as a sort of "parallel processor".  You allocate
5 bitmaps, and then blit to and fro simulating a parallel adder.  Here's
the generation section of am Amiga program I wrote in early '86 which did
about 5 gens/sec for a 320 x 200 grid.  I needed 39 2-input blits per
generation.  Wiser folks than I have since improved this to about 20/sec
by going straight to the blitter and using its 3 input capability,
requiring 9 blits.  (See 'Amazing Computing' vol.2 #12 thru vol.3 #2.).

Inspired by Mark D. Niemiec's "Life Algorithms" in Byte, Jan '79...

----------------------------------------------
/* Blitter op-codes */
#define MOV  0xC0
#define COM  0x50
#define XOR  0x60
#define BIC  0x20
#define BIS  0xE0
#define AND  0x80

/* Blitter use */
#define SBLIT(s,l,t,d,c)  BltBitMap(s,l,   t,  d,Left,Top,Wid,Hyt,c,0xFF,0)
#define BBLIT(s,d,c)      BltBitMap(s,Left,Top,d,Left,Top,Wid,Hyt,c,0xFF,0)

generation()
/*
 * The blitter "program" to produce the next generation.
 * I think it might be possible to work directly with the blitter
 * to utilize its 3-input capability, but this is for some other day...
 */
{
   /* add up the neighbor's above each cell */
   SBLIT(bm[SCREEN], Left-CellX, Top-CellY, bm[0], MOV);
   SBLIT(bm[SCREEN], Left,       Top-CellY, bm[1], MOV);
   SBLIT(bm[SCREEN], Left+CellX, Top-CellY, bm[2], MOV);
   BBLIT(bm[1], bm[0], XOR);
   BBLIT(bm[0], bm[1], BIC);
   BBLIT(bm[2], bm[0], XOR);
   BBLIT(bm[0], bm[2], BIC);
   BBLIT(bm[2], bm[1], BIS);
  
   /* add up the neighbor's below each cell */
   SBLIT(bm[SCREEN], Left-CellX, Top+CellY, bm[2], MOV);
   SBLIT(bm[SCREEN], Left,       Top+CellY, bm[3], MOV);
   SBLIT(bm[SCREEN], Left+CellX, Top+CellY, bm[4], MOV);
   BBLIT(bm[3], bm[2], XOR);
   BBLIT(bm[2], bm[3], BIC);
   BBLIT(bm[4], bm[2], XOR);
   BBLIT(bm[2], bm[4], BIC);
   BBLIT(bm[4], bm[3], BIS);

   /* now add the 'above' and 'below' counts */
   BBLIT(bm[2], bm[0], XOR);
   BBLIT(bm[0], bm[2], BIC);
   BBLIT(bm[2], bm[1], XOR);
   BBLIT(bm[1], bm[2], BIC);
   BBLIT(bm[3], bm[1], XOR);
   BBLIT(bm[1], bm[3], BIC);
   BBLIT(bm[3], bm[2], BIS);

   /* left and right */
   SBLIT(bm[SCREEN], Left-CellX, Top, bm[3], MOV);
   SBLIT(bm[SCREEN], Left+CellX, Top, bm[4], MOV);
   BBLIT(bm[4], bm[3], XOR);
   BBLIT(bm[3], bm[4], BIC);

   /* grand total neighbors */
   BBLIT(bm[3], bm[0], XOR);
   BBLIT(bm[0], bm[3], BIC);
   BBLIT(bm[3], bm[1], XOR);
   BBLIT(bm[1], bm[3], BIC);
   BBLIT(bm[4], bm[1], XOR);
   BBLIT(bm[1], bm[4], BIC);
   BBLIT(bm[3], bm[2], BIS);
   BBLIT(bm[4], bm[2], BIS);

   /* add 'self' in and put on screen */
   SBLIT(bm[SCREEN], Left, Top, bm[0], BIS);
   BBLIT(bm[2], bm[0], BIC);
   BBLIT(bm[1], bm[0], AND);
   SBLIT(bm[0], Left, Top, bm[SCREEN], MOV);
}

-scott

peter@ficc.UUCP (Peter da Silva) (08/04/88)

In article <3084@tekig5.PEN.TEK.COM>, wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) writes:
> In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes:
> > The Blitter can be used to pump LIFE along at a bit over 20 generations
> > per second. The 68000 can't hope to catch up. How's the 68030?

> Okay, so how does blitter help if the cell array is the frame buffer.  I
> mean it seems kind of a waste to work in an array and then to move it 
> out to the frame buffer.  

It took me a while to parse that sentence, because of course the cell
array is the frame buffer. Then I realised that you were missing one
important point...

The Blitter is doing the work. I don't have the article here, but the
Blitter can actually do all the work of creating the next generation.
I think it takes 11 blits.
-- 
Peter da Silva, Ferranti International Controls Corporation, sugar!ficc!peter.
"You made a TIME MACHINE out of a VOLKSWAGEN BEETLE?"
"Well, I couldn't afford another deLorean."
"But how do you ever get it up to 88 miles per hour????"

elg@killer.DALLAS.TX.US (Eric Green) (08/07/88)

In message <3084@tekig5.PEN.TEK.COM>, wayneck@tekig5.PEN.TEK.COM (Wayne Knapp) says:
>In article <1189@ficc.UUCP>, peter@ficc.UUCP (Peter da Silva) writes:
>> 
>> The Blitter can be used to pump LIFE along at a bit over 20 generations
>> per second. The 68000 can't hope to catch up. How's the 68030?
>
>Okay, so how does blitter help if the cell array is the frame buffer.  I
>mean it seems kind of a waste to work in an array and then to move it 
>out to the frame buffer.  

The Amiga blitter is in "chip" memory (the memory also accessible by
the video chip and other custom DMA devices, e.g. the sound DMA
channel which runs around 32khz or so). I assume that this is what you
are referring to as a "frame buffer". In the microcomputer world,
generally the video memory is in the same addressing space as the rest
of CPU memory -- it's just on an isolated bus, where video refresh
cannot interfere with CPU operations (at least in the case of IBM PC
type thingys, and, partially, the Amiga). In any event, blitter memory
and "video" memory are the same, so there's no need to do any moving.
I assume that this is the case with most viable blitter systems.

In the microcomputer world, "frame buffers" are generally associated
with "frame grabbers", i.e., video frame digitizers. Perhaps this is a
case where workstation/Sun terminology slams into conflicting
microcomputer terminology?

--
Eric Lee Green    ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg
          Snail Mail P.O. Box 92191 Lafayette, LA 70509              
       MISFORTUNE, n. The kind of fortune that never misses.

atbowler@watmath.waterloo.edu (Alan T. Bowler [SDG]) (08/15/88)

In article <61781@sun.uucp> guy@gorodish.Sun.COM (Guy Harris) writes:
>
>What if they look at the microcode, or directly at the hardware (if the
>underlying machine isn't implemented using microcode)?  In order for the
>environment to be truly airtight, the microcode (or hardware) has to be
>airtight.
>
>It may be that producing a sufficiently airtight compiler that implements the
>object-oriented model by producing the "right" code, including checks, is
>either impossible or more costly than producing a microcode or hardware
>implementation atop a compiler that doesn't have to be so airtight because the
>microcode or hardware is.  If people's experience points in this direction,
>fine.
>
Perhaps this is the wrong question.  There is also the danger that the
existence of the manditory checks will increase program complexity
as the programmer finds that he has to program arround the environment
to accomplish the task.  A classic example are those Pascal programs
that allocate 1 large static linear array and then run their own
memory allocator withing it.  Instead of machine pointers they
pass arround indexes into the array.  This has defeated the purpose
of the Pascal tight checking, reintroduced all the old danger of
running off the end of your allocated memory (into a chunk logically
attached to another pseudo pointer), and increased the notational complexity
of the program (and probability of errors).  If you examine these programs
and what they accomplish, you will usually conclude 
  1) This was the best choice that could be made given the constraints
     of the language
  2) The programmer must have been persistent to get the thing working
     at all.

lisper-bjorn@CS.YALE.EDU (Bjorn Lisper) (08/22/88)

In article <20349@watmath.waterloo.edu> atbowler@watmath.waterloo.edu (Alan
T. Bowler [SDG]) writes:
>.....  A classic example are those Pascal programs
>that allocate 1 large static linear array and then run their own
>memory allocator withing it.  Instead of machine pointers they
>pass arround indexes into the array.  This has defeated the purpose
>of the Pascal tight checking, reintroduced all the old danger of
>running off the end of your allocated memory (into a chunk logically
>attached to another pseudo pointer), and increased the notational complexity
>of the program (and probability of errors).  If you examine these programs
>and what they accomplish, you will usually conclude 
>  1) This was the best choice that could be made given the constraints
>     of the language
>  2) The programmer must have been persistent to get the thing working
>     at all.

Just a comment on this....this is the typical technique to implement
pointers and dynamic memory handling in Fortran, that to the contrary of
Pascal doesn't have pointer types and dynamic memory allocation as language
primitives. Are "those Pascal programs" maybe translated Fortran programs?

Bjorn Lisper

Robert_P_Rubendunst@cup.portal.com (08/30/88)

Every operating system that loads code into RAM and then executes it is
that nasty old self modifying code.

Every cache on every computer should be flushable by CPU instruction.

jpoon@mipos2.intel.com (Jack Poon~) (10/11/89)

Greetings,

Could any experts out there educate me WHY and HOW does self-modifying code use?
What the advantage of using self-modifying code that non-self-modifying code
cannot achieve?
Is there any compiler which will generate code that self-modified?  
A small and useful example of self-modifying will be very helpful.

Thanks,
Jack Poon

slackey@bbn.com (Stan Lackey) (10/11/89)

In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes:
>Greetings,
>
>Could any experts out there educate me WHY and HOW does self-modifying code use?
>What the advantage of using self-modifying code that non-self-modifying code
>cannot achieve?

I'm not an expert, but I figured I'd answer anyway.

There are many examples of self-modifying code.  Perhaps the most
common is for code to change the value of an immediate operand.  This
would be an advantage, if memory space were very tight.  However, it
is discouraged mainly because modern processors often have deep
prefetching of instructions, if not instruction caches, and it can be
difficult to detect if a normal write could write data that has
already been prefetched.  So, newer architectures have specified
unpredictable behavior when this is done.

Yet there are uses for self-modifying code.  Some spreadsheet programs
compile on-the-fly to improve execution speed (the difference between
compiled and interpreted code), and some bitblt's as well.  Debuggers
often write jumps or traps into the instruction stream, to make
single-stepping and breakpoints work.

All architectures do have some kind of facility to make instruction-
space writes work; there's a sequence that clears instruction
prefetchers and instruction caches.  Often a supplier will furnish a
system service that a user program can call.
-Stan

firth@sei.cmu.edu (Robert Firth) (10/11/89)

In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes:

>Could any experts out there educate me WHY and HOW does self-modifying code use?
>What the advantage of using self-modifying code that non-self-modifying code
>cannot achieve?

In the spirit of this question, I won't discuss why the current consensus
is that self-modifying code is bad, but rather will give examples.

Directly self-modifying code is pretty rare nowadays, but used to be
an essential tool.  Consider how you access an array element, A[i],
on a modern machine:

	Load index-register, i
	Load accumulator, address-of-A[0](index-register)

Some older machines didn't have index registers, so your compiler
generated this:

	Load accumulator, i
	Add accumulator, address-of-A[0]
	Store accumulator, right-half-of-instruction-L
L:	Load accumulator, contents-of-address-0

So, by the time you executed L, the right operand had been modified to
read 'contents-of-address-A[i]'.  And, basically, you HAD to do it this
way.

Some debuggers still play this trick.  If you set a breakpoint at L,
the debugger actually overwrites instruction L with 'trap-to-debugger',
of course saving the old instruction for subsequent execution.

----

Rather more common, are systems where code is generated and executed
dynamically.  Of course, any code overlay system is like that: a data
area is allocated, filled with bits read from a file, and then jumped
into and obeyed as code.  A more sophisticated system allows the
dynamic replacement of code bodies by alternative versions; thus, if
during debugging you suspect the new version of "munge()" is bad, you
replace it with the old version, read in from some file in the program
development environment, and carry on running.

This doesn't need true self-modifying code; rather, it needs dynamic
variability of procedure bodies (eg called via pointers), and the
ability to remap pieces of memory from data space into code space.

----

Finally, one very powerful technique is used in some implementations
of prototyping languages.  Typically, such languages are interpreted,
so each statement in 'Object Logo Plus', or whatever, is turned into
a data structure that is crawled over by an interpreter.

The trick is this: keep track of how often a code fragment (procedure,
method, or whatever) is modified, and how often it is executed.  When
a certain stability threshold is reached, compile the sucker into true
machine code, and have all subsequent executions be direct rather than
interpreted.  Over a prototyping session of an hour or two, the user
will observe a speedup of 2x to 20x, as the more stable and heavily used
parts of the program get converted gradually into machine code.

Hope that helps.

lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (10/11/89)

In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes:
>
>Could any experts out there educate me WHY and HOW does self-modifying code use?
>What the advantage of using self-modifying code that non-self-modifying code
>cannot achieve?

In prehistoric times, self-modifying code was used to make up for
machine limitations. This led to snarled messes, and today, the
practice mostly survives in three areas:

Copy protected PC software: the obscurity treated as virtue.

Debuggers: a debugger may write e.g. traps into the code area, and
may fill little buffers with code (and then execute the buffer).
(Note that typical hardware features, such as trap-after-each-
instruction modes, have NOT obsoleted this practice.)

Interpreters: some interpreters will actually compile little code
sequences, and then run them on the spot. Some APLs did this.  (APL
compilers aren't possible because you run into expressions whose
meaning isn't known until you compute the value of some previous
expression. So, the interpreter would just compile up-to-there, run
that, and then go back around.)
-- 
Don		D.C.Lindsay 	Carnegie Mellon Computer Science

chase@Ozona.orc.olivetti.com (David Chase) (10/11/89)

jpoon@mipos2.intel.com (Jack Poon~) writes:
>Could any experts out there educate me WHY and HOW does self-modifying
>code use?

This was discussed at length a couple of months ago; I believe Robert
Henry made some sort of a summary.

>What the advantage of using self-modifying code that non-self-modifying code
>cannot achieve?

It's possible to generate a "function" on the fly.  Koopman and Lee
had a fun paper in this summer's SIGPLAN conference (SIGPLAN Notices
July 1989) where they used self-modifying code in an interpreter.

>Is there any compiler which will generate code that self-modified?  
I had one once, but scrapped that scheme for portability reasons.
I believe that some LISP compilers make use of self-modifying code in
various ways.

>A small and useful example of self-modifying will be very helpful.

The following runs on a Motorola 68020 (with Sun calling conventions):
---------------- 
typedef int (*PF)(); PF bar;

PF partially_apply_f_to_a(f,a) unsigned int f; unsigned int a; {

  static unsigned int pa_bits[] = {
  0x20172f00, 0x203a000e, 0x2f400004,
  0x4e714ef9, 0xFFFFFFFF, 0xAAAAAAAA};

  unsigned int * code = (unsigned int *) malloc (sizeof pa_bits);
  unsigned int i;
  
  for (i = 0; i < 6; i++) code[i] = pa_bits[i];

  code[4] = f;
  code[5] = a;

  return (PF) code;
}
----------------

If you have a 68020, you can test it with the following program:

----------------
foo(a,b) { return a - b; };
main() {
  PF bar;
  printf("Executing text gives 9 - 2 = %d\n", foo(9,2));

  bar = partially_apply_f_to_a(foo,9);
  printf("Executing partial application gives 9 - 2 = %d\n", (*bar)(2));

  bar = partially_apply_f_to_a(bar,2);
  printf("Executing second partial application gives 9 - 2 = %d\n", (*bar)());
}
----------------

Note that this code is less "self-modifying" and more
"run-time-generated", though rapid recycling of a block of memory
could cause it appear to be "self-modifying".

Enjoy,

David

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/11/89)

In article <4415@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:

>into and obeyed as code.  A more sophisticated system allows the
>dynamic replacement of code bodies by alternative versions; thus, if
>during debugging you suspect the new version of "munge()" is bad, you

Anybody else remember the TSS operating system, which let a system
programmer test modified system routines on his own task on the fly
without affecting other tasks?  Of course, VM can do this, too, but
try to find a Unix system with a *real* dynamic loader like this :-)


  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

usenet@cps3xx.UUCP (Usenet file owner) (10/11/89)

In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes:
>Greetings,
>
>Could any experts out there educate me WHY and HOW does self-modifying code use?
>What the advantage of using self-modifying code that non-self-modifying code
>cannot achieve?

Back when I was doing assembly hacking on the RadioShack coco,  I needed
a super fast block copy.
To save a few cycles per loop, I  poked the calculated indecies into
the load/store instructions.
I ended up with a routine that could copy blocks at a rate
of less than 3cycels per byte. This is good considering each
instruction takes a minimum of 2 cycles, and it takes two instructions
to do a load here; store there.

>Is there any compiler which will generate code that self-modified?  
>A small and useful example of self-modifying will be very helpful.

possible, but i doubt it.
 Joe Porkka   porkka@frith.egr.msu.edu

tada@athena.mit.edu (Michael J Zehr) (10/11/89)

I found an example of self-modifying code back in the middle ages (c.
1981) in the operating system on a Z-80 chip.  As i recall, the Z-80 had
bit set, clear, and test op-codes which needed an assembly-time bit
number rather than using a register to determine which bit to
set/clear/test.  it took less space to determine the right bit and load
the opcode (i think a shift would give you the opcode for a bit, or
maybe shift and add an offset) into the right place in memory and then
execute it.  the other option was to have a 8 branch statements.

but the way to really save space was to call the routine with a
parameter which indicated which operation (set, clear, test) and after
computing the opcode to indicate the right bit, also select the opcode
to indicate the kind of operation and load that into memory and execute
it. 

*very* kludgy, but it took a surprisingly small amount of space with
only a small penalty in speed.

-michael j zehr

scott@bbxsda.UUCP (Scott Amspoker) (10/11/89)

In article <48682@ricerca.UUCP> chase@Ozona.UUCP (David Chase) writes:
>jpoon@mipos2.intel.com (Jack Poon~) writes:
>>Could any experts out there educate me WHY and HOW does self-modifying
>>code use?

In some cases it was necessary to implement subroutines.  I used to
work on an old IBM System/3 (now *there's* a clunker).  There
was no subroutine branch instruction.  Anytime you did a normal
branch, the address of the next instruction was placed into the
"address recall register (ARR)".  The first thing the subroutine
had to do was store the ARR into the proper location of some
branch instruction that would ultimately act as a "return".
Obviously, subroutines were not re-entrant.

In fact, we had a standard convention of using the expression '*-*'
(read as: location counter minus location counter: absolute 0).
So, if you were maintaining some assembly code and enountered
an instruction such as:

    bra  *-*

You knew that the operand would be determined at runtime. 
Scary stuff, no?

-- 
Scott Amspoker
Basis International, Albuquerque, NM
(505) 345-5232
unmvax.cs.unm.edu!bbx!bbxsda!scott

johnl@esegue.segue.boston.ma.us (John R. Levine) (10/11/89)

In article <1080@mipos3.intel.com> jpoon@mipos2.intel.com (Jack Poon~) writes:
>Could any experts out there educate me WHY and HOW does self-modifying code
>use? What the advantage of using self-modifying code that non-self-modifying
>code cannot achieve?

Depending on your point of view, either every modern computer uses
self-modifying code, or almost nobody uses it any more.  One of the great
conceptual breakthroughs that made the modern computer possible was to store
instructions and data in the same memory, so that instructions could create
and modify other instructions.  (I believe it was due to Von Neumann, but
Babbage may have preceded him.)

In ancient days, before about 1954, there were no index registers or even
indirect addressing, so the only way you got a computer program to do
anything interesting like add up all of the elements in an array was to have
the program patch itself.  For example, if your array started at location
100, the first instruction in the loop would be something like "ADD 100"
which added the contents of location 100.  Next you wanted to add location
101, so you'd do that by adding 1 to that ADD instruction, changing it to
"ADD 101", and so forth.  There were, as you might expect, a whole slew of
clever tricks involving diddling instructions.

Since the advent of index registers, rather than modifying the add
instruction in memory, one uses an index register or indirect address word
which modifies the effect of the instruction without having to modify the
instruction in memory.  The bad side of modifiable code is that a buggy
program that is trying to change data will change code instead, which can
lead to some truly spectacular wipeouts.  (Sometimes they completely clear
memory, so there is no trace of the program at all!)  There are performance
problems associated with self-modifying code as well.  Most computers fetch
instructions considerably ahead of where they are executing; even the lowly
8086 can fetch up to 6 bytes ahead of where it is executing.  This means that
if you store into the next instruction, the CPU might or might not already
have fetched that instruction, so it might execute the old instruction or the
new one, depending on such things as interrupts, DMA, and even register
contents.  Needless to say, this kind of bug is very hard to find.

Experience suggests that in most cases, the possible damage from bugs
smashing code outweighs the possible gain from allowing modifications,
particularly since indexing and indirect addressing solve the problems most
often addressed by code modification.  Most computers now make the code
read-only, so while a particular program is running it is forcibly prevented
from modifying itself.

There are a few places where instruction modification is still used.  In some
cases where programs dynamically link to libraries, the "call" instructions
that refer to library routines are modified by the dynamic linker to point to
wherever the routine happens to have ended up.  High-performance sort
programs that run on mainframes invariably write the inner comparison loop
for the sort at the time the sort starts, so they don't have to reinterpret
the sort specifications over and over.  Incremental compilers such as are
commonly found in Lisp systems compile a function to code at the time it is
first called, install that code into the running process, and sometimes even
patch call instructions to point to it (like the dynamic linker.)

Some people might claim that compiling a program to object code and then
loading and executing it counts as instruction modification.  After all, the
compiled code wasn't there when the computer started up, somebody must have
modified what was there, so we all modify code all the time.  I find this
definition a little precious, and prefer to consider code to be
self-modifying only in the case where the modifier and modifiee are part of
the same program.  I leave the definition of "same program" up to you.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
Massachusetts has over 100,000 unlicensed drivers.  -The Globe

koopman@a.gp.cs.cmu.edu (Philip Koopman) (10/11/89)

In article <1080@mipos3.intel.com>, jpoon@mipos2.intel.com (Jack Poon~) writes:
> ...
> What the advantage of using self-modifying code that non-self-modifying code
> cannot achieve?
> ...

In combinator graph reduction execution of functional programming
languages, self-modifying code gives approximately a factor of
two speedup on almost any hardware.  This is because graph reduction
is by its very nature self-modifying.  My paper "A fresh look at
combinator graph reduction" in the 1989 SIGPLAN conference on
Programming Language Design and Implementation has details.

In my case, the self-modification is limited to changing the
addresses of subroutine call instructions.  If a RISC architecture
such as the R2000 had a jump to subroutine that took an address
out of the data cache (perhaps by using the suceeding word
in memory as an address, referenced through the data cache)
this would be good enough for my application.

  Phil Koopman                koopman@greyhound.ece.cmu.edu   Arpanet
  2525A Wexford Run Rd.
  Wexford, PA  15090
Senior Scientist at Harris Semiconductor, and researcher at CMU.
I don't speak for them, and they don't speak for me.

peter@ficc.uu.net (Peter da Silva) (10/11/89)

One use of self-modifying code that still survives is total balls-to-the-wall
block operations, such as BitBlts. There was a discussion of hardware
graphics coprocessors where someone (maybe Henry Spencer, as the leading
opponent of the beasts) brought this up. I have not seen such code myself,
but it involved generating the BitBlt subroutine optimised for the operation
desired on the fly and then executing it.
-- 
Peter da Silva, *NIX support guy @ Ferranti International Controls Corporation.
Biz: peter@ficc.uu.net, +1 713 274 5180. Fun: peter@sugar.hackercorp.com. `-_-'
                                                                           'U`
Quote: Structured Programming is a discipline -- not a straitjacket.

bjornl@tds.kth.se (Bj|rn Lisper) (10/11/89)

Self-modifying code can be seen as partial evaluation delayed to runtime.
Partial evaluation, in its cleanest form, is to use the knowledge of some
function arguments to simplify the function. Consider a function f(x,y) in
two arguments x,y. If we know the value of x, say that x=c, then we can
substitute c for x in the call f(c,y) and obtain a simplified "residual"
function fc(y). Usually this is done at compile-time. A prime example is
substitution of symbolic constants. Nothing prevents, however, that this is
done at runtime (if, for instance, x is not known before then), which then
results in self-modifying code.

Bjorn Lisper

pl@etana.tut.fi (Lehtinen Pertti) (10/11/89)

From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman):
> 
> In my case, the self-modification is limited to changing the
> addresses of subroutine call instructions.  If a RISC architecture
> 

	I've been lately wondering if there is any architecture
	with possibility to execute instruction indirectly.

	Old Intel 8080 made interrupt handling by executing
	instruction provided by peripheral, but does any
	current architecture support anything like this.

	I think this would provide a way to create 'self-modifying'
	code without crashing into cache problems.

	I mean something like:

		exec r0	; execute instructio in register r0

	or

		exec (r0)	; execute instruction pointed by r0

	Does anyone know?

--
pl@tut.fi				! All opinions expressed above are
Pertti Lehtinen				! purely offending and in subject
Tampere University of Technology	! to change without any further
Software Systems Laboratory		! notice

wilson@carcoar.Stanford.EDU (Paul Wilson) (10/11/89)

In article <4415@bd.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:
>Finally, one very powerful technique is used in some implementations
>of prototyping languages.  Typically, such languages are interpreted,
>so each statement in 'Object Logo Plus', or whatever, is turned into
>a data structure that is crawled over by an interpreter.
>
>The trick is this: keep track of how often a code fragment (procedure,
>method, or whatever) is modified, and how often it is executed.  When
>a certain stability threshold is reached, compile the sucker into true
>machine code, and have all subsequent executions be direct rather than
>interpreted.  Over a prototyping session of an hour or two, the user
>will observe a speedup of 2x to 20x, as the more stable and heavily used
>parts of the program get converted gradually into machine code.
>
>Hope that helps.

Does anybody actually do this?  It seems the obvious thing to do, but
I don't know of anybody who does it.  All of the dynamic compilation
systems I know of compile things the very first time they're executed,
and don't recompile unless something gets redefined.

It seems to me that in a very high-performance system you want to do
this, recompiling with a higher level of optimization when an execution
count hits a threshold.  This will heuristically improve your
response time/ execution time tradeoff in an interactive programming
environment.  (And also your space/time tradeoffs in the generated
code.)

I would be *very* interested in anybody's experiences with such a
system, especially any statistics on the frequencies of execution
of different procedures/methods.  (e.g., does a 90/10 rule apply,
or an 80/20, or what?  What does the curve look like?)

I'm interested in preventing unnecessary compilation and optimization
of things that aren't executed very frequently.  This is not just
to avoid useless compiler work, but to keep the code transparently
debuggable whenever possible.

(Yes, I know you can often debug optimized code by recording the
optimizations and backward-mapping them at debug time.  But sometimes
you can't, and it's always more work for the compiler and debugger
writers.  I'm looking at two levels of optimized code -- easily
source-mappable "transparent" optimizations, and hairy "opaque"
optimizations for important well-debugged time-critical code.
Any info to evaluate the tradeoffs would be of interest.)


   -- Paul

Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@carcoar.stanford.edu
Box 4348   Chicago,IL 60680 
Paul R. Wilson                         
Software Systems Laboratory               lab ph.: (312) 996-9216
U. of Illin. at C. EECS Dept. (M/C 154)   wilson@carcoar.stanford.edu
Box 4348   Chicago,IL 60680 

joel@cfctech.UUCP (Joel Lessenberry) (10/12/89)

	I did a lot of work on IBM series 1's with the event driven
 	executive operating system...programmed in Event Driven Language.

	EDL (sorta a MACRO assembler) had opcode parameter naming ie

	move	#1,1,p2=lable1

	this would move to index reg #1 the immediate value 1, with lable1
	being the address of the immediate in the opcode. all of the instructions
	had this 'feature'.  Actually, the S/1 was not all that bad. any one 
	else have any experience?

					joel

 Joel Lessenberry, Distributed Systems | +1 313 948 3342
 joel@cfctech.UUCP                     | Chrysler Financial Corp.
 joel%cfctech.uucp@mailgw.cc.umich.edu | MIS, Technical Services
 {sharkey|mailrus}!cfctech!joel        | 2777 Franklin, Sfld, MI

johne@hpvcfs1.HP.COM (John Eaton) (10/12/89)

<<<<
< 
i< What the advantage of using self-modifying code that non-self-modifying code
cannot achieve?
---------
Job Security...

I once traced a problem in some PC software to a bit of SMC that worked
on a 8088 but not on a 8086. Turned out that the larger prefetch of an
8086 would grab the old data before the write and use that instead of
what the programmer intended.


John Eaton
!hpvcfs1!johne

hascall@atanasoff.cs.iastate.edu (John Hascall) (10/12/89)

In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
}From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman):
  
}	I've been lately wondering if there is any architecture
}	with possibility to execute instruction indirectly.
 
}	I mean something like:
}
}		exec (r0)	; execute instruction pointed by r0
 
}	Does anyone know?

	    Well, how about the IBM 370 series.  (not exactly modern, I know)

	    If I recall correctly:

		       EX    mask,location
		       <next instruction>

            location:  <target instruction>


	    Fetches the target instruction at "location" and ORs "mask" with
	    (some of) the target instruction and then executes the resulting
	    instruction.  Control resumes at "next instruction".
	    
	    The EX instruction is used all the time as
	    it is the only decent way to do a number of things.

John Hascall
ISU Comp Center

cooper@hpsrad.enet.dec.com (g.d.cooper in the shadowlands) (10/12/89)

One my major reasons for using self-modifying code was due to the user
implementable  instructions (UUOs) on the PDP-10.   If  you were doing
funky  IO you  could define your  own system call that executed within
the user space.  Since IO channel numbers are different  for each file
you would build the instruction sequences somewhere in memory and then
execute them.

A couple of months ago I was writing a device driver for RT-11 on the
PDP-11 and I had to use self-modifying code to either store or discard
data depending upon whether or not there was a free buffer.  The '11
lends itself to this sort of manipulation by allowing instruction
sequences such as:

	tstb	free			;any free buffers
	bne	1$			;yes don't worry be happy

	mov	(pc)+,@#routine		;no free buffers then we'll
		nop			;have to discard
	br	2$

1$:
	mov	(pc)+,@#routine		;move the mov instruction into
		mov	(r0)+,(r1)+	;the routine
2$:

I also remember doing similar things on the PDP-8 and the RDS-500 due
to hardware limitations but the less said about that the better.

		   Hacking assembler for a living,

				shades
============================================================================
| He paid too high a price for living | Geoffrey D. Cooper                 | 
| too long with a single dream.....   | cooper@hpsrad.enet.dec.com	   |
|-------------------------------------| business (508) 467-3678            |
| decwrl!hpsrad.enet.dec.com!cooper   | home (617) 925-1099                |
============================================================================
Note: I'm a consultant.  My opinions are *MY* opinions.

mrc@Tomobiki-Cho.CAC.Washington.EDU (Mark Crispin) (10/12/89)

In article <1619@atanasoff.cs.iastate.edu> hascall@atanasoff.UUCP (John Hascall) writes:
>In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
>}From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman):
>}	I've been lately wondering if there is any architecture
>}	with possibility to execute instruction indirectly.
>	    Well, how about the IBM 370 series.  (not exactly modern, I know)

On the DEC PDP-10 (CPU for the DECsystem-10 and DECSYSTEM-20), *all*
instructions calculated an "effective" address based on an 18-bit
value (the right half of the instruction word), index register (the
low order 4 bits of the left half iff non-zero), and indirect bit (the
next higher order bit after the index register).  This was done before
any opcode decoding was done (hence, you would get any page faults
caused by the effective address decoding before you'd get any illegal
instruction faults).  Indirection was recursive, so you could do an
amazing set of references.

So, you could always execute indirect and/or indexed.  In fact, it was
fairly common to do XCT FCNTAB(FC) where FCNTAB was a table of
instructions to execute and FC was the desired index into that table.


Mark Crispin / 6158 Lariat Loop NE / Bainbridge Island, WA 98110-2020
mrc@CAC.Washington.EDU / MRC@WSMR-SIMTEL20.Army.Mil / (206) 842-2385
Atheist & Proud / 450cc Rebel pilot -- a step up from 250cc's!!!
tabesaserarenakerebanaranakattarashii...kisha no kisha ga kisha de kisha-shita
sumomo mo momo, momo mo momo, momo ni mo iroiro aru
uraniwa ni wa niwa, niwa ni wa niwa niwatori ga iru

cac@steven.COM (cac) (10/12/89)

In article <9175@etana.tut.fi>, pl@etana.tut.fi (Lehtinen Pertti) writes:
+ 	I've been lately wondering if there is any architecture
+ 	with possibility to execute instruction indirectly.
+ 	I think this would provide a way to create 'self-modifying'
+ 	code without crashing into cache problems.
+ 	I mean something like:
+ 		exec r0	; execute instructio in register r0
+ 	or
+ 		exec (r0)	; execute instruction pointed by r0
+ 	Does anyone know?

  The PDP-15 did, as I recall, and I think the PDP-10 did as well, altho
I never did assembly on the 10.

> Pertti Lehtinen				! purely offending and in subject

-- 
Charles A. Clinton		Disclaim! Disclaim!
Sierra Geophysics, Inc.		...!uw-beaver!sumax!ole!steven!cac
11255 Kirkland Way		Voice: (206) 822-5200  Telex: 5106016171
Kirkland, WA 98033 USA		FAX: (206) 827-3893

johnl@esegue.segue.boston.ma.us (John R. Levine) (10/12/89)

In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
>From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman):
>	I've been lately wondering if there is any architecture
>	with possibility to execute instruction indirectly.

Lots of architectures do.  The IBM 370 has an execute instruction, as
does the DEC-20, GE 635 or whatever they call it these days (Bull DPS-8,
perhaps) and many other more antique or obscure machines.

Each has its own quirks.  On the 370, EX takes two operands, the address of
the instruction to execute and a register.  The low byte of the register is
logically or-ed with the instruction before it is executed (in the
instruction decoder, not in memory.)  In memory-to-memory move or compare
instructions, the second byte is the length, so this is how you faked
variable length string instructions on the 360.  The success of this approach
can be appreciated by noting that one of the changes in the 370 was to add
variable length string compare and move instructions.  It is expressly
forbidden to execute another execute instruction.

On the DEC-20, you can execute anything you want, including another execute
instruction.  On the '20 the execute instruction again takes two operands,
the location of the instruction to be executed and a register.  In early
implementations of the PDP-6 and PDP-10 the register was ignored, but later
they used the register field to mean things like "interpret the source
address in user context" to make it easier to fetch and store user data while
executing a system call in kernel mode.  The -20 also takes interrupts by
executing instructions in fixed addresses.  You can use a "read (or write)
word and increment pointer," thereby implementing a poor man's DMA.

I haven't seen execute instructions on any more modern machines; I guess that
the same arguments that give us read-only code say that the hackery to
implement execute isn't worth it either.  i can't see how you could implement
it on a pipelined machine without totally draining the pipeline, thus causing
a terrible performance hit.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
Massachusetts has over 100,000 unlicensed drivers.  -The Globe

bga@odeon.ahse.cdc.com (Bruce Albrecht) (10/12/89)

In article <236@bbxsda.UUCP>, scott@bbxsda.UUCP (Scott Amspoker) writes:
> So, if you were maintaining some assembly code and enountered
> an instruction such as:
> 
>     bra  *-*
> 
> You knew that the operand would be determined at runtime. 
> Scary stuff, no?

It's equivalent to the FORTRAN assigned GOTO, or the PDP-11's JSR/JMP reg,
or other jumps calculated on the fly, it's just that the address is saved
in memory.

CDC's peripheral processors (PP's) had 4k bytes.  In order to squeeze 
everything into a PP, they relied heavily on overlays, and putting code
in buffer areas to preset constants, etc.  The program would execute the
code in the buffer, which set any jumps, constants, etc, that needed
changing, and then reuse the buffer in the main code area.

art@salt.acc.com (Art Berggreen) (10/12/89)

I seem to recall one of the primary paper tape bootstraps for PDP-11s
used a loop which read a byte from the tape into the instruction loop.
The punch pattern for the leader left the code unchanged until the
beginning of a record at which time the code was modified to jump to
a different part of the bootstrap controlled by the byte read off the
tape.  Details are a bit hazy (twas quite a while ago...).

clj@ksr.com (Chris Jones) (10/12/89)

In article <9175@etana.tut.fi>, pl@etana (Lehtinen Pertti) writes:
>
>	I've been lately wondering if there is any architecture
>	with possibility to execute instruction indirectly.
>
>	I mean something like:
>
>		exec r0	; execute instructio in register r0
>
>	or
>
>		exec (r0)	; execute instruction pointed by r0

We used to do this on Data General (16 bit) Eclipses.  Peripherals on the
machine were addressed by device codes, and the I/O instructions had the device
codes as part of the instruction (rather than being taken from a register).  To
allow the same code to service more than one device, we'd build an I/O
instruction in a register, then execute it out of the register.

The GE/Honeywell 635/645/DPS-8 line had both XEC (execute) and XED (execute
double) instructions, used to implement one or two instruction subroutines.

By the way, the instruction format on the those machines was a 36 bit word,
with the 18 bits of address in the upper half of the word.  I have been told
the reason for this is so that self-modifying code wouldn't destroy opcode bits
if address manipulation caused a carry out of the address field.

tim@cayman.amd.com (Tim Olson) (10/12/89)

In article <6481@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes:
| In my case, the self-modification is limited to changing the
| addresses of subroutine call instructions.  If a RISC architecture
| such as the R2000 had a jump to subroutine that took an address
| out of the data cache (perhaps by using the suceeding word
| in memory as an address, referenced through the data cache)
| this would be good enough for my application.

But it does!  The R{2,3}000, SPARC, Am29000, (and 88K, I believe) all
have the ability to perform a register-indirect call, where the target
of the call is sourced from a general-purpose register.  These
instructions normally are used to build "big" addresses, but can be
used in your application simply by loading the modified address into
the register before performing the call.

A similar technique for branches is normally used for performing
"switch" statements via a jump table stored in data memory.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

aaronb@lagrange.gatech.edu (Aaron Birenboim) (10/12/89)

In article <1989Oct11.013553.3893@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:

>...  Most computers fetch
>instructions considerably ahead of where they are executing; even the lowly
>8086 can fetch up to 6 bytes ahead of where it is executing.  This means that
>if you store into the next instruction, the CPU might or might not already
>have fetched that instruction, so it might execute the old instruction or the
>new one, depending on such things as interrupts, DMA, and even register
>contents.  Needless to say, this kind of bug is very hard to find.

Is this really true?  There isn't enough dependancy checking in the
8086 instruction pipe to detect this type of operation, clear
the pipe, and re-fetch the altered instruction, or some such
corrective measure.  I'm glad I don't try self-modifying code.

Aaron Birenboim        | aaronb@eedsp.gatech.edu |  Why do we have to wear
Georgia Tech Box 30735 | (404) 874-1973          |  shoes all the time?
Atlanta, GA 30332      +-------------------------+  
USENET: ...!{allegra,hplabs,ihnp4,ulysses}!gatech!gt-eedsp!aaronb

yodaiken@freal.cs.umass.edu (victor yodaiken) (10/12/89)

Forget where I saw this nice example of self-modifying code. It was 
billed as one ofthe earliest mutual exclusion programs.


L: move opcode["branch to L"] to L
critical region
...
move opcode[" move opcode["branch to L"]] to L


The first programto execute the instruction at L changes the instruction
so that all others will just busy loop, and then changes it back
when it leaves the critical region
Of course it only works for uni-processors.

victor

fwb@demon.siemens.com (Frederic W. Brehm) (10/12/89)

Compiler generated self-modifying code:  YES.

When I was a graduate student (in ancient times:  1974) I had occasion to
look at the generated code for both a CDC 6000-series Fortran compiler and
an IBM 370-series COBOL compiler.  Both compilers used the same technique
for inserting argument addresses in subroutines (did you know COBOL has
subroutines?!!!).  The compilers generated code that copied the argument
addresses from the argument list into the instruction address fields.

Fred
--
Frederic W. Brehm	Siemens Corporate Research	Princeton, NJ
fwb@demon.siemens.com	-or-	princeton!siemens!demon!fwb

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/12/89)

In article <9175@etana.tut.fi>, pl@etana.tut.fi (Lehtinen Pertti) writes:

|  	I've been lately wondering if there is any architecture
|  	with possibility to execute instruction indirectly.

  Among others the GE-600 series, currently marketed as the Honeywell
DPS. It has an execute and execute double (execute two instructions at
location) to use.

  One use of self-modifying code is to get around bad hardware design,
such as making i/o port numbers immediate addresses. If this is the
case, as it has been on a number of machines, you either need to have
one device driver per port or modify the code.

  There was some such limitation on the DPS, and the way I got around
it, rather than write self modifying code, was to build the instruction
I needed on the stack and do a XEC on it. This was useful on the models
which support pure code segments.

  I think the 8080 had i/o port as part of the address of the
instruction, and the Z80 added the ability to have the port address in a
register. Someone will undoubtedly remember it better than I do.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

fwb@demon.siemens.com (Frederic W. Brehm) (10/12/89)

In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
>	I've been lately wondering if there is any architecture
>	with possibility to execute instruction indirectly.
>...
>	I mean something like:
>
>		exec r0	; execute instructio in register r0
>...
>	Does anyone know?

The Data General Eclipse (16-bit) had an instruction just like this.  I
don't know if the MV series has one like it.

Fred
--
Frederic W. Brehm	Siemens Corporate Research	Princeton, NJ
fwb@demon.siemens.com	-or-	princeton!siemens!demon!fwb

EGNILGES@pucc.Princeton.EDU (Ed Nilges) (10/12/89)

I have followed this discussion with interest.  In modern very-high-
level interpreted languages like REXX, an interesting variant of the
SMC issue emerges.  This is the use of the INTERPRET verb:


     INTERPRET expression


evaluates "expression" and executes it as a REXX statement.  Some of
the same arguments pro and con are used with reference to the use of
INTERPRET: pro:


     1.  It's "efficient"

     2.  It allows me to execute more complex and sophisticated
         algorithms in which code is data and data is code, eg.,
         implementation of expert systems


con:


     1.  It's "confusing"

     2.  It breaks associated technology.  Traditional SMC breaks
         instruction caches and other such contraptions, whereas
         use of INTERPRET breaks compilers.  This last point is not
         a necessity, since the compiler could generate code for
         lazy evaluation of the expression, but the only current
         compiler for REXX simply prohibits INTERPRET.


Other examples of this SMC variant include DO in Hypertalk.

henry@utzoo.uucp (Henry Spencer) (10/12/89)

In article <1989Oct12.041940.5651@ginger.acc.com> art@salt.acc.com (Art Berggreen) writes:
>I seem to recall one of the primary paper tape bootstraps for PDP-11s
>used a loop which read a byte from the tape into the instruction loop.

Bootstraps written before the days of decent-sized EPROMs often employ
all kinds of dirty tricks to minimize size.  The classic was the toggle-in
(i.e. manually entered, none of this wimpy ROM business :-)) bootstrap
for the pdp8 with RK disks:  two instructions.  The first was "kick disk
controller" (the controller handily came up in a fairly reasonable state
after bus reset).  The second was "branch to self".  These went in at a
very specific location in low memory.  As the second-stage bootstrap came
in off disk, it overwrote low memory, and the early part of it was very
carefully crafted to pick up control and wait for the disk controller to
finish.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

andrew@frip.WV.TEK.COM (Andrew Klossner) (10/13/89)

[]

	"I had occasion to look at the generated code for both a CDC
	6000-series Fortran compiler and an IBM 370-series COBOL
	compiler.  Both compilers used the same technique for inserting
	argument addresses in subroutines (did you know COBOL has
	subroutines?!!!)."

Did you know that COBOL has an instruction to modify a GOTO instruction?

	ALTER LOOP-LABEL TO PROCEED TO ALL-DONE.

means change the destination of the GOTO at LOOP-LABEL from whatever it
is now to the label ALL-DONE.  (I've heard that the ANSI COBOL
committee is doing away with this, and I've heard further that members
of the committee have been sued, but I don't know details.)  Of course,
you can implement this without actually modifying any code, by
arranging that all branches which are targets of ALTER statements fetch
their targets from data space.

	"The R{2,3}000, SPARC, Am29000, (and 88K, I believe) all have
	the ability to perform a register-indirect call, where the
	target of the call is sourced from a general-purpose register.
	These instructions normally are used to build "big" addresses ..."

On the 88k, at least, these instructions are more often used to
implement indirect procedure calls, for example to translate the C
construct

	return_value = (*procedure_ptr)(arg1, arg2);

The direct procedure call on the 88k can address plus-or-minus 128MB of
instruction space, which so far has been big enough for my codes.  :-)

  -=- Andrew Klossner   (uunet!tektronix!frip.WV.TEK!andrew)    [UUCP]
                        (andrew%frip.wv.tek.com@relay.cs.net)   [ARPA]

cassel@sce.carleton.ca (Ron Casselman) (10/13/89)

I seem to remember a piece of self-modifying code that used to run
on CP/M. I remember it because I had to disassemble it as an assignment for
a course in systems software. The program would copy itself to a different
part of memory and change all address references as it did so. It would
then load a second program into the original location. (all programs
under CP/M started at a fixed location). The second program would then
begin to run. Sorry I cannot remember any more details as to the purpose
of this self-modifying code.
 

pasek@ncrcce.StPaul.NCR.COM (Michael A. Pasek) (10/13/89)

In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
>	I've been lately wondering if there is any architecture
>	with possibility to execute instruction indirectly.
>[stuff deleted]
>	I mean something like:
>		exec r0	; execute instructio in register r0

Something like this is certainly possible.  The IBM 360/370 architecture has
an "Execute" instruction.  The instruction allows you to execute a single
instruction "indirectly", with bits 8-15 of the executed instruction modified
by bits 24-31 of a register specified on the Execute instruction.  For example,
to do an immediate compare of a byte value:
         LA     R1,BYTEVAL         GET THE VALUE WE'RE LOOKING FOR
         EX     R1,COMPARE         CHECK TO SEE IF IT'S THERE
             .
             .
COMPARE  CLI    TESTVAL,0          COMPARE

In the above example, the "virgin" instruction in memory has the following
format (assuming that "TESTVAL" is at absolute location 00000123 in memory):
       95000123
It is actually executed (assuming BYTEVAL = X'AF') as:
       95AF0123

The Execute instruction can be very handy for variable-length block moves,
and is certainly much safer than self-modifying code.  Of course, you could
still call it self-modifying, it's just modified in the CPU rather than in
memory :-).

M. A. Pasek          Switching Software Development         NCR Comten, Inc.
(612) 638-7668              CNG Development               2700 N. Snelling Ave.
pasek@c10sd3.StPaul.NCR.COM                               Roseville, MN  55113

johnl@esegue.segue.boston.ma.us (John R. Levine) (10/13/89)

In article <527@eedsp.gatech.edu> aaronb@lagrange (Aaron Birenboim) writes:
>In article <1989Oct11.013553.3893@esegue.segue.boston.ma.us> I wrote:
>>even the lowly 8086 can fetch up to 6 bytes ahead of where it is executing
>>[so if you store into the next instruction it will probably execute the
>>old version]
>Is this really true?  There isn't enough dependancy checking in the 8086
>instruction pipe to detect this type of operation, clear the pipe, and
>re-fetch the altered instruction, or some such corrective measure.

You bet there isn't.  Why waste all that logic to handle the .000001% of
programs that would want to store into the next instruction?  And you thought
the 8086 wasn't a RISC.  You can patch code reliably so long as there is a
branch executed between the patcher and the patchee, since all '86 processors
flush the prefetch queue on a branch.  The advisability of patching code
remains as dubious as ever, particularly on machines in the '86 series in
which the funky instruction encoding makes it tricky to identify the correct
location to patch.
-- 
John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650
johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl
Massachusetts has over 100,000 unlicensed drivers.  -The Globe

bean@putter.sgi.com (Bean Anderson) (10/13/89)

In article <16557@siemens.siemens.com>, fwb@demon.siemens.com (Frederic
W. Brehm) writes:
> In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
> >	I've been lately wondering if there is any architecture
> >	with possibility to execute instruction indirectly.
> >...
> >	I mean something like:
> >
> >		exec r0	; execute instructio in register r0
> >...
> >	Does anyone know?
> 

The HP300  (a discontinued machine) had an execute-top-of-stack
instruction that got considerable use --
mostly because certain instructions that had immediate operands did not
have corresponding instructions
that had stack (or register) operands.   I think the HP3000 had a
similar instruction.

				Bean

slackey@bbn.com (Stan Lackey) (10/13/89)

In article <1989Oct12.184824.2465@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
>In article <527@eedsp.gatech.edu> aaronb@lagrange (Aaron Birenboim) writes:
>>In article <1989Oct11.013553.3893@esegue.segue.boston.ma.us> I wrote:
>>>even the lowly 8086 can fetch up to 6 bytes ahead of where it is executing
>>>[so if you store into the next instruction it will probably execute the
>>>old version]
>>Is this really true?  There isn't enough dependancy checking in the 8086
>>instruction pipe to detect this type of operation, clear the pipe, and
>>re-fetch the altered instruction, or some such corrective measure.
>
>You bet there isn't.  Why waste all that logic to handle the .000001% of
>programs that would want to store into the next instruction?

All the PDP/11's that do prefetching or instruction caching DO check
for possible writes to the next instruction, and flush/refetch/etc to
make it work, simply because it isn't specified NOT to work.  Even
11mode on the VAXes do this.  Some shortcuts are taken, like just
comparing the 8? low bits of a data write to those of the prefetched
instructions; performance will be hurt a little, but erring on the
conservative side always works.

Why "waste" all that logic?  Because it's better for you in the long
run, if as much as possible of old software works on new hardware.
-Stan

colwell@mfci.UUCP (Robert Colwell) (10/13/89)

>You vendors of custom processors has better take note of this statement.  It
>is really amusing to see the Multiflow and Alliant fellows bragging at how
>well they are doing when single chip microprocessors run circles around their
>entire multiprocessor complexes.  The SUN-4 Sparcstation which did all this

I used to design micros, and I used to design workstations, too.
What they're capable of now, and where they're going in the 
performance space, isn't the constant source of surprise you imply.

I have yet to brag about anything here.  This is comp.arch, not 
comp.marketing.  Marketing and market success are peripheral issues
which I have no interest in pursuing in this forum.  Or do you think
that you have to make a lot of money off something before it can be
considered technically important?  

What interests me is building fast computers.  To that end, Multiflow
has contributed something original and innovative, and as far as I can
tell, multiple-instruction issue is a path that seems to be gaining
adherents, especially among the very micros I'm supposed to be terrified
of.  From a comp.arch point of view I think that's neat.

>"damage" is an absolute DOG compared to the most recent Sparc, MIPS, or i860
>chip sets.  No doubt a real terror from Motorola is not far away.  The lifetime
>of custom processors, even processors like the Cray line, are real short.
>Imagine the surprise of CRI executives when they realize that they are no
>longer losing market share to the Japanese super computer companies, and are
>now losing market share to microprocessors.  We are watching this happen
>internally at LLNL and do not doubt that it is happening elsewhere.

>There is no defense against the ATTACK OF THE KILLER MICROS!

We've been through this before.  When micros want to supplant the high-end
supers, they'll have to come up with answers to the age-old (well, ok, 
maybe 15 years old) problem of how to build a memory that will keep a fast
CPU fed.  Such a memory costs a Whole Lot Of Money.  When you've sunk a lot
of money into that memory, you can afford more CPU than you can buy off the
shelf.  There *is* some question about how fast one can turn out specialized
processors vs. how fast the micro guys can crank them out.  (It's clear who's
winning that race for certain data points such as Cray vs. Moto/Intel, 
though.)

But you do have a point (whether it's the one you intended, I don't know).
If LLNL personnel are getting their work done on workstations, maybe the
market for true supercomputer performance (for which you need the
kind of memory and I/O bandwidth I referred to above) isn't as big as 
people have always thought it was.  Or their workloads have migrated away
from the canonical vector codes to codes that no longer show off the big
supers to their best advantage?  If either of these is true, then you're
right, this presages a major shift in what hardware people buy.

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
31 Business Park Dr.
Branford, CT 06405     203-488-6090

mustard@sdrc.UUCP (Sandy Mustard) (10/13/89)

In article <1619@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
> In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
> 	    Well, how about the IBM 370 series.  (not exactly modern, I know)
> 	    If I recall correctly:
> 		       EX    mask,location
> 		       <next instruction>
>             location:  <target instruction>
> 	    Fetches the target instruction at "location" and ORs "mask" with
> 	    (some of) the target instruction and then executes the resulting
> 	    instruction.  Control resumes at "next instruction".
> John Hascall
> ISU Comp Center


To quote the 370/XA Principles of Operations manual,

"When the target instruction is a successful branching instruction, the
instruction address of the current PSW is replaced by the branch address
specified in the target instruction." 

thus control may resume at the target address of the branch instruction.

Sandy Mustard
mustard@sdrc.UU.NET

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/13/89)

You might consider instructions of the form "Execute Register" or
"Execute Memory", where the primary instruction specified where a
(single) second instruction was to be fetched from, a form of
self-modifying code.

Certainly, these caused the same difficulties for the pipeline.

I have seen "Execute Register" used to synthesize dynamic shift
instructions, where the shift amount is not known at compile time
(dynamic shifts are much less frequent than static shifts in most
instruction mixes), and to produce the indirect syscall() system call.
"Execute Register" was also used to produce a relatively fast
instruction level simulator for a similar, but not identical,
instruction set processor.

--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

mcdonald@uxe.cso.uiuc.edu (10/13/89)

>>There is no defense against the ATTACK OF THE KILLER MICROS!

>We've been through this before.  When micros want to supplant the high-end
>supers, they'll have to come up with answers to the age-old (well, ok, 
>maybe 15 years old) problem of how to build a memory that will keep a fast
>CPU fed.  Such a memory costs a Whole Lot Of Money.  When you've sunk a lot


>Bob Colwell               ..!uunet!mfci!colwell
>Multiflow Computer     or colwell@multiflow.com

Prognostication:

When the need arieses (oh My, I'll leave that typo in. Our MIPS is named
Aries) the micro people will find a way to get the memory bandwidth
they will need. 4 megabit chips, itsy bitsy low capacitance wires,
a few ASICS, maybe even a few fiber optics, they will squeeze it into
the motherboard somehow. And it won't cost a Whole Lot of Money,
because it will be simple (outside the chips, of course!!) and
elegant.

Doug MCDonald

aubrey@rpp386.cactus.org (Aubrey McIntosh) (10/13/89)

In article <1989Oct12.162236.24239@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In article <1989Oct12.041940.5651@ginger.acc.com> art@salt.acc.com (Art Berggreen) writes:
>>I seem to recall one of the primary paper tape bootstraps for PDP-11s
>>used a loop which read a byte from the tape into the instruction loop.
>
>Bootstraps written before the days of decent-sized EPROMs often employ
>all kinds of dirty tricks to minimize size.  The classic was the toggle-in
>(i.e. manually entered, none of this wimpy ROM business :-)) bootstrap

I'm beginning to believe that every time I have an original idea, there must
be a hundred people working on exactly the same thing.  In my case, there
was a pdp8e connected to a mass spec, and a 16 bit machine in another
building that had the loadable programs.

When a comm packet of type 'execute there' passed its checksum,
the 'go loop' instruction was overwritten with the (successfully)
loaded code's start up address.  Saved lots of walking across campus
to re-toggle after a comm error. 



-- 
Aubrey McIntosh                  Freelance using Modula-2
                                 Real time, embedded, instruments.
Austin, TX 78723                 Enquiries welcome
1-(512)-452-1540                 aubrey%rpp386.Cactus.org@cs.utexas.edu

henry@utzoo.uucp (Henry Spencer) (10/14/89)

In article <46852@bbn.COM> slackey@BBN.COM (Stan Lackey) writes:
>>>Is this really true?  There isn't enough dependancy checking in the 8086
>>>instruction pipe to detect this type of operation, clear the pipe, and
>>>re-fetch the altered instruction, or some such corrective measure.
>>
>>You bet there isn't.  Why waste all that logic to handle the .000001% of
>>programs that would want to store into the next instruction?
>
>All the PDP/11's that do prefetching or instruction caching DO check
>for possible writes to the next instruction, and flush/refetch/etc to
>make it work, simply because it isn't specified NOT to work...

I think that can fairly be considered an oversight by the 11's original
designers.  IBM did not make the same mistake on the 360 series.  If you
read the 360/370 "Principles of Operation" -- still a model of how to
document a processor -- you will find quite a bit of very carefully
phrased verbiage describing *exactly* what you can and cannot assume
about self-modifying code.  Even that is rather more generous than a
modern designer would be, though, since the 360 series arose at a time
when self-modifying code was still common practice.
-- 
A bit of tolerance is worth a  |     Henry Spencer at U of Toronto Zoology
megabyte of flaming.           | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/14/89)

Some of this discussion reminds me of discussions about threaded-code systems
like Forth.  Has anyone ever identified any ISA issues that are specific for
code like this?  I have always assumed that a fast branching RISC would be
ideal, but I don't know of this is a correct assumption.
(I am not asking for a repeat of RISC arguments: any instruction, to be
considered, has to show at least a 1% speedup on an actual complete program
to be considered for special support.)

  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster
  NASA Ames Research Center  ARPA lamaster@ames.arc.nasa.gov
  Moffett Field, CA 94035     
  Phone:  (415)694-6117       

rcd@ico.ISC.COM (Dick Dunn) (10/14/89)

EGNILGES@pucc.Princeton.EDU (Ed Nilges) writes:
> I have followed this discussion with interest.  In modern very-high-
> level interpreted languages like REXX, an interesting variant of the
> SMC issue emerges...
>...   INTERPRET expression
>...evaluates "expression" and executes it as a REXX statement...

It's not a very new idea, though.  It exists in LISP (apply/eval) and
SNOBOL (compile()), both of which go way back.  It exists in the UNIX
shells (eval and ``).  The hardware analog, as others have noted, is the
EXECUTE instruction on some older machines.

It may be useful to make a distinction between self-modifying and "self-
generating", for want of a better word.

About objections to self-modification:
>      2.  It breaks associated technology.  Traditional SMC breaks
>          instruction caches and other such contraptions, whereas
>          use of INTERPRET breaks compilers.  This last point is not
>          a necessity, since the compiler could generate code for
>          lazy evaluation of the expression, but the only current
>          compiler for REXX simply prohibits INTERPRET.

In hardware, you just need a way to say "clean these things out of your
cache".  It shouldn't be hard, but it has to be provided.  It helps the
tradeoff at which generating code is practical if it isn't a privileged
instruction.

INTERPRET, or eval or compile or whatever, should not break a compiler.
There's no excuse for it--as Nilges says, you just have to handle it
correctly.
-- 
Dick Dunn     rcd@ico.isc.com    uucp: {ncar,nbires}!ico!rcd     (303)449-2870
   ...No DOS.  UNIX.

hedley@imagen.UUCP (Hedley Rainnie) (10/14/89)

Mention has been made of the Dec-10 and 20. These machines also had the
feature that the registers were part of the address space, a very rare
feature indeed these days, this allows one to code a string search algorithm
and have the code reside in addresses 0..17 (octal of course), and then to
jump into the registers via an XCT instruction, the pleasure of this is that
the code will run at register speeds. Nowadays the cache technology is such 
that this kind of coding is no longer needed but a little nostalgia is a 
good thing.

Hedley

{decwrl|sun}!imagen!iit!hedley
hedley@imagen.com
-- 
{decwrl!sun}!imagen!hedley
hedley@imagen.com

cliffhanger@cup.portal.com (Cliff C Heyer) (10/14/89)

>Could any experts out there educate me WHY and HOW does 
self-modifying code use? 

The first program I ever wrote was self-modifying. In 
1976 with the 8080 for class assignment I wrote a ping 
pong program with LEDs. The rest of the class wrote code 
for each direction, but I wrote one "shell" routine and 
then inserted instructions into it after calculating the 
"pong" direction. My program was about 1/5 as long as 
all the others, which was important when you had only 1 
or 2K of memory. (The teacher had never heard of doing 
such a thing, and I had to convince him I deserved an 
A+)

>What the advantage of using self-modifying code that 
non-self-modifying code >cannot achieve?

On DECsystem-10s and 20s I wrote code to check the 
serial number so that people who stole the code could 
not run it (easily) on their processor. But if you put 
the instructions in the code itself, a smart hacker 
could remove them from the EXE file or insert a jump 
statement to skip over them. So I generated the check at 
run time by moving "numbers" to memory and then 
executing them. This make cracking the security much 
more difficult. I moved numbers to registers and then 
bit shifted them to make the instruction, then moved the 
PC  to the register. (God forbid trying this on a VAX!)

>Is there any compiler which will generate code that 
self-modified?  

If you mean C or BASIC, I'm sure there are some "hacker" 
compilers that allow you to play, but in general you 
have no way of knowing address locations that your C 
compiler will create, or even what instructions your C 
compiler will generate, so how can you write C 
statements that will modify something that does not yet 
exist? You have to use machine code, since then you know 
what is where. You can do it with any assembler as far 
as I  know, but you have to fuss with the OS and/or 
compile switches to actually compile and run the code 
and suppress error messages.

>A small and useful example of self-modifying will be 
very helpful.

Read above.

baum@Apple.COM (Allen J. Baum) (10/14/89)

[]
>In article <4534@imagen.UUCP> hedley@imagen.UUCP (Hedley Rainnie) writes:
>
>Mention has been made of the Dec-10 and 20. These machines also had the
>feature that the registers were part of the address space, a very rare
>feature indeed these days, this allows one to code a string search algorithm
>and have the code reside in addresses 0..17 (octal of course), and then to
>jump into the registers via an XCT instruction, the pleasure of this is that
>the code will run at register speeds.

Except on the newer DEC-20s, it ran slower out of the register file than 
out of real memory. oops.



--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

dworkin@Solbourne.COM (Dieter Muller) (10/14/89)

In article <35636@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes:
>[]
>>In article <4534@imagen.UUCP> hedley@imagen.UUCP (Hedley Rainnie) writes:
>>
>>[running out of the registers]
>
>Except on the newer DEC-20s, it ran slower out of the register file than 
>out of real memory. oops.

Well, it depends on how you define `real memory'.  Registers are in
`fast memory'.  Quoting from the Hardware Reference Manual, Volume I
-- Central Processor, page 1-24:

	Fast memory times are for referencing as a memory location for
	an operand; a fast memory instruction fetch takes slightly
	more time than a cache access.

And, on the next page, a nice table of access times.  The relevant
entries are:

	Memory		Read/Write

	MA20 Core	.833/.40
	KL10 Fast	.067/.067
	KL10 Cache	.133/.133

Times are in milliseconds for a single word access.  So, if you had
cache installed, and your code happened to get placed in it (at the
whim of the monitor), then yes, code from memory would run faster
than code in registers.

However, if you were a financially-strapped site that decided not to
invest in such new-fangled things like cache, code would be much
faster in the registers.

	Dworkin
-- 
No, I'm this way normally.  Why do you ask?
boulder!stan!dworkin  dworkin%stan@boulder.colorado.edu  dworkin@solbourne.com
Flamer's Hotline: (303) 678-4624 (1000 - 1800 Mountain Time)

gary@dgcad.SV.DG.COM (Gary Bridgewater) (10/14/89)

In article <16557@siemens.siemens.com> fwb@demon.UUCP (Frederic W. Brehm) writes:
>In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
 >>	I've been lately wondering if there is any architecture
 >>	with possibility to execute instruction indirectly.
 >>...
 >>	I mean something like:
 >>
 >>		exec r0	; execute instructio in register r0
 >>...
 >>	Does anyone know?
 >
 >The Data General Eclipse (16-bit) had an instruction just like this.  I
 >don't know if the MV series has one like it.

 The MV instruction set includes the Eclipse instruction set which includes
 the Nova instruction set. Assembly routines writtern for the first Nova
 run on the latest MV.
 The most useful purpose, on our I/O architecture, for the XCT instruction
 is to write self-modifying re-entrant I/O handlers:
	load accumulator 0 with an I/O instruction  with a 0 device code
	OR in the device code of interest
	XCT accumulator 0 
So you have one generic disk handler for each type of disk which can handle
any number of controllers.
You can also write a fairly tight loop via
	load accumulator 0 with an XCT 0
	XCT 0
which was kind of an interesting thing to watch back when we had lights - the
u-code address lights came on fairly solid and the address lights froze. This
can be interrupted but the cpu can't be halted since halt is only honored 
between instructions.
-- 
Gary Bridgewater, Data General Corp., Sunnyvale Ca.
gary@sv4.ceo.sv.dg.com or 
{amdahl,aeras,amdcad,mas1,matra3}!dgcad.SV.DG.COM!gary
No good deed goes unpunished.

brooks@vette.llnl.gov (Eugene Brooks) (10/15/89)

>>There is no defense against the ATTACK OF THE KILLER MICROS!

>We've been through this before.  When micros want to supplant the high-end
>supers, they'll have to come up with answers to the age-old (well, ok, 
>maybe 15 years old) problem of how to build a memory that will keep a fast
>CPU fed.  Such a memory costs a Whole Lot Of Money.  When you've sunk a lot


>Bob Colwell               ..!uunet!mfci!colwell
>Multiflow Computer     or colwell@multiflow.com

Interleaving directly on a memory chip is absolutely trivial.  The memory
chip makers will do it as soon as there is a mass-market for it.  Of course
the supercomputer vendors will be able to make use of these hot memory chips,
but their slow CPUs just won't be fast enough to keep up with microprocessors.

brooks@maddog.llnl.gov, brooks@maddog.uucp

bart@videovax.tv.tek.com (Bart Massey) (10/15/89)

I can't resist throwing in my favorite example of self-modifying code.  A
friend and I were studying how the stack looks after a panic on a VAX 785
running 4.3BSD UNIX, and noticed that all the registers are saved by
_panic.  OK, we said to ourselves, that makes sense.  Let's look at the
assembly for panic:

	$ adb /vmunix /dev/kmem
	_panic+2,5?ia
	_panic+2:	subl2	$8,sp
	_panic+5:	cvtwl	$100,-4(fp)
	_panic+b:	tstl	_panicstr
	_panic+11:	beql	_panic+19
	_panic+13:	bisl2	$4,-4(fp)
	_panic+17:
	
Hmm, no register saves there -- ah, I know!  The regmask, at _panic, must
be set to all ones using loader magic or a sed script or something.

	_panic?x
	_panic:
	_panic:		0

At this point, we began to _panic :-).  To make a long story short, sure
'nuf, after only a half-day of thinking about it, we tried

	_panic/x
	_panic:
	_panic:		fff

and found in sys/vax/locore.s

	/* trap() and syscall() save r0-r11 in the entry mask (per ../h/reg.h) */
	/* panic() is convenient place to save all for debugging */
	bisw2	$0x0fff,_trap
	bisw2	$0x0fff,_syscall
	bisw2	$0x0fff,_panic

Sigh.  Of course.  Since panic() is written in C, there's no clean way to set
its regmask, perhaps not even with an asm(), so...

					Bart Massey
					..tektronix!videovax.tv.tek.com!bart
					..tektronix!reed.bitnet!bart

jba@harald.ruc.dk (Jan B. Andersen) (10/16/89)

>In article <9175@etana.tut.fi> pl@etana.tut.fi (Lehtinen Pertti) writes:
>}From article <6481@pt.cs.cmu.edu>, by koopman@a.gp.cs.cmu.edu (Philip Koopman):
>  
>}	I've been lately wondering if there is any architecture
>}	with possibility to execute instruction indirectly.
> 
>}	I mean something like:
>}
>}		exec (r0)	; execute instruction pointed by r0
> 
>}	Does anyone know?

Data General's MV Family has an 'execute accumulator' instruction. Very
useful when implementing a debugger.

don@gp.govt.nz (Don Stokes, GPO) (10/16/89)

In article <672@sce.carleton.ca>, cassel@sce.carleton.ca (Ron Casselman) writes:
> I seem to remember a piece of self-modifying code that used to run
> on CP/M. I remember it because I had to disassemble it as an assignment for
> a course in systems software. The program would copy itself to a different
> part of memory and change all address references as it did so. It would
> then load a second program into the original location. (all programs
> under CP/M started at a fixed location). The second program would then
> begin to run. Sorry I cannot remember any more details as to the purpose
> of this self-modifying code.

Sounds a little like DDT (Dynamic Debugging Tool).  If you type DDT
<program> DDT starts itself up at the bottom of the TPA (at this stage,
DDT is just a normal COM file), slides itself into high memory and loads
<program> from disk into the bottom of the TPA so that the program
doesn't know that DDT is there.  Since DDT has to run in many different
memory sizes, it would have to relocate itself dynamically (this is based 
on observation of my trusty Kaypro II, not an actual study of the code).

Don Stokes  ZL2TNM  /  /                                 vuwcomp!windy!gpwd!don
Systems Programmer /GP/ Government Printing Office       PSI%0530147000028::DON
__________________/  /__Wellington, New Zealand__________don@gp.govt.nz________
Those who believe their systems are idiot-proof just do not realise how
ingenious idiots can be. 

elwin@Athena.MIT.EDU (Lee W Campbell) (10/16/89)

>>There is no defense against the ATTACK OF THE KILLER MICROS!
>
>We've been through this before.  When micros want to supplant the high-end
>supers, they'll have to come up with answers to the age-old (well, ok, 
>maybe 15 years old) problem of how to build a memory that will keep a fast
>CPU fed.  Such a memory costs a Whole Lot Of Money.  When you've sunk a lot
>  ...

Reminds me of the old joke about the Cray 1, with about 4 megawords of
fast static ram: if you buy the memory, for $1 per word (not an unreasonable
price in those days), Cray would throw in the CPU for free!

      .===.       \|/			-	   This Space		-
     { Max }      AKA     Lee		-	 Unintentionally	-
   H-headphones   /|\     Campbell	-	   Left Blank		-

roelof@idca.tds.PHILIPS.nl (R. Vuurboom) (10/17/89)

In article <480@gp.govt.nz> don@gp.govt.nz (Don Stokes, GPO) writes:
>In article <672@sce.carleton.ca>, cassel@sce.carleton.ca (Ron Casselman) writes:
>> I seem to remember a piece of self-modifying code that used to run
>> on CP/M. I remember it because I had to disassemble it as an assignment for
>
>Sounds a little like DDT (Dynamic Debugging Tool).  If you type DDT
><program> DDT starts itself up at the bottom of the TPA (at this stage,
>DDT is just a normal COM file), slides itself into high memory and loads

Didn't DDT use relative addressing? Something vaguely tells me that DDT
could (block) copy itself to another address. If it used relative 
addressing then I don't think it needed to self-modify.

Who would have thought that SMC would've been a me-too thread :-)?
Anyways heres my me too contribution:

Many moons ago I had a cute little Z80 micro called a Jupiter Ace and 
had those elegant little rubber keypads which were specially designed for 
the one finger-push hard typists. 
For those of you who have never heard of this magnificent machine: 
it was designed by the same guys who did the ZX-81. They left 
Sinclair's company, founded their own startup company and neither they nor 
the Jupiter Ace were ever heard of again. But I'm digressing...

The Jupiter Ace had a Forth intepreter in rom (8k - of which 2k was a
character set generator - oh for the good old days...). What I wanted
to do was use the interpreter to trace through itself at the Z80 instruction
level. Now the Z80 didn't have any support for trace instructions on
the one hand and I didn't want to write code to interpret the 600 odd
cases that the Z80 had on the other. 

I came up with the following little routine:

		save real registers/condition codes 
		load registers/condition codes from emulated registers

		nop
		nop
		nop
		nop

		save registers and condition codes to emulated registers
		load real register/condition code values


By loading the instruction I wanted to trace in the area with nop's
and calling the routine I could let the Z80 "calculate" the condition codes 
and the changed registers for me. The next step was to calculate the
new pc and load the following instruction to be traced.

Of course, this was pretty primitive. It wouldn't work for branches
(but I could determine if a branch would take place and set the 
emulated pc accordingly and continue from there). Nor would it
work for io instructions :-) but they didn't occur often anyway.

Primitive or no it helped me trace through the interpreter. 
Wonder if the Ace is in the attic somewhere...:-)
 
-- 
Artificial Intelligence: 
When computers start selling stocks because its Friday the 13th....
Roelof Vuurboom  SSP/V3   Philips TDS Apeldoorn, The Netherlands   +31 55 432226
domain: roelof@idca.tds.philips.nl             uucp:  ...!mcvax!philapd!roelof

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/17/89)

  You *may* be thinking of the CP/M system generation. Since the o/s was
written in absolute assembler it was shipped to run in a small (maybe
16k) system. When you created a larger version there was the original
o/s and a list of locations which needed to be changed. The builder
took a copy of the o/s and added a constant to each of the locations in
the list.

  Then again you may be thinking of something I've happily forgen...

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

baum@Apple.COM (Allen J. Baum) (10/17/89)

Reply-To: baum@apple.UUCP (Allen Baum)
Distribution: net
--------
[]
>In article <274@ssp1.idca.tds.philips.nl> roelof@idca.tds.PHILIPS.nl (R. Vuurboom) writes:
>
>I came up with the following little routine:
>
>		save real registers/condition codes 
>		load registers/condition codes from emulated registers
>
>		nop
>		nop
>		nop
>		nop
>
>		save registers and condition codes to emulated registers
>		load real register/condition code values
>
>
>By loading the instruction I wanted to trace in the area with nop's
>and calling the routine I could let the Z80 "calculate" the condition codes 
>and the changed registers for me. The next step was to calculate the
>new pc and load the following instruction to be traced.
>
>Of course, this was pretty primitive. It wouldn't work for branches
>(but I could determine if a branch would take place and set the 
>emulated pc accordingly and continue from there). Nor would it
>work for io instructions :-) but they didn't occur often anyway.

The original Apple II monitor ROM had a single step/trace routine that
did exactly that. I think for branches I just replaced the offset,
so it always branched to a fixed place, and updated the emulated PC
with the offset if it did.


--
		  baum@apple.com		(408)974-3385
{decwrl,hplabs}!amdahl!apple!baum

mjones@stdc01.UUCP (Michael Jones) (10/17/89)

  In article <527@eedsp.gatech.edu> aaronb@lagrange.UUCP (Aaron Birenboim) writes:
  >johnl@esegue.segue.boston.ma.us (John R. Levine) writes:
  >>...  Most computers fetch
  >>instructions considerably ahead of where they are executing; even the lowly
  >>8086 can fetch up to 6 bytes ahead of where it is executing.  This means that
  >>if you store into the next instruction, the CPU might or might not already
  >>have fetched that instruction, so it might execute the old instruction or ...

  >Is this really true?  There isn't enough dependancy checking in the
  >8086 instruction pipe to detect this type of operation, clear
  >the pipe, and re-fetch the altered instruction, or some such
  >corrective measure.  I'm glad I don't try self-modifying code.

It is most absolutely true. Some of the early copy protect schemes (I think it
may have been Lotus 1.0) used this sort of thing to disarm trace or single-step
breaking of their code. If you ran normally, the CPU did not see the write into
"here + 1", but if you were single stepping, or had an ICE, or whatever, then it
did. This happened several times in the code, with each false branch sending you
to a pile of "mystery code".

-- 
-- Michael T. Jones          Email:            ...!mcnc!rti!stdc01!mjones --
-- The wise man will pursue  Paper: 3101-H Aileen Drive, Raleigh NC 27606 --
-- excellence in all things  Voice: W:(919)361-3800  and  H:(919)851-7979 --

rec@sisyphuscpd4 (Robert Cousins) (10/18/89)

In article <480@gp.govt.nz>, don@gp.govt.nz (Don Stokes, GPO) writes:
> In article <672@sce.carleton.ca>, cassel@sce.carleton.ca (Ron
Casselman) writes:
> > I seem to remember a piece of self-modifying code that used to run
> > on CP/M. I remember it because I had to disassemble it as an assignment for
> > a course in systems software. The program would copy itself to a different
> > part of memory and change all address references as it did so. It would
> > then load a second program into the original location. (all programs
> > under CP/M started at a fixed location). The second program would then
> > begin to run. Sorry I cannot remember any more details as to the purpose
> > of this self-modifying code.
> 
> Sounds a little like DDT (Dynamic Debugging Tool).  If you type DDT
> <program> DDT starts itself up at the bottom of the TPA (at this stage,
> DDT is just a normal COM file), slides itself into high memory and loads
> <program> from disk into the bottom of the TPA so that the program
> doesn't know that DDT is there.  Since DDT has to run in many different
> memory sizes, it would have to relocate itself dynamically (this is based 
> on observation of my trusty Kaypro II, not an actual study of the code).
> 
> Don Stokes  ZL2TNM  /  /                                
vuwcomp!windy!gpwd!don
> Systems Programmer /GP/ Government Printing Office      
PSI%0530147000028::DON
> __________________/  /__Wellington, New
Zealand__________don@gp.govt.nz________
> Those who believe their systems are idiot-proof just do not realise how
> ingenious idiots can be. 

Actually, there were a whole family of relocating programs under the CP/M
family of operating systems.  A common trick was to use PRL (Page ReLocatable)
images.  This was created by assembling (!) the programs at two different
origins (typically 0 and 100h) and taking the difference between the two
images.  This was used to create a bitmap which could be used by the loading
program to assign the final location to the changed bytes.  Code of this form
could be executed at any page (page = 256 bytes) boundry.  MP/M codified this
for running two programs in one 48k bank.  One partition began at 0000h (and
programs loaded at 100h) while the other began at a higher location.  COM and
PRL files could run at 100h, but PRL files only could be run at the higher
addresses.

The process of configuring the kernel involved relocating it by this same
technique. Some companies used it for working with their BIOS also.

The technique was never highly popular, however, since people rarely needed
to change the base address of their code but were unable to relink it.

I personally preferred the TurboDOS operating system.  It was compatible with
CP/M but approximately 6 times faster on file operations, multiuser and
multiprocessor.  I still have a TurboDOS machine at home.

Robert Cousins
Dept. Mgr, Workstation Dev't.
Data General Corp.

Speaking for myself alone.

bls@cs.purdue.EDU (Brian L. Stuart) (10/18/89)

In article <33557@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov (Hugh LaMaster) writes:
>Some of this discussion reminds me of discussions about threaded-code systems
>like Forth.  Has anyone ever identified any ISA issues that are specific for
>code like this?  I have always assumed that a fast branching RISC would be
>ideal, but I don't know of this is a correct assumption.
>
>  Hugh LaMaster, m/s 233-9,  UUCP ames!lamaster

This would in fact be quite useful in some Forth implementations, but those
implementations really do not utilize self-modifying (or compile-on-the-fly)
code.  The Forth compiler I wrote (which did compile-on-the-fly and some
really strange stack manipulations) would have benefited most from a VERY
fast and simple subroutine linkage.  (What I had wasn't really bad at all;
it was a PDP-11.)  The reason for this is that the executable code I generated
consisted completely of JSR's, MOV #xxx,(R5)+'s, and RTS's.  (The R5 being
because that was my data stack pointer.)  The primitives were coded in
PDP-11 assembler and would have been well served by any standard RISC,
though some of the 11's addressing modes were handy.

Brian L. Stuart
Department of Computer Science
Purdue University

newbery@rata.vuw.ac.nz (Michael Newbery) (10/18/89)

In article <1989Oct12.162236.24239@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>Bootstraps written before the days of decent-sized EPROMs often employ
>all kinds of dirty tricks to minimize size. 
 describes PDP8 bootstrap as...
>..  two instructions.  The first was "kick disk
>controller" (the controller handily came up in a fairly reasonable state
>after bus reset).  The second was "branch to self".[...]
Eons ago we had an IBM 1130 to boot which, you inserted a special binary
punched card in the reader and pressed "Program Load". The 1130 was a
16bit machine with an instruction format like (I think, it's been a while)
 opcode  Long Indirect Index_Reg rest
[4 bits]  [1]    [1]       [2]   [8]
But, an 80col card has only 12 rows, so the load read 12 bits into each
16bit word as [opcode] [4 zeros] [rest]. All very well, except that the XIO
(eXecute Input Output) instruction had to be Long, so the bootstrap program
spent its first few instructions franticly modifying itself to create the
reuquired instructions.
The IBM supplied card used most of the 80 cols and a friend of mine
decided he could do better. The result was a program that modified itself
to create an XIO to read in sector 0 of the disk, which it then
executed... at about which point the program ended. Sector 0 was read in,
containing, among other things, the interrupt handling routines for the
interrupt that resulted when the I/O completed, and some code that was in
the right place by the time the bootstrap program reached it... The 1130
was not noted for its speediness. The new bootstrap card only needed about
1/3 of a card. 

--
Michael Newbery<newbery@rata.vuw.ac.nz> (...!uunet!vuwcomp!newbery if you must)
Abstain from wine, women and song. Mostly song.

jthomas@nmsu.edu (543) (10/19/89)

In article <1148@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) writes:

bill>   You *may* be thinking of the CP/M system generation. Since the o/s was
bill> written in absolute assembler it was shipped to run in a small (maybe
bill> 16k) system. When you created a larger version there was the original
bill> o/s and a list of locations which needed to be changed. The builder
bill> took a copy of the o/s and added a constant to each of the locations in
bill> the list.

And CP/M got that (too) from RT-11 which did it at initialization and got
it from .... ::-{})

Jim Thomas

toddpw@tybalt.caltech.edu (Todd P. Whitesel) (10/19/89)

This isn't a me too, but it is an early and (was) a widespread piece of code.

The original Apple // DOS 3.3 (and its predecessors) came on a "DOS System
Master" diskette with a bunch of nifty utilities and demos. When you booted a
"Master" diskette (you could make one with a utility on the System Master disk)
it would load in from $1D00-3FFF (i.e. 16K) and figure out how much memory
there was. Then it would copy itself up to the top of RAM, via a relocator that
took about 1/2K on the disk (I don't know exactly how long it was). If you used
the INIT command to format a disk, it would be a "Slave" diskette because the
bootable DOS image was already relocated (INIT saved a copy of the DOS in
to disk when it was finished).

one pirating technique was to remove the top row of DRAMs and boot a master,
forcing it to load in 32K. then you INITed a disk with that DOS. after
replacing the DRAMs you booted your favorite 48K copy-protected game (which
almost always used a heavily-modified copy of DOS) and somehow got it to reboot
the 32K slave disk. the slave thinks you only have 32K so it doesn't overwrite
the copy-protected dos above it in memory, and you could examine the patched
DOS at your leisure or even save it's routines to disk. (we had programs which
used the copy-protected image in 48K and internal "standard routines" to copy
a protected disk's files onto a normal disk, this was called "DeMuffin" which
was a pun on the 13-to-16-sector converted on the system master called Muffin.)

ain't nostalgia great?

Todd Whitesel
toddpw @ tybalt.caltech.edu

davecb@yunexus.UUCP (David Collier-Brown) (10/19/89)

  The ICL System 10 had an interesting boot (which used no self-modifying
code whatsoever (:-)).   Zero, as an instruction, was interpreted as

	opcode	  operand   modifier
	======	  =======   ========
	disk_load address_0,sector_0

and was generated when one cleared the screen (!) and pressed the execute-
immediate button.

  Poof! The bootstrap appears.

  This was less desirable when someone tried to execute data, of course,
but did show a fine bit of subtlety in assigning opcodes.

--dave
-- 
David Collier-Brown,  | davecb@yunexus, ...!yunexus!davecb or
72 Abitibi Ave.,      | {toronto area...}lethe!dave 
Willowdale, Ontario,  | Joyce C-B:
CANADA. 416-223-8968  |    He's so smart he's dumb.

herndon@umn-cs.CS.UMN.EDU (Robert Herndon) (10/20/89)

Another very common use for self-modifying codes that I have not
seen mentioned at all is bootstraps.  These are typically a block
of code copied into low/high memory at start-up time.  On most
machines, excepting the many workstations and micros that have
the bootstrap in ROM, the bootstrap is usually loaded where the
operating system is eventually supposed to go.

  The bootstrap, then, is usually forced to copy itself elsewhere
before reading in the operating system.  Since it typically doesn't
know how big memory is, it often opts to use heuristics to find out,
then copies itself to high/low memory, relocating itself in the process.
[No pun intended.]  If the machine supports position-independent-code,
this is an ideal place to use it.  If it doesn't, one has to diddle
instructions.  Either way, the bootstrap has to know enough about
itself to move itself somewhere and resume execution there (like
some of the 'core-wars' creatures described in Scientific American).
The PDP-11 block 0 bootstrap worked this way with position
independent code.  Since the PDP-11 supported position independent
code, it didn't have to diddle instructions.  It was easy, however,
to forget to strip the header (8 words) off.  The original UNIX
"magic number" is a tribute to this fact -- a 407 instruction
[the original magic number] would jump over the header, and the
bootstrap was off and running in spite of your omission.

  Another option, seen on some machines with instruction caches,
is to read the whole operating system into memory at some location
not overlapping the bootstrap.  At this point, all that has to
be done is to move the OS to the right location in memory and
jump to its start location.  The code to do this is typically
only a few instructions, and can be fit into a/the instruction
cache.

				Robert Herndon
				herndon@umn-cs.cs.umn.edu
-- 
Robert Herndon				Dept. of Computer Science,
...!ihnp4!umn-cs!herndon		Univ. of Minnesota,
herndon@umn-cs.ARPA			136 Lind Hall, 207 Church St. SE
herndon.umn-cs@csnet-relay.ARPA		Minneapolis, MN  55455

eob@cbnewsk.ATT.COM (eamonn.j.o'brien) (10/22/89)

In article <23062@cup.portal.com>, cliffhanger@cup.portal.com (Cliff C Heyer) writes:
> in general you 
> have no way of knowing [program] address locations that your C 
> compiler will create

... except for pointers to functions.


The following is my attempt at self modifying C code:

--------------------------------------------------

foo()
{
        printf( "foo()\n" );
}

endfoo(){}

bar()
{
        printf( "bar()\n" );
}

endbar(){}
        
main()
{
        char* foo_ptr    = foo;
        char* endfoo_ptr = endfoo;
        char* bar_ptr    = bar;
        char* endbar_ptr = endbar;
        
        printf("before:\n");
        foo();
        bar();
        
        printf("modifying:\n"); 
        for( ; bar_ptr < endbar_ptr; ++foo_ptr,++bar_ptr )
                *foo_ptr = *bar_ptr;
                
        printf("after:\n");             
        foo();
        bar();
        
}

----------------------------------------------------

It calls foo() and bar(), overwrites foo() with bar(), and calls
foo() and bar() again. However on my system (AT&T Unix PC --
68000) it coredumps after printing "modifying:". Perhaps the
program memory is read-only.
-- 
Eamonn O'Brien          ohm!eob      eob@ohm.att.com
~~~~~~~~~~~~~~
I'm not speaking for the company, I'm just losing my mind.

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (10/23/89)

  I was feeling smug about not having written any self modifying code in
years when I remembered that I HAD written a program which modifies the
o/s in memory.

  It seems that people were posting patches for every version of DOS
known to allow ^W and ^U to work as they do in UNIX. I wrote a small
program to search the core image, find the location for the patch and
the patch area, and apply the fix. This avoids having a hacked version
of the the o/s on a disk, which always worries people in these virus
ridden days.

  It was called CTLENABL, and works on DOS 2.0-4.0. DOS users run it in
their autoexec, usually. Now if DOS only included the source...

-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
"The world is filled with fools. They blindly follow their so-called
'reason' in the face of the church and common sense. Any fool can see
that the world is flat!" - anon

honig@ics.uci.edu (David A. Honig) (10/24/89)

I've programmed a high-performance graphics system (an IMI 455) with
techniques that exploit stored-program advantages.  To program the
thing, you write C-programs on a Unix host which write programs into
"display list memory".  These aren't compilers per se, they compute
(eg, from physical models) what kind of display commands to give to
the display generator, itself a microcoded cpu.

One way of using this thing for animation is to have the C-program
change the display-generator's program as it is being read and
interpreted by the display-generator.

Why use this technique?  Its actually simplifying, given the existing
complex architecture needed to obtain the high performance.  And it
avoids some of the problems that self-referencing, self-modifying code
has, since the C-program doesn't modify itself.
--
David A Honig

meissner@dg-rtp.dg.com (Michael Meissner) (10/24/89)

In article <173@harald.UUCP> jba@harald.ruc.dk (Jan B. Andersen) writes:
>  Data General's MV Family has an 'execute accumulator' instruction. Very
>  useful when implementing a debugger.

The execute accumulator instruction is not as useful for implementing
a debugger as you might think.  First of all, it burns a register, and
since the machine only has four integer registers, this can be deadly,
particularly, since some of the character move instructions use all
four integer accumulators.  Second of all, if the instruction takes
more than one 16-bit word, the remaining words of the instruction are
the 16-bit words following the XCT instruction, which isn't terribly
useful.

About the only time I have ever seen the XCT used is for device
drivers to build I/O instructions, with the appropriate device code
inserted into the instruction.

The language runtimes, when they wanted to build instructions on the
fly, would build them on the stack (if running under AOS, AOS/VS or
RDOS) or in low memory (if running under native UNIX), insert a long
jump back to the next location, and jump to the location where the
instructions were built.

One place where self-mofifying code was used, was in the general
purpose SYS function under RDOS, AOS, and AOS/VS, which typically took
four arguments, an integer giving the system call number, and pointers
to three integer sized items that were copied to/from three
accumulators.  Except for native UNIX, system calls consisted of a
call instruction, followed by the system call number, error branch
location, and normal branch location.  The general SYS function would
have to build such a call on the fly.  Another place was generalized
functions that executed an user specified instruction.

In terms of debugger support, there is a BKPT instruction which the
debugger replaces the first 16-bits of the instruction.  When a BKPT
instruction is encountered, the registers are pushed, and the machine
jumps to the user breakpoint handler (most traps on the MV do not go
directly into the kernel, but jump to user trap handlers).  The
handler can then do whatever it wants, and it uses either the WPOPB
instruction to return if the BKPT instruction has been removed, or the
PBX instruction if the breakpoint is still active (Ac0 contains the
first 16-bits of the instruction to execute).
--

Michael Meissner, Data General.				If compiles where much
Uucp:		...!mcnc!rti!xyzzy!meissner		faster, when would we
Internet:	meissner@dg-rtp.DG.COM			have time for netnews?

yodaiken@freal.cs.umass.edu (victor yodaiken) (11/06/89)

In article <2630@gandalf.UUCP> carr@gandalf.UUCP (Dave Carr) writes:
>In article <1080@mipos3.intel.com>, jpoon@mipos2.intel.com (Jack Poon~) writes:
...
>> What the advantage of using self-modifying code that non-self-modifying code
>> cannot achieve?
>
[stuff about 80186 i/o deleted]

>Another thing that pops to mind, is when your software may contain lots of 
>options, and spends most of its time testing option flags and skipping
>around unused options.  Why not have the software create the software
>without the options in!  You would need the software in a run-time locatable
>format, but this is certainly feasible.
>

The Synthesis operating system, by Henry Massilin and Carlton Pu of
Columbia University is an amazing example of this method. System calls and
system operations which will be computed repeatedly are compiled on the
fly. For example, when a process "opens" a file, the system generates code
for reading and writing that particular file, eleminating several layers of
switches. Context switch code is also compiled for each process. Basically
code generation is used to do constant propagation. Interesting to see
how this work might do on a machine with a deep pipline. Probably ok, since
compliations saves 100's of instruction executions.

If there is interest, I'll try to dig up the reference.

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (11/09/89)

>Basically, it's a performance hack which can make a system much more
>useable on interactive performance. There are systems sensitive to an
>extra IF statement in their inner-loops, particularly when the
>condition check is expensive (a few instructions can be expensive in
>this context) and the inner loop can be executed millions of times per
>request. There's always a way to replace it, but it may not result in
>acceptable performance.

On a system I worked on adding a conditional branch to the critical
path for syscall added 5-10% (correct me if I'm wrong, Wayne) for
minimal syscall timings.  Maybe not much overall, but annoying.
    (This was a "semi-static" test, which we got around by making up a
separate head and changing the syscall vector).

We have several times talking about self-modifying code on the
functional level. Perhaps it is just a question of notation - perhaps
we just need a "statement data type - or a question of optimization -
perhaps a good compiler should generate self modifying code in some of
the hairy situations we have talked about in this thread, given
high-level code that is not self modifying.

Conside Barry's example:

[B]
    ...predecessor code...
L:  {IF request_is_flagged THEN do_flag_action}
    ...successor code...

Note that I have put curly braces {} around the statement to indicate the
range to which the label L applies.
    In a high level notation the self-modification that Barry described,
removing the IF at the code that changes the assertion, may be described
as:

[S]:
    ...Elsewhere, conditions have changed
    ...so now the test does not need to be performed,
    ... ie. !request_is_flagged:
    L := skip /* Dijkstra's high level nop */
    ...

    ...Elsewhere, conditions have changed
    ... ie. request_is_flagged:
    L := do_flag_action;
    ...

This involves assignment to a variable of type "statement".

In a functional notation you might do

[F]:
    L: pointer to function

    L_skip(){}

    L_do_flag_action() { ... }

    ...predecessor code:
    (*L)();
    ...successor notation

where the code that changes the request_is_flagged status changes the
function pointer: L := &L_skip, L := & L_do_flag_action.
    In a decent language you could declare the function pointer
and the values it assumes at the point at which it is used.

(Yeah, great liberties with notation...)


Obviously [S] is equivalent to [F], although more/less convenient.


To begin with, [F] can be implemented in the obvious way, through
indirect function call/return.

[C]
    L_skip:
    	return

    L_do_flag_action:
    	...
    	return

    ...predecessor code:
    call indirect through L
    ...successor notation


    The first level of optimization is to notice that L_skip and
L_do_flag_action always return to the same place, reducing function
calling overhead to indirect branch.

[G]
    L_skip:
    	goto L'

    L_do_flag_action:
    	...
    	goto L'

    ...predecessor code:
    goto indirect L
L': /* return place */
    ...successor notation


(By the way, this is an optimization that would be nice in any system that makes
calls through arrays of pointers to functions: like UNIX syscalls (sysent[]),
file system calls (NFS, the System V file system switch), device driver calls
([bc]devsw[]).  99% of these functions are called in one and only one place.)


The next level of optimization is to inline the code with an IF - getting us back
to the explicitly hand coded version:

[IF]
    ...predecessor code:
    IF L == L_skip THEN
    	skip
    ELSE
    	do_flag_action
    FI
    ...successor notation

(Of course, a good programmer might have done this manually: instead
of indicating request_was_flagged in a complicated data structure, she
might have encoded it in a single variable, which she effectively did
with the function pointer.  So, the complexity of the test doesn't
concern us - just the overhead of the IF or call)


The next step is to actually do self-modifying code on the "micro" scale:

[SM]
    ...predecessor code	    L_skip: 	L_do_flag_action:
L1:    ___	 	      	nop 	    do_1
L2:    ___  	    	    	nop 	    do_2
L3:    ___  	    	    	nop 	    do_3
    ...successor code


    ...Elsewhere, conditions have changed
    ...so now the test does not need to be performed,
    ... ie. !request_is_flagged:
    L1 := nop
    L2 := nop
    L3 := nop
    ...flush cache or whatever else is necessary
    ...

    ...Elsewhere, conditions have changed
    ... ie. request_is_flagged:
    L1 := do_1
    L2 := do_2
    L3 := do_3
    ...flush cache or whatever else is necessary
    ...

Optimizations based on the code body might be desirable:

    do_action_1:    	    	do_action_2:
    	step1	    	    	    step1
    	step2.1	    	    	    step2.2
    	step3	    	    	    step3
    	step4.1	    	    	    step4.2
    	step5	    	    	    step5

Instead of 5 instructions changing, only 2 need be changed.


Alternatively, rather than micro self-modifying code, a compiler might
take [IF] and attempt to broaden the domain of the code affected. For
example, if [IF] were called within the body of another indirectly
called function that assumed values A and B, two copies of the bodies
of each of A and B might be made...





Okay, enough time wasted.  I think I have several points winding through
all this:

    (1) Self-modifying code at the "function" level is easily expressible
    	(somebody was trying to make a standard library out of this
    	(I was supposed to help, sorry))

    (2) "Micro" self-modifying code is easily expressible
    	(a) with an augmented language
    	(b) with function pointers.

    (3) It would not be unreasonable for an optimizing compiler to
    	make the sort of optimizations described above automatically,
    	acheiving the performance advantages of "micro" self modifying
    	code from a high level representation.

But, is it worthwhile spending the effort on a compiler that can do this
sort of thing?  Probably not... the situations where "micro" self modifying
code are really necessary are few and far between.  Force the implementors
to do it by hand...   Not commercially important.  Might be fun for
some academic research...

    
--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

bjornl@tds.kth.se (Bj|rn Lisper) (11/13/89)

In article <1989Nov6.172043.10373@world.std.com> bzs@world.std.com (Barry
Shein) writes:
%A common, general usage of self-modifying code is seen in interpreter
%(not necessarily, but often, programming languages) loops. Suppose you
%have the ability to trace or otherwise flag certain events. This is
%typically done by a check at the top of the loop:

%	IF request_is_flagged THEN do_flag_action

%The check for "request_is_flagged" can be quite expensive (say,
%hashing into a symbol table to see if a bit is set) and interpreters
%might have to go through this loop for every minor operation (eg. add
%two numbers.) If there are no items flagged or only certain subsets
%need to be checked one way to speed up this loop is to simply modify
%the top-level code so the check isn't done or loops back to a
%different check. ...

Some time ago I made a posting about the connection between partial
evaluation and self-modifying code. (Partial evaluation is simplifying
functions, or code, when the input is partially known. Self-modifying code
can be seen as partial evaluation at runtime.) The above was an instance I
though of when writing that posting. The rule

IF true THEN S => S

is used to simplify conditionals where the condition is known. This can be
done either at compile-time, if the condition is known then, or at run-time,
when the condition becomes known (and we know it will not change value).

Actually, I think partial evaluation would provide a quite clean theory for
an interesting class of self-modifying programs.

Bjorn Lisper

raulmill@usc.edu (Raul Deluth Rockwell) (11/15/89)

The original message of this thread seems to have expired, so I may be
way off base, but here is what appears to be a fragment of the
question:

In article <1080@mipos3.intel.com>, jpoon@mipos2.intel.com (Jack Poon~) writes:
> Could any experts out there educate me WHY and HOW does self-modifying code use?
> ...
> What the advantage of using self-modifying code that non-self-modifying code
> cannot achieve?

In short, incremental compilers, linking loaders, debuggers and
similar system tools use self-modifying code.  It is quite possible to
have a situation where rapid allocation and deallocation of memory
will run into cache corruption.

Of course with classic environments (unix & c for example), this isn't
much of a problem, or it can be hacked around by flushing the entire
cache.  But to exclude fast/tight loader design because somebody
decided that that is a 'poor programming practice' seems rather silly.

--