[comp.arch] Execute instructions, Bit Blit, Self-modifying code, Programming in Assembler

chase@Ozona.orc.olivetti.com (David Chase) (07/28/88)

It's all my fault; I started all these silly conversations with a
question about on-the-fly generation of code to implement partial
application.  I only got one useful answer (from someone at MicroSoft,
explaining how o-t-f code generation would be handled in the OS/2
world), but I haven't seen such volume in comp.arch in a long time.

So, what can I say?  Maybe you all should read before you write, and
maybe do a little thinking in between.  Here, I'll try to motivate
partial application, though it has little to do with speed.

Supposing I wanted to implement function-returning functions, and I
wanted to implement one returning *different* functions.  Partial
application (combined with some way to garbage collect on the
generated functions; trust me, it is doable) gives you that.
Supposing you wanted nested functions (in the style of Pascal) but you
didn't want to make a distinction between "real functions" (ptr to
code) and "function parameters" (ptr to code plus scope, if you can
pass nested functions as parameters).  Well, partial application
(behind the scenes) will give you that, too.  Your compiler just cooks
up a little routine that takes an extra parameter, and on entry to the
outer procedure a little bit of code is generated on to the stack
corresponding to the partial application of the little routine to the
address of the current activation record.  The address of the code on
the stack is the ptr to the nested function.  (You could get fancier,
of course, but code generation onto the stack is used in the other
tricks, too.)

By the way, the idea for using CG onto the stack to get closures as
function pointers is due to Thomas M. Breuel of (I think) MIT.  He
says he may write a paper on this trick; I hope he does.  Hearing of
this, I implemented partial application to see if I understood the
technique.

You may be thinking, "is this guy blowing smoke?  How does this work,
anyway?"  The partial application of F to A (assumed to be 32 bits) on
a Sun 3 running SunOS 3.4 is accomplished by generating code that

1) duplicates (pushes again) the top of the stack (saved PC)
2) stores A under the top of stack (sp+4)
3) a nop for alignment for the garbage collector
4) branches to the code for F

or
	movl sp@,d0          ; Pick up PC and
        movl d0,sp@-         ; shove it up one
        movl 1$, d0          ; Pick up arg
        movl d0,sp@(4)       ; and insert it
        nop                  ; Alignment to make GC happy
        jmp  _foo            ; branch to routine.
    1$: .long 1

or (copied onto stack, or into allocated memory)

 0x20172f00
 0x203a000e
 0x2f400004
 0x4e714ef9
 0xFFFFFFFF  <- store function here
 0xAAAAAAAA  <- store argument here

So here is a use for on-the-fly code generation that is NOT easily
replaced by hardware, and is not "self-modifying" in the direct sense
of the word (though I can easily imagine code getting generated onto
the same address more than once, which could confuse some caches).

It probably also interferes with fancy register global and
interprocedural register allocators; I am sort of aware, *in the
abstract* of Things That Go Wrong.  I am interested in examples from
real life (preferably using machines that I am likely to encounter in
the future, not the past) of machines/operating systems that make this
impossible, or very expensive, or very interesting.  Given a language
like Pascal or Modula-2, one is stuck (or blessed, depending upon your
taste in these things) with nested procedures (that I think can be
passed as parameters); this little trick makes it cheaper to pass a
procedure as a parameter and cheaper to call a procedure passed as a
parameter, but at some expense to the implementation of nested
procedures, and possibly at some overall expense because of caching
tricks that no longer work.  Basically, we're talking about
engineering and economic decisions.

Comments?

David Chase
Olivetti Research Center, Menlo Park, CA

max@polya.Stanford.EDU (Max Hailperin) (07/28/88)

David Chase's posting about run-time-generated code for
partial-application for closures may have left some of y'all wondering
why this is a good (= efficient) way to represent closures.  He writes
that it allows fast passing and calling of procedural parameters, but
on the surface it appears that the traditional method of passing both
a code pointer and an environment pointer would be faster.  Thus some
explanation is called for.

The rational for passing a closure as the address of a
(run-time-generated) piece of code is that the compiler can figure out
that most closures don't really need to be closures.  This is because
most procedures use only their own local variables and truly global
variables, but nothing from intermediate levels.  Thus it is desirable
to have procedures that take procedural parameters work whether passed
a closure or a non-closed procedure.  The method Chase describes
allows for this, and thus improves performance in the common case.

So the bottom line is this is a good way to represent closures because
it allows a good representation of closures that don't need to close
over anything.  This is all implicit in Chase's desire to have the
calling sequence for closures and normal procedures be the same, but I
thought I'd make it explicit to help make clear why architectural
support for run-time generated code would be nice.

firth@sei.cmu.edu (Robert Firth) (07/28/88)

In article <3435@polya.Stanford.EDU> max@polya.Stanford.EDU (Max Hailperin) writes:

>The rational for passing a closure as the address of a
>(run-time-generated) piece of code is that the compiler can figure out
>that most closures don't really need to be closures.  This is because
>most procedures use only their own local variables and truly global
>variables, but nothing from intermediate levels.  Thus it is desirable
>to have procedures that take procedural parameters work whether passed
>a closure or a non-closed procedure.  The method Chase describes
>allows for this, and thus improves performance in the common case.

Could someone please be so kind as to run the code sample by me again,
if possible with annotations about performance, since I'm afraid I
still don't see it.

The cost of passing the environment pointer to a parametric procedure
is rarely more than one instruction.  I grant you this is wasted if
the procedure being called doesn't need it, but surely there is a
minimum overhead of one instruction in getting from the runtime-
generated code back to the "real" code of the called procedure?

And, of course, when you generate the actual procedure parameter you
don't need to evaluate its environment pointer if you can see it won't
need one; likewise the entry sequence of a procedure can ignore any
passed-in evvironment pointer.

chase@Ozona.orc.olivetti.com (David Chase) (07/29/88)

[First, though I talk about this, credit is due to other people.  I
imagine the Lisp crowd has known about this for ages, and Thomas
Breuel pointed it out to me.]

In article <6412@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes:
>The cost of passing the environment pointer to a parametric procedure
>is rarely more than one instruction.  I grant you this is wasted if
>the procedure being called doesn't need it, but surely there is a
>minimum overhead of one instruction in getting from the runtime-
>generated code back to the "real" code of the called procedure?
>
>And, of course, when you generate the actual procedure parameter you
>don't need to evaluate its environment pointer if you can see it won't
>need one; likewise the entry sequence of a procedure can ignore any
>passed-in evvironment pointer.

I think the other (major) motivation for doing this (of course, the
one that I omitted) is to simplify interlanguage stuff.  In most
implementations of Pascal, Fortran, and C, a "function" or
"subroutine" variable (such as they are permitted at all) is
represented as a pointer to code, except for the case of (correct me
if I am wrong here) functions and procedures passed as parameters in
Pascal (*).  In that case, because the function might be nested
(inherit variables from its lexical ancestor) a pointer to the frame
(or similar structure) must be passed along with it (unless code is
generated as a I described in the previous posting).  This hurts you
in the following ways.

1) a (possibly bogus) frame pointer gets passed with EVERY procedure
passed as a parameter, whether it needs it or not.

2) ALL calls to a procedure passed as a parameter must set up the
lexical scope in case it is needed.

3) I can't pass nested functions or procedures passed as parameters to
other languages (e.g., Fortran ODE solvers or minimum finders, if you
need justification for this).  In fact, unless I tell my compiler that
"this is an interlanguage call", I can't pass ANY procedures as
parameters to subroutines written in other languages, because the
other languages won't be expecting to see that (possibly bogus) frame
pointer passed along with the procedure in the parameter list.

Given that many procedures are not nested, and given that many people
have an aversion to lexical procedures anyway (though some people love
them), I can't see how any minor speedups for the nested case justify
penalizing all other cases and hindering interlanguage calls so
severely.

As to the costs -- good code for doing this would be (on a 68k)

1) save the current LP (lexical pointer) if necessary.
2) load the one for the routine about to be called (using PC-relative
   arithmetic)
3) branch to the routine
....
4) The routine would restore the LP before returning.

This costs you a branch, I think, for each call of the routine, plus
the time to "generate" the code on entry to the lexical ancestor (one
block move plus a store).  Of course, if the nested procedure is not
passed as a parameter, then the ancestor procedure can just load the
LP and branch directly to the routine, removing the need to generate
code on the stack.  For a discussion of this and many other related
tricks, one might track down the Orbit papers in the '86 compiler
construction conference, or Guy Steele's work on the Rabbit compiler, 
or Luca Cardelli's work with his ML compiler.

Let me say it again, by cases:

1) non-nested procedure parameter
   pass as pointer to code.

2) incoming procedure parameter
   treat as pointer to code.

3) nested procedure, not passed as parameter. (+)
   treat as code + pointer, since all occurrences of the procedure
   are exposed to the compiler.

4) nested procedure, passed as parameter.
   Generate code on the fly, and use that for the parameter
   occurrence, but just load the LP and go for calls to the nested
   procedure from within the nesting procedure.

So, on-the-fly code generation only costs you when a nested procedure
is passed as a parameter.  I don't have statistics handy, so my
guesses are suspect, but I'll guess that that doesn't occur too often.

Of course, all the cost estimates change radically if I'm on a machine
with split instruction and data spaces, or some similar impediment.
That's what I was talking about in the first place, and what I wanted
to know about.  How is on-the-fly code generation handled in some of
these split cache/split bus architectures that seem to be appearing
nowadays?  Is it very costly, or even possible?  (and if so, how?)

(*) My Pascal knowledge is pretty wimpy, but I didn't see any big
block letters saying "don't pass nested procedures as parameters".

(+) This gets tricky if (say) A contains B and C, and B calls C, and B
    is passed as a parameter.  However, from within B it will be
    possible to recover the lexical environment for C, so we are OK
    (in fact, in this case the LP is the same for B and C).

David Chase
Olivetti Research Center, Menlo Park, CA