chase@Ozona.orc.olivetti.com (David Chase) (07/28/88)
It's all my fault; I started all these silly conversations with a question about on-the-fly generation of code to implement partial application. I only got one useful answer (from someone at MicroSoft, explaining how o-t-f code generation would be handled in the OS/2 world), but I haven't seen such volume in comp.arch in a long time. So, what can I say? Maybe you all should read before you write, and maybe do a little thinking in between. Here, I'll try to motivate partial application, though it has little to do with speed. Supposing I wanted to implement function-returning functions, and I wanted to implement one returning *different* functions. Partial application (combined with some way to garbage collect on the generated functions; trust me, it is doable) gives you that. Supposing you wanted nested functions (in the style of Pascal) but you didn't want to make a distinction between "real functions" (ptr to code) and "function parameters" (ptr to code plus scope, if you can pass nested functions as parameters). Well, partial application (behind the scenes) will give you that, too. Your compiler just cooks up a little routine that takes an extra parameter, and on entry to the outer procedure a little bit of code is generated on to the stack corresponding to the partial application of the little routine to the address of the current activation record. The address of the code on the stack is the ptr to the nested function. (You could get fancier, of course, but code generation onto the stack is used in the other tricks, too.) By the way, the idea for using CG onto the stack to get closures as function pointers is due to Thomas M. Breuel of (I think) MIT. He says he may write a paper on this trick; I hope he does. Hearing of this, I implemented partial application to see if I understood the technique. You may be thinking, "is this guy blowing smoke? How does this work, anyway?" The partial application of F to A (assumed to be 32 bits) on a Sun 3 running SunOS 3.4 is accomplished by generating code that 1) duplicates (pushes again) the top of the stack (saved PC) 2) stores A under the top of stack (sp+4) 3) a nop for alignment for the garbage collector 4) branches to the code for F or movl sp@,d0 ; Pick up PC and movl d0,sp@- ; shove it up one movl 1$, d0 ; Pick up arg movl d0,sp@(4) ; and insert it nop ; Alignment to make GC happy jmp _foo ; branch to routine. 1$: .long 1 or (copied onto stack, or into allocated memory) 0x20172f00 0x203a000e 0x2f400004 0x4e714ef9 0xFFFFFFFF <- store function here 0xAAAAAAAA <- store argument here So here is a use for on-the-fly code generation that is NOT easily replaced by hardware, and is not "self-modifying" in the direct sense of the word (though I can easily imagine code getting generated onto the same address more than once, which could confuse some caches). It probably also interferes with fancy register global and interprocedural register allocators; I am sort of aware, *in the abstract* of Things That Go Wrong. I am interested in examples from real life (preferably using machines that I am likely to encounter in the future, not the past) of machines/operating systems that make this impossible, or very expensive, or very interesting. Given a language like Pascal or Modula-2, one is stuck (or blessed, depending upon your taste in these things) with nested procedures (that I think can be passed as parameters); this little trick makes it cheaper to pass a procedure as a parameter and cheaper to call a procedure passed as a parameter, but at some expense to the implementation of nested procedures, and possibly at some overall expense because of caching tricks that no longer work. Basically, we're talking about engineering and economic decisions. Comments? David Chase Olivetti Research Center, Menlo Park, CA
max@polya.Stanford.EDU (Max Hailperin) (07/28/88)
David Chase's posting about run-time-generated code for partial-application for closures may have left some of y'all wondering why this is a good (= efficient) way to represent closures. He writes that it allows fast passing and calling of procedural parameters, but on the surface it appears that the traditional method of passing both a code pointer and an environment pointer would be faster. Thus some explanation is called for. The rational for passing a closure as the address of a (run-time-generated) piece of code is that the compiler can figure out that most closures don't really need to be closures. This is because most procedures use only their own local variables and truly global variables, but nothing from intermediate levels. Thus it is desirable to have procedures that take procedural parameters work whether passed a closure or a non-closed procedure. The method Chase describes allows for this, and thus improves performance in the common case. So the bottom line is this is a good way to represent closures because it allows a good representation of closures that don't need to close over anything. This is all implicit in Chase's desire to have the calling sequence for closures and normal procedures be the same, but I thought I'd make it explicit to help make clear why architectural support for run-time generated code would be nice.
firth@sei.cmu.edu (Robert Firth) (07/28/88)
In article <3435@polya.Stanford.EDU> max@polya.Stanford.EDU (Max Hailperin) writes: >The rational for passing a closure as the address of a >(run-time-generated) piece of code is that the compiler can figure out >that most closures don't really need to be closures. This is because >most procedures use only their own local variables and truly global >variables, but nothing from intermediate levels. Thus it is desirable >to have procedures that take procedural parameters work whether passed >a closure or a non-closed procedure. The method Chase describes >allows for this, and thus improves performance in the common case. Could someone please be so kind as to run the code sample by me again, if possible with annotations about performance, since I'm afraid I still don't see it. The cost of passing the environment pointer to a parametric procedure is rarely more than one instruction. I grant you this is wasted if the procedure being called doesn't need it, but surely there is a minimum overhead of one instruction in getting from the runtime- generated code back to the "real" code of the called procedure? And, of course, when you generate the actual procedure parameter you don't need to evaluate its environment pointer if you can see it won't need one; likewise the entry sequence of a procedure can ignore any passed-in evvironment pointer.
chase@Ozona.orc.olivetti.com (David Chase) (07/29/88)
[First, though I talk about this, credit is due to other people. I imagine the Lisp crowd has known about this for ages, and Thomas Breuel pointed it out to me.] In article <6412@aw.sei.cmu.edu> firth@bd.sei.cmu.edu.UUCP (Robert Firth) writes: >The cost of passing the environment pointer to a parametric procedure >is rarely more than one instruction. I grant you this is wasted if >the procedure being called doesn't need it, but surely there is a >minimum overhead of one instruction in getting from the runtime- >generated code back to the "real" code of the called procedure? > >And, of course, when you generate the actual procedure parameter you >don't need to evaluate its environment pointer if you can see it won't >need one; likewise the entry sequence of a procedure can ignore any >passed-in evvironment pointer. I think the other (major) motivation for doing this (of course, the one that I omitted) is to simplify interlanguage stuff. In most implementations of Pascal, Fortran, and C, a "function" or "subroutine" variable (such as they are permitted at all) is represented as a pointer to code, except for the case of (correct me if I am wrong here) functions and procedures passed as parameters in Pascal (*). In that case, because the function might be nested (inherit variables from its lexical ancestor) a pointer to the frame (or similar structure) must be passed along with it (unless code is generated as a I described in the previous posting). This hurts you in the following ways. 1) a (possibly bogus) frame pointer gets passed with EVERY procedure passed as a parameter, whether it needs it or not. 2) ALL calls to a procedure passed as a parameter must set up the lexical scope in case it is needed. 3) I can't pass nested functions or procedures passed as parameters to other languages (e.g., Fortran ODE solvers or minimum finders, if you need justification for this). In fact, unless I tell my compiler that "this is an interlanguage call", I can't pass ANY procedures as parameters to subroutines written in other languages, because the other languages won't be expecting to see that (possibly bogus) frame pointer passed along with the procedure in the parameter list. Given that many procedures are not nested, and given that many people have an aversion to lexical procedures anyway (though some people love them), I can't see how any minor speedups for the nested case justify penalizing all other cases and hindering interlanguage calls so severely. As to the costs -- good code for doing this would be (on a 68k) 1) save the current LP (lexical pointer) if necessary. 2) load the one for the routine about to be called (using PC-relative arithmetic) 3) branch to the routine .... 4) The routine would restore the LP before returning. This costs you a branch, I think, for each call of the routine, plus the time to "generate" the code on entry to the lexical ancestor (one block move plus a store). Of course, if the nested procedure is not passed as a parameter, then the ancestor procedure can just load the LP and branch directly to the routine, removing the need to generate code on the stack. For a discussion of this and many other related tricks, one might track down the Orbit papers in the '86 compiler construction conference, or Guy Steele's work on the Rabbit compiler, or Luca Cardelli's work with his ML compiler. Let me say it again, by cases: 1) non-nested procedure parameter pass as pointer to code. 2) incoming procedure parameter treat as pointer to code. 3) nested procedure, not passed as parameter. (+) treat as code + pointer, since all occurrences of the procedure are exposed to the compiler. 4) nested procedure, passed as parameter. Generate code on the fly, and use that for the parameter occurrence, but just load the LP and go for calls to the nested procedure from within the nesting procedure. So, on-the-fly code generation only costs you when a nested procedure is passed as a parameter. I don't have statistics handy, so my guesses are suspect, but I'll guess that that doesn't occur too often. Of course, all the cost estimates change radically if I'm on a machine with split instruction and data spaces, or some similar impediment. That's what I was talking about in the first place, and what I wanted to know about. How is on-the-fly code generation handled in some of these split cache/split bus architectures that seem to be appearing nowadays? Is it very costly, or even possible? (and if so, how?) (*) My Pascal knowledge is pretty wimpy, but I didn't see any big block letters saying "don't pass nested procedures as parameters". (+) This gets tricky if (say) A contains B and C, and B calls C, and B is passed as a parameter. However, from within B it will be possible to recover the lexical environment for C, so we are OK (in fact, in this case the LP is the same for B and C). David Chase Olivetti Research Center, Menlo Park, CA