moss@cs.umass.edu (Eliot Moss) (11/25/90)
Mention of architectural speedups via special branch target cache mechanisms, and the fact that they do not apply to "object-oriented" procedure calls because they are indirect, encouraged me to open up wider discussion of the issues surrounding object-oriented procedure calls (OOPCs for short). First, a review of some of the linguistic features. A number of languages can compile a "dynamically bound" (OO) procedure call into an indirect branch, i.e., a branch through a compile-time-known offset in a run-time-known table. Other OO languages require lookup that makes obtaining a fixed, compile-time-known offset difficult. Smalltalk, for example, requires run-time method lookup in general. On the other hand, it has been empirically demonstrated that any given call site rarely varies in the class/type of the argument(s) determining the call binding (i.e., which actual code is run). In 1984 (see POPL for that year) Deutsch and Schiffman presented the *inline method cache*, where a direct call is performed, and the prolog at the top of each method verifies that the class of the receiver is the same as it was at the previous call at the same call site. If the class is the same, then execution proceeds; if it is different, then a lookup is performed (using a full, software managed cache, backed up by an exhaustive search up the class hierarchy) and the code at the call site is changed to refer to the appropriate method, and the class-last-time is changed to reflect the appropriate binding. The calling sequence is something like this: call code-last-time class-last-time ... return point ... Note that the return is 4 bytes beyond the usual return address. Anyway, this technique is very effective (very good hit rate), but does require code modification. It appears it might work well with a properly designed smart I-cache, though. Another thing that I should mention is that some informal, unpublished measurements that I heard about indicate that over 95% of call sites in a statically typed OO language (Trellis/Owl) could be statically bound. That is, there is only one *possible* target for the dynamically bound call at that many call sites. Clearly this is an opportunity for compiler optimization! It does require *global knowledge* of the program, though. The inline method cache and similar techniques can bring the cost of OOPC within an instruction or so of ordinary procedure call, even with hardware speedups for statically known calls. The REAL loss of performance with an object oriented style is losing the ability to inline the whole procedure call! (IMHO.) In the long run, we will take a considerable performance hit if we do not perform appropriate global analysis and optimization to recognize and statically bind every call we can, and thus gain the ability to inline and apply further optimizations. Interesting related ideas include method splitting and customized compilation (see the Chambers and Ungar paper from SIGPLAN 90, for example). This is yet another case where we need experts from different specialties to try to understand each others area to some extent, and cooperate to achieve the best result. Yours in OO performance ..... Eliot Moss -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206; Moss@cs.umass.edu -- Send compilers articles to compilers@iecc.cambridge.ma.us or {ima | spdcc | world}!iecc!compilers. Meta-mail to compilers-request.
rouellet@crhc.uiuc.edu (Roland G. Ouellette) (11/26/90)
> The REAL loss of performance with an object oriented style is losing > the ability to inline the whole procedure call! (IMHO.) I believe Ralph Johnson has been working on inlining Small Talk procedures (and undoing the process at the right time). He's johnson@cs.uiuc.edu and may have published (this?) in OOPSLA (this year?). R. -- = Roland G. Ouellette ouellette@tarkin.enet.dec.com = = 1203 E. Florida Ave rouellet@[dwarfs.]crhc.uiuc.edu = = Urbana, IL 61801 "You rescued me; I didn't want to be saved." = = - Cyndi Lauper =
aglew@crhc.uiuc.edu (Andy Glew) (11/27/90)
>Mention of architectural speedups via special branch target cache mechanisms, >and the fact that they do not apply to "object-oriented" procedure calls >because they are indirect, encouraged me to open up wider discussion of the >issues surrounding object-oriented procedure calls (OOPCs for short). BTC's will still help indirect procedure calls. Indirect procedure calls are, after all, LOAD r1 := M[oblist+proc] BRANCH [r1] The data cache supports the first operation. The BTC supports the second operation. Of course, this requires twice as much cache state, and two operations to perform. >First, a review of some of the linguistic features. A number of languages can >compile a "dynamically bound" (OO) procedure call into an indirect branch, >i.e., a branch through a compile-time-known offset in a run-time-known table. Warning: this is yet another of my blue-sky ideas. Many "AI" or "advanced" systems use indirection through an object table in accessing functions and data. Why not try an "indirect cache" ---- instead of a cache that responds to the address of an object, have a cache that responds to the address of the object table entry? Make it simple: support one level of indirection only (I don't want to get lost in the maze of reminesces about machines that supported arbitrary chains of indirection in hardware). A special "load-indirect" or "branch-indirect" function would be needed. load-indirect's function would be: load-indirect dest maddr (1) Look up maddr associatively in indirect cache. If found, then use it... Else, continue (2) Look up maddr in normal cache. If not found, cache miss to get it -> *maddr (3) Look up *maddr in normal cache If not found, cache miss to get it Place result in dest. Place result in indirect cache, indexed by maddr. The indirect cache probably would not need to be snooped on the basis of the direct address, since the direct addresses only change as the result of garbage collections. (Multiprocessor issues?) An indirect cache *might* be a win. How much I don't know, because I don't have any significant benchmark workloads that use such indirect accesses. I don't have any statistics on the frequency of such operations. As usual, what is needed are real numbers characterizing such "OOPS" applications. -- Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]
henry@zoo.toronto.edu (Henry Spencer) (11/27/90)
In article <MOSS.90Nov24181741@ibis.cs.umass.edu> moss@cs.umass.edu (Eliot Moss) writes: >Mention of architectural speedups via special branch target cache mechanisms, >and the fact that they do not apply to "object-oriented" procedure calls >because they are indirect... ??? The AMD 29k BTC, the major example that I know about, works regardless of what kind of branch you use. It even works on interrupts, as I recall. Caches have the advantage of being able to observe run-time behavior, so they don't *care* whether the branches are not statically predictable. -- "I'm not sure it's possible | Henry Spencer at U of Toronto Zoology to explain how X works." | henry@zoo.toronto.edu utzoo!henry