[comp.arch] Object-Oriented Procedure Call

moss@cs.umass.edu (Eliot Moss) (11/25/90)

Mention of architectural speedups via special branch target cache mechanisms,
and the fact that they do not apply to "object-oriented" procedure calls
because they are indirect, encouraged me to open up wider discussion of the
issues surrounding object-oriented procedure calls (OOPCs for short).

First, a review of some of the linguistic features. A number of languages can
compile a "dynamically bound" (OO) procedure call into an indirect branch,
i.e., a branch through a compile-time-known offset in a run-time-known table.
Other OO languages require lookup that makes obtaining a fixed,
compile-time-known offset difficult. Smalltalk, for example, requires run-time
method lookup in general.

On the other hand, it has been empirically demonstrated that any given call
site rarely varies in the class/type of the argument(s) determining the call
binding (i.e., which actual code is run). In 1984 (see POPL for that year)
Deutsch and Schiffman presented the *inline method cache*, where a direct call
is performed, and the prolog at the top of each method verifies that the class
of the receiver is the same as it was at the previous call at the same call
site. If the class is the same, then execution proceeds; if it is different,
then a lookup is performed (using a full, software managed cache, backed up by
an exhaustive search up the class hierarchy) and the code at the call site is
changed to refer to the appropriate method, and the class-last-time is changed
to reflect the appropriate binding. The calling sequence is something like
this:
		call code-last-time
		class-last-time
		... return point ...

Note that the return is 4 bytes beyond the usual return address. Anyway, this
technique is very effective (very good hit rate), but does require code
modification. It appears it might work well with a properly designed smart
I-cache, though.

Another thing that I should mention is that some informal, unpublished
measurements that I heard about indicate that over 95% of call sites in a
statically typed OO language (Trellis/Owl) could be statically bound. That is,
there is only one *possible* target for the dynamically bound call at that
many call sites. Clearly this is an opportunity for compiler optimization! It
does require *global knowledge* of the program, though.

The inline method cache and similar techniques can bring the cost of OOPC
within an instruction or so of ordinary procedure call, even with hardware
speedups for statically known calls. The REAL loss of performance with an
object oriented style is losing the ability to inline the whole procedure
call! (IMHO.) In the long run, we will take a considerable performance hit if
we do not perform appropriate global analysis and optimization to recognize
and statically bind every call we can, and thus gain the ability to inline and
apply further optimizations. Interesting related ideas include method
splitting and customized compilation (see the Chambers and Ungar paper from
SIGPLAN 90, for example).

This is yet another case where we need experts from different specialties to
try to understand each others area to some extent, and cooperate to achieve
the best result.

Yours in OO performance .....				Eliot Moss
--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu
-- 
Send compilers articles to compilers@iecc.cambridge.ma.us or
{ima | spdcc | world}!iecc!compilers.  Meta-mail to compilers-request.

rouellet@crhc.uiuc.edu (Roland G. Ouellette) (11/26/90)

> The REAL loss of performance with an object oriented style is losing
> the ability to inline the whole procedure call! (IMHO.)

I believe Ralph Johnson has been working on inlining Small Talk
procedures (and undoing the process at the right time).  He's
johnson@cs.uiuc.edu and may have published (this?) in OOPSLA (this
year?).

R.
--
= Roland G. Ouellette			ouellette@tarkin.enet.dec.com	=
= 1203 E. Florida Ave			rouellet@[dwarfs.]crhc.uiuc.edu	=
= Urbana, IL 61801	   "You rescued me; I didn't want to be saved." =
=							- Cyndi Lauper	=

aglew@crhc.uiuc.edu (Andy Glew) (11/27/90)

>Mention of architectural speedups via special branch target cache mechanisms,
>and the fact that they do not apply to "object-oriented" procedure calls
>because they are indirect, encouraged me to open up wider discussion of the
>issues surrounding object-oriented procedure calls (OOPCs for short).

BTC's will still help indirect procedure calls.
Indirect procedure calls are, after all, 

    LOAD r1 := M[oblist+proc]
    BRANCH [r1]

The data cache supports the first operation.
The BTC supports the second operation.

Of course, this requires twice as much cache state, and two operations
to perform.


>First, a review of some of the linguistic features. A number of languages can
>compile a "dynamically bound" (OO) procedure call into an indirect branch,
>i.e., a branch through a compile-time-known offset in a run-time-known table.

Warning: this is yet another of my blue-sky ideas.

Many "AI" or "advanced" systems use indirection through an object
table in accessing functions and data.

Why not try an "indirect cache" ---- instead of a cache that responds
to the address of an object, have a cache that responds to the address
of the object table entry?
    Make it simple: support one level of indirection only (I don't
want to get lost in the maze of reminesces about machines that
supported arbitrary chains of indirection in hardware).

A special "load-indirect" or "branch-indirect" function would be needed.
load-indirect's function would be:

    load-indirect   dest    maddr

    (1)	Look up maddr associatively in indirect cache.
    	If found, then use it...
    	Else, continue

    (2) Look up maddr in normal cache.
    	If not found, cache miss to get it -> *maddr

    (3) Look up *maddr in normal cache
    	If not found, cache miss to get it 

    	Place result in dest.
    	Place result in indirect cache, indexed by maddr.

The indirect cache probably would not need to be snooped on the basis
of the direct address, since the direct addresses only change as the
result of garbage collections. (Multiprocessor issues?)





An indirect cache *might* be a win.  How much I don't know, because I
don't have any significant benchmark workloads that use such indirect
accesses.  I don't have any statistics on the frequency of such
operations.  As usual, what is needed are real numbers characterizing
such "OOPS" applications.  

--
Andy Glew, a-glew@uiuc.edu [get ph nameserver from uxc.cso.uiuc.edu:net/qi]

henry@zoo.toronto.edu (Henry Spencer) (11/27/90)

In article <MOSS.90Nov24181741@ibis.cs.umass.edu> moss@cs.umass.edu (Eliot Moss) writes:
>Mention of architectural speedups via special branch target cache mechanisms,
>and the fact that they do not apply to "object-oriented" procedure calls
>because they are indirect...

???  The AMD 29k BTC, the major example that I know about, works regardless
of what kind of branch you use.  It even works on interrupts, as I recall.
Caches have the advantage of being able to observe run-time behavior, so
they don't *care* whether the branches are not statically predictable.
-- 
"I'm not sure it's possible            | Henry Spencer at U of Toronto Zoology
to explain how X works."               |  henry@zoo.toronto.edu   utzoo!henry