[comp.arch] Software modularity vs. instruction locality

moss@takahe.cs.umass.edu (Eliot &) (11/06/89)

While I would agree that many subroutines calls, or even a threaded
interpreter, would tend to undermine *sequential* instruction reference, and
thus substantially affect the performance of a sequential pre-fetch buffer, a
true cache is insensitive to jumps. What matters for cache performance is the
the *total volume* of instruction locations referenced over a range of time,
and whether they tend to fit in the cache. Processor performance may still be
adversely affected, since it may still be faster to buffer and deliver
sequential instruction words than to take a jump, especially on a fast
pipelined machine, like many RISC cpus.

--

		J. Eliot B. Moss, Assistant Professor
		Department of Computer and Information Science
		Lederle Graduate Research Center
		University of Massachusetts
		Amherst, MA  01003
		(413) 545-4206; Moss@cs.umass.edu

shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) (11/07/89)

In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) 
writes:
>In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote:
>>                                     Second, I would expect better locality
>> for code reference than for data reference, hence the I cache ought to do
>> more good than the D cache.  Aren't the pathological cache-busting programs
>> generally ones which spray data accesses all over the place?
>
>Not necessarily. There's a subtle problem here; good software modularity
>practices tend to hurt code locality. If you're calling subroutines a lot
>in generated code the PC jumps all over the shop.

This happens for example in a FORTH machine, FORTH typically is
subroutine threaded, so there is a flurry of subroutine calls
happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016 
Forth CPU).

In forth there is a subroutine call every five or so instructions
I would guess.

Wonder if some better cache philosophy can help here.

-- shrikumar ( shri@ccs1.cs.umass.edu, shri@ncst.in )

rmb@mipon2.intel.com (Bob Bentley) (11/10/89)

In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes:
>
>             ... There's a subtle problem here; good software modularity
>practices tend to hurt code locality. If you're calling subroutines a lot
>in generated code the PC jumps all over the shop.
>
>I have no statistics on this, but I can easily imagine something like, say,
>the inner loop of a threaded-code interpreter busting hell out of the I cache
>because it's doing the equivalent of an indexed call indirect to some far-off
>routine every couple of instructions.
>
>Has anyone done any systematic investigation of this issue?
>-- 

We did a pretty thorough study of cache behavior at BiiN (now, alas, defunct).
This was prompted by the large variance which we observed in cache hit ratios
for different benchmarks; in particular, OS-intensive benchmarks were *much*
worse than CPU-only (Dhrystone, etc.) or Unix utility (grep, diff, etc.) tests.

There were a number of contributing causes to the observed cache behavior
(use of a sub-sectored caching scheme was a major one, since it led to an
effective cache occupancy of < 40%).  However, the nature of OS code was
certainly a factor.  The BiiN OS was written in Ada in a highly modular fashion.
The first versions of the OS contained considerable amounts of diagnostic code,
and very few routines were inlined.  The result was that there was very little
either spatial or temporal locality in the OS code.  Calls/branches were very
frequent, tight loops were very rare.   Though not as bad as Eric's hypothetical
example, the effect on cache hit ratio and hence on overall system performance 
was still significant.  There are some definite negative performance
implications to modular programming techniques which need to be kept in mind.
(This is not to say that I am opposed to modular programming techniques,
especially for projects as large and complex as the BiiN OS - as a former
colleague of mind used to ask, "Do you want it to go fast or do you want it to
work"?).

	Bob Bentley

--------------------------------------------------------------------------------
|	Intel Corp., M/S JF1-58			UUCP:   rmb@omefs3.intel.com   |
|	2111 N.E. 25th Avenue			Phone:  (503) 696-4728         |
|	Hillsboro, Oregon 97124			Fax:    (503) 696-4515         |
--------------------------------------------------------------------------------

crowl@cs.rochester.edu (Lawrence Crowl) (11/10/89)

In article <1TMk2X#Qggn6=eric@snark.uu.net>
eric@snark.uu.net (Eric S. Raymond) writes:
>good software modularity practices tend to hurt code locality. If you're
>calling subroutines a lot in generated code the PC jumps all over the shop.

For the purposes of caches and locality, it doesn't matter if the PC jumps all
over the shop.  What matters is if the PC jumps to someplace it has recently
been.  Unrolling loops reduces PC jumping, but also reduces locality and cache
effectiveness.  Replacing several calls to a procedure by the inline code to
achieve the same function produces the same effect.  By leaving the calls in
place, I reduce the number of (static) instructions that the PC visits, and
thereby increase the PC locality.  In short, MODULARITY IMPROVES PC LOCALITY.

There are some tradeoffs involved in modularity that affect performance. 
First, the additional call instructions may increase code size for very small
procedures. (Automatic inlining helps here.)  Second, the PC jumps drain the
instruction pipeline.  If the drain is sufficiently frequent, lower PC
locality may improve performance.  Third, procedures may be spread over many
pages and each occupy only a small portion of its page.  The result is that
while the INSTRUCTION locality increases, the PAGE locality decreases.  This
could reduce performance via translation cache misses and paging.

Suppose we have a highly modular program, and an inlining compiler.  The
compiler may choose to call a procedure, or expand it inline.  For those
procedures where calls are a net performance loss, the compiler will choose to
expand them inline.  Given a modular program, and inlining compiler may emit
either modular code, or non-modular code.  The reverse is not true.  So, given
an inlining compiler, MODULARITY IS ALWAYS GOOD.

(There are also macro-efficiency and software engineering reasons to prefer
modularity.  However, they are outside the scope of this newsgroup.)
-- 
  Lawrence Crowl		716-275-9499	University of Rochester
		      crowl@cs.rochester.edu	Computer Science Department
...!{allegra,decvax,rutgers}!rochester!crowl	Rochester, New York,  14627

eric@snark.uu.net (Eric S. Raymond) (11/13/89)

In <1989Nov9.205356.17585@cs.rochester.edu> Lawrence Crowl wrote:
> (There are also macro-efficiency and software engineering reasons to prefer
> modularity.

Hearty agreement on that (I'm a rather vigorous partisan of ADT methods,
myself).

You describe a set of heuristics for choosing inlining or non-inlining that
make intuitive sense (at least they do now that I've reread your posting
about three times :-)). But it sounds to me as though effective use of those
heuristics might require maintenance of an awful lot of `macro' information
about the code object and its call patterns, including a global flow
analysis. Is this actually done? If not, what kind of less elaborate state
has been found to suffice?
-- 
      Eric S. Raymond = eric@snark.uu.net    (mad mastermind of TMN-Netnews)

mef@aplcen.apl.jhu.edu (Marty Fraeman) (11/16/89)

In article <6374@dime.cs.umass.edu> shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) writes:
>In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) 
>writes:
>>In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote:
>>>                                     Second, I would expect better locality
>>> for code reference than for data reference, hence the I cache ought to do
>>> more good than the D cache.  Aren't the pathological cache-busting programs
>>> generally ones which spray data accesses all over the place?
>>
>>Not necessarily. There's a subtle problem here; good software modularity
>>practices tend to hurt code locality. If you're calling subroutines a lot
>>in generated code the PC jumps all over the shop.
>
>This happens for example in a FORTH machine, FORTH typically is
>subroutine threaded, so there is a flurry of subroutine calls
>happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016 
>Forth CPU).
>
>In forth there is a subroutine call every five or so instructions
>I would guess.
We have looked at a similar issue in Forth.  Over 90% of sequential
code accesses are less than 6.25 instructions long on the SC32 Forth
engine.  This machine can execute most Forth primitives with a single
one cycle instruction except for load and store which take two cycles.
Subroutine calls are one cycle and most returns take zero cycles since
they can generally be combined with another instruction.  

We also looked at the effectiveness of instruction caches on this 
machine and found that fairly small caches (<16KB) could still achieve 
>95% hit rates.  However, since the size of the programs we studied 
was fairly modest our I-cache size result should be taken with a grain 
of salt.  On the other hand the size of programs we studied was 
comparable to the size of single threads on typical real-time 
applications we've developed in the past so I believe there is some 
significance to our data.

As a final comment on I-cache effectiveness in Forth, keep in mind that
while Forth instruction traces hop all over the place the hierachical
nature of most Forth implementations keeps code size much smaller than
usual.

	Marty Fraeman

	mef@aplcen.apl.jhu.edu
	301-953-5000, x8360

	JHU/Applied Physics Laboratory
	Johns Hopkins Road
	Laurel, Md. 20707

peter@ficc.uu.net (Peter da Silva) (11/16/89)

In article <6374@dime.cs.umass.edu> shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) writes:
> In forth there is a subroutine call every five or so instructions
> I would guess.

Yeh, but most of these calls would be to the same place, so the subroutine
would be in the cache. You'd expect that { @ ! (colon) (exit) + - / * } and
a few others would be in the cache most of the time.

> Wonder if some better cache philosophy can help here.

I'd worry more about pipeline flushes.
-- 
`-_-' Peter da Silva <peter@ficc.uu.net> <peter@sugar.hackercorp.com>.
 'U`  --------------  +1 713 274 5180.
"*Real* wizards don't whine about how they paid their dues"
	-- Quentin Johnson quent@atanasoff.cs.iastate.edu

freudent@eric.nyu.edu (Eric Freudenthal) (11/16/89)

   In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote:
   >                                     Second, I would expect better locality
   > for code reference than for data reference, hence the I cache ought to do
   > more good than the D cache.  Aren't the pathological cache-busting programs
   > generally ones which spray data accesses all over the place?

No.  That is typical of pathological demand-paged-memory-busting
techniques.  The reason is that typically caches consist of a large
set of realatively short entries (typically a few-to-several bytes) as
compared to demand-paged-memory which consists of a set of LONG
entry-measured in kilobytes.  A truly random cache access pattern is
pretty good because the lookup scheme is usually a hash function for
which it is easy to generate bad access patterns (typically (addr >>
m)).  Furthermore, the amount of memory moved (therefore the time
involved with moving) for a single cache-update is not very great, so
if the extra stuff is never used, very little effort was wasted.  In
comparison a single page-in takes so much time, and moves so much
data, that random addressing patterns will cause pathologically bad behavior.

Caches, however are only useful if the same data is used
repetitively before it is flushed.  This seems to be more an issue of
algorithm design than programming style.
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
				Eric Freudenthal
				NYU Ultracompter Lab
				715 Broadway, 10th floor
				New York, NY  10012

				Phone:(212) 998-3345 work
				      (718) 789-4486 home
				Email:freudent@ultra.nyu.edu
--
 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
				Eric Freudenthal
				NYU Ultracompter Lab
				715 Broadway, 10th floor
				New York, NY  10012

				Phone:(212) 998-3345 work
				      (718) 789-4486 home
				Email:freudent@ultra.nyu.edu