moss@takahe.cs.umass.edu (Eliot &) (11/06/89)
While I would agree that many subroutines calls, or even a threaded interpreter, would tend to undermine *sequential* instruction reference, and thus substantially affect the performance of a sequential pre-fetch buffer, a true cache is insensitive to jumps. What matters for cache performance is the the *total volume* of instruction locations referenced over a range of time, and whether they tend to fit in the cache. Processor performance may still be adversely affected, since it may still be faster to buffer and deliver sequential instruction words than to take a jump, especially on a fast pipelined machine, like many RISC cpus. -- J. Eliot B. Moss, Assistant Professor Department of Computer and Information Science Lederle Graduate Research Center University of Massachusetts Amherst, MA 01003 (413) 545-4206; Moss@cs.umass.edu
shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) (11/07/89)
In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes: >In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote: >> Second, I would expect better locality >> for code reference than for data reference, hence the I cache ought to do >> more good than the D cache. Aren't the pathological cache-busting programs >> generally ones which spray data accesses all over the place? > >Not necessarily. There's a subtle problem here; good software modularity >practices tend to hurt code locality. If you're calling subroutines a lot >in generated code the PC jumps all over the shop. This happens for example in a FORTH machine, FORTH typically is subroutine threaded, so there is a flurry of subroutine calls happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016 Forth CPU). In forth there is a subroutine call every five or so instructions I would guess. Wonder if some better cache philosophy can help here. -- shrikumar ( shri@ccs1.cs.umass.edu, shri@ncst.in )
rmb@mipon2.intel.com (Bob Bentley) (11/10/89)
In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes: > > ... There's a subtle problem here; good software modularity >practices tend to hurt code locality. If you're calling subroutines a lot >in generated code the PC jumps all over the shop. > >I have no statistics on this, but I can easily imagine something like, say, >the inner loop of a threaded-code interpreter busting hell out of the I cache >because it's doing the equivalent of an indexed call indirect to some far-off >routine every couple of instructions. > >Has anyone done any systematic investigation of this issue? >-- We did a pretty thorough study of cache behavior at BiiN (now, alas, defunct). This was prompted by the large variance which we observed in cache hit ratios for different benchmarks; in particular, OS-intensive benchmarks were *much* worse than CPU-only (Dhrystone, etc.) or Unix utility (grep, diff, etc.) tests. There were a number of contributing causes to the observed cache behavior (use of a sub-sectored caching scheme was a major one, since it led to an effective cache occupancy of < 40%). However, the nature of OS code was certainly a factor. The BiiN OS was written in Ada in a highly modular fashion. The first versions of the OS contained considerable amounts of diagnostic code, and very few routines were inlined. The result was that there was very little either spatial or temporal locality in the OS code. Calls/branches were very frequent, tight loops were very rare. Though not as bad as Eric's hypothetical example, the effect on cache hit ratio and hence on overall system performance was still significant. There are some definite negative performance implications to modular programming techniques which need to be kept in mind. (This is not to say that I am opposed to modular programming techniques, especially for projects as large and complex as the BiiN OS - as a former colleague of mind used to ask, "Do you want it to go fast or do you want it to work"?). Bob Bentley -------------------------------------------------------------------------------- | Intel Corp., M/S JF1-58 UUCP: rmb@omefs3.intel.com | | 2111 N.E. 25th Avenue Phone: (503) 696-4728 | | Hillsboro, Oregon 97124 Fax: (503) 696-4515 | --------------------------------------------------------------------------------
crowl@cs.rochester.edu (Lawrence Crowl) (11/10/89)
In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes: >good software modularity practices tend to hurt code locality. If you're >calling subroutines a lot in generated code the PC jumps all over the shop. For the purposes of caches and locality, it doesn't matter if the PC jumps all over the shop. What matters is if the PC jumps to someplace it has recently been. Unrolling loops reduces PC jumping, but also reduces locality and cache effectiveness. Replacing several calls to a procedure by the inline code to achieve the same function produces the same effect. By leaving the calls in place, I reduce the number of (static) instructions that the PC visits, and thereby increase the PC locality. In short, MODULARITY IMPROVES PC LOCALITY. There are some tradeoffs involved in modularity that affect performance. First, the additional call instructions may increase code size for very small procedures. (Automatic inlining helps here.) Second, the PC jumps drain the instruction pipeline. If the drain is sufficiently frequent, lower PC locality may improve performance. Third, procedures may be spread over many pages and each occupy only a small portion of its page. The result is that while the INSTRUCTION locality increases, the PAGE locality decreases. This could reduce performance via translation cache misses and paging. Suppose we have a highly modular program, and an inlining compiler. The compiler may choose to call a procedure, or expand it inline. For those procedures where calls are a net performance loss, the compiler will choose to expand them inline. Given a modular program, and inlining compiler may emit either modular code, or non-modular code. The reverse is not true. So, given an inlining compiler, MODULARITY IS ALWAYS GOOD. (There are also macro-efficiency and software engineering reasons to prefer modularity. However, they are outside the scope of this newsgroup.) -- Lawrence Crowl 716-275-9499 University of Rochester crowl@cs.rochester.edu Computer Science Department ...!{allegra,decvax,rutgers}!rochester!crowl Rochester, New York, 14627
eric@snark.uu.net (Eric S. Raymond) (11/13/89)
In <1989Nov9.205356.17585@cs.rochester.edu> Lawrence Crowl wrote: > (There are also macro-efficiency and software engineering reasons to prefer > modularity. Hearty agreement on that (I'm a rather vigorous partisan of ADT methods, myself). You describe a set of heuristics for choosing inlining or non-inlining that make intuitive sense (at least they do now that I've reread your posting about three times :-)). But it sounds to me as though effective use of those heuristics might require maintenance of an awful lot of `macro' information about the code object and its call patterns, including a global flow analysis. Is this actually done? If not, what kind of less elaborate state has been found to suffice? -- Eric S. Raymond = eric@snark.uu.net (mad mastermind of TMN-Netnews)
mef@aplcen.apl.jhu.edu (Marty Fraeman) (11/16/89)
In article <6374@dime.cs.umass.edu> shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) writes: >In article <1TMk2X#Qggn6=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) >writes: >>In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote: >>> Second, I would expect better locality >>> for code reference than for data reference, hence the I cache ought to do >>> more good than the D cache. Aren't the pathological cache-busting programs >>> generally ones which spray data accesses all over the place? >> >>Not necessarily. There's a subtle problem here; good software modularity >>practices tend to hurt code locality. If you're calling subroutines a lot >>in generated code the PC jumps all over the shop. > >This happens for example in a FORTH machine, FORTH typically is >subroutine threaded, so there is a flurry of subroutine calls >happening at about 4 million a second. (in a 8-10 Mhz (?) Novix 2016 >Forth CPU). > >In forth there is a subroutine call every five or so instructions >I would guess. We have looked at a similar issue in Forth. Over 90% of sequential code accesses are less than 6.25 instructions long on the SC32 Forth engine. This machine can execute most Forth primitives with a single one cycle instruction except for load and store which take two cycles. Subroutine calls are one cycle and most returns take zero cycles since they can generally be combined with another instruction. We also looked at the effectiveness of instruction caches on this machine and found that fairly small caches (<16KB) could still achieve >95% hit rates. However, since the size of the programs we studied was fairly modest our I-cache size result should be taken with a grain of salt. On the other hand the size of programs we studied was comparable to the size of single threads on typical real-time applications we've developed in the past so I believe there is some significance to our data. As a final comment on I-cache effectiveness in Forth, keep in mind that while Forth instruction traces hop all over the place the hierachical nature of most Forth implementations keeps code size much smaller than usual. Marty Fraeman mef@aplcen.apl.jhu.edu 301-953-5000, x8360 JHU/Applied Physics Laboratory Johns Hopkins Road Laurel, Md. 20707
peter@ficc.uu.net (Peter da Silva) (11/16/89)
In article <6374@dime.cs.umass.edu> shri@ccs1.cs.umass.edu (H.Shrikumar{shri@ncst.in}) writes: > In forth there is a subroutine call every five or so instructions > I would guess. Yeh, but most of these calls would be to the same place, so the subroutine would be in the cache. You'd expect that { @ ! (colon) (exit) + - / * } and a few others would be in the cache most of the time. > Wonder if some better cache philosophy can help here. I'd worry more about pipeline flushes. -- `-_-' Peter da Silva <peter@ficc.uu.net> <peter@sugar.hackercorp.com>. 'U` -------------- +1 713 274 5180. "*Real* wizards don't whine about how they paid their dues" -- Quentin Johnson quent@atanasoff.cs.iastate.edu
freudent@eric.nyu.edu (Eric Freudenthal) (11/16/89)
In <1989Nov4.004529.10049@ico.isc.com> Dick Dunn wrote: > Second, I would expect better locality > for code reference than for data reference, hence the I cache ought to do > more good than the D cache. Aren't the pathological cache-busting programs > generally ones which spray data accesses all over the place? No. That is typical of pathological demand-paged-memory-busting techniques. The reason is that typically caches consist of a large set of realatively short entries (typically a few-to-several bytes) as compared to demand-paged-memory which consists of a set of LONG entry-measured in kilobytes. A truly random cache access pattern is pretty good because the lookup scheme is usually a hash function for which it is easy to generate bad access patterns (typically (addr >> m)). Furthermore, the amount of memory moved (therefore the time involved with moving) for a single cache-update is not very great, so if the extra stuff is never used, very little effort was wasted. In comparison a single page-in takes so much time, and moves so much data, that random addressing patterns will cause pathologically bad behavior. Caches, however are only useful if the same data is used repetitively before it is flushed. This seems to be more an issue of algorithm design than programming style. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Eric Freudenthal NYU Ultracompter Lab 715 Broadway, 10th floor New York, NY 10012 Phone:(212) 998-3345 work (718) 789-4486 home Email:freudent@ultra.nyu.edu -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - Eric Freudenthal NYU Ultracompter Lab 715 Broadway, 10th floor New York, NY 10012 Phone:(212) 998-3345 work (718) 789-4486 home Email:freudent@ultra.nyu.edu