jcallen@Encore.COM (Jerry Callen) (11/01/90)
davidsen@crdos1.crd.ge.com (bill davidsen) writes: > The feature is intelligent I-cache, which only stores instructions >which are the target of jumps. This is, of course, the AMD29K branch target cache, which stores the target of the branch and the next 3 instructions. The idea is that this gives the memory system 4 clock ticks in which to deliver the 4th instruction after the branch target, and then hopefully the memory system can deliver subsequent instructions every cycle (page mode DRAM, for instance). I attended an AMD marketing pitch a few years ago on the 29K; the argument was that the branch target cache enables the 29K to squeeze reasonable performance out of a cheap DRAM memory system. Can anyone comment on how effective the branch target cache is? -- Jerry Callen jcallen@encore.com
rfk@egr.duke.edu (Robert F. Krick) (11/02/90)
From article <2823@crdos1.crd.ge.COM>, by davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr): > > The feature is intelligent I-cache, which only stores instructions > which are the target of jumps. > > [stuff deleted] > > If anyone has any info on recent work (if any) I'd like to hear it. If > there are any good papers I should look up I'd like to see them, too. > Obviously this must either be harder to do than I think, or provide less > benefit, or everyone would be doing it. Although not a "smart I-cache", Dr. Apostolos Dollas (my advisor) and I have been working to *eliminate* the performance degradation associated with I-cache misses and branch penalties. To accomplish this, we have developed the Sustained Performance Architecture (SPA). This architecture is based on the concept that program flow information (extracted prior to execution) can be used to identify those basic blocks which are candidates for execution within the next n cycles (where n is the number of cycles of latency for the memory). The instructions from these basic blocks can be prefetched from multiple banks of instruction memory such that the required instructions are available when they are needed by the processor. (Of course, some instructions will be prefetched unnecessarily, but these are not delivered to the processor.) In essence, SPA leverages information available when the program is compiled in order to improve performance. Since the program flow information is extracted prior to execution, SPA cannot *improve* the performance for branches which have dynamically defined targets (i.e. jmp r2). With the notable exception of subroutine returns which SPA can handle without any loss in performance, the degradation associated with this class of branch instructions is insignificant ( << 1%), because these instructions are sufficiently rare in compiled code from languages such as FORTRAN and C. We have developed a working `proof of concept' prototype. In addition, we have applied for patent protection in both the USA and Japan. If you would like more details about this architecture, please see our article in the December '89 issue of the ACM "Computer Architecture News". Based on demand, I will make additional technical reports available via ftp. ======================================================================== Robert F. Krick; Dept. of Elec. Eng.; Duke University; Durham, NC 27706 Internet: rfk@cameron.egr.duke.edu AT&T: (919)660-5268 "When the branching gets conditional, the prefetching gets tough, and the tough get going!"
jbuck@galileo.berkeley.edu (Joe Buck) (11/02/90)
In article <1157@cameron.egr.duke.edu>, rfk@egr.duke.edu (Robert F. Krick) writes: |> Since the program flow information is extracted prior to execution, SPA |> cannot *improve* the performance for branches which have dynamically |> defined targets (i.e. jmp r2). With the notable exception of subroutine |> returns which SPA can handle without any loss in performance, the degradation |> associated with this class of branch instructions is insignificant ( << 1%), |> because these instructions are sufficiently rare in compiled code from |> languages such as FORTRAN and C. Object-oriented languages are becoming more popular; there are many more indirect branches (in C++, these are virtual function calls) in such languages. However, I have no statistics on what fraction of instructions in such programs are indirect branches, and this may vary widely from program to program. You'd do very poorly on Forth code, but I doubt that you care all that much. :-) > We have developed a working `proof of concept' prototype. In addition, > we have applied for patent protection in both the USA and Japan. Have you gotten it to the point where you can run the SPECmarks on it (meaning that you've finished Fortran and C compilers that can provide the scheduling information)? -- Joe Buck jbuck@galileo.berkeley.edu {uunet,ucbvax}!galileo.berkeley.edu!jbuck
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/02/90)
In article <1157@cameron.egr.duke.edu> rfk@egr.duke.edu (Robert F. Krick) writes: | Since the program flow information is extracted prior to execution, SPA | cannot *improve* the performance for branches which have dynamically | defined targets (i.e. jmp r2). With the notable exception of subroutine | returns which SPA can handle without any loss in performance, the degradation | associated with this class of branch instructions is insignificant ( << 1%), | because these instructions are sufficiently rare in compiled code from | languages such as FORTRAN and C. register int (*state)(); /* code */ if (m < n) (*state)(foo, mumble, barf); I agree that this is an infrequent case. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) The Twin Peaks Halloween costume: stark naked in a body bag
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/02/90)
In article <13120@encore.Encore.COM> jcallen@encore.Com (Jerry Callen) writes: | I attended an AMD marketing pitch a few years ago on the 29K; the | argument was that the branch target cache enables the 29K to squeeze | reasonable performance out of a cheap DRAM memory system. Can anyone | comment on how effective the branch target cache is? A number of people have pointed out that a branch target cache makes really good use of the cache in terms of keeping the CPU busy, but is less effective in reducing main memory loading. I could be easily convinced that a small branch target cache, like 4k or so, and a larger general cache for both I and D would be desirable. Then you get into delays caused by multilevel cache... -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) The Twin Peaks Halloween costume: stark naked in a body bag
tim@proton.amd.com (Tim Olson) (11/04/90)
In article <13120@encore.Encore.COM> jcallen@encore.Com (Jerry Callen) writes: | I attended an AMD marketing pitch a few years ago on the 29K; the | argument was that the branch target cache enables the 29K to squeeze | reasonable performance out of a cheap DRAM memory system. Can anyone | comment on how effective the branch target cache is? Mark Hill's thesis compared a number of different caches, including the 29K's BTC. He concluded that the BTC was more effective than a regular cache up to about 2K bytes. -- Tim Olson Advanced Micro Devices (tim@amd.com)
rfk@duke.cs.duke.edu (Robert F. Krick) (11/04/90)
I have accumulated the several followups to my last posting. Here are some replies: In article <39324@ucbvax.BERKELEY.EDU>, jbuck@galileo.berkeley.edu (Joe Buck) writes: > Object-oriented languages are becoming more popular; there are many more > indirect branches (in C++, these are virtual function calls) in such > languages. However, I have no statistics on what fraction of instructions > in such programs are indirect branches, and this may vary widely from > program to program. You'd do very poorly on Forth code, but I doubt that > you care all that much. :-) Indeed the Sustained Performance Architecture (SPA) which I described may not be appropriate for all languages because of the requirement that "most" of the program flow must be known prior to execution. As I said, SPA cannot *improve* the performance for portions of the program where the branches have dynamically defined targets. A special purpose i-cache could easily be incorporated in SPA to explicitly handle these situations in the same manner as other architectures. The decision to incorporate such an i-cache would depend on the frequency and penalty associated with these dynamically defined branch targets. >> We have developed a working `proof of concept' prototype. In addition, >> we have applied for patent protection in both the USA and Japan. > Have you gotten it to the point where you can run the SPECmarks on it > (meaning that you've finished Fortran and C compilers that can provide > the scheduling information)? The goal of our `proof of concept' prototype was to demonstrate that critical aspects of the architecture could be implemented in hardware. We have accomplished this goal by building a working system. We do not have any plans to develop the existing prototype further into a system capable of executing the SPECmarks. However, we are actively seeking licensing and/or funding which will enable us to develop the next generation prototype, namely, a full-fledged general purpose computer. My current research is designing and implementing the necessary `scheduling' algorithms for the compiler. In article <13120@encore.Encore.COM>, jcallen@encore.Com (Jerry Callen) writes: > This is, of course, the AMD29K branch target cache, which stores the > target of the branch and the next 3 instructions. As you realize, any technique which is heuristic or based upon locality of reference cannot be 100% effective. As a result, there will *always* be some performance degradation for techniques such as branch prediction, i-caches, branch target caches, etc. In contrast, the SPA concept addresses the *cause* of performance degradation in instruction sequencing by looking ahead in the program flow to prefetch all instructions which may be required. As the disparity between processor speed and main memory speed increases, heuristic and locality-based solutions are becoming increasingly expensive and IMHO will eventually give way to structured solutions which provide higher performance at lower cost. In article <2829@crdos1.crd.ge.COM>, davidsen@crdos1.crd.ge.com (bill davidsen) writes: >> In article <1157@cameron.egr.duke.edu> rfk@egr.duke.edu (Robert F. Krick) >> writes: >> Since the program flow information is extracted prior to execution, SPA >> cannot *improve* the performance for branches which have dynamically >> defined targets (i.e. jmp r2). With the notable exception of subroutine >> returns which SPA can handle without any loss in performance, the degradation >> associated with this class of branch instructions is insignificant ( << 1%), >> because these instructions are sufficiently rare in compiled code from >> languages such as FORTRAN and C. > > register int (*state)(); > > /* code */ > if (m < n) (*state)(foo, mumble, barf); > > I agree that this is an infrequent case. Thanks for your support! There is another (possibly more frequent) situation in which branches with dynamically defined targets may be used: a branch table for the switch/case construct in C. For small branch tables, it is more appropriate in SPA to generate code for a binary decision tree (despite the slight increase in code size). The code for these binary decision trees can then be executed without any loss in performance. ======================================================================== Robert F. Krick; Dept. of Elec. Eng.; Duke University; Durham, NC 27706 Internet: rfk@cameron.egr.duke.edu AT&T: (919)660-5268 "When the branching gets conditional, the prefetching gets tough, and the tough get going!"